Comp 271 Weeks 14 & 15
Compiler
Using tokens: see class Tokenizer in smachines2.
public enum Tokens { PLUS, MINUS, TIMES, DIV, MOD, ASSIGN, EQUAL, LESS, LESSEQUAL,
GREATER, GREATERQUAL, ...
DO, WHILE, IF, ELSE, RETURN, }
Converting from a String to a Token
Greater simplicity in comparisons (== works)
switch() statements may not work
construction of static strToTokenMap: a singleton?
CompileExpr2
The strategy is to have compileExpr return an Integer object if the expr is constant (either a final int or a number).
problem: 1+n, where n is a variable.
One fix is to push the numeric value we want, and then emit a SWAP
instruction to get them in the right order, if the operantor is not
commutative.
CompileExpr3
Here we build the actual expression tree.
Note that the expression tree is different from the true parse tree. The original EBNF grammar is
E -> T { addop T}
In "plain" BNF grammar, this is:
E -> T MT
MT -> addop MT | empty
This builds a somewhat weird tree (draw some examples). However, we can do better:
Define ExprNode, ConstNode, VarNode, UnopNode, BinopNode. Note that the
latter four are subclasses of the abstract class ExprNode. We have one
abstract method in ExprNode: compile() (the only method at all!)
Then modify compileSExpr, compileTerm so that, if both operands are
ConstNodes, then we create a new ConstNode to hold the result, instead of generating lots of code.
Note how the object hierarchy helps us here, and also how we check node types with instanceof.
Common subexpressions
Suppose we want to optimize about common subexpressions:
- z1 = x+y; z2 = x+y;
- z = (x+y)*(x+y)
The first example is an optimization among multiple statements; the second is an optimization within a single statement.
How do you detect these?
We construct a Map of expression
subexpressions, making sure that we look for == matches instead of
.equals matches; this means that we cannot use HashMap.
We can, however, still use hashing, if we pay careful attention to the hashCode() method, eg
BinOpNode.hashCode = mixup(left.hashCode(), right.hashCode(), (int) operand);
Each subexpression is first looked up in the hashtable; if we find
an exact match there, we return the hashed entry. Note that we really
want to be using == on the pointers, all the way, not .equals().
Example: (x+y)*(x+y)
1. Enter x, y,BinOpNode(x,PLUS, y) in the table.
2. Now do it again
second x: find entry for previous x, and return a duplicate pointer to the same node.
second y: find entry for previous y, and return a duplicate pointer to the same node.
Now we want to see if we can find an instance of (x, PLUS, y). We search our subexpression table for an exact match, using == on all three fields. We find one! Therefore, we do not create a new BinOpNode; we just return a reference to the existing cell.
Technically, we want to do the lookup before actually creating the node.
One way to achieve this is to modify the node constructors, but that can be massively confusing.
Now we have to generate code. The first time we come to a common subexpression, we evaluate it, and create a new storage location to store the result. That is, we ALLOC a new variable temp on the stack frame, before compiling the expression. (Note that if we try this after
starting to compile the expression, we already have emitted code that
is pushing things onto the stack, and we will muddle things up.) That
is, the code for
z = (x+y)*(x+y)
compiles as if it were
temp = x+y;
z = temp * temp
Rather than reloading temp, though, we would probably do something like
this, say, for (x+y)*3*(x+y) (the intervening 3 makes a
superficially obvious optimization be less obvious).
LOAD x
LOAD y
ADD
DUP
// now
two copies of x+y are on the stack
STORF temp // pops one copy
LOADI 3
MUL
LOADF temp
MUL
Tree Balancing
Tree Rotations
Right rotation:
A
B
/
\
/ \
B
T3
=> T1 A
/
\
/ \
T1
T2
T2 T3
Splay Trees
Splay trees are binary search trees where, on every access, we move the value in question (if found) to the root, in an operation known as splaying.
Note that accesses are now mutation operations! The idea is that
frequently accessed values will gravitate over time towards the root.
Splay Trees modify the rotate-to-root strategy with the addition of
"grandfather" rotations. Rotations occur in pairs (except the last). Let x be a node, p the parent, and g the
grandparent. If x is the left subnode of p and p is the right subnode
of g (or x is right and p is left), we rotate around p and then g as in
the rotate-to-root method. This is sometimes called the zig-zag step,
as the two rotations are in different directions. However, if x is left
and p is left, we rotate first around g and then around p (note the
reversed order!), in a step sometimes known as "zig-zig".
Consider all the nodes of the tree on the path between x and the root.
After splaying x up to the root, the depth of x is, of course, now 0.
However, the average depth of all the nodes on that root-to-x path is
now halved. Thus, while splaying doesn't necessarily improve the tree
balance, it does move a number of nodes closer to the root.
Examples:
- splaying a leaf in a balanced tree (eg 6 3 8 2 5 7 9)
- splaying a leaf in a degenerate tree (2 3 5 6 7 8 9)
To insert into a splay tree, we first find the node y which, under
ordinary insertion, x would be inserted immediately below. Y is then
splayed to the root, at which time we insert x above y. Assume that x
would have been inserted to the right of y. This means x>y, but also
that x<z for every existing z>y. In inserting x at the root, we
make y its left subnode, and move y's right subtree to the right of x.
Splaying wrecks havoc on iterators. The main problem is that normally
accesses are iterator-safe (that is, it is safe to access values in a
data structure while an iterator is "in progress"; you just can't insert) , but here accesses are not safe.
AVL Trees
These are named for Adelson-Velskii and Landis, from their 1962 paper.
The idea behind AVL trees is that at each node we store a value
representing depth(left_subtree) - depth(right_subtree); we'll call
this the balance factor.
We will then use rotations to keep the balance factor small.
We can compute balance factors easily enough using recursion, but it is
better to cache the value at each node to avoid excessive computational
time. Our
goal is to maintain the balance factor for every node as -1, 0, or 1.
When we insert
a node, we have to do appropriate rotations to maintain the balance
factor for every ancestor to the new node (and also be sure that the
rotations do not introduce any unbalancing of their own).
As we work up the path from the newly inserted value to the tree root,
we consider the new balance factor of each node. If it is -1, 0, or 1,
we do nothing. If it is -2 or 2, we do rotations.
Let X be the node in question, with right and left subnodes R and L. If
the balance fator of x is +2, then the left subtree L is too deep. We
know L has a balance factor of -1, 0, or 1, but it matters which. Let
the left child of L be LL, and the right child be LR. The tree now
looks like this:
X
/
\
L R
/
\
LL LR
(Note that LL, LR, and R are entire subtrees, not just nodes. However, we know that their depths are all similar, because of the balance-factor requirement.)
We will eventually do a right rotation about X; however we might first have to do a left rotation about L. Doing the right rotation about X would leave us with:
X
L
/
\
/ \
L R
=> LL X
/
\
/ \
LL LR
LR R
If the balance factor of L had been 0 or 1, this is sufficient. Assume
for a moment that the balance factor of L is 1, so depth(LL) =
depth(LR)+1. Because X has balance-factor +2, we know depth(R) =
depth(LL)-2. After the rotation, the depth of LL has decremented by 1,
the total depth of LR is unchanged, and the total depth of R has
incremented by 1 to match exactly the depth of LR. So the balance
factor of L (the new root) is now 0. If the original balance factor of
L had been 0, the post-rotation balance factor of L becomes -1. Either
way, it still works.
But if LR is deeper than LL (depth(LL)+1 = depth(LR)), then the new L is unbalanced. So we first do a left rotation about L to move some of LR up:
X
X
LR
/
\
/ \
/ \
L R
=> LR
R
=> L X
/
\
/ \
/ \ / \
LL LR
L
LRR
LL LRL LRR R
/
\
/ \
LRL LRR LL LRL
We argue that LR now has balance factor 0 or 1, and so when we do the
right rotation about X as the next step, the previous argument shows
that we have achieved AVL balance at the new root (which will be LR).
We know that
depth(LL) +1 = max(depth(LRL), depth(LRR))
Example
Start with the tree
6
/ \
3 8
/ \ / \
2 5 7 9
Now add 10, 11, 12, 13, 14, 15, which I'll do in hex with A B C D E F.
Add A; no rotations are needed:
6
/ \
3 8
/ \ / \
2 5 7 9
\
A
After adding B, below A above, we need to rotate around node 9 above:
6
/ \
3 8
/ \ / \
2 5 7 A
/ \
9 B
After adding C, below B above, we will need to rotate around node 8 above:
6
/ \
3 A
/ \ / \
2 5 8 B
/ \ \
7 9 C
Now we add D below C, and rotate around B to get
6
/ \
3 A
/ \ / \
2 5 8 C
/ \ / \
7 9 B D
Now we add E below D, and this time the node we rotate around is all the way at the root, 6:
A
/ \
6 C
/ \ / \
3 8 B D
/ \ / \ \
2 5 7 9 E
Actually, this is just a
little misleading; all the rebalancing rotations above involved having the deepest tree to the
"right-right" of the pivot node; that is, the right subtree of the
right subtree. In that case, a simple rotation (to the left) about the
pivot node is all we need. The "left-left" case is similar. However,
the right-left (and left-right) cases are not quite as simple: here, we
need to do a preliminary rotation around the right subtree first.
Suppose the tree is:
6
/ \
3 9
/ \ / \
2 5 8 A
We now insert a 7, making the tree unbalanced (BF = +/- 2) at node 6:
6
/ \
3 9
/ \
8 A
/
7
Rotation about 6 alone gives:
9
/ \
6 A
/ \
3 8
/
7
This still has a balance factor of +/- 2! This rotation did not help! However, the AVL rule in this case is to first rotate about the 9 (the parent), and then about the 6. Also note that the rotations are in different directions.
After rotating right about the 9 we get:
6
/ \
3 8
/ \
7 9
\
A
Then rotating left about the 6, we get
8
/ \
6 9
/ \ \
3 7 A
This tree has had its balance improved.
Bayer Trees
Mostly these are known as B-trees. Bayer named them that in his paper,
though he did not spell out what the B stood for. B-trees are not binary trees; in fact, you might look at them as evidence that being binary makes life much harder.
B-trees have what I will call a degree,
d. (Some books, and Wikipedia, call this degree 2d.) Each interior node
(other than the root) has k nodes, d<=k<=2d, and k+1 children.
The k values, a0...ak-1, divide the leaf data into k+1 categories: x<a0, ai<x<ai+1 for i=0..k-1, and ak+1<x. These categories form the k+1 children; thus, a B-tree is still an ordered tree even if it is not binary. All leaf nodes are the same depth in the tree.
Examples of how ordered trees with more than one data item per node might look
Bayer's idea is that when a node becomes full, we split it in half, and push the median element up
a level. The pushed-up node may cause a split in the parent as well; we
keep pushing until the process stops or we end up pushing a new root
node.
2,3-tree example (B-tree of degree 1)
B-tree of degree 2 (2-4 values per node)
Red-Black Trees
These are binary trees that, like AVL trees, remain reasonably balanced
because we do restructuring on each insert. The restructuring runs in
time O(log height). Here's the wikipedia definition:
- A node is either red or black.
- The root is black.
- All leaves are black.
- Both children of every red node are black.
- Every simple path from a given node to any of its descendant leaves contains the same number of black nodes.
A black node can have black subnodes. Note that 4 implies that the parent of a red node is black.
Let the black-height of the tree be the number of black nodes in any
path from the root to a leaf, constant as per property 5. Property 4
ensures that, along any path, the path length is no more than twice the black-height.
B-Tree analogy:
We create a B-tree of order 4 (2-4 values per node), by moving the
red-node values up into their parent. There's a good picture of this on
wikipedia.
The B-Tree implementation may be less efficient, however, because the
nodes have more unused space and/or we have to combine the overall tree
search with binary array search of the nodes.