Comp 271 Weeks 14 & 15

Compiler

Using tokens: see class Tokenizer in smachines2.

   public enum Tokens { PLUS, MINUS, TIMES, DIV, MOD, ASSIGN, EQUAL, LESS, LESSEQUAL,
            GREATER, GREATERQUAL, ...
            DO, WHILE, IF, ELSE, RETURN, }

Converting from a String to a Token
Greater simplicity in comparisons (== works)
switch() statements may not work
construction of static strToTokenMap: a singleton?

CompileExpr2

The strategy is to have compileExpr return an Integer object if the expr is constant (either a final int or a number).

problem: 1+n, where n is a variable.
One fix is to push the numeric value we want, and then emit a SWAP instruction to get them in the right order, if the operantor is not commutative.

CompileExpr3

Here we build the actual expression tree.

Note that the expression tree is different from the true parse tree. The original EBNF grammar is
    E -> T { addop T}
In "plain" BNF grammar, this is:
    E -> T MT
    MT -> addop MT | empty
This builds a somewhat weird tree (draw some examples). However, we can do better:

Define ExprNode, ConstNode, VarNode, UnopNode, BinopNode. Note that the latter four are subclasses of the abstract class ExprNode. We have one abstract method in ExprNode: compile() (the only method at all!)

Then modify compileSExpr, compileTerm so that, if both operands are ConstNodes, then we create a new ConstNode to hold the result, instead of generating lots of code.

Note how the object hierarchy helps us here, and also how we check node types with instanceof.

Common subexpressions

Suppose we want to optimize about common subexpressions:

z1 = x+y; z2 = x+y;
z = (x+y)*(x+y)

The first example is an optimization among multiple statements; the second is an optimization within a single statement.

How do you detect these?
We construct a Map of expression subexpressions, making sure that we look for == matches instead of .equals matches; this means that we cannot use HashMap.

We can, however, still use hashing, if we pay careful attention to the hashCode() method, eg
    BinOpNode.hashCode = mixup(left.hashCode(), right.hashCode(), (int) operand);

Each subexpression is first looked up in the hashtable; if we find an exact match there, we return the hashed entry. Note that we really want to be using == on the pointers, all the way, not .equals().

Example: (x+y)*(x+y)
1. Enter x, y,BinOpNode(x,PLUS, y) in the table.
2. Now do it again
    second x: find entry for previous x, and return a duplicate pointer to the same node.
    second y: find entry for previous y, and return a duplicate pointer to the same node.
    Now we want to see if we can find an instance of (x, PLUS, y). We search our subexpression table for an exact match, using == on all three fields. We find one! Therefore, we do not create a new BinOpNode; we just return a reference to the existing cell.

Technically, we want to do the lookup before actually creating the node.

One way to achieve this is to modify the node constructors, but that can be massively confusing.

Now we have to generate code. The first time we come to a common subexpression, we evaluate it, and create a new storage location to store the result. That is, we ALLOC a new variable temp on the stack frame, before compiling the expression. (Note that if we try this after starting to compile the expression, we already have emitted code that is pushing things onto the stack, and we will muddle things up.) That is, the code for
    z = (x+y)*(x+y)
compiles as if it were
    temp = x+y;
    z = temp * temp
Rather than reloading temp, though, we would probably do something like this, say, for (x+y)*3*(x+y) (the intervening 3 makes a superficially obvious optimization be less obvious).
    LOAD x
    LOAD y
    ADD
    DUP                   // now two copies of x+y are on the stack
    STORF temp       // pops one copy
    LOADI    3
    MUL
    LOADF temp
    MUL

Tree Balancing

Tree Rotations

Right rotation:

          A                         B
       /    \                    /    \
      B      T3         =>     T1      A
    / \                             / \
   T1 T2                          T2    T3

Splay Trees

Splay trees are binary search trees where, on every access, we move the value in question (if found) to the root, in an operation known as splaying. Note that accesses are now mutation operations! The idea is that frequently accessed values will gravitate over time towards the root.

Splay Trees modify the rotate-to-root strategy with the addition of "grandfather" rotations. Rotations occur in pairs (except the last). Let x be a node, p the parent, and g the grandparent. If x is the left subnode of p and p is the right subnode of g (or x is right and p is left), we rotate around p and then g as in the rotate-to-root method. This is sometimes called the zig-zag step, as the two rotations are in different directions. However, if x is left and p is left, we rotate first around g and then around p (note the reversed order!), in a step sometimes known as "zig-zig".

Consider all the nodes of the tree on the path between x and the root. After splaying x up to the root, the depth of x is, of course, now 0. However, the average depth of all the nodes on that root-to-x path is now halved. Thus, while splaying doesn't necessarily improve the tree balance, it does move a number of nodes closer to the root.

Examples:

splaying a leaf in a balanced tree (eg 6 3 8 2 5 7 9)
splaying a leaf in a degenerate tree (2 3 5 6 7 8 9)

To insert into a splay tree, we first find the node y which, under ordinary insertion, x would be inserted immediately below. Y is then splayed to the root, at which time we insert x above y. Assume that x would have been inserted to the right of y. This means x>y, but also that x<z for every existing z>y. In inserting x at the root, we make y its left subnode, and move y's right subtree to the right of x.

Splaying wrecks havoc on iterators. The main problem is that normally accesses are iterator-safe (that is, it is safe to access values in a data structure while an iterator is "in progress"; you just can't insert) , but here accesses are not safe.

AVL Trees

These are named for Adelson-Velskii and Landis, from their 1962 paper. The idea behind AVL trees is that at each node we store a value representing depth(left_subtree) - depth(right_subtree); we'll call this the balance factor. We will then use rotations to keep the balance factor small.

We can compute balance factors easily enough using recursion, but it is better to cache the value at each node to avoid excessive computational time. Our goal is to maintain the balance factor for every node as -1, 0, or 1. When we insert a node, we have to do appropriate rotations to maintain the balance factor for every ancestor to the new node (and also be sure that the rotations do not introduce any unbalancing of their own).

As we work up the path from the newly inserted value to the tree root, we consider the new balance factor of each node. If it is -1, 0, or 1, we do nothing. If it is -2 or 2, we do rotations.

Let X be the node in question, with right and left subnodes R and L. If the balance fator of x is +2, then the left subtree L is too deep. We know L has a balance factor of -1, 0, or 1, but it matters which. Let the left child of L be LL, and the right child be LR. The tree now looks like this:

          X
       /    \
      L      R
    / \
   LL LR

(Note that LL, LR, and R are entire subtrees, not just nodes. However, we know that their depths are all similar, because of the balance-factor requirement.)

We will eventually do a right rotation about X; however we might first have to do a left rotation about L. Doing the right rotation about X would leave us with:

          X                         L
       /    \                    /    \
      L      R         =>      LL      X
    / \                             / \
   LL LR                          LR    R

If the balance factor of L had been 0 or 1, this is sufficient. Assume for a moment that the balance factor of L is 1, so depth(LL) = depth(LR)+1. Because X has balance-factor +2, we know depth(R) = depth(LL)-2. After the rotation, the depth of LL has decremented by 1, the total depth of LR is unchanged, and the total depth of R has incremented by 1 to match exactly the depth of LR. So the balance factor of L (the new root) is now 0. If the original balance factor of L had been 0, the post-rotation balance factor of L becomes -1. Either way, it still works.

But if LR is deeper than LL (depth(LL)+1 = depth(LR)), then the new L is unbalanced. So we first do a left rotation about L to move some of LR up:

          X                         X                            LR
       /    \                    /    \                        /    \
      L      R         =>      LR      R              =>      L       X
    / \                      / \                          / \    / \
   LL LR                   L    LRR                      LL LRL   LRR   R
      / \                / \
    LRL LRR            LL   LRL

We argue that LR now has balance factor 0 or 1, and so when we do the right rotation about X as the next step, the previous argument shows that we have achieved AVL balance at the new root (which will be LR). We know that
    depth(LL) +1 = max(depth(LRL), depth(LRR))

Example
Start with the tree
           6
         /   \
        3      8
       / \    / \
      2   5 7   9

Now add 10, 11, 12, 13, 14, 15, which I'll do in hex with A B C D E F.
Add A; no rotations are needed:
           6
         /   \
        3      8
       / \    / \
      2   5 7   9
                  \
             A

After adding B, below A above, we need to rotate around node 9 above:
           6
         /   \
        3      8
       / \    / \
      2   5 7   A
                / \
               9   B

After adding C, below B above, we will need to rotate around node 8 above:
           6
         /   \
        3      A
       / \    / \
      2   5 8   B
            / \   \
           7   9   C

Now we add D below C, and rotate around B to get

           6
         /   \
        3      A
       / \    / \
      2   5 8    C
            / \   / \
           7   9 B D

Now we add E below D, and this time the node we rotate around is all the way at the root, 6:

           A
         /   \
        6       C
      /   \    / \
     3     8 B    D
    / \   / \       \
   2 5 7 9        E

Actually, this is just a little misleading; all the rebalancing rotations above involved having the deepest tree to the "right-right" of the pivot node; that is, the right subtree of the right subtree. In that case, a simple rotation (to the left) about the pivot node is all we need. The "left-left" case is similar. However, the right-left (and left-right) cases are not quite as simple: here, we need to do a preliminary rotation around the right subtree first. Suppose the tree is:

           6
         /   \
        3      9
       / \    / \
      2   5 8   A

We now insert a 7, making the tree unbalanced (BF = +/- 2) at node 6:
           6
         /   \
        3      9
              / \
             8   A
            /
           7

Rotation about 6 alone gives:

           9
         /   \
        6      A
      / \
     3    8
         /
        7

This still has a balance factor of +/- 2! This rotation did not help! However, the AVL rule in this case is to first rotate about the 9 (the parent), and then about the 6. Also note that the rotations are in different directions.

After rotating right about the 9 we get:

           6
         /   \
        3      8
              / \
             7   9
                  \
                   A

Then rotating left about the 6, we get

           8
         /   \
        6      9
       / \      \
      3   7      A

This tree has had its balance improved.

Bayer Trees

Mostly these are known as B-trees. Bayer named them that in his paper, though he did not spell out what the B stood for. B-trees are not binary trees; in fact, you might look at them as evidence that being binary makes life much harder.

B-trees have what I will call a degree, d. (Some books, and Wikipedia, call this degree 2d.) Each interior node (other than the root) has k nodes, d<=k<=2d, and k+1 children. The k values, a₀...a_k-1, divide the leaf data into k+1 categories: x<a₀, a_i<x<a_i+1 for i=0..k-1, and a_k+1<x. These categories form the k+1 children; thus, a B-tree is still an ordered tree even if it is not binary. All leaf nodes are the same depth in the tree.

Examples of how ordered trees with more than one data item per node might look

Bayer's idea is that when a node becomes full, we split it in half, and push the median element up a level. The pushed-up node may cause a split in the parent as well; we keep pushing until the process stops or we end up pushing a new root node.

2,3-tree example (B-tree of degree 1)

B-tree of degree 2 (2-4 values per node)

Red-Black Trees

These are binary trees that, like AVL trees, remain reasonably balanced because we do restructuring on each insert. The restructuring runs in time O(log height). Here's the wikipedia definition:

A node is either red or black.
The root is black.
All leaves are black.
Both children of every red node are black.
Every simple path from a given node to any of its descendant leaves contains the same number of black nodes.

A black node can have black subnodes. Note that 4 implies that the parent of a red node is black.

Let the black-height of the tree be the number of black nodes in any path from the root to a leaf, constant as per property 5. Property 4 ensures that, along any path, the path length is no more than twice the black-height.

B-Tree analogy:
We create a B-tree of order 4 (2-4 values per node), by moving the red-node values up into their parent. There's a good picture of this on wikipedia.

The B-Tree implementation may be less efficient, however, because the nodes have more unused space and/or we have to combine the overall tree search with binary array search of the nodes.