Trees

In recursion.html#exprtrees we looked at some simple tree structures used for expressing "hierarchical" information: expressions:

2*(3+5)

A tree consists of nodes, each of which has zero or more subnodes.

A binary tree is one in which each node has up to two subnodes.

Binary tree nodes thus contain:

data
a left subtree
a right subtree

The code might look like this (from bintree.cs)

    class treenode {
	private T data_;
	private treenode left_, right_;
	public treenode(T d, treenode l, treenode r) {data_ = d; left_ = l; right_ = r;}
	public T data() {return data_;}
	public treenode left()  {return left_;}
	public treenode right() {return right_;}
    }

A binary tree is ordered if there is an ordering relation on the data and, for each node, every data value appearing in the left subtree is less than the node's own data, and every data value appearing in the right subtree is greater than the node's own data.

Quite often, eg for expression trees, no ordering is involved.

The root node is at the top of the tree. Nodes with no subtrees (ie in which the subtree fields are null) are leaf nodes. Other nodes are interior nodes.

The depth of a node is how far it is, in node-to-node links, from the root. Level n of a tree consists of all nodes with depth n.

Examples:

Abstract expression trees
Parse trees
Ancestry trees
Decision trees (Bailey p 288)

None of these is ordered!

Tree Traversal

To traverse a tree is to visit each node, perhaps printing out the data values or perhaps taking some other action. Traversal methods are usually recursive, and usually the initial call is on the tree's root.

Three common forms of traversal are preorder, inorder, or postorder, depending on whether a node's data is visited before, between or after the left and right subtrees are visited. All of these are considered to be depth-first traversal. Here is a sample inorder traversal

Demo:

traverse the expression tree for 4+x*(y+1)

inorder

postorder (cf Jan Łukasiewicz)

traverse the following tree from inttree.cs:

            6
         /    \
        4   8
      / \   /   \
     1    5 7    37

We can also do breadth-first traversal, by going along levels: 6, 4, 8, 1, 5, 7, 37.

Example code for inorder traversal:

Tree Building

How do we build trees? One way is from the bottom up; that is much easier of the class treenode is public. We could do this (assuming treenode is equivalent to treenode):

    treenode n1 = new treenode("a", null, null);
    treenode n2 = new treenode("3", null, null);
    treenode n3 = new treenode("*", n1, n2);
    treenode n4 = new treenode("1", null, null);
    treenode root = new treenode("+", n3, n4);

What does this build?

We can build ordered trees by inserting in order; see inttree.cs:

    public void insert(int val) {
	if (root_ == null) {
		root_ = new treenode(val, null, null);
		return;
	}
	rinsert(val, root_);
    }

    // recursive insert: does not get called with t==null
    private void rinsert(int val, treenode t) {
	if (val == t.data() ) return;	// should we do something else?
	if (val < t.data()) {
	    if (t.left() == null) {
		t.setleft(new treenode(val, null, null));
	    } else {
		rinsert(val, t.left());
	    }
	} else { // val > t.data()
	    if (t.right() == null) {
		t.setright(new treenode(val, null, null));
	    } else {
		rinsert(val, t.right());
	    }
	}
    }

If a treenode is null, we cannot modify it; we can only modify the parent. Bailey addresses this in some contexts by including a pointer to the parent node, but that won't help us here.

Note that the tree we get by inserting values depends on the order of insertion.

In-class demos:

try inserting the values in order
what orders lead to the same tree?
convert to preorder traversal

Lab 4: use this idea to implement a StrList that has O(log(n)) times for both search and insert.

n-ary trees and ordered n-ary trees

A tree doesn't have to be binary to be ordered. But the number of data elements at each node must be one less than the number of subtrees.

An n-ary tree has, for each node, a list of subnodes:

class Treenode {
    private T data;
    private List> children;
    ...
}

Different nodes of a tree can have different numbers of subnodes.

An ordered n-ary tree means that if a node has n child nodes it also has a List of n-1 data values ⟨d₀,..., d_n-2⟩, and all data values within child node kk-1 and d_k. That is, the ⟨d₀,..., d_n-2⟩ divide the data into n intervals, and the subtrees correspond to these intervals.

As a Consider:

                        37
                    /    \
          (23 34)                  (43 56 71   88)
        /    |   \                  /   |   \    \    \
   /     |     \              /     |     \    \       \
     16      29      35         39     47      60    75       103

html documents as a data structure (Document Object Model, or DOM)

The structure is basically that of an n-ary tree

Recursion on trees

Let us suppose we have an inttree. We want one of the following:

The depth of the tree
The maximum value in the tree
The sum of all the numbers in the tree
Whether 7 is in the tree

The last one can be done efficiently (O(log N)) if we know the tree is sorted. Otherwise, it is the same as the others: we proceed recursively, checking the left and right subtrees and then the root node.

private static int sum(treenode t) {

if (t==null) return 0;

else return sum(t.left()) + t.data() + sum(t.right())

}

Traversing Trees

What if we want to implement traversing a tree until we find a particular value, n? The interface method might be

    public string Get(int n) {
        if (n>=count_) return null;
        getstr_ = null;
        get(n,root_);
        return getstr_;
    }

Now the recursive helper method get(n, treenode p) must be defined. How would we do this?

In traversers.cs, I have implemented several different ways to traverse a tree.

1. inTraverse(): recursive traversal, now with a depth parameter

2. MyEnumerator: using IEnumberable, and yield return, with recursion. This makes use of the idea of coroutines. There is a demo in demoEnumerator(int n).

3. inTraverseStack(): stack-based iterative traversal, using a "trick"

A slightly more natural version of the stack-based iteration is this; however, it requires being able to push both treenodes and data as separate entities. This is

4. inTraverseStackAlt()

    s.pushNode(root_);
    while (!s.isEmpty()) {
        Object x = s.pop();
        if (x is treenode) {
            treenode t = x as treenode;
            if (t.right() != null) s.pushNode(t.right());
            s.pushData(t.data());
            if (t.left() != null) s.pushNode(t.left());
        } else {
            string str = x as string;
            Console.WriteLine(str);
        }
    }

Next we replace the Stack with a Queue. (Normally the queue operations are not called push() and pop() but rather enqueue() and dequeue(), but I wanted a very simple drop-in replacement for a stack.

5. inTraverseQueue
6. inTraverseQueueAlt

What order do these visit the nodes of the tree? (#5 is a little broken; the "real" version is #6).

As a related question, why are there so few actual problems that are solved with Queues?

Then we return using the stack to the iterator approach, with

7. MyEnumerator2

The final mechanism uses a stack and a simple iterator-like interface:

8. NextSetup()/Next():

Tree Balancing

The problem with ordered trees is that the worst-case depth is O(N), rather than O(log N). We would like to make sure that the tree remains at least somewhat balanced as nodes are added (remaining balanced as nodes are deleted is a related problem, which we will not consider). Some algorithms in widespread use are AVL trees, Black-Red trees, Bayer trees (B-trees), and Skew trees.

Tree Rotations

Tree rotations are a simple reorganization of a tree or subtree. The idea is to consider a (subtree) root node A and one child B (say the left child); the child B moves up to the root and the former root A moves down. The root node A may be a subnode of a larger tree; since none of the values in the A-subtree change, any such larger tree would remain ordered. The transformation shown here, from the left to the right diagram, is a right rotation about A (the topmost node A moves down to the right); an example of a left rotation (about B) would be the reverse.

Right rotation:

          A                         B
       /    \                    /    \
      B      T3         =>     T1      A
    / \                             / \
   T1 T2                          T2    T3

This is legal because we have T1 ≤ B ≤ T2 ≤ A ≤ T3 (where T1,T2,T3 represent all values in those subtrees)

It is straightforward to write code to implement a rotation. The ordered node data type is D. Note that we swap the data in the node occupied by A and B; we have to preserve the node occupied originally by A because nodes above may point to it.

TreeNode rotateRight(TreeNode A) {
    if (A==null) return;
    TreeNode B = A.left();
    if (B==null) return;
    TreeNode T1 = B.left(), T2 = B.right(), T3 = A.right();

    D temp = B.data();
    B.setData(A.data());
    A.setData(temp);

    B.setLeft(T2);       // this is now the node that holds A's data
    B.setRight(T3);
    A.setLeft(T1);       // this is now the node that holds B's data
    A.setRight(B);
}

Compare Bailey, p 355.

Splay Trees

Splay trees are binary search trees where, on every access, we move the value in question (if found) to the root, in an operation known as splaying. Note that accesses are now mutation operations! The idea is that frequently accessed values will gravitate over time towards the root.

A predecessor to splay trees was rotate-to-root, in which we can move a node up to the root through a series of rotations, each time rotating the node with its (new) parent. This does succeed in bringing a given value to the root.

Splay Trees modify the rotate-to-root strategy with the addition of "grandfather" rotations. Rotations occur in pairs (except the last). Let x be a node, p the parent, and g the grandparent. If x is the left subnode of p and p is the right subnode of g (or x is right and p is left), we rotate around p and then g as in the rotate-to-root method. This is sometimes called the zig-zag step, as the two rotations are in different directions. However, if x is left and p is left, we rotate first around g and then around p (note the reversed order!), in a step sometimes known as "zig-zig".

Consider all the nodes of the tree on the path between x and the root. After splaying x up to the root, the depth of x is, of course, now 0. However, the average depth of all the nodes on that root-to-x path is now halved. Thus, while splaying doesn't necessarily improve the tree balance, it does move a number of nodes closer to the root.

Examples:

splaying a leaf in a balanced tree (eg 6 3 8 2 5 7 9)
splaying a leaf in a degenerate tree (2 3 5 6 7 8 9)

To insert into a splay tree, we first find the node y which, under ordinary insertion, x would be inserted immediately below. Y is then splayed to the root, at which time we insert x above y. Assume that x would have been inserted to the right of y. This means x>y, but also that xy. In inserting x at the root, we make y its left subnode, and move y's right subtree to the right of x.

Splaying wrecks havoc on iterators. The main problem is that normally accesses are iterator-safe (that is, it is safe to access values in a data structure while an iterator is "in progress"; you just can't insert) , but here accesses are not safe.

AVL Trees

These are named for Adelson-Velskii and Landis, from their 1962 paper. The idea behind AVL trees is that at each node we store a value representing depth(left_subtree) - depth(right_subtree); we'll call this the balance factor. We will then use rotations to keep the balance factor small.

We can compute balance factors easily enough using recursion, but it is better to cache the value at each node to avoid excessive computational time. Our goal is to maintain the balance factor for every node as -1, 0, or 1. When we insert a node, we have to do appropriate rotations to maintain the balance factor for every ancestor to the new node (and also be sure that the rotations do not introduce any unbalancing of their own).

As we work up the path from the newly inserted value to the tree root, we consider the new balance factor of each node. If it is -1, 0, or 1, we do nothing. If it is -2 or 2, we do rotations.

Let X be the node in question, with right and left subnodes R and L. If the balance factor of x is +2, then the left subtree L is too deep. We know L has a balance factor of -1, 0, or 1, but it matters which. Let the left child of L be LL, and the right child be LR. The tree now looks like this:

          X
       /    \
      L      R
    / \
   LL LR

(Note that LL, LR, and R are entire subtrees, not just nodes. However, we know that their depths are all similar, because of the balance-factor requirement.)

We will eventually do a right rotation about X; however we might first have to do a left rotation about L. Doing the right rotation about X would leave us with:

          X                         L
       /    \                    /    \
      L      R         =>      LL      X
    / \                             / \
   LL LR                          LR    R

If the balance factor of L had been 0 or 1, this is sufficient. Assume for a moment that the balance factor of L is 1, so depth(LL) = depth(LR)+1. Because X has balance-factor +2, we know depth(R) = depth(LL)-2. After the rotation, the depth of LL has decremented by 1, the total depth of LR is unchanged, and the total depth of R has incremented by 1 to match exactly the depth of LR. So the balance factor of L (the new root) is now 0. If the original balance factor of L had been 0, the post-rotation balance factor of L becomes -1. Either way, it still works.

But if LR is deeper than LL (depth(LL)+1 = depth(LR)), then the new L is unbalanced. So we first do a left rotation about L to move some of LR up:

          X                         X                            LR
       /    \                    /    \                        /    \
      L      R         =>      LR      R              =>      L       X
    / \                      / \                          / \    / \
   LL LR                   L    LRR                      LL LRL   LRR   R
      / \                / \
    LRL LRR            LL   LRL

We argue that LR now has balance factor 0 or 1, and so when we do the right rotation about X as the next step, the previous argument shows that we have achieved AVL balance at the new root (which will be LR). We know that
    depth(LL) +1 = max(depth(LRL), depth(LRR))

Example
Start with the tree
           6
         /   \
        3      8
       / \    / \
      2   5 7   9

Now add 10, 11, 12, 13, 14, 15, which I'll do in hex with A B C D E F.
Add A; no rotations are needed:
           6
         /   \
        3      8
       / \    / \
      2   5 7   9
                  \
             A

After adding B, below A above, we need to rotate around node 9 above:
           6
         /   \
        3      8
       / \    / \
      2   5 7   A
                / \
               9   B

After adding C, below B above, we will need to rotate around node 8 above:
           6
         /   \
        3      A
       / \    / \
      2   5 8   B
            / \   \
           7   9   C

Now we add D below C, and rotate around B to get

           6
         /   \
        3      A
       / \    / \
      2   5 8    C
            / \   / \
           7   9 B D

Now we add E below D, and this time the node we rotate around is all the way at the root, 6:

           A
         /   \
        6       C
      /   \    / \
     3     8 B    D
    / \   / \       \
   2 5 7 9        E

Actually, this is just a little misleading; all the rebalancing rotations above involved having the deepest tree to the "right-right" of the pivot node; that is, the right subtree of the right subtree. In that case, a simple rotation (to the left) about the pivot node is all we need. The "left-left" case is similar. However, the right-left (and left-right) cases are not quite as simple: here, we need to do a preliminary rotation around the right subtree (right child) first. Suppose the tree is:

           6
         /   \
        3      9
          / \
           8   A

We now insert a 7, making the tree unbalanced (BF = - 2) at node 6:
           6
         /   \
        3      9
              / \
             8   A
            /
           7

Rotation about 6 alone gives:

           9
         /   \
        6      A
      / \
     3    8
         /
        7

This still has a balance factor of +/- 2! This rotation did not help! However, the AVL rule in this case is to first rotate about the 9 (the right child), and then about the 6. Also note that the rotations are in different directions, and the BF sign changes (it is +1 at the node labeled 9, and -2 at the node labeled 6).

After rotating right about the 9 we get:

           6
         /   \
        3      8
              / \
             7   9
                  \
                   A

Then rotating left about the 6, we get

           8
         /   \
        6      9
       / \      \
      3   7      A

This tree has had its balance improved.

Bayer Trees

Generally these are known as B-trees. Bayer named them that in his paper, though he did not spell out what the B stood for. B-trees are not binary trees; in fact, you might look at them as evidence that being binary makes life much harder.

B-trees have what I will call an order, B. (Some books, and Wikipedia, call this degree 2B.) Each interior node (other than the root) has k nodes, B<=k<=2B, and k+1 children. The k values, a₀...a_k-1, divide the leaf data into k+1 categories: x<a₀, a_i<x<a_i+1 for i=0..k-1, and a_k+1<x. These categories form the k+1 children; thus, a B-tree is still an ordered tree even if it is not binary. All leaf nodes are the same depth in the tree.

Some B-tree visualizations:

Examples of how ordered trees with more than one data item per node might look

Bayer's idea is that when a node becomes full, we split it in half, and push the median element up a level. The pushed-up node may cause a split in the parent as well; we keep pushing until the process stops or we end up pushing a new root node.

Bayer trees grow deeper only when a new root node is pushed up. Insertion of new values always starts by finding the appropriate leaf node for that value. If that leaf node still has room, the value is inserted; otherwise, push-up is used as necessary.

B-tree of order 1

Insert 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 into a B tree of order 1
Insert 4,17,8,7,2,19,13,15,3,10,1,16,20,9,18,11,14,5,12,6

B-tree of order 2 (2-4 values per node; also known as a 2-4 tree)

Red-Black Trees

These are binary trees that, like AVL trees, remain reasonably balanced because we do restructuring on each insert. The restructuring runs in time O(log height). Here's the wikipedia definition:

A node is either red or black.
The root is black.
All leaves are black.
Both children of every red node are black.
Every simple path from a given node to any of its descendant leaves contains the same number of black nodes.

A black node can have black subnodes. Note that 4 implies that the parent of a red node is black.

Let the black-height of the tree be the number of black nodes in any path from the root to a leaf, constant as per property 5. Property 4 ensures that, along any path, the path length is no more than twice the black-height.

B-Tree analogy

We create a B-tree-like structure by consolidating each black node together with any red offspring. Let us refer to the consolidation of a black node with its red children as a B-node; it then follows from rule 5 that the tree made from the B-nodes has all paths of the same length. It follows from rule 4 that the offspring of any red offspring are black, and so if we identify a B-node in the original tree, the immediate children of this node must all be black.

The catch is that the tree we create this way isn't a proper B-tree. A B-tree of order 1 has 1 or 2 data values per node, and 2 or 3 subtrees; a B-tree of order 2 has 2, 3 or 4 data values per non-root node with 3, 4 or 5 children. The B-like tree here has 1, 2 or 3 data values per node, and 2, 3 or 4 subtrees.

Given a B-like tree with 1-3 data values per node, maintaining the B-tree property of all paths having equal length, we can convert it to a red-black tree as follows:

If a B-node has 1 value, it becomes a single black node
If a B-node has two values, it becomes two nodes. The upper one will be black and the lower one red; it does not matter which is which.
If a B-node has three values, the middle value becomes a black node and the two other values become red nodes representing the roots of the left and right subtrees.

Realistically, there is no reason to prefer the red-black formulation to a proper B-tree formulation. There is no especial benefit to having trees be binary.

Inserting into a B-like tree with 1-3 data values per node is similar to insertion into a regular B-tree, except that when a node is "full" with three values, and we split it and push up a middle value, the split will be into pieces of sizes 1 and 2. It does not matter which piece is which; if a node contains values (10,15,20) and we add 18, we can push up either the 15, with split nodes (10) and (18,20), or the 18, with split nodes (10,15) and (20).

Red-Black Node Insertion

Let us refer to the sibling of the parent of a node as the uncle; the uncle of a node, if it exists, is always unique because the tree is binary. The first step is to add the new node to the tree in the usual binary-tree-insertion manner; we must then rotate and recolor in order to restore red-blackness. The important red-black requirements are 4 and especially 5.

As we work up the tree, we will denote the current node by N. Originally, N is the leaf node where the new data value is inserted, but N then takes values along the path from that leaf to the root. When we originally install the leaf node, we will color it red. As we move upwards, the node designated current will always be red.

We will also let P be the parent of N, G the grandparent and U the uncle. We will follow the five cases of Wikipedia.

            G
         /    \
        U      P
             /   \
            N

Case 1: N is the root of the tree. We color it black and are done.

Case 2: P is black. In this case we can leave N red and be done.

Case 3: P and U are both red. Then we can change P and U both to black and change G to red. Property 4 now holds for G, and property 5 holds for the tree because all paths formerly through black node G now pass through exactly one of the newly black nodes U and P.

Case 4: P is red but U is black, and N is the right child of P and P is the left child of G (or N is the left child of P and P is the right child of G):
            G
         /    \
        P      U
      /   \
           N
          / \

In this case we start with a left rotation about P:
            G
         /    \
        N      U
      /   \
     P

The number of black nodes along any path through P or N is unchanged. Rule 4, however, fails. We fix this by setting the just-lowered P to current and moving to case 5, below

Case 5: P is red, U is black, N is the left child of P which is the left child of G (or N is right of P which is right of G):
            G
         /    \
        P      U
      /   \
     N     T

In this case we do a right rotation about G, changing P to black and G to red:

            P
         /    \
        N       G
      /   \
             T     U

Property 4 now holds, and so does property 5 for the tree now rooted at P: the number of black nodes along any path through N, T or U is unchanged although which specific nodes are black has changed.