Trees
In recursion.html#exprtrees we
looked at some simple tree structures used for expressing "hierarchical"
information: expressions:
2*(3+5)
*
/ \
2 +
/ \
3 5
A tree consists of nodes, each of which has zero or more subnodes.
A binary tree is one in which each node has up to two
subnodes.
Binary tree nodes thus contain:
- data
- a left subtree
- a right subtree
The code might look like this (from bintree.cs)
class treenode {
private T data_;
private treenode left_, right_;
public treenode(T d, treenode l, treenode r) {data_ = d; left_ = l; right_ = r;}
public T data() {return data_;}
public treenode left() {return left_;}
public treenode right() {return right_;}
}
A binary tree is ordered if there is an ordering
relation on the data and, for each node, every data value appearing in the
left subtree is less than the node's own data, and every data value
appearing in the right subtree is greater than the node's own data.
Quite often, eg for expression trees, no ordering is involved.
The root node is at the top of the tree. Nodes with no
subtrees (ie in which the subtree fields are null) are leaf
nodes. Other nodes are interior nodes.
The depth of a node is how far it is, in node-to-node
links, from the root. Level n of a tree consists of all
nodes with depth n.
Examples:
- Abstract expression trees
- Parse trees
- Ancestry trees
- Decision trees (Bailey p 288)
None of these is ordered!
Tree Traversal
To traverse a tree is to visit each node, perhaps
printing out the data values or perhaps taking some other action.
Traversal methods are usually recursive, and usually the initial call is
on the tree's root.
Three common forms of traversal are preorder, inorder,
or postorder, depending on whether a node's data is
visited before, between or after the left and right subtrees are visited.
All of these are considered to be depth-first traversal.
Here is a sample inorder traversal
Demo:
traverse the expression tree for 4+x*(y+1)
inorder
postorder (cf Jan
Łukasiewicz)
traverse the following tree from simpletrees.zip:
6
/ \
4 8
/ \ / \
1 5 7
37
We can also do breadth-first traversal, by going along
levels: 6, 4, 8, 1, 5, 7, 37.
Example code for inorder traversal:
Tree Building
How do we build trees? One way is from the bottom up; that is much easier
of the class treenode is public. We could do this (assuming treenode is
equivalent to treenode):
treenode n1 = new treenode("a", null, null);
treenode n2 = new treenode("3", null, null);
treenode n3 = new treenode("*", n1, n2);
treenode n4 = new treenode("1", null, null);
treenode root = new treenode("+", n3,
n4);
What does this build?
We can build ordered trees by inserting in order:
public void insert(int val) {
if (root_ == null) {
root_ = new treenode(val, null, null);
return;
}
rinsert(val, root_);
}
// recursive insert: does not get called with t==null
private void rinsert(int val, treenode t) {
if (val == t.data() ) return; // should we do something else?
if (val < t.data()) {
if (t.left() == null) {
t.setleft(new treenode(val, null, null));
} else {
rinsert(val, t.left());
}
} else { // val > t.data()
if (t.right() == null) {
t.setright(new treenode(val, null, null));
} else {
rinsert(val, t.right());
}
}
}
If a treenode is null, we cannot modify it; we can only modify the
parent. Bailey addresses this in some contexts by including a pointer to
the parent node, but that won't help us here.
Note that the tree we get by inserting values depends on the order of
insertion.
In-class demos:
- try inserting the values in order
- what orders lead to the same tree?
- convert to preorder traversal
n-ary trees and ordered n-ary trees
A tree doesn't have to be binary to be ordered. But the number of data
elements at each node must be one less than the number of subtrees.
An n-ary tree has, for each node, a list of subnodes:
class Treenode {
private T data;
private List<T> children;
...
}
Different nodes of a tree can have different numbers of subnodes.
An ordered n-ary tree means that if a node has n child
nodes it also has a List of n-1 data values ⟨d0,..., dn-2⟩,
and all data values within child node kk-1
and dk. That is, the ⟨d0,..., dn-2⟩
divide the data into n intervals, and the subtrees correspond to these
intervals.
As a Consider:
37
/ \
(23
34)
(43 56 71 88)
/
|
\
/ | \ \
\
/
|
\
/ |
\ \ \
16
29
35
39 47
60 75 103
html documents as a data structure (Document Object Model, or DOM)
The structure is basically that of an n-ary tree
Recursion on trees
Let us suppose we have an inttree. We want one of the following:
- The depth of the tree
- The maximum value in the tree
- The sum of all the numbers in the tree
- Whether 7 is in the tree
The last one can be done efficiently (O(log N)) if we know the tree
is sorted. Otherwise, it is the same as the others: we proceed
recursively, checking the left and right subtrees and then the root
node.
private static int sum(treenode t) {
if (t==null) return 0;
else return sum(t.left()) + t.data() +
sum(t.right())
}
Traversing Trees
What if we want to implement traversing a tree until we find a
particular value, n? The interface method might be
public
String Get(int n) {
if (n>=count_)
return null;
getstr_ = null;
get(n,root_);
return getstr_;
}
Now the recursive helper method get(n, treenode p) must be defined.
How would we do this?
In the BlueJ project traversers,
I have implemented several different ways to traverse a tree.
1. inTraverse(): recursive traversal, now with a depth parameter
3. inTraverseStack(): stack-based iterative traversal, using a
"trick"
A slightly more natural version of the stack-based iteration is
this; however, it requires being able to push both treenodes and
data as separate entities. This is
4. inTraverseStackAlt()
s.pushNode(root_);
while (!s.isEmpty()) {
Object x = s.pop();
if (x is treenode) {
treenode t
= x as treenode;
if
(t.right() != null) s.pushNode(t.right());
s.pushData(t.data());
if
(t.left() != null) s.pushNode(t.left());
} else {
string str
= x as string;
Console.WriteLine(str);
}
}
Next we replace the Stack with a Queue. (Normally the queue
operations are not called push() and pop() but rather enqueue() and
dequeue(), but I wanted a very simple drop-in replacement for a
stack.
5. inTraverseQueue
6. inTraverseQueueAlt
What order do these visit the nodes of the tree? (#5 is a little
broken; the "real" version is #6).
As a related question, why are there so few actual problems that are
solved with Queues?
For the next traverser, we define a Next() method that uses the
stack-based approach:
7. NextSetup()/Next()
The drawback to that is there can be only one active iteration at a
time. The final version solves that, and also enables for-each
loops:
8. java Iterator (allowing a for-each loop), internally based on a
stack.
Tree Balancing
The problem with ordered trees is that the worst-case depth is O(N),
rather than O(log N). We would like to make sure that the tree remains
at least somewhat balanced as nodes are added (remaining balanced as
nodes are deleted is a related problem, which we will not consider).
Some algorithms in widespread use are AVL trees, Black-Red trees,
Bayer trees (B-trees), and Skew trees.
Tree Rotations
Tree rotations are a simple reorganization of a tree or subtree.
The idea is to consider a (subtree) root node A and one child B (say
the left child); the child B moves up to the root and the former root
A moves down. The root node A
may be a subnode of a larger tree; since none of the values in the
A-subtree change, any such larger tree would remain ordered. The
transformation shown here, from the left to the right diagram, is a right rotation about A (the
topmost node A moves down to the right); an example of a left rotation
(about B) would be the reverse.
Right rotation:
A
B
/
\
/
\
B
T3
=> T1 A
/
\
/
\
T1
T2
T2
T3
This is legal because we have T1 ≤ B ≤ T2 ≤ A ≤ T3 (where T1,T2,T3
represent all values in those subtrees)
It is straightforward to write code to implement a rotation. The
ordered node data type is D. Note that we swap the data in the node
occupied by A and B; we have to preserve the node occupied originally
by A because nodes above may point to it.
TreeNode rotateRight(TreeNode A) {
if (A==null) return;
TreeNode B = A.left();
if (B==null) return;
TreeNode T1 = B.left(), T2 = B.right(), T3
= A.right();
D temp = B.data();
B.setData(A.data());
A.setData(temp);
B.setLeft(T2);
// this is now the node that holds A's data
B.setRight(T3);
A.setLeft(T1);
// this is now the node that holds B's data
A.setRight(B);
}
Compare Bailey, p 355.
Splay Trees
Splay trees are binary search trees where, on every access,
we move the value in question (if found) to the root, in an
operation known as splaying.
Note that accesses are now mutation operations! The idea is
that frequently accessed values will gravitate over time
towards the root.
A predecessor to splay trees was rotate-to-root,
in which we can move a node up to the root through a series of
rotations, each time rotating the node with its (new) parent.
This does succeed in bringing a given value to the root.
Splay Trees modify the rotate-to-root strategy with the
addition of "grandfather" rotations. Rotations occur in pairs
(except the last). Let x be a node, p the parent, and g the
grandparent. If x is the left subnode of p and p is the right
subnode of g (or x is right and p is left), we rotate around p
and then g as in the rotate-to-root method. This is sometimes
called the zig-zag step, as the two rotations are in different
directions. However, if x is left and p is left, we rotate
first around g and then around p (note the reversed order!),
in a step sometimes known as "zig-zig".
Consider all the nodes of the tree on the path between x and
the root. After splaying x up to the root, the depth of x is,
of course, now 0. However, the average depth of all the nodes
on that root-to-x path is now halved. Thus, while splaying
doesn't necessarily improve the tree balance, it does move a
number of nodes closer to the root.
Examples:
- splaying a leaf in a balanced tree (eg 6 3 8 2 5 7 9)
- splaying a leaf in a degenerate tree (2 3 5 6 7 8 9)
To insert into a splay tree, we first find the node y which,
under ordinary insertion, x would be inserted immediately
below. Y is then splayed to the root, at which time we insert
x above y. Assume that x would have been inserted to the right
of y. This means x>y, but also that xy. In inserting x at the root, we make y its left
subnode, and move y's right subtree to the right of x.
Splaying wrecks havoc on iterators. The main problem is that
normally accesses are iterator-safe (that is, it is safe to
access values in a data structure while an iterator is "in
progress"; you just can't insert)
, but here accesses are not safe.
AVL Trees
These are named for Adelson-Velskii and Landis, from their
1962 paper. The idea behind AVL trees is that at each node
we store a value representing depth(left_subtree) -
depth(right_subtree); we'll call this the balance
factor. We will use rotations to keep the balance
factor small during insertions.
We can compute balance factors easily enough using
recursion, but it is better to cache the
value at each node to avoid excessive computational time. (Actually, we store at each node the depth of
that node in the tree; the balance factor at node p is then
p.left().depth() - p.right().depth().) Our goal is to
maintain the balance factor for every node as -1, 0, or 1.
When we insert a node, we have to do appropriate rotations
to maintain the balance factor for every ancestor to the new
node (and also be sure that the rotations do not introduce
any unbalancing of their own).
As we work up the path from the newly inserted value to the
tree root, we consider the new balance factor of each node.
If it is -1, 0, or 1, we do nothing. If it is -2 or 2 (the
worst it can be after a single insertion), we do rotations.
Let X be the node in question, with right and left subnodes
R and L. If the balance factor of X is +2, then the left
subtree L is too deep. We know L has a balance factor of -1,
0, or 1, but it matters which. Let the left child of L be
LL, and the right child be LR. The tree now looks like this:
X
/
\
L
R
/
\
LL LR
(Note that LL, LR, and R are entire
subtrees, not just nodes. However, we know that
their depths are all similar, because of the balance-factor
requirement.)
We will eventually do a right rotation about X; however
we might first have to do a left
rotation about L. Doing the right rotation about X would
leave us with:
X
L
/
\
/
\
L
R
=> LL
X
/
\
/
\
LL LR
LR
R
If the balance factor of L had been 0 or 1, this is
sufficient. Assume for a moment that the balance factor of L
is 1, so depth(LL) = depth(LR)+1. Because X has
balance-factor +2, we know depth(R) = depth(LL)-2. After the
rotation, the depth of LL has decremented by 1, the total
depth of LR is unchanged, and the total depth of R has
incremented by 1 to match exactly the depth of LR. So the
balance factor of L (the new root) is now 0. If the original
balance factor of L had been 0, the post-rotation balance
factor of L becomes -1. Either way, it still works.
But if LR is deeper than LL (depth(LL)+1 = depth(LR)), then
the new L is unbalanced. So we first do a left
rotation about L to move some of LR up:
X
X
LR
/
\
/
\
/
\
L
R
=> LR
R
=>
L X
/
\
/
\
/
\ / \
LL LR
L
LRR
LL
LRL LRR R
/
\
/
\
LRL
LRR
LL LRL
We argue that LR now has balance factor 0 or 1, and so when
we do the right rotation about X as the next step, the
previous argument shows that we have achieved AVL balance at
the new root (which will be LR). We know that
depth(LL) +1 =
max(depth(LRL), depth(LRR))
The general rule is that we need to do two rotations (about L and then X
in the first diagram) if the balance factors at L and at X have different
sign (eg +2 at X and -1 at L, or -2 at X and +1 at L).
Example
Start with the tree
6
/ \
3 8
/ \
/ \
2 5
7 9
Now add 10, 11, 12, 13, 14, 15, which I'll do in hex with A
B C D E F.
Add A; no rotations are needed:
6
/ \
3 8
/ \ / \
2 5 7 9
\
A
After adding B, below A above, we need to rotate around node
9 above:
6
/ \
3 8
/ \ / \
2 5 7 A
/ \
9 B
After adding C, below B above, we will need to rotate around
node 8 above:
6
/ \
3 A
/ \ / \
2 5 8 B
/ \ \
7 9 C
Now we add D below C, and rotate around B to get
6
/ \
3 A
/ \ / \
2 5 8 C
/ \ / \
7 9 B D
Now we add E below D, and this time the node we rotate
around is all the way at the root, 6:
A
/ \
6 C
/ \ / \
3 8 B D
/
\ / \ \
2
5 7
9 E
Actually, this is just a
little misleading; all the rebalancing rotations above
involved having the deepest tree to the "right-right" of the
pivot node; that is, the right subtree of the right subtree.
In that case, a simple rotation (to the left) about the
pivot node is all we need. The "left-left" case is similar.
However, the right-left (and left-right) cases are not quite
as simple: here, we need to do a preliminary rotation around
the right subtree (right child) first. Suppose the tree is:
6
/ \
3 9
/ \
8 A
We now insert a 7, making the tree unbalanced (BF = - 2) at
node 6:
6
/ \
3 9
/ \
8 A
/
7
Rotation about 6 alone gives:
9
/ \
6 A
/ \
3 8
/
7
This still has a balance factor of +/- 2! This rotation did
not help! However, the AVL rule in this case is to first
rotate about the 9 (the right child), and then
about the 6. Also note that the rotations are in different
directions, and the BF sign changes (it is +1 at the node
labeled 9, and -2 at the node labeled 6).
After rotating right
about the 9 we get:
6
/ \
3 8
/ \
7 9
\
A
Then rotating left
about the 6, we get
8
/ \
6 9
/ \ \
3 7 A
This tree has had
its balance improved.
Bayer Trees
Generally these are known as B-trees.
Bayer named them that in his paper, though he did not
spell out what the B stood for. B-trees are not
binary trees; in fact, you might look at them as evidence
that being binary makes life much harder.
Bayer's co-author was McCreight, who in 2013 said the following:
Bayer and I were in a lunchtime where we get
to think [of] a name. And ... B is, you know ... We were working for
Boeing at the time, we couldn't use the name without talking to lawyers.
So, there is a B. [The B-tree] has to do with balance, another B. Bayer
was the senior author, who [was] several years older than I am and had
many more publications than I did. So there is another B. And so, at the
lunch table we never did resolve whether there was one of those that made
more sense than the rest. What really lives to say is: the more you think
about what the B in B-trees means, the better you understand B-trees.
So maybe the B stands for Boeing.
B-trees have what I will call an order,
d. Each interior node (other than the root) has k nodes,
d<=k<=2d, and k+1 children. The k values, a0...ak-1,
divide the leaf data into k+1 categories: x<a0,
ai<x<ai+1 for i=0..k-1, and ak+1<x.
These categories form the k+1 children; thus, a B-tree is
still an ordered tree
even if it is not binary. All leaf nodes are the same
depth in the tree.
The nomenclature for B-tree order is not
entirely consistent; some people use 2d, or even 2d+1, as
the order of what is called an order-d tree above. For
greater consistency, we can identify B-trees by their
maximum degree, which is the maximum
number of children a node can have. A B-tree of order d,
as above, has degree 2d+1.
Some B-tree visualizations:
Examples of how ordered trees with more than one data item
per node might look
Bayer's idea is that when a node becomes overfull, we split
it in half, and push the median element up
a level. The pushed-up node may cause a split in the parent
as well; we keep pushing until the process stops or we end
up pushing a new root node.
Bayer trees grow deeper only when a new root node
is pushed up. Insertion of new values always starts by
finding the appropriate leaf node for that value. If that
leaf node still has room, the value is inserted; otherwise,
push-up is used as necessary.
B-tree of order 1
Insert
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 into a B
tree of order 1
Insert
4,17,8,7,2,19,13,15,3,10,1,16,20,9,18,11,14,5,12,6
B-tree of order 2 (2-4 values per node; also known as a 2-4
tree)
2-4 Trees
Before proceeding to red-black trees we consider 2-4 trees, which are a
form of B-trees. The tree is ordered, and each node is allowed to have
between two and four children. The number of data values at a node is one
less than the number of children. An additional requirement is that all
leaf nodes must be at the same depth.
Insertion is similar to a B-tree. We start by adding the new value u to
the appropriate leaf node, finding that leaf node using the tree ordering.
Leaf nodes can contain between 1 and 3 data values. If adding u would give
the leaf node four data values, we push the median value of the node up to
the parent, and split the node into two (those less than the median and
those greater than the median, which go on the right and left respectively
of the pushed-up value).
This may cause the parent node to be too big, in which case we repeat the
process. If we must add a new node at every level, then we end up
splitting the existing root node and creating a new root above it.
Red-Black Trees
These are binary trees that, like AVL
trees, remain reasonably balanced because we do
restructuring on each insert. The restructuring runs in time
O(log height). Here's the wikipedia
definition:
- A node is either red or black.
- The root is black.
- All leaves are black.
- Both children of every red node are black (that is, a
parent and child cannot both be red)
- Every simple path from a given node to any of its
descendant leaves contains the same number of black
nodes.
A black node can have black subnodes. Note that 4 implies
that the parent of a red node is black.
Let the black-height of the tree be the
number of black nodes in any path from the root to a leaf,
constant as per property 5. Property 4 ensures that, along
any path, the
path length is no more than twice the black-height.
2-4 Tree analogy
From a red-black tree we can create a 2-4 tree consolidating each black
node together with any black grandchildren of a red child. If a node N is
black and has one red child, this will give N three black children. If N
has two red children, it will now have four black children.
We can convert a 2-4 tree back to a red-black tree in a unique way if we
require that the red-black tree be left-leaning; that
is, that if a black node has a red and a black child, then the red child
is the left one. Given a node N in the 2-4 tree, we do the following:
- if N has two children, we leave it alone.
- if N has three children n1, n2 and n3, we create a new red child node
to N's left and move n1 and n2 so they are children of the red node. N's
right child is n3.
- if N has four children, we create two red child nodes. The first red
node gets the first two of N's children, and the second red node gets
the second two.
We know how to insert into a 2-4 tree so as to maintain the 2-4 property.
Because the 2-4 property is functionally equivalent to the red-black
property for the corresponding red-black tree, we can now maintain
red-black trees.
B-Tree analogy
We create a B-tree-like structure by consolidating each black node
together with any red offspring. Let us refer to the consolidation of a
black node with its red children as a B-node; a B-node will now have
between 1 and 3 data values, and between 2 and 4 children. It then follows
from rule 5 that the tree made from the B-nodes has all paths of the same
length. It follows from rule 4 that the offspring of any red offspring are
black, and so if we identify a B-node in the original tree, the immediate
children of this node must all be black.
The catch is that the tree we create this way isn't quite a
proper B-tree. A B-tree of order 1 has 1 or 2 data values per node, and 2
or 3 subtrees; a B-tree of order 2 has 2, 3 or 4 data values per non-root
node with 3, 4 or 5 children. The B-like tree here has 1, 2 or 3
data values per node, and 2, 3 or 4 subtrees.
Given a B-like tree with 1-3 data values per node, maintaining the B-tree
property of all paths having equal length, we can convert it to a
red-black tree as follows:
- If a B-node has 1 value, it becomes a single black node
- If a B-node has two values, it becomes two nodes. The upper one will
be black and the lower one red; it does not matter which is which.
- If a B-node has three values, the middle value becomes a black node
and the two other values become red nodes representing the roots of the
left and right subtrees.
Inserting into a B-like tree with 1-3 data values per node is similar to
insertion into a regular B-tree, except that when a node is "full" with
three values, and we split it and push up a middle value, the split will
be into pieces of sizes 1 and 2. It does not matter which piece is which;
if a node contains values (10,15,20) and we add 18, we can push up either
the 15, with split nodes (10) and (18,20), or the 18, with split nodes
(10,15) and (20).
Realistically, there is no reason to prefer the red-black formulation to
a proper B-tree formulation. There is no especial benefit to having trees
be binary.
Red-Black Node Insertion
Here is the algorithm for "directly" inserting into a red-black tree,
without converting to a 2-4 tree or a B-like tree
Let us refer to the sibling of the parent
of a node as the uncle; the uncle of a
node, if it exists, is always unique because the tree is
binary. The first step is to add the new node to the tree
in the usual binary-tree-insertion manner; we must then
rotate and recolor in order to restore red-blackness. The
important red-black requirements are 4 and especially 5.
As we work up the tree, we will denote the
current node by N. Originally, N is the
leaf node where the new data value is inserted, but N then
takes values along the path from that leaf to the root. When
we originally install the leaf node, we will color it red.
As we move upwards, the node designated current
will always be red.
We will also let P be the parent of N, G the grandparent and
U the uncle. We will follow the five cases of Wikipedia.
G
/ \
U P
/ \
N
Case 1: N is the root of the tree. We color it black and are
done.
Case 2: P is black. In this case we can leave N red and be
done.
Case 3: P and U are both red. Then we can change P and U
both to black and change G to red. Property 4 now holds for
G, and property 5 holds for the tree because all paths
formerly through black node G now pass through exactly one
of the newly black nodes U and P.
Case 4: P is red but U is black, and N is the right child of
P and P is the left child of G (or N is the left child of P
and P is the right child of G):
G
/ \
P
U
/ \
N
/ \
In this case we start with a left rotation about P:
G
/ \
N
U
/ \
P
The number of black nodes along any path through P or N is
unchanged. Rule 4, however, fails. We fix this by setting
the just-lowered P to current and moving
to case 5, below
Case 5: P is red, U is black, N is the left child of P which
is the left child of G (or N is right of P which is right of
G):
G
/ \
P
U
/ \
N
T
In this case we do a right rotation about G, changing P to
black and G to red:
P
/ \
N
G
/ \
T U
Property 4 now holds, and so does property 5 for the tree
now rooted at P: the number of black nodes along any path
through N, T or U is unchanged although which specific
nodes are black has changed.