Comp 271 Week 12

Lab 9: Treemaps

Timings with and without the optimization:
durations with optimization in effect:

duration = 6790108
duration = 5405435
duration = 4422699
duration = 5219938
duration = 5340764
duration = 5062445

sum: 32241389; avg 5373564

durations with optimization commented out:

duration = 6103430
duration = 5471016
duration = 6332788
duration = 5435676
duration = 5641989
duration = 4668541

sum: 33653440; avg 5608906

The difference seems dwarfed by the natural variation! I'd need to collect a lot more data.
Here's more data:
no-cache avg duration, 10 runs, is 11,294,353
cache avg duration, 10 runs, is         7,342,483

no-cache avg duration, 100 runs, is 5,823,755
cache avg duration, 100 runs, is      3,476,205

no-cache avg duration, 1000 runs, is 3,665,824
cache avg duration, 1000 runs, is      2,896,411

This makes the cache seem more worth the effort.

I implemented this by creating separate put() and putcache() methods; the former did not use the cache while the latter did.

Hash maps and hash sets

Basic strategy: an array of buckets, called hashTable; each bucket is a pointer to a singly linked list of Nodes holding data values that hash to that bucket. We place the value key in the list hashTable[index] after computing
index = key.hashCode() % hashTable.length

The hashcode() function is part of the System.lang.Object class; every java class has this. If this were not the case, then defining a hashCode() function would be up to the developer of each hash-based class; as a result, it would be straightforward to define a hashmap specific to String, or specific to some other fixed type, but it would be hard to define a general hashmap<K>.

Here is the formula for hashCode of a string:

	s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
	= (...((s[0]*31 + s[1])*31 + s[2])*31 + ...)*31 + s[n-1]

If the string s is "eat", then s[0] = 'e' = 101, s[1]='a' = 97, and s[2]=t=116, and the hashcode should be:
(101*31+97)*31+116 = 100178.

In java.util.HashMap, when the size() of the Map reaches 75% of hashTable.length, we resize the table and rehash everything. This gives us important flexibility in keeping the table relatively "sparse".

In general, when we put K things into a hash table of size N, randomly, then the average bucket list will have length λ = K/N. However, we'll have some empty buckets, and also some larger buckets. The expected fraction of buckets that will have size i is around λⁱe^-λ/i!, from the Poisson Distribution.

For i=0, this gives N*e^-λ as the number of empty buckets. For λ=0.75, this is 47.2% empty. We also have about 35.4% of the buckets with 1 element, and 13.3% with two elements, and 3.3% with three elements, and 0.6% with four elements.

The average list length is 0*0.472 + 1*0.354 + 2*0.133 + 3*0.033 + 4*0.006 + ... = 75% = λ

For searching for things not in the list, this is a pretty good measure of the average time for lookup; if we resize regularly, this means that lookup is O(1) (though we have a time/space tradeoff here!). For looking up things that are in the list, we should instead not count the 0-length lists, which means we should divide by e^-λ so as to adjust the weighting properly, but then also divide by 2 because on average we'll find what we're looking for after searching just half the list. Mostly this is not a real-life concern. For the theoretically inclined, however, λ < ln(2) means the searching of nonempty lists for a value that is present is as fast or faster than searching for something that is not there.

Bad table sizes

Should you avoid hash tables of size a power of 2? The concern is that then hashCode % length just means throwing away the high-order bits of the hashCode, which means they were "wasted". There's some merit in choosing table sizes other than a power of 2, and once upon a time an effort was made to choose tables for which the size was a prime number, but mostly java's hashCode() methods are sufficiently "pseudorandom" that this isn't necessary. My initialTableSize=10 is probably enough to avoid having to worry about any of this.

Thursday

Hashing as the last refuge of singly linked lists? We have a lot of small lists! Replacing them with ArrayLists, which start at size 10, would clearly be a lose.

Hash Sets: just drop the data field.
Applications:

determining if something has been seen before
weeding out duplicates
recognizing keywords

The hashing technique I used is called bucket hashing: each of the hashTable[i] is a "bucket", implemented as a linked list. If two data values hash to the same bucket, we just put them into the same linked list.

Another approach to hashing is the "open addressing" hash technique, in which buckets are not used. We must have the table size significantly larger than the amount of data (that is, load_factor > 1). The central method, in Bailey, is locate(). Note how, if we don't find what we're looking for at index=hashcode, we increment the index. This is our rehashing approach. (More sophisticated rehashing has been used, but if that is an issue, bucket hashing might be more appropriate.)

Note also that, to support deletion, we need to be able to mark positions as "reserved". We're allowed to insert into reserved locations, but in searches we skip over them and keep going.

Note also that collisions are pretty inevitable. With table size N, K entries, and load factor λ=K/N, the expected number of positions with exactly two entries is Nλ²e^-λ/2. As a practical matter, we need N ~ K² before the probability of some collisions drops to a modest level. Compare to the Birthday Paradox: if you have 24 people chosen at random, the probability that two have the same birthday (a birthday "collision") is greater than 50%.

Binary Tree Deletion

How would we delete a node X from a binary tree? If X is a leaf node, this is easy. Otherwise, we are going to have to "promote" one of the subnodes of X (not necessarily one of the immediate subnodes) to the place where X had been; we're then going to have to restructure the remaining subnodes so that they again form a binary tree.

As far as the rest of the tree (the portion above X) is concerned, we can restructure the remaining nodes however we want; all of the values are smaller than the first (smallest) parent-tree node that is to the right, and greater than the first (largest) parent-tree node that is to the left.

One possible restructuring is to sort all the data values below X, and build a degenerate linear "tree" where each node has only a right (or left) child. The problem with this approach is that the depth of the subtree is now as large as possible (equal to the number of nodes); our goal in deletion is thus to arrange for the depth not to get worse. A second problem is that the processing here is proportional to the number of nodes below X, and we'd rather be proportional to the depth of the tree below X.

Here are the two easy cases:

The node has no children. Just delete it.
The node has one child. Call the child node Y. Note that Y might have a very large tree below it! However, that doesn't matter; we simply promote Y to where X had been. More specifically, if X was the right subtree of its parent node P, we simply do P.setRight(Y); similarly if X was the root or P.left().

Now what do we do if the node has two children, which I'll call L and R? One strategy is to promote L, and insert R below the rightmost subnode of L. Alternatively, we can promote R and insert L below the leftmost subnode of R. The drawback of these is that the new tree will typically have depth = depth(L) + depth(R), while we previously had depth 1 + max(depth(L),depth(R)).

The best approach is to search down the rightmost path from L, to get the largest value in the subtree less than X (that is, the largest value below L), and/or to search down the leftmost path from R to get the smallest subtree value greater than X (that is, the smallest value below R), and promote one of those. Considering the first case, let Y be the largest value below L. Y must not have a right child, and so we can delete Y from where it had been using the second case above. We now put Y where X was, and leave L (which we just modified) and R as the left and right children of Y.

It is not exactly clear when we should take the largest value below L and when we should take the smallest value below R. It is known that if we always do one consistently, then over time the tree will become spindly and have excessive depth. One strategy is to take whichever of the two nodes has greater depth, on the theory that that will lead to depth reduction. Another strategy is simply to alternate.

See also page 352 of Bailey. The approach there is removing the root of a subtree. However, finding the left or right predecessor is still part of the primary case.

Tree Balancing

The problem with ordered trees is that the worst-case depth is O(N), rather than O(log N). We would like to make sure that the tree remains at least somewhat balanced as nodes are added (remaining balanced as nodes are deleted is a related problem, which we will not consider). Some algorithms in widespread use are AVL trees, Black-Red trees, Bayer trees (B-trees), and Skew trees.

Tree Rotations

Tree rotations are a simple reorganization of a tree or subtree. The idea is to consider a (subtree) root node A and one child B (say the left child); the child B moves up to the root and the former root A moves down. The root node A may be a subnode of a larger tree; since none of the values in the A-subtree change, any such larger tree would remain ordered. The transformation shown here, from the left to the right diagram, is a right rotation about A (the topmost node A moves down to the right); an example of a left rotation (about B) would be the reverse.

          A                         B
       /    \                    /    \
      B      T3         =>     T1      A
    / \                             / \
   T1 T2                          T2    T3

This is legal because we have T1 ≤ B ≤ T2 ≤ A ≤ T3 (where T1,T2,T3 represent all values in those subtrees)

It is straightforward to write code to implement a rotation. The ordered node data type is D. Note that we swap the data in the node occupied by A and B; we have to preserve the node occupied originally by A because nodes above may point to it.

TreeNode<D> rotateRight(TreeNode<D> A) {
    if (A==null) return;
    TreeNode<D> B = A.left();
    if (B==null) return;
    TreeNode<D> T1 = B.left(), T2 = B.right(), T3 = A.right();

    D temp = B.data();
    B.setData(A.data());
    A.setData(temp);

    B.setLeft(T2);       // this is now the node that holds A's data
    B.setRight(T3);
    A.setLeft(T1);       // this is now the node that holds B's data
    A.setRight(B);
}

Compare Bailey, p 355.

Splay Trees

Splay trees are binary search trees where, on every access, we move the value in question (if found) to the root, in an operation known as splaying. Note that accesses are now mutation operations! The idea is that frequently accessed values will gravitate over time towards the root.

A predecessor to splay trees was rotate-to-root, in which we can move a node up to the root through a series of rotations, each time rotating the node with its (new) parent. This does succeed in bringing a given value to the root.

Splay Trees modify this simple strategy with the addition of "grandfather" rotations. Let x be a node, p the parent, and g the grandparent. If x is the left subnode of p and p is the right subnode of g (or x is right and p is left), we rotate around p and then g as in the rotate-to-root method. This is sometimes called the zig-zag step, as the two rotations are in different directions. However, if x is left and p is left, we rotate first around g and then around p (note the reversed order!), in a step sometimes known as "zig-zig".