Comp 271 Week 12
Lab 9: Treemaps
Timings with and without the optimization:
durations with optimization in effect:
duration = 6790108
duration = 5405435
duration = 4422699
duration = 5219938
duration = 5340764
duration = 5062445
sum: 32241389; avg 5373564
durations with optimization commented out:
duration = 6103430
duration = 5471016
duration = 6332788
duration = 5435676
duration = 5641989
duration = 4668541
sum: 33653440; avg 5608906
The difference seems dwarfed by the natural variation! I'd need to collect a lot more data.
Here's more data:
no-cache avg duration, 10 runs, is 11,294,353
cache avg duration, 10 runs, is 7,342,483
no-cache avg duration, 100 runs, is 5,823,755
cache avg duration, 100 runs, is 3,476,205
no-cache avg duration, 1000 runs, is 3,665,824
cache avg duration, 1000 runs, is 2,896,411
This makes the cache seem more worth the effort.
I implemented this by creating separate put() and putcache() methods; the former did not use the cache while the latter did.
Hash maps and hash sets
Basic strategy: an array of buckets, called hashTable; each bucket is a
pointer to a singly linked list of Nodes holding data values that hash
to that bucket. We place the value key in the list hashTable[index] after computing
index = key.hashCode() % hashTable.length
The hashcode() function is part of the System.lang.Object class;
every java class has this. If this were not the case, then defining a
hashCode() function would be up to the developer of each hash-based
class; as a result, it would be straightforward to define a hashmap
specific to String, or specific to some other fixed type, but it would
be hard to define a general hashmap<K>.
Here is the formula for hashCode of a string:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
= (...((s[0]*31 + s[1])*31 + s[2])*31 + ...)*31 + s[n-1]
If the string s is "eat", then s[0] = 'e' = 101, s[1]='a' = 97, and s[2]=t=116, and the hashcode should be:
(101*31+97)*31+116 = 100178.
In java.util.HashMap, when the size() of the Map reaches 75% of
hashTable.length, we resize the table and rehash everything. This gives
us important flexibility in keeping the table relatively "sparse".
In general, when we put K things into a hash table of size N, randomly, then the average
bucket list will have length λ = K/N. However, we'll have some empty
buckets, and also some larger buckets. The expected fraction of buckets
that will have size i is around λie-λ/i!, from the Poisson Distribution.
For i=0, this gives N*e-λ as the number of empty buckets.
For λ=0.75, this is 47.2% empty. We also have about 35.4% of the
buckets with 1 element, and 13.3% with two elements, and 3.3%
with three elements, and 0.6% with four elements.
The average list length is 0*0.472 + 1*0.354 + 2*0.133 + 3*0.033 + 4*0.006 + ... = 75% = λ
For searching for things not in the list, this is a pretty good measure
of the average time for lookup; if we resize regularly, this means that
lookup is O(1) (though we have a time/space tradeoff here!). For
looking up things that are in the list, we should instead not count the 0-length lists, which means we should divide by e-λ
so as to adjust the weighting properly, but then also divide by 2
because on average we'll find what we're looking for after searching
just half the list. Mostly this is not a real-life concern. For the
theoretically inclined, however, λ < ln(2) means the searching of
nonempty lists for a value that is present is as fast or faster than
searching for something that is not there.
Bad table sizes
Should you avoid hash tables of size a power of 2? The concern is that
then hashCode % length just means throwing away the high-order bits of
the hashCode, which means they were "wasted". There's some merit in
choosing table sizes other than a power of 2, and once upon a time an
effort was made to choose tables for which the size was a prime number,
but mostly java's hashCode() methods are sufficiently "pseudorandom"
that this isn't necessary. My initialTableSize=10 is probably enough to
avoid having to worry about any of this.
Thursday
Hashing as the last refuge of singly linked lists? We have a lot of
small lists! Replacing them with ArrayLists, which start at size 10,
would clearly be a lose.
Hash Sets: just drop the data field.
Applications:
- determining if something has been seen before
- weeding out duplicates
- recognizing keywords
The hashing technique I used is called bucket
hashing: each of the hashTable[i] is a "bucket", implemented as a
linked list. If two data values hash to the same bucket, we just put
them into the same linked list.
Another approach to hashing is the "open addressing" hash technique, in which buckets are not used. We must
have the table size significantly larger than the amount of data (that
is, load_factor > 1). The central method, in Bailey, is locate().
Note how, if we don't find what we're looking for at index=hashcode, we
increment the index. This is our rehashing approach. (More sophisticated rehashing has been used, but if that is an issue, bucket hashing might be more appropriate.)
Note also that, to support deletion, we need to be able to mark
positions as "reserved". We're allowed to insert into reserved
locations, but in searches we skip over them and keep going.
Note also that collisions are pretty inevitable. With table size N, K
entries, and load factor λ=K/N, the expected number of positions with
exactly two entries is Nλ2e-λ/2. As a practical matter, we need N ~ K2 before the probability of some
collisions drops to a modest level. Compare to the Birthday Paradox: if
you have 24 people chosen at random, the probability that two have the
same birthday (a birthday "collision") is greater than 50%.
Binary Tree Deletion
How would we delete a node X from a binary tree? If X is a leaf
node, this is easy. Otherwise, we are going to have to "promote" one of
the subnodes of X (not necessarily one of the immediate subnodes) to
the place where X had been; we're then going to have to restructure the
remaining subnodes so that they again form a binary tree.
As far as the rest of the tree (the portion above X) is concerned, we
can restructure the remaining nodes however we want; all of the values
are smaller than the first (smallest) parent-tree node that is to the
right, and greater than the first (largest) parent-tree node that is to
the left.
One possible restructuring is to sort all the data values below X, and
build a degenerate linear "tree" where each node has only a right (or
left) child. The problem with this approach is that the depth of the
subtree is now as large as possible (equal to the number of nodes); our
goal in deletion is thus to arrange for the depth not to get worse. A
second problem is that the processing here is proportional to the
number of nodes below X, and we'd rather be proportional to the depth of the tree below X.
Here are the two easy cases:
- The node has no children. Just delete it.
- The node has one child.
Call the child node Y. Note that Y might have a very large tree below
it! However, that doesn't matter; we simply promote Y to where X had
been. More specifically, if X was the right subtree of its parent node
P, we simply do P.setRight(Y); similarly if X was the root or P.left().
Now what do we do if the node has two children, which I'll call L and
R? One strategy is to promote L, and insert R below the rightmost
subnode of L. Alternatively, we can promote R and insert L below the
leftmost subnode of R. The drawback of these is that the new tree will
typically have depth = depth(L) + depth(R), while we previously had
depth 1 + max(depth(L),depth(R)).
The best approach is to search down the rightmost path from L, to get
the largest value in the subtree less than X (that is, the largest
value below L), and/or to search down the leftmost path from R to get
the smallest subtree value greater than X (that is, the smallest value
below R), and promote one of those. Considering the first case, let Y
be the largest value below L. Y must not have a right child, and so we
can delete Y from where it had been using the second case above. We now
put Y where X was, and leave L (which we just modified) and R as the
left and right children of Y.
It is not exactly clear when we should take the largest value below L and when we should take the smallest value below R. It is
known that if we always do one consistently, then over time the tree
will become spindly and have excessive depth. One strategy is to take
whichever of the two nodes has greater depth, on the theory that that
will lead to depth reduction. Another strategy is simply to alternate.
See also page 352 of Bailey. The approach there is removing the root of
a subtree. However, finding the left or right predecessor is still part
of the primary case.
Tree Balancing
The problem with ordered trees is that the worst-case depth is O(N),
rather than O(log N). We would like to make sure that the tree remains
at least somewhat balanced as nodes are added (remaining balanced as
nodes are deleted is a related problem, which we will not consider).
Some algorithms in widespread use are AVL trees, Black-Red trees, Bayer
trees (B-trees), and Skew trees.
Tree Rotations
Tree rotations are a simple reorganization of a tree or subtree.
The idea is to consider a (subtree) root node A and one child B (say
the left child); the child B moves up to the root and the former root A
moves down. The root node A may be a subnode of a larger tree; since
none of the values in the A-subtree change, any such larger tree would
remain ordered. The transformation shown here, from the left to the
right diagram, is a right rotation about A (the topmost node A moves down to the right); an example of a left rotation (about B) would be the reverse.
A
B
/
\
/ \
B
T3
=> T1 A
/
\
/ \
T1
T2
T2 T3
This is legal because we have T1 ≤ B ≤ T2 ≤ A ≤ T3 (where T1,T2,T3 represent all values in those subtrees)
It is straightforward to write code to implement a rotation. The
ordered node data type is D. Note that we swap the data in the node
occupied by A and B; we have to preserve the node occupied originally
by A because nodes above may point to it.
TreeNode<D> rotateRight(TreeNode<D> A) {
if (A==null) return;
TreeNode<D> B = A.left();
if (B==null) return;
TreeNode<D> T1 = B.left(), T2 = B.right(), T3 = A.right();
D temp = B.data();
B.setData(A.data());
A.setData(temp);
B.setLeft(T2); // this is now the node that holds A's data
B.setRight(T3);
A.setLeft(T1); // this is now the node that holds B's data
A.setRight(B);
}
Compare Bailey, p 355.
Splay Trees
Splay trees are binary search trees where, on every access, we move the value in question (if found) to the root, in an operation known as splaying.
Note that accesses are now mutation operations! The idea is that
frequently accessed values will gravitate over time towards the root.
A predecessor to splay trees was rotate-to-root, in which we can move a
node up to the root through a series of rotations, each time rotating
the node with its (new) parent. This does succeed in bringing a given
value to the root.
Splay Trees modify this simple strategy with the addition of
"grandfather" rotations. Let x be a node, p the parent, and g the
grandparent. If x is the left subnode of p and p is the right subnode
of g (or x is right and p is left), we rotate around p and then g as in
the rotate-to-root method. This is sometimes called the zig-zag step,
as the two rotations are in different directions. However, if x is left
and p is left, we rotate first around g and then around p (note the
reversed order!), in a step sometimes known as "zig-zig".