Comp 271 Lab 3 - Your Own Hash Map
Goals
- Use a hash table to implement a Map (Dictionary)
- Implement the hash table with linked lists
- Count the occurrence of each word in a file
- Print the words and their occurrences (not necessarily in alphabetical
order)
Overview
The end result is to count the occurrence of each word in the file
wirth.text (a classic essay on software engineering). An alternative file,
for when you need to print debugging information, is tiny.text. A third file
is spacedoctor.text, which is an explanation of general relativity using
only the 1000 most common English words. Verify that not more than 1000
different words are used! (All three files are in the project zipfile
below.)
The idea is to read each word, and look it up in a hash table
that uses linked lists as buckets (that is, the bins that things
are put into by hashcode). If the word is already present, increment its
count; otherwise install the word with a count of 1. The linked lists are
based on class Cell, below, which has space for the word, its count, and the
next Cell. This form of implementation exposes all the internals of Cell,
but as everything is still "hidden" in class HashDict, that's ok. Arguably
ok, at least.
To read words we will use the Tokenizer class in Tokenizer.java; the token()
method returns each word as a string. To initialize it, use
Tokenizer t = new Tokenizer(filename);
It also happens to return punctuation and (if the punctuation isn't
consistent with mini-java) the string "ILLEGAL TOKEN"; you should skip
these. So to get the individual (alphabetic) words, you can do this:
string word = t.token();
while (word == null || word.equals("ILLEGAL TOKEN") || !Character.isLetter(word.charAt(0))) { // why check word==null?
if (word == null) break;
word = t.token();
}
The project is in three java files (plus wirth.text and tiny.text), all
within an IntelliJ project in countwords1.zip.
- CountWords.java: contains the top-level Main()
- HashDict.java: contains the
hash-based counting structure
- Tokenizer.java: as above
The basic idea of counting words is as follows, where getword()
returns the next alphabetic word as above or, if the end of the file has
been reached, null. The code shown here, however, uses the built-in HashMap
class:
HashMap<String,Integer> d = new HashMap<String,Integer>();
...
String word = getword();
while (word != null) {
if (d.containsKey(word)) {
int c = d.get(word);
d.put(word, c+1);
} else {
d.put(word,1);
}
word = getword();
}
// the Dark Runes:
for (Map.Entry<String,Integer> entry : d.entrySet()) System.out.printf("%s: %d%n", entry.getKey(), entry.getValue());
Your job is for the above loop to work after d is replaced with a HashDict,
which you will define. This is a hashtable-based Map/Dictionary
class that uses linked lists for the "buckets". You are to implement the
following:
- boolean containsKey(String w):
returns true if w is found
as a word in the dictionary.
- int get(string w):
returns the int corresponding to w.
You could return 0 if not found, but a return type of int
cannot be null. (You could return an Integer,
which can be null.) The get()
method should return the current count corresponding to the given word.
- void put(String w, int count):
adds or updates the count associated with the word w. You may, if you
wish, make a distinction between put(w,c)
for new entries and update(w,c)
for existing entries.
- void print(): prints each
word and its corresponding count. Do this with an outer i<HMAX
loop that goes through each element of HTable,
and an inner p!=null loop
that goes through each element of the linked list.
The get/put interface is optional, actually. Note that in the word-found
case in the code above, the hashmap is searched for the word three
times! If you can improve on that, go for it. One approach is to have
get() return a reference to the Cell containing the word, which can then be
incremented without a second put(). (The drawback here is that Cell, below,
isn't really meant to be exposed to the user programmer.) Another approach
is an increment() method to do a second search and increment the count in
one swoop. A third approach might be a method install_or_increment_count(),
which would be called without a previous containsKey() search.
Be sure you don't call any methods on a null word!
Implementing a Dictionary with a Hash Table
Our hash table can't be just the words themselves; we also need a place
to keep count. Therefore the objects in the linked lists will ultimately
be the following:
private class Cell {
private String data_;
private int count_;
private Cell next_;
public Cell(String s, Cell n) {data_ = s; next_ = n;}
public String data() {return data_;}
public int count() {return count_;}
public void count(int c) {count_ = c;}
public Cell next() {return next_;}
}
The "buckets" of the hash table will now be of type Cell, which
corresponds to a linked list of Cell. The table itself is declared like
this:
Cell [] HTable
To initialize this,
see the code in HashDict.java. If word w, with hash(w) = h, is not
found, you will add it to the front of list HTable[h]. You do
that like this:
HTable[h] = new Cell(w, 1, HTable[h]);
The 1 here is the initial count. The old list is the second HTable[h],
which now plays the role of the list following the newly
inserted Cell.
The hash function will be the following (as before, we have a little extra
work to make sure it returns values >= 0):
private int hash(string s) {
int val = s.GetHashCode() % HMAX;
if (val < 0) val += HMAX;
return val;
}
I have provided a search method for containsKey().
You must write get() , put() and print().
The first two will have a structure similar to containsKey(); print will
have a structure like this:
for (int i=0; i<HMAX; i++) {
p = HTable[i];
while (p!= null) {
// print p.data(), p.count()
p = p.next();
}
}
Here is the output of a run on the file tiny.text:
trust:
1
a: 2
or: 1
depend: 1
confidence: 1
for: 1
worthy: 1
is: 2
your: 1
an: 1
the: 1
reliable: 1
that: 3
can: 2
person: 1
organization: 1
of: 1
attribute: 1
mechanism: 1
you: 2
on: 1
Because this project involves more than one file, your best bet for
submission is as a .zip file.