Comp 271 Lab 3: implementing dictionaries with hashing

Comp 271 Lab 3 - Your Own Hash Map

Goals

Use a hash table to implement a Map (Dictionary)
Implement the hash table with linked lists
Count the occurrence of each word in a file
Print the words and their occurrences (not necessarily in alphabetical order)

Overview

The end result is to count the occurrence of each word in the file wirth.text (a classic essay on software engineering). An alternative file, for when you need to print debugging information, is tiny.text. A third file is spacedoctor.text, which is an explanation of general relativity using only the 1000 most common English words. Verify that not more than 1000 different words are used! (All three files are in the project zipfile below.)

The idea is to read each word, and look it up in a hash table that uses linked lists as buckets (that is, the bins that things are put into by hashcode). If the word is already present, increment its count; otherwise install the word with a count of 1. The linked lists are based on class Cell, below, which has space for the word, its count, and the next Cell. This form of implementation exposes all the internals of Cell, but as everything is still "hidden" in class HashDict, that's ok. Arguably ok, at least.

To read words we will use the Tokenizer class in Tokenizer.java; the token() method returns each word as a string. To initialize it, use

    Tokenizer t = new Tokenizer(filename);

It also happens to return punctuation and (if the punctuation isn't consistent with mini-java) the string "ILLEGAL TOKEN"; you should skip these. So to get the individual (alphabetic) words, you can do this:

    string word = t.token();
    while (word == null  || word.equals("ILLEGAL TOKEN") || !Character.isLetter(word.charAt(0))) {   // why check word==null?
         if (word == null) break;
         word = t.token();
    }

The project is in three java files (plus wirth.text and tiny.text), all within an IntelliJ project in countwords1.zip.

CountWords.java: contains the top-level Main()
HashDict.java: contains the hash-based counting structure
Tokenizer.java: as above

The basic idea of counting words is as follows, where getword() returns the next alphabetic word as above or, if the end of the file has been reached, null. The code shown here, however, uses the built-in HashMap class:

    HashMap<String,Integer> d = new HashMap<String,Integer>();
    ...
    String word = getword();
    while (word != null) {
	if (d.containsKey(word)) {
            int c = d.get(word);
            d.put(word, c+1);
	} else {
	     d.put(word,1);
	}
	word = getword();
    }
    // the Dark Runes:
    for (Map.Entry<String,Integer> entry : d.entrySet()) System.out.printf("%s:  %d%n", entry.getKey(), entry.getValue());

Your job is for the above loop to work after d is replaced with a HashDict, which you will define. This is a hashtable-based Map/Dictionary class that uses linked lists for the "buckets". You are to implement the following:

boolean containsKey(String w): returns true if w is found as a word in the dictionary.
int get(string w): returns the int corresponding to w. You could return 0 if not found, but a return type of int cannot be null. (You could return an Integer, which can be null.) The get() method should return the current count corresponding to the given word.
void put(String w, int count): adds or updates the count associated with the word w. You may, if you wish, make a distinction between put(w,c) for new entries and update(w,c) for existing entries.
void print(): prints each word and its corresponding count. Do this with an outer i<HMAX loop that goes through each element of HTable, and an inner p!=null loop that goes through each element of the linked list.

The get/put interface is optional, actually. Note that in the word-found case in the code above, the hashmap is searched for the word three times! If you can improve on that, go for it. One approach is to have get() return a reference to the Cell containing the word, which can then be incremented without a second put(). (The drawback here is that Cell, below, isn't really meant to be exposed to the user programmer.) Another approach is an increment() method to do a second search and increment the count in one swoop. A third approach might be a method install_or_increment_count(), which would be called without a previous containsKey() search.

Be sure you don't call any methods on a null word!

Implementing a Dictionary with a Hash Table

Our hash table can't be just the words themselves; we also need a place to keep count. Therefore the objects in the linked lists will ultimately be the following:

    private class Cell {
       private String data_;
       private int count_;
       private Cell next_;
       public Cell(String s, Cell n) {data_ = s; next_ = n;}
       public String data() {return data_;}
       public int count() {return count_;}
       public void count(int c) {count_ = c;}
       public Cell next() {return next_;}
    }

The "buckets" of the hash table will now be of type Cell, which corresponds to a linked list of Cell. The table itself is declared like this:

Cell [] HTable

To initialize this, see the code in HashDict.java. If word w, with hash(w) = h, is not found, you will add it to the front of list HTable[h]. You do that like this:

HTable[h] = new Cell(w, 1, HTable[h]);

The 1 here is the initial count. The old list is the second HTable[h], which now plays the role of the list following the newly inserted Cell.

The hash function will be the following (as before, we have a little extra work to make sure it returns values >= 0):

    private int hash(string s) {
	int val = s.GetHashCode() % HMAX;
	if (val < 0) val += HMAX;
	return val;
    }

I have provided a search method for containsKey(). You must write get() , put() and print(). The first two will have a structure similar to containsKey(); print will have a structure like this:

        for (int i=0; i<HMAX; i++) {
            p = HTable[i];
            while (p!= null) {
		// print p.data(), p.count()
                p = p.next();
            }
       }

Here is the output of a run on the file tiny.text:

trust: 1
a: 2
or: 1
depend: 1
confidence: 1
for: 1
worthy: 1
is: 2
your: 1
an: 1
the: 1
reliable: 1
that: 3
can: 2
person: 1
organization: 1
of: 1
attribute: 1
mechanism: 1
you: 2
on: 1

Because this project involves more than one file, your best bet for submission is as a .zip file.