Comp 388 Lab 4: implementing dictionaries with hashing

Comp 388 Lab 4 - Hash lookup

Goals

use a hash table to implement a Dictionary
Count the occurrence of each word in a file
Print the words and their occurrences (not necessarily in alphabetical order)

Overview

The end result is to count the occurrence of each word in the file wirth.text (a classic essay on software engineering). An alternative file, for when you need to print debugging information, is tiny.text.

To read words we will use the Tokenizer class in tokenizer.cs; the token() method returns each word as a string. To initialize it, use

    Tokenizer t = new Tokenizer("wirth.text");

It also happens to return punctuation and (if the punctuation isn't consistent with mini-java) the string "ILLEGAL TOKEN"; you should skip these. So to get the individual (alphabetic) words, you can do this:

    string word = t.token();
    while (word == null  || word == "ILLEGAL TOKEN" || !Char.IsLetter(word[0])) {   // why check word==null?
         if (word == null) break;
         word = t.token();
    }

The project is in three files:

countwords.cs: contains the top-level Main()
hashdict.cs: contains the hash-based counting structure
tokenizer.cs: as above

The basic idea of counting words is as follows, where getword() returns the next alphabetic word as above or, if the end of the file has been reached, null. The code shown here, however, uses the built-in Dictionary class:

    Dictionary<string,int> d = new Dictionary<string,int>();
    string word = getword();
    while (word != null) {
	if (d.ContainsKey(word)) {
	    d[word] += 1;
	} else {
	     d.Add(word,1);
	}
	word = getword();
    }
    foreach (KeyValuePair<string,int> entry in d) Console.WriteLine("{0}:  {1}", entry.Key, entry.Value);

The built-in Dictionary class defines operator[] for its purposes; you can do that, but a simpler idea is to use a conventional interface. We will also avoid using generic types like <string,int>. Here is one possibility:

bool ContainsKey(string w): returns true if w is found as a word in the dictionary.
int Get(string w): returns the int corresponding to w. You could return 0 if not found, but int cannot be null (alternatively, you could use int?). The Get() method would look somewhat like Dictionary.TryGetValue(), but would return the result as an ordinary function return rather than by using an out parameter.
void Put(string w, int count): adds or updates the count associated with the word w. Many implementations, though, prefer to keep adding a new (K,V) pair separate from updating the V of an existing (K,V) pair.
void Print(): prints each word and its corresponding count.

Be sure you don't call any methods on a null word!

Implementing a Dictionary with a Hash Table

Our hash table can't be just the words themselves; we also need a place to keep count. Therefore the objects in the table will ultimately be the following:

class StrIntPair {
    private string _word;
    private int _count;
    public StrIntPair(string w, int c) {_word = w; _count = c;}
    public string getWord() { return _word; }
    public int getCount() { return _count; }
    public void setCount(int c) { _count = c; }
}

The "buckets" of the hash table will now be of type List<StrIntPair>. The table itself will be declared like this:

List<StrIntPair>[] HTable

To initialize this, we set HTable = new List<StrIntPair>[HMAX] and then initialize each HTable[i] (why do we need this?)

for (int i=0; i<HMAX; i++) HTable[i] = new List<StrIntPair>();

The hash function will be the following (we have a little extra work to make sure it returns values >= 0):

    private int hash(string s) {
	int val = s.GetHashCode() % HMAX;
	if (val < 0) val += HMAX;
	return val;
    }

I have provided a search method for ContainsKey(). You must write Get() , Put() and Print(). The first two will have a structure similar to ContainsKey(); Print will have a structure like this:

        for (int i=0; i<HMAX; i++) {
            foreach(StrIntPair sip in HTable[i]) {
		// print sip.getWord(), sip.getCount()

            }
       }