Comp 388 Lab 5: implementing dictionaries

Comp 388 Lab 5 - Tree-based lookup

Goals

use a tree to implement a Dictionary
Count the occurrence of each word in a file
Print the words and their occurrences in alphabetical order

Overview

The end result is to count the occurrence of each word in the file wirth.text, as with Lab 4. Now, however, we will build the dictionary with a tree.

As an alternative input file, here's an article by Randall Munro (of xkcd fame) entitled The Space Doctor's Big Idea. It is about Einstein's theory of General Relativity, and is supposedly written using only the 1,000 most common English words. How many unique words does it contain? (I've removed contractions.)

To read words we will use the Tokenizer class in tokenizer.cs; the token() method returns each word as a string. To initialize it, use

    Tokenizer t = new Tokenizer("wirth.text");

It also happens to return punctuation and (if the punctuation isn't consistent with mini-java) the string "ILLEGAL TOKEN"; you should skip these. So to get the individual (alphabetic) words, you can do this:

    string word = t.token();
    while (word == null  || word == "ILLEGAL TOKEN" || !Char.IsLetter(word[0])) {   // why check word==null?
         if (word == null) break;
         word = t.token();
    }

The project's files are the following:

countwords2.cs contains the top-level Main
strtree.cs implements an ordered binary tree of strings
tokenizer.cs as above (and as before)

The previous lab outlined the basic idea of counting words. Now we're doing it with a different underlying data structure.

The strtree.cs file contains a class strtree.cs. Nodes have a string field called data_ and an integer (count) field called wordcount_. It also contains the following methods:

inTraverse(): inorder traversal of all the strings
search(string s): finds if the given string is in the tree
insert(string s): inserts a new string in the correct position. If the string is already there, do nothing.

You will have to build an interface more suitable for counting words. Here is a suggestion.

bool ContainsKey(string w): returns true if w is found as a word in the dictionary. This would be based on search(s).
int Get(string w): returns the int (representing count) corresponding to w. This would also be based on search(s), but returning the count_ field instead of true/false. You could return 0 if not found, but int cannot be null (alternatively, you could use int?).
void Put(string w, int count): adds or updates the count associated with the word w. This might be based on insert(s), now incrementing the count if s is found (rather than doing nothing). Alternatively, many implementations prefer to keep adding a new (K,V) pair separate from updating the V of an existing (K,V) pair; if you wanted to go that route, you should have separate insert(s) (inserts with count=1) and increment(s) (increments existing count).
As an alternative to the three methods above, you can write one method updateCount(string s), based on insert(s). It would search the tree for string s; if it does not find it then the string would be inserted with a count of 1, while if it does find it then its count would be incremented.

void Print(): prints each word (in order, since you used a tree!) and its corresponding count. This will be based on inTraverse().

You do not have to make your dictionary a generic one, with <K,V> types.

Your Print() method should print the words in alphabetical order. This is the main reason for the ordered-tree implementation; last week's hash-dictionary implementation didn't maintain the words in any sensible order.

Be sure you don't call any methods on a null word!