Comp 388 Lab 5 - Tree-based lookup
Goals
- use a tree to implement a Dictionary
- Count the occurrence of each word in a file
- Print the words and their occurrences in alphabetical order
Overview
The end result is to count the occurrence of each word in the file wirth.text,
as with Lab 4. Now, however, we will build the dictionary with a tree.
As an alternative input file, here's an article by Randall Munro (of xkcd
fame) entitled The Space Doctor's
Big Idea. It is about Einstein's theory of General Relativity, and is
supposedly written using only the 1,000 most common English words. How many
unique words does it contain? (I've removed contractions.)
To read words we will use the Tokenizer class in tokenizer.cs;
the token() method returns each word as a string. To initialize it, use
Tokenizer t = new Tokenizer("wirth.text");
It also happens to return punctuation and (if the punctuation isn't
consistent with mini-java) the string "ILLEGAL TOKEN"; you should skip
these. So to get the individual (alphabetic) words, you can do this:
string word = t.token();
while (word == null || word == "ILLEGAL TOKEN" || !Char.IsLetter(word[0])) { // why check word==null?
if (word == null) break;
word = t.token();
}
The project's files are the following:
The previous lab outlined the basic idea of counting words. Now we're doing
it with a different underlying data structure.
The strtree.cs file contains a class strtree.cs. Nodes have a string field
called data_ and an integer
(count) field called wordcount_.
It also contains the following methods:
- inTraverse(): inorder traversal of all the strings
- search(string s): finds if the given string is in the tree
- insert(string s): inserts a new string in the correct position. If the
string is already there, do nothing.
You will have to build an interface more suitable for counting
words. Here is a suggestion.
- bool ContainsKey(string w): returns true if w is found as a word in
the dictionary. This would be based on search(s).
- int Get(string w): returns the int (representing count) corresponding
to w. This would also be based on search(s), but returning the count_
field instead of true/false. You could return 0 if not found,
but int cannot be null (alternatively, you could use int?).
- void Put(string w, int count): adds or updates the count associated
with the word w. This might be based on insert(s), now incrementing the
count if s is found (rather than doing nothing). Alternatively, many
implementations prefer to keep adding a new (K,V) pair
separate from updating the V of an existing (K,V) pair; if you
wanted to go that route, you should have separate insert(s) (inserts
with count=1) and increment(s) (increments existing count).
- As an alternative to the three methods above, you can write one
method updateCount(string s), based on insert(s). It would search the
tree for string s; if it does not find it then the
string would be inserted with a count of 1, while if it does
find it then its count would be incremented.
- void Print(): prints each word (in order, since you used a tree!) and
its corresponding count. This will be based on inTraverse().
You do not have to make your dictionary a generic one,
with <K,V> types.
Your Print() method should print the words in alphabetical order.
This is the main reason for the ordered-tree implementation; last week's
hash-dictionary implementation didn't maintain the words in any sensible
order.
Be sure you don't call any methods on a null word!