Comp 271 Lab 5: implementing dictionaries

Comp 271 Lab 5 - Tree-based dictionaries

Goals

use a tree to implement a Map (dictionary)
Count the occurrence of each word in a file
Print the words and their occurrences in alphabetical order

Overview

The end result is again to count the occurrence of each word in the file. This time we'll use a tree as the underlying data structure.

The project is in countwords2.zip, together with three data files: wirth.text, tiny.text, and spacedoctor.text (a summary of General Relativity using just the 1000 most common English words).

To read words we will use the Tokenizer class in tokenizer.java; the token() method returns each word as a string. To initialize it, use

    Tokenizer t = new Tokenizer("wirth.text");

It also happens to return punctuation and (if the punctuation isn't consistent with mini-java) the string "ILLEGAL TOKEN"; you should skip these. So to get the individual (alphabetic) words, you can do this:

    string word = t.token();
    while (word == null  || word == "ILLEGAL TOKEN" || !Char.IsLetter(word[0])) {   // why check word==null?
         if (word == null) break;
         word = t.token();
    }

The basic idea of counting words is as follows, where getword() returns the next alphabetic word as above or, if the end of the file has been reached, null. The code shown here, however, uses the built-in HashMap class:

    HashMap<String,Integer> d = new HashMap<String,Integer>();
    ...
    String word = getword();
    while (word != null) {
	if (d.containsKey(word)) {
            int c = d.get(word);
            d.put(word, c+1);
	} else {
	     d.put(word,1);
	}
	word = getword();
    }

Your job is for the above loop to work after d is replaced with a TreeDict , which you will implement. This is a tree-based Map/Dictionary class.

Implementing a Dictionary with an Ordered Binary Tree

You are to implement the following:

boolean containsKey(String w): returns true if w is found as a word in the dictionary.
int get(string w): returns the int corresponding to w. You could return 0 if not found, but int cannot be null. The get() method should return the current count corresponding to the given word.
void put(String w, int count): adds or updates the count associated with the word w. You may, if you wish, make a distinction between put(w,c) for new entries and update(w,c) for existing entries.
void print(): prints each word and its corresponding count. This is actually done for you; it is inTraverse(), except you have to get rid of the indentation.

I've started all of these for you; typically the public version calls the private version that also has a TreeNode as a parameter. containsKey() is probably the simplest: if you get to a null TreeNode, the word is not found. If you find the word at TreeNode p, return true. Otherwise search the left or right subnodes of p, recursively. Note the use of k.compareTo(p.key()), which returns a negative value, zero, or a positive value depending on whether k<p.key(), k==p.key() or k>p.key(), respectively (but you can't use <, == or > on String objects in Java).

get() is a lot like containsKey(), except that if you find the word at TreeNode p you return the value stored there (p.value()), rather than true. For a null node, I returned -1. Again, if you're at a non-null node and the word you're looking for is not at that node, search the left or the right subtree, recursively, as appropriate. For containsKey() and get(), I added return statements necessary to make it compile; edit these out.

update() is more like rinsert(), in that the recursive version, with parameter p of type TreeNode, should never be called with p==null. The problem is that you may have to insert the new word as p's left or right child, which don't exist if p is null.

For print(), use inTraverse() but get rid of the indentation. Here is the output for tiny.text:

filename: tiny.text
a (2)
an (1)
attribute (1)
can (2)
confidence (1)
depend (1)
for (1)
is (2)
mechanism (1)
of (1)
on (1)
or (1)
organization (1)
person (1)
reliable (1)
that (3)
the (1)
trust (1)
worthy (1)
you (2)
your (1)

Because this project involves more than one file, your best bet for submission is as a .zip file.