Comp 150-001, MWF 12:35, Crown 105

Week 10

Homework 8: Moby Dick and Pride and Prejudice

How can you tell who wrote a book? Did Shakespeare write the plays attributed to him, or was it someone else with the same name?

The field of digital humanities, and, more specifically, stylometry, tries to analyze who wrote what in terms of statistics about word usage.

Here are the four texts we will analyze:

And here is the starter file:

There are some rather sophisticated statistical techniques that have been applied to stylometry, but we're going to do something relatively simple:

If the intersection is large, that suggests similar authorship. As we will see, though, the numbers can be suggestive, but they are not necessarily compelling.

To read in the text, as one gigantic string, use this:

file = open(filename, 'r')        # 'r' for reading the file
fulltext =

Moby Dick is over a megabyte, but reading it into one big string is no problem for Python. To split the text into a list of separate words (separated by "whitespace"), do this:

words = fulltext.split()

To cleanup the individual words, getting rid of punctuation, and then convert everything to lower case, this works. It isi possible the result of this is the empty string.

def cleanup(s):
    result = ""
    for ch in s:
       if ch.isalpha() or ch == "'": result += ch
    return result.lower()

To get the relative word frequencies, we use a dictionary. The key is the word, and the value is the count. For each cleaned-up non-empty word cw (check for empties with cw == ""), do the following, where d is a new dictionary:

if cw in d: d[cw] += 1        # increment the count of word cw
else: d[cw] = 1               # new word

Now we have a large dictionary d of (word,count) pairs. We need to convert this to a list, and then sort the list, and then take the first N elements (which means we sort from largest to smallest). We then get rid of the counts, leaving only a list of words. Here's a complete function to do this:

def topN(d, N):
    def getcount(pair):
        return pair[1]
    wlistpairs = []                                # convert dictionary to list of pairs
    for key in d:
    wlistpairs.sort(key=getcount, reverse=True)    # sort that list
    wlistpairs = wlistpairs[0:N]                   # take first N
    wlist = []
    for (word,count) in wlistpairs:                # now drop the counts
    return wlist

For the four texts, print out the intersection size for the two pairs of the same authorship, and also the four pairs of books with different authorship.

It's hard to know what N to use. If it's too small, all you get is the most popular words. Try N=300 and N=500 to start.

If s is a set (or a list or a dictionary), its size can be found with len(s).

Here are two more books if you wish to add them:

Here is a rather small file for testing your code and then examining it visually:

Here is a link to a Python lab in which more sophisticated techniques were applied to try to figure out who wrote which of The Federalist Papers. Alexander Hamilton is known to have written 51, and James Madison is known to have written 14, and John Jay is known to have written 5. But that still leaves 15 unattributed.

Organizational strategy

(I'm adding this later.) Maybe the simplest approach is to focus on creating a function set_topN that takes a filename, and perhaps N (the alternative is that N is global) and returns the set of the top N words in the named file. That is, you apply each of the steps above, sequentially.

Once you have this, you can create the sets like this:

You can then take the intersections manually:





How topN() works

Choosing N

map(function, list):    given a function f and a list [x1, x2, x3, x4], it returns the list [f(x1), f(x2), f(x3), f(x4)] etc. (Actually, it returns an "iterable"; to get the list, take list() of the result).