Comp 163 Week 10 notes

Week of March 28

Continue with Poisson distribution: what is it good for?

Poisson and the birthday problem

Hashing

Python hash function is hash()

def readveg(fname):
    file = open(fname, 'r')
    fulltext = file.read()
    words = fulltext.split()
    return words

flist = readveg('fruits.text')

# n is size of table, wlist is list of words
def makehash(n,wlist):
    htable = [[]]*n
    for w in wlist:
        h = hash(w) % n
        htable[h] = htable[h] + [w]
        # print('Adding word "{}" to position {}'.format(w, h))
    return htable

def printhash(ht):
    for i in range(len(ht)):
        lis = ht[i]
        if lis: print('ht[{}] = {}'.format(i,lis))

def histogram(ht):
    n = len(ht)
    buckets=[0]*n
    for lis in ht:
        buckets[len(lis)] += 1
    pos=n-1
    while pos > 0 and buckets[pos] == 0:
        pos -= 1
    return buckets[0:pos+1]

n=100
ht=makehash(n,flist)
print(histogram(ht))

Poisson example: choosing N points in the interval [0,1] chosen uniformly. Uniform distribution (all locations equally likely) is not really Poisson, but for N large, the rate is approximately 1 point for every subinterval of length 1/N. This is never exactly Poisson, as the point selections are not uniform. We can, for example, never have more than N points in any subinterval. Still, the approximation is good. This might be used, for example, to model telephone calls, or network connections.

Poisson example for significance testing

Suppose that 30% of the people with a certain illness respond to a drug. In a trial of 40 patients, 40% responded to a different drug. Is it better?

Let's ask this question: in a trial of N=40 patients, how likely is it that 40% will respond to the original drug? 30% is, after all, only the average.

Let's set this up as a Poisson distribution with p=0.3, and so λ = pN = 12; that is, the rate is 12 responses per 40 patients on average. So P(X=16) =

λ^ke^-λ/k!. This works out to around 0.054. Small, but not impossible. But let's add this up for X=16 to, say, X=30:

import math

def fact(n):
if n==1: return 1
return n*fact(n-1)

def P(k,lam):
return math.pow(lam, k)*math.exp(-lam)/fact(k)

list(map( lambda k: P(k,12), range(16,31)))

returns [0.054293, 0.0383247, 0.0255498, 0.0161367, 0.0096820, 0.0055326, 0.00301778, 0.0015745, 0.0007872, 0.0003779, 0.0001744, 7.751e-05, 3.3220e-05, 1.374e-05, 5.498e-06]

sum(l) # where l is the list above

0.15558

That means there's almost a 16% chance of getting a 40% response or better with the original drug. So, loosely speaking, there's a 16% chance our new drug is no better at all.

This works, but the use of the Poisson distribution in this way is uncommon. A much more common approach is to calculate the standard deviation of the Poisson distribution, and see how many standard deviations high is the X=16 result. The standard deviation formula is sqrt(λ*N) = sqrt(12) = 3.46. At 16 successes, we are 3/3.46 = 0.86 standard deviations high.

We can use the so--called erf(x) function, which is the probability of an event with mean 0 and standard deviation 1/sqrt(2) being less than x in absolute value. We want the probability of a value being less than 0.86, with std dev 1; we can divide by sqrt(2) and so we're asking for erf(0.86/1.414) = erf(0.608) = 0.61. But this is the probability of not being too high or too low. The probability of being too high or too low is then 1 - 0.61 = 0.39, and the probability of being too high is half that, or 0.19. That's 19%, even worse than 16%.

Generally speaking, we need to repeat our drug test with a larger sample. It is possible to calculate a sample size such that, if the response rate is 40%, the probability of that being due to chance is less than 5% (the usual threshold for significance), or less.

Bayes' Theorem and medical test accuracy

Graphs

Königsberg bridges; distill as graph. Note the potential issue with multiple edges between the same set of vertexes.

Zork graph (playclassic.games/games/adventure-dos-games-online/play-zork-great-underground-empire-online)

Basic graph terminology

graph (simple graph)
multigraph
edges
vertices
adjacent vertices
connected graph
complete graph
degree of vertex
subgraph, induced subgraph
bipartite graph
isomorphic graphs

Handshaking Lemma