Week of March 28

Continue with Poisson distribution: what is it good for?

**Poisson and the birthday problem**

**Hashing**

Python hash function is hash()

def
readveg(fname):

file = open(fname, 'r')

fulltext = file.read()

words = fulltext.split()

return words

flist = readveg('fruits.text')

# n is size of table, wlist is list of words

def makehash(n,wlist):

htable = [[]]*n

for w in wlist:

h = hash(w) % n

htable[h] = htable[h] + [w]

# print('Adding word "{}" to
position {}'.format(w, h))

return htable

def printhash(ht):

for i in range(len(ht)):

lis = ht[i]

if lis: print('ht[{}] =
{}'.format(i,lis))

def histogram(ht):

n = len(ht)

buckets=[0]*n

for lis in ht:

buckets[len(lis)] += 1

pos=n-1

while pos > 0 and buckets[pos] == 0:

pos -= 1

return buckets[0:pos+1]

n=100

ht=makehash(n,flist)

print(histogram(ht))

**Poisson example**: choosing N points in the interval [0,1]
chosen *uniformly*. Uniform distribution (all locations equally
likely) is not really Poisson, but for N large, the rate is approximately
1 point for every subinterval of length 1/N. This is never *exactly*
Poisson, as the point selections are not uniform. We can, for example,
never have more than N points in any subinterval. Still, the approximation
is good. This might be used, for example, to model telephone calls, or
network connections.

**Poisson example** for significance testing

Suppose that 30% of the people with a certain illness respond to a drug. In a trial of 40 patients, 40% responded to a different drug. Is it better?

Let's ask this question: in a trial of N=40 patients, how likely is it
that 40% will respond to the **original** drug? 30% is,
after all, only the average.

Let's set this up as a Poisson distribution with p=0.3, and so λ = pN = 12; that is, the rate is 12 responses per 40 patients on average. So P(X=16) =

λ^{k}e^{-λ}/k!. This works out to around 0.054. Small, but
not impossible. But let's add this up for X=16 to, say, X=30:

import math

def
fact(n):

if n==1: return 1

return n*fact(n-1)

def
P(k,lam):

return math.pow(lam, k)*math.exp(-lam)/fact(k)

list(map(
lambda k: P(k,12), range(16,31)))

returns [0.054293, 0.0383247, 0.0255498, 0.0161367, 0.0096820, 0.0055326, 0.00301778, 0.0015745, 0.0007872, 0.0003779, 0.0001744, 7.751e-05, 3.3220e-05, 1.374e-05, 5.498e-06]

sum(l) # where l is the list above

0.15558

That means there's almost a 16% chance of getting a 40% response or better with the original drug. So, loosely speaking, there's a 16% chance our new drug is no better at all.

This works, but the use of the Poisson distribution in this way is
uncommon. A much more common approach is to calculate the **standard
deviation** of the Poisson distribution, and see how many
standard deviations high is the X=16 result. The standard deviation
formula is sqrt(λ*N) = sqrt(12) = 3.46. At 16 successes, we are 3/3.46 =
0.86 standard deviations high.

We can use the so--called erf(x) function, which is the probability of an
event with mean 0 and standard deviation 1/sqrt(2) being less than x in
absolute value. We want the probability of a value being less than 0.86,
with std dev 1; we can divide by sqrt(2) and so we're asking for
erf(0.86/1.414) = erf(0.608) = 0.61. But this is the probability of not
being too high *or* too low. The probability of being too high or
too low is then 1 - 0.61 = 0.39, and the probability of being too high is
half that, or 0.19. That's 19%, even worse than 16%.

Generally speaking, we need to repeat our drug test with a larger sample. It is possible to calculate a sample size such that, if the response rate is 40%, the probability of that being due to chance is less than 5% (the usual threshold for significance), or less.

Königsberg bridges; distill as graph. Note the potential issue with multiple edges between the same set of vertexes.

Zork graph (playclassic.games/games/adventure-dos-games-online/play-zork-great-underground-empire-online)

Basic graph terminology

- graph (simple graph)
- multigraph
- edges
- vertices
- adjacent vertices
- connected graph
- complete graph
- degree of vertex
- subgraph, induced subgraph
- bipartite graph
- isomorphic graphs