Week of March 28
Continue with Poisson distribution: what is it good for?
Poisson and the birthday problem
Hashing
Python hash function is hash()
def
readveg(fname):
file = open(fname, 'r')
fulltext = file.read()
words = fulltext.split()
return words
flist = readveg('fruits.text')
# n is size of table, wlist is list of words
def makehash(n,wlist):
htable = [[]]*n
for w in wlist:
h = hash(w) % n
htable[h] = htable[h] + [w]
# print('Adding word "{}" to
position {}'.format(w, h))
return htable
def printhash(ht):
for i in range(len(ht)):
lis = ht[i]
if lis: print('ht[{}] =
{}'.format(i,lis))
def histogram(ht):
n = len(ht)
buckets=[0]*n
for lis in ht:
buckets[len(lis)] += 1
pos=n-1
while pos > 0 and buckets[pos] == 0:
pos -= 1
return buckets[0:pos+1]
n=100
ht=makehash(n,flist)
print(histogram(ht))
Poisson example: choosing N points in the interval [0,1] chosen uniformly. Uniform distribution (all locations equally likely) is not really Poisson, but for N large, the rate is approximately 1 point for every subinterval of length 1/N. This is never exactly Poisson, as the point selections are not uniform. We can, for example, never have more than N points in any subinterval. Still, the approximation is good. This might be used, for example, to model telephone calls, or network connections.
Poisson example for significance testing
Suppose that 30% of the people with a certain illness respond to a drug. In a trial of 40 patients, 40% responded to a different drug. Is it better?
Let's ask this question: in a trial of N=40 patients, how likely is it that 40% will respond to the original drug? 30% is, after all, only the average.
Let's set this up as a Poisson distribution with p=0.3, and so λ = pN = 12; that is, the rate is 12 responses per 40 patients on average. So P(X=16) =
λke-λ/k!. This works out to around 0.054. Small, but not impossible. But let's add this up for X=16 to, say, X=30:
import math
def
fact(n):
if n==1: return 1
return n*fact(n-1)
def
P(k,lam):
return math.pow(lam, k)*math.exp(-lam)/fact(k)
list(map(
lambda k: P(k,12), range(16,31)))
returns [0.054293, 0.0383247, 0.0255498, 0.0161367, 0.0096820, 0.0055326, 0.00301778, 0.0015745, 0.0007872, 0.0003779, 0.0001744, 7.751e-05, 3.3220e-05, 1.374e-05, 5.498e-06]
sum(l) # where l is the list above
0.15558
That means there's almost a 16% chance of getting a 40% response or better with the original drug. So, loosely speaking, there's a 16% chance our new drug is no better at all.
This works, but the use of the Poisson distribution in this way is uncommon. A much more common approach is to calculate the standard deviation of the Poisson distribution, and see how many standard deviations high is the X=16 result. The standard deviation formula is sqrt(λ*N) = sqrt(12) = 3.46. At 16 successes, we are 3/3.46 = 0.86 standard deviations high.
We can use the so--called erf(x) function, which is the probability of an event with mean 0 and standard deviation 1/sqrt(2) being less than x in absolute value. We want the probability of a value being less than 0.86, with std dev 1; we can divide by sqrt(2) and so we're asking for erf(0.86/1.414) = erf(0.608) = 0.61. But this is the probability of not being too high or too low. The probability of being too high or too low is then 1 - 0.61 = 0.39, and the probability of being too high is half that, or 0.19. That's 19%, even worse than 16%.
Generally speaking, we need to repeat our drug test with a larger sample. It is possible to calculate a sample size such that, if the response rate is 40%, the probability of that being due to chance is less than 5% (the usual threshold for significance), or less.
Königsberg bridges; distill as graph. Note the potential issue with multiple edges between the same set of vertexes.
Zork graph (playclassic.games/games/adventure-dos-games-online/play-zork-great-underground-empire-online)
Basic graph terminology