Probability

I don't have a great textbook source for this. There's some material in Lovasz & Vesztergombi on pp 51-54, and Aspnes 224-??. The math starts getting pretty dense, though, after page 235.

Typically, we start with a set of all possible outcomes, which has probability 1.0. Various subsets have their own probabilities. If A and B are disjoint sets, then P(A∪B) = P(A) + P(B). In general, P(A∪B) = P(A) + P(B) - P(A∩B). Assuming the set of all possible outcomes is finite, these rules imply that all probabilities are determined by listing the probabilities of the individual outcomes.

Often, if there are multiple outcomes, it is reasonable to assume that all outcomes are equally likely (though this is definitely not always the case). For example, with coin tosses, if we assume P(Heads) = P(Tails), then each is 1/2, because we know they are equal and their sum is 1.0. For a single uniform six-sided die, P(1) = P(2) = ... = P(6) = 1/6. For drawing a playing card from a well-shuffled deck, any given card has probability 1/52 of being drawn.

In these notes I will outline some basic facts about probability that are often used in computer science.

The field of statistics is, loosely speaking, analysis of data to determine the probability that a given result could be attributed to chance. The lower that probability, the more likely we are to believe the result.

Poker

Each card has a rank 2,3,4,5,6,7,8,9,10,J,Q,K,A, and a suit. I'm going to assume, for simplicity, that an ace is always high; that is, (A,2,3,4,5) is not a straight.

We will calculate some five-card poker-hand probabilities. There are (52 choose 5) = 2,598,960 hands in all.

How many ways can you get a straight flush (5 cards in sequence, one suit)? There are 9 choices for the rank of the lowest card [2 through 10], times 4 possible suits, which gives 9*4 = 36 hands. (Here we are explicitly using the hypothesis that (A,2,3,4,5) is not a straight.)

How many ways can you get four-of-a-kind? 13 ranks for the four, times 12x4 possibilities for the fifth card, is 624.

How many ways can you get a full house (three of a kind plus two of another kind)? For the three cards of the same rank, we have 13 possible ranks * 4 choices for the odd suit (the suit not included in the three) = 52. For the two-of-a-kind, we now have 12 possible ranks * (4 choose 2) suit choices (choose 2 suits out of 4) = 72. This is a grand total of 3744.

How many ways can you draw a flush (5 cards, same suit): (13 choose 5) for one suit, times 4 suits, = 1287 * 4 = 5148.

How many ways can you draw a straight? 5 cards in sequence? 9 choices for the rank of the lowest card, times 4⁵ choices for the suits of the 5 cards, is 9*1024 = 9216.

Three of a kind? four choices of the odd suit, times 13 values. There are now two more cards that have to have different ranks. The number of combinations of the two ranks is (12 choose 2), and their are 4*4 choices for the suits. So that's 4*13*(12 choose 2) * 16 = 54912.

Two pair? There are (13 choose 2) ways to choose the two ranks. For each rank, we can choose suits in (4 choose 2) ways. There are now 11*4 possibilities for the fifth card, for 78*6*6*44 = 123,552 choices.

Multiple choices

Let's look at the probability distribution of the result of rolling two dice. Outcomes will be represented as an ordered pair, eg (3,5). Note that we are assuming the two dice are distinct; that is, we can view the roll as rolling the first die and then the second die. This is a real thing: there is only one way to roll two 2's, but there are two ways to roll a 2 and a 3, so the set {2,3} is indeed twice as likely as getting two 2's.

Now let's look at the sum of the two dice. There are 36 outcomes. How many outcomes yield a total of 2? Just (1,1). How many outcomes yield a 3? (1,2) and (2,1). For a 4, there are three outcomes: (1,3), (2,2), (3,1). In general, for sum N<=7 there are N-1 outcomes: (1,N-1) through (N-1,1). For N>7, the number of outcomes is the same as for 12 - N.

The probability of getting a 4 is thus 3/36.Here is a table of all the probabilities.

Number	2	3	4	5	6	7	8	9	10	11	12
Probability	1/36	2/36	3/36	4/36	5/36	6/36	5/36	4/36	3/36	2/36	1/36

How about the probability of rolling two dice and getting a 4 on at least one die? Again, the order of the dice is important. To get a 4 on the first die we can roll one of the six outcomes (4,1), (4,2), (4,3), (4,4), (4,5), or (4,6). To get a 4 on the second die, but not the first (which we have already counted), we can roll (1,4), (2,4), (3,4), (5,4), (6,4), . Note that we only count (4,4) once. There are 11 outcomes, for a probability of 11/36.

If I have two children, and one is a boy, what is the probability the other is a boy?

Here the real question is what does probability mean. Are we talking about my children? That result is fixed (one boy and one girl, btw). Are we talking about the frequentist approach? That is, we sample 4,000 people who have two children, and discard the data from the ~1,000 who have two girls. This is the "normal" approach to probability. Or are we talking about the Bayesian (bayz'-e-in) approach, where probability reflects the strength of your belief? That approach might apply to an individual with two children where you simply do not know the full data.

If I flip two coins and at least one is a head, what is the probability the other is also a head?

Here is the frequentist analysis. There are four outcomes: (H,H), (H,T), (T,H), (T,T). Knowing that there is at least one head eliminates (T,T). The other three are equally likely, and so there is 1 chance in 3 that we have two heads (H,H).

How about the probability that, if I roll two dice and one is a 4, the other is also a 4? Here we have our 11 ways to roll two dice and get at least one 4. There is only one outcome that represents two 4's. So the probability is 1/11.

The Monty Hall problem.

You are on a game show, and there are three doors, A, B and C. There is a prize behind exactly one of the doors; behind the others, there is nothing. You are to guess at a door. The probability that the prize is behind your door is 1/3.

At this point the host (who knows what is where) opens one of the remaining doors, revealing no prize. The host can always do this, because there is a prize behind only one door; there are always two "dud" doors. Your choice: should you switch doors to the remaining closed door?

Somewhat surprisingly, the answer is yes. If your choice had been door A, and the host opens door B, then the probability that the treasure is behind door A remains at 1/3. So the probability it is behind door C is now 2/3.

The usual argument against this is that the host can always pick an empty door, so that doesn't reveal any material information. But "revealing information" arguments are fundamentally imprecise. Another way to look at it is that opening the door does not reveal any information and so the probability of the prize being behind your first-picked door must remain the same.

Monty Hall versus Two Heads

In the Multiple Choices section, knowing that two coins did not result in TT meant that P(HH) was now 1/3. That is, partial information changed the probability of HH. However, in the Monty Hall section, the partial information provided by Monty's opening of Door 2 or Door 3 did not affect the probability P(prize is behind Door 1).

Why the difference?

To better compare these, let's come up with a pair of closer scenarios. We'll assume we have a 3-sided die (or a 6-sided die with 1 on two faces, 2 on two faces and 3 on two faces). Somebody else rolls the die, and the prize is placed behind the door with the corresponding number. We want to know the probability that the prize is behind Door 1, P(door1). At this point it is 1/3.

Scenario 1 (multiple choices): Monty now opens door 3 -- always door 3, even if the prize is there -- and reveals whether the prize is there.

Scenario 2 (Monty Hall): Monty, knowing where the prize is, picks one of doors 2 or 3 to open, revealing no prize. For definiteness, let's suppose that if the prize is behind door 3, Monty picks door 2, and if the prize is behind door 2, Monty picks door 3, and if the prize is behind door 1, Monty picks door 2 with 50% probability (and hence the same for door 3).

Let A denote the event that the prize is behind door 1. Monty's actions can be viewed as a pair of events. In scenario 1, the outcome is whether or not the prize is behind door 3. There are two possible events: M1 = E(prize not behind door 3) and M2 = E(prize behind door 3). In scenario 2, the outcome of no prize occurs with certainty; the event is really whether Monty opens door 2 or door 3. We have M3 = E(Monty opens door 2) and M4 = E(Monty opens door 3).

In scenario 1, after Monty's information, if there is no prize behind door 3 then the probability that the prize is behind door 1 is now 1/2. But in scenario 2, the probability that the prize is now behind door 1 remains at 1/3, and therefore the probability that it is behind the other unopened door is now 2/3.

One way to understand this is to use the frequentist approach. Suppose we run scenario 1 300 times. M1 is expected to occur 200 times, and M2 100 times. Of those 200 times when M1 occurred, the prize is behind door 1 half the time, that is, 100 times. Given M1, the probability that the prize is behind door 1 becomes 100/200 = 1/2. Given M2, P(door 1) = 0; one of the big differences between the two scenarios is that in scenario 1 we are ignoring M2, while in scenario 2 both Monty events are considered. Here's the tabular analysis that includes M2:

Prize is behind door 1		Prize is behind door 2		Prize is behind door 3
M1	M2	M1	M2	M1	M2
100 times	0 times	100 times	0 times	0 times	100 times

If we just look at the entries in the final row that don't include M2, we get 100/200 (=number of times prize is behind door 1 / total number)

In 300 runs of scenario 2, by comparison, M3 and M4 each occur about 150 times, and, for each of those 150 times, the prize is behind door 1 about 50 times. So P(door 1) remains at 1/3. Here's that in tabular form:

Prize is behind door 1		Prize is behind door 2		Prize is behind door 3
M3	M4	M3	M4	M3	M4
50 times	50 times	0 times	100 times	100 times	0 times

So there are 100 times in all, out of 300, that the prize is behind door 1. This time, it doesn't matter if we ignore M3 or M4: if we only look at the bottom-row entries for M3, we get 50/150 = 1/3, and similarly for M4.

Another way to visualize the difference is that, in scenario 1,we are ignoring the outcome M2. If we add that in, then we can reason as follows: M1 occurs 2/3 of the time, and M2 occurs 1/3 of the time. If M1 occurs, P(A) = 1/2. If M2 occurs, P(A) = 0. So, taking the weighted average,

P(A) = 1/2*2/3 + 0*1/3 = 1/3

where this probability is regardless of the Monty outcome.

For one more way to look at this, imagine knowing that two coins have been tossed behind door #1. The probability of HH is 1/4. Now you are told that the result of the tosses is not TT. The probability of HH has changed; it is now 1/3. For the Monty Hall problem, the probability that the prize was behind Door 1 did not change. But in the two-heads case, the information you were given directly related to what was behind Door 1, and for the Monty Hall case, the information you were given has nothing to do with Door 1.

The final way to view the difference is in terms of conditional probability. Recall that the conditional probability of A given B, or P(A|B), is defined to be P(A∩B)/P(B).

In light of this, P(A|M1) = 1/2, straightforwardly (using the tables if necessary). Similarly, P(A|M2) = 0.

For scenario 2, however, P(A|M3) = P(A|M4) = 1/3.

If P(A) = P(A|B), we say A and B are independent. So, intuitively, in scenario 1, A is not independent of Monty's door opening, and so the conditional probability of A changes. In scenario 2, by contrast, A is independent of Monty's door opening. By design, Monty's scenario-2 door opening conveys no information about A. (Note that, as a consequence, Monty's scenario-2 door opening does convey information about the probability that the prize is behind the unopened one of door 2 or 3; that now rises to 2/3.)

Birthdays

What is the probability of, in a group of N people, two people having the same birthday?

Let's look at the probability of all N people having different birthdays. There are 365 choices for the first, times 364 for the second, and so on to 365-N+1 for the last. Multiplying, we get 365!/(365-N)!. As a probability, divide this by 365^N:

(365/365)*(364/365)*(363/365)*...*(365-N+1)/365

Here's a table of probabilities that everyone has a different birthday, for N=1 to N=25. To get the probability that two people have the same brithday, subtract from 1.

1	1.000
2	0.997
3	0.992
4	0.984
5	0.973
6	0.960
7	0.944
8	0.926
9	0.905
10	0.883
11	0.859
12	0.833
13	0.806
14	0.777
15	0.747
16	0.716
17	0.685
18	0.653
19	0.621
20	0.589
21	0.556
22	0.524
23	0.493
24	0.462
25	0.431

At what point are the odds of no birthdays in common less than 50%?

How about a different problem: we have 1,000,000 possible network addresses. Each network card gets its address assigned at random.

Question 1. How many addresses can we allocate randomly before the probability of an address conflict exceeds 50%?

Answer: For N cards, probability of no conflict is

(M/M)*(M-1)/M*(M-2)/M...*(M-N+1)/M

= 1*(1-1/M)*(1-2/M)*...*(1-(N-1)/M)

For N small, this is near to 100%, because all the factors are near to 1.0. Let's use the following approximation, good when a and b are small:

(1-a)(1-b) ≃ 1-(a+b)

The exact answer is 1 - (a+b) + ab, but when a and b are small, say < 1/100, then ab is smaller still: less than 1/10,000.

Using this approximation with the second form above, this becomes

1 - (1+2+...+(N-1))/M.

The exact value of the sum (1+2+...+(N-1) is (N-1)N/2. It is is roughly N²/2. Plugging this into the formula above, our estimate for the probability of no conflict 1 - N²/2M. Again, for small N, this is nearly 100%. For M=365 and N=23, this formula gives .337, which is definitely a little smaller than what we got before. But it's a good start. It works well when N² is much less than M. Here, N² is about the same as M.

If 1 - N²/2M is the probability of no conflict, then the probability of a collision (or conflict) is N²/2M. If we choose N so large that this yields a number greater than 1.0, then we're definitely past the useful range of the approximation.

For the network-addresses problem, trying to solve N²/2M = 0.5, we get N = sqrt(M), which would mean 1000 addresses. The exact probability of having a collision in 1000 addresses is ~39.3%; we get a 50% change at around 1177 addresses.

Having the probability of a network meltdown be 50% seems unwieldy. That brings us to

Question 2. How many addresses do we have to choose before the probability of an address conflict exceeds 0.001?

Using the same formula, we are solving N²/2M = 1/1000, or N² = 2,000, or N ≃ 45. Not a very big network. Here, for N=45 addresses the probability of collision is just under 0.001 and for N=46 it is just over. So the approximation works quite well. This is because N² is much smaller than M.

Binomial distribution

Let's flip a coin N=100 times and look at how many heads we get. On average, we'll get 50, but how wide is the range?

Demo in Python, using binomial.py. Start with 20.

How many ways can we get k heads, k<=20? That's just choosing a subset of {1,2,...,20} of size k. So it's (20 choose k). The probability of this outcome is (20 choose k)/2²⁰. The probability of 10 heads (and 10 tails) is thus (20 choose 10)/2²⁰ = 184,756/1,048,576 = 17.62%

Try for N=1000 also?

Now let's do the same, but this time with an "unfair coin" that has heads with a fixed probability p, 0<p<1.

Demo for p=0.2, p=0.1

Connection to binomial coefficients: now P(count=k) = (n choose k)*p^k(1-p)^N-k. This is a coefficient in the expansion of (p+(1-p))^N.

We can calculate the mean: Np. We can also find the standard deviation (which I am not defining here, but which reflects the width of the probability curve); it turns out to be sqrt(Np(1-p))

For coins, with N=100, the mean is 50 and the standard deviation is sqrt(25) = 5. For p=0.2, stdev = 4. For p=0.1, it is 3.

Poisson Distribution

The Poisson distribution is all about counting some kind of event, with two rules:

The events should happen with a constant average rate, λ>0 (this rate can be per time or per spatial interval)
The events are independent, so the occurrence of one event in an interval doesn't affect the probability of any other events.

The basic Poisson question is what is the probability of getting k events in an interval of length 1? If X represents the count, it turns out that P(X=k) = λ^ke^-λ/k!.

The average value of these counts is just λ, the average rate (as expected!). The standard deviation is sqrt(λ).

Another way to look at the Poisson rate λ is that the underlying events are sprinkled along an axis, and you are counting the number in a spatial interval of length 1.

Examples (including some that are technically approximations)

911 calls per hour
alpha-particle decays per hour
earthquakes per year
If function rand() returns a random value in the range 0 to N-1, and we call it K times, on average each value occurs K/N times (true for K<N and also for K>N). This is a Poisson distribution with λ = K/N.
Choosing N uniformly distributed random points in the interval 0≤x≤1, and then sorting and plotting them. Seen in increasing order, the points are Poisson distributed, with λ=N, with the conditional-probability caveat that we just happened to get N points (P(X=N)).

The second rule above is that the underlying events being counted are completely independent. In particular, the occurrence of one event does not affect the occurrence of another. If the rate is λ events per unit time interval, the average time between events is 1/λ. This means that, at any instant, the average time until the next event is 1/λ, and this must be true even if an event has just occurred. The Poisson events are each "infinitesimally" likely, but they occur "infinitely often", with a finite rate λ per unit time (or unit space).

Flipping coins is sort of like Poisson. Suppose we flip coins at a fixed rate of 100 times/minute, and events are heads. The average number of heads is 50 per minute. It's not quite Poisson, because you cannot possibly get more than 100 heads. Once 100 heads occurs, the probability of another is zero.

A much closer binomial approximation to the Poisson distribution is if the probability of a head is small. For example, suppose the probability of a head is 0.1, and we flip the coin 500 times per minute. The average number of heads per minute, λ, is still 50. In general, the Poisson approximation to the binomial distribution becomes very good if p small, < 0.05 (for coin flips, p=0.5) and N is large (N>20). If we're counting heads after N=100 flips, we take λ=50 = (1/2)*100. In general, for a binomial distribution with probability p and N tries, λ = pN

Note that events in this binomial approach to the Poisson distribution are the heads, from unfair coins with a very small rate of heads. However, the flipping rate is correspondingly much higher, so that the average rate of heads per minute, λ, remains a reasonable value. The Poisson process is sometimes described as a binomial distribution where the head probability p is infinitesimal, but the count N is infinite, in such a way that the rate λ = pN remains fixed at an ordinary value.

Why the sum of λ^ke^-λ/k! is 1.0. This is because of the power-series formula for e^λ: it is Σ λ^k/k!, where the sum goes from k=0 to ∞. Multiplying by e^-λ yields 1 = Σ λ^ke^-λ/k!

For the binomial distribution the mean is pN and the standard deviation is sqrt(Np(1-p)). For the Poisson distribution, λ = pN is the mean, same as for the binomial distribution. The Poisson standard deviation is sqrt(λ) = sqrt(pN). The difference in standard deviation is by a factor of sqrt(1-p), which is small for p close to 0. For example, if p=0.1, then sqrt(1-p) = 0.949, which is ~5% less than 1. But for p=1/2, sqrt(1-p) ≃ 70%, and the Poisson curve is too wide by about 30-40%.

Poisson's birthday

With 23 people, let's consider events to be birthdates. The average number λ of birthdays on a given date is 23/365. But these are likely to be Poisson-distributed. So the expected number of dates with k=2 birthdays should be around 365*λ²e^-λ/2!, which is 0.68. This doesn't quite give a probability of a collision, but if the expected number of collisions is 0.68, that means that at least one collision is reasonably likely.

Normal distribution

The ultimate approximation here is the normal distribution (y = e^-x²), the classic "bell-shaped curve" of probability. I'm not going to consider it, except to say that it is completely characterized by the mean (the center-point of the peak) and the standard deviation: how wide is the peak. Specifically, the probability of being within ±1.0 standard deviations of the mean is 0.68268949213. (That value is erf(1/sqrt(2)), which is defined as an integral involving y = e^-x².)

Hashing

Suppose you reduce each data value to an integer h in the range 0..N-1, using some "pseudorandom" process, and then put that data value in slot A[h] of an array of size N. What is the probability of no "collisions"?

For example, we have K=100 strings we want to store in an array of size N=200. Each string s is converted to a pseudorandom integer < 200, using a "hash" function h(s). The Poisson events are now the 100 different values of h(s), distributed "spatially" among the numbers < 200. Think of throwing 100 darts at a dartboard with 200 spaces. Each space will receive, on average, 1/2 a dart. How many spaces can we expect to receive 2 darts?

We can analyze this using the Birthday technique above; here, we have a ~50% chance of a collision if the number of data values is roughly sqrt(N). So, either collisions are likely or the array A is not very full. In the case above with K=100 and N=200, the probability of collisions is quite high.

We can also use the Poisson distribution (as an approximation, but a good one). If there are K data values, then each array slot has on average λ = K/N occupants. Therefore, applying the Poisson distribution (really a Binomial distribution but with a small p of 1/N), we get the probability of k occupants for a cell is λ^ke^-λ/k!. Out of N cells, the expected count of cells with k occupants is then Nλ^ke^-λ/k!

Number of empty array cells: e^-λ, equal to 200*0.6065 = 121.306 ≃ 121

Number of array cells with one value: Nλe^-λ, which for N=200 and λ=0.5 works out to about 61

Number of array cells with two values: Nλ²e^-λ/2!. This works out to about 15. We can store two strings at the same location by making our array of size 200 be an array of lists (often so-called "linked lists"), so the collisions themselves are not the issue. However, note that this means that, for ~30 strings (15 slots, times 2 strings in each slot), we have a list to search

Number of array cells with three values: Nλ³e^-λ/3! ≃ 2.5.

At this point, we have accounted for 199.5 out of 200 array slots; having more than three strings that hash to the same slot is unlikely.

Once we calculate the number of k=0 (empty) cells, we can get subsequent ones by multiplying by λ/k:

k	expected count
0	121.306
1	121.306*0.5/1 = 60.65
2	60.65*0.5/2 = 15.16
3	15.16*0.5/3 = 2.53

The factor of λ moves the numerator from λ^k-1 to λ^k; the k in the denominator moves that from 1/(k-1)! to 1/k!

Bayes Formula

Suppose A and B are events. The notation P(A|B) is the probability of A, given that we know B. Formally, P(A|B) is defined as P(A∩B)/P(B). The point is that the interrelatedness of A and B is important. If P(A|B) = P(A), then A and B are said to be independent: knowing B does not influence the probability of A. Another way to describe this is that P(A)*P(B) = P(A∩B) (this follows from the definition of P(A|B) and the definition of independence).

Here is Bayes' formula:

P(A|B) = P(B|A)*P(A)/P(B)

One of the classic applications is in testing. Suppose you have a medical test. The false-positive rate is 1%, and the false-negative rate is 1%. Sounds good, right?

But suppose the underlying condition occurs with frequency 1 in 1,000. Let's let A be the probability someone has the condition, and B the probability that the test showed positive for that person. We want P(A|B): if someone tests positive, how likely is it that they really are positive?

We know P(A) = 0.001.

Also, P(B|A) = 0.99, the probability of a positive test given that someone is sick. This is just 1.0 minus the false-negative rate.

It takes a little more thought to get P(B), the probability of a positive test. If someone not is sick, the probability of a positive test is 0.01, and the probability of not being sick is 0.999. Multiplying these gives 0.00999, which is the probability of a positive test given that someone is well.

We could round this off to 0.01 and stop there, but we really should consider the outcome for those who are sick. For those, the probability of a positive test is 0.99, and the probability of being sick is 0.001. So, the probability of a positive test given that someone is sick is 0.99*0.001 = 0.00099. The exact value for P(B) is the sum of P(positive test | not sick) and P(positive test | sick), which is 0.01*0.999 + 0.99*0.001 = 0.01098.

Maybe a better way to do this is to make a table, based on 100,000 people, of whom 100 are sick

	sick	not sick	row totals
positive test	99	999	1098
negative test	1	98901	98902
	100	99,900	100,000

Again, we get P(B) = 1,098/100,000.

All this leads to

P(A|B) = 0.99*0.001/0.01098 = 0.09016, or 9%.

So if we test a bunch of people, and get 100 who test positive, only 9 of them actually will be positive.

This has serious implications for testing of rare diseases. It is widely believed that we'd all be better off with widespread use of an early-screening test for cancer. But most cancers are rare, and most tests have a false-positive test much higher than 1%. So the large majority of people who would be treated would in fact be free of cancer! And since cancer treatments tend to have very significant side effects, it is possible that people would, on average, be worse off with more widespread testing.

(This probably does not apply to testing for SARS-CoV-2, because P(A) is, in exposed populations, often much higher.)