I don't have a great textbook source for this. There's some material in Lovasz & Vesztergombi on pp 51-54, and Aspnes 224-??. The math starts getting pretty dense, though, after page 235.

A lot of probability is closely related to sets. We have the set of all outcomes, which has probability 1.0. Various subsets have their own probabilities. If A and B are disjoint sets, then P(A∪B) = P(A) + P(B). In general, P(A∪B) = P(A) + P(B) - P(A∩B).

Often, if there are multiple outcomes, it is reasonable to assume that
all outcomes are equally likely (though this is definitely not *always*
the case). For example, with coin tosses, if we assume P(Heads) =
P(Tails), then each is 1/2. For a single six-sided die, P(1) = P(2) = ...
= P(6) = 1/6. For drawing a playing card from a well-shuffled deck, any
given card has probability 1/52 of being drawn.

In these notes I will outline some basic facts about probability that are often used in computer science.

The field of *statistics* is, loosely speaking, analysis of data
to determine the probability that a given result could be attributed to
chance. The lower that probability, the more likely we are to believe the
result.

Each card has a *rank* 2,3,4,5,6,7,8,9,10,J,Q,K,A, and a suit.
I'm going to assume, for simplicity, that an ace is always high; that is,
(A,2,3,4,5) is not a straight.

We will calculate some five-card poker-hand probabilities. There are (52 choose 5) = 2,598,960 hands in all.

How many ways can you get a **straight flush** (5 cards in
sequence, one suit)? There are 9 choices for the rank of the lowest card
[2 through 10], times 4 possible suits, which gives 9*4 = 36 hands. (Here
we are explicitly using the hypothesis that (A,2,3,4,5) is *not* a
straight.)

How many ways can you get **four-of-a-kind**? 13 ranks for
the four, times 12x4 possibilities for the fifth card, is 624.

How many ways can you get a **full house** (three of a kind
plus two of another kind)? For the three cards of the same rank, we have
13 possible ranks * 4 choices for the odd suit (the suit not included in
the three) = 52. For the two-of-a-kind, we now have 12 possible ranks * (4
choose 2) suit choices (choose 2 suits out of 4) = 72. This is a grand
total of 3744.

How many ways can you draw a **flush** (5 cards, same
suit): (13 choose 5) for one suit, times 4 suits, = 1287 * 4 = 5148.

How many ways can you draw a **straight**? 5 cards in
sequence? 9 choices for the rank of the lowest card, times 4^{5}
choices for the suits of the 5 cards, is 9*1024 = 9216.

**Three of a kind**? four choices of the odd suit, times 13
values. There are now two more cards that have to have different ranks.
The number of combinations of the two ranks is (12 choose 2), and their
are 4*4 choices for the suits. So that's 4*13*(12 choose 2) * 16 = 54912.

**Two pair**? There are (13 choose 2) ways to choose the two
ranks. For each rank, we can choose suits in (4 choose 2) ways. There are
now 11*4 possibilities for the fifth card, for 78*6*6*44 = 123,552
choices.

Let's look at the probability distribution of the result of rolling two
dice. Outcomes will be represented as an ordered pair, *eg* (3,5).
Note that we are assuming the two dice are distinct; that is, we can view
the roll as rolling the first die and then the second die. This is a real
thing: there is only one way to roll two 2's, but there are two ways to
roll a 2 and a 3, so the set {2,3} is indeed twice as likely as getting
two 2's.

Now let's look at the **sum** of the two dice. There are 36
outcomes. How many outcomes yield a total of 2? Just (1,1). How many
outcomes yield a 3? (1,2) and (2,1). For a 4, there are three outcomes:
(1,3), (2,2), (3,1). In general, for sum N<=7 there are N-1 outcomes:
(1,N-1) through (N-1,1). For N>7, the number of outcomes is the same as
for 12 - N.

The **probability** of getting a 4 is thus 3/36.Here is a
table of all the probabilities.

Number | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |

Probability | 1/36 | 2/36 | 3/36 | 4/36 | 5/36 | 6/36 | 5/36 | 4/36 | 3/36 | 2/36 | 1/36 |

*If I have two children, and one is a boy, what is the probability the
other is a boy?*

Here the real question is what does probability mean. Are we talking
about *my* children? That result is fixed (one boy and one girl,
btw). Are we talking about the **frequentist** approach?
That is, we sample 4,000 people who have two children, and discard the
data from the ~1,000 who have two girls. This is the "normal" approach to
probability. Or are we talking about the **Bayesian**
(bayz'-e-in) approach, where probability reflects the strength of your
belief? That approach might apply to an individual with two children where
you simply do not know the full data.

If I flip two coins and at least one is a head, what is the probability the other is also a head?

Here is the frequentist analysis. There are four outcomes: (H,H), (H,T), (T,H), (T,T). Knowing that there is at least one head eliminates (T,T). The other three are equally likely, and so there is 1 chance in 3 that we have two heads (H,H).

How about the probability that, if I roll two dice and one is a 4, the other is also a 4? Here we have our 11 ways to roll two dice and get at least one 4. There is only one outcome that represents two 4's. So the probability is 1/11.

You are on a game show, and there are three doors, A, B and C. There is a prize behind exactly one of the doors; behind the others, there is nothing. You are to guess at a door. The probability that the prize is behind your door is 1/3.

At this point the host (who knows what is where) opens one of the remaining doors, revealing no prize. The host can always do this, because there is a prize behind only one door; there are always two "dud" doors. Your choice: should you switch doors to the remaining closed door?

Somewhat surprisingly, the answer is yes. If your choice had been door A, and the host opens door B, then the probability that the treasure is behind door A remains at 1/3. So the probability it is behind door C is now 2/3.

The usual argument *against* this is that the host can always
pick an empty door, so that doesn't reveal any material information. But
"revealing information" arguments are fundamentally imprecise. Another way
to look at it is that opening the door does not reveal any information and
so the probability of the prize being behind your first-picked door must
remain the same.

In the Multiple Choices section, knowing that two coins did *not*
result in TT meant that P(HH) was now 1/3. That is, partial information
changed the probability of HH. However, in the Monty Hall section, the
partial information provided by Monty's opening of Door 2 or Door 3 did *not*
affect the probability P(prize is behind Door 1).

Why the difference?

To better compare these, let's come up with a pair of closer scenarios. We'll assume we have a 3-sided die (or a 6-sided die with 1 on two faces, 2 on two faces and 3 on two faces). Somebody else rolls the die, and the prize is placed behind the door with the corresponding number. We want to know the probability that the prize is behind Door 1, P(door1). At this point it is 1/3.

**Scenario 1** (multiple choices): Monty now opens door 3,
and reveals whether the prize is there.

**Scenario 2** (Monty Hall): Monty, knowing where the prize
is, picks one of doors 2 or 3 to open, revealing no prize. For
definiteness, let's suppose that if the prize is behind door 3, Monty
picks door 2, and if the prize is behind door 2, Monty picks door 3, and
if the prize is behind door 1, Monty picks door 2 with 50% probability
(and hence the same for door 3).

Let A denote the **event** that the prize is behind door 1.
Monty's actions can be viewed as a pair of events. In scenario 1, the
outcome is whether or not the prize is behind door 3. There are two
possible events: M1 = E(prize **not **behind door 3) and M2
= E(prize behind door 3). In scenario 2, the outcome of no prize
occurs with certainty; the event is really whether Monty opens door 2 or
door 3. We have M3 = E(Monty opens door 2) and M4 = E(Monty opens door 3).

In scenario 1, after Monty's information, if there is no prize behind door 3 then the probability that the prize is behind door 1 is now 1/2. But in scenario 2, the probability that the prize is now behind door 1 remains at 1/3, and therefore the probability that it is behind the other unopened door is now 2/3.

One way to understand this is to use the **frequentist**
approach. Suppose we run **scenario 1** 300 times. M1 is
expected to occur 200 times, and M2 100 times. Of those 200 times when M1
occurred, the prize is behind door 1 half the time, that is, 100 times.
Given M1, the probability that the prize is behind door 1 becomes 100/200
= 1/2. Given M2, P(door 1) = 0; one of the big differences between the two
scenarios is that in scenario 1 we are ignoring M2, while in scenario 2
both Monty events are considered. Here's the tabular analysis that
includes M2:

Prize is behind door 1 | Prize is behind door 2 | Prize is behind door 3 | |||

M1 | M2 | M1 | M2 | M1 | M2 |

100 times | 0 times | 100 times | 0 times | 0 times | 100 times |

If we just look at the entries in the final row that don't include M2, we get 100/200.

In 300 runs of **scenario 2**, by comparison, M3 and M4
each occur about 150 times, and, for each of those 150 times, the prize is
behind door 1 about 50 times. So P(door 1) remains at 1/3. Here's that in
tabular form:

Prize is behind door 1 | Prize is behind door 2 | Prize is behind door 3 | |||

M3 | M4 | M3 | M4 | M3 | M4 |

50 times | 50 times | 0 times | 100 times | 100 times | 0 times |

So there are 100 times in all, out of 300, that the prize is behind door 1. This time, it doesn't matter if we ignore M3 or M4: if we only look at the bottom-row entries for M3, we get 50/150 = 1/3, and similarly for M4.

Another way to visualize the difference is that, in scenario 1,we are
ignoring the outcome M2. If we add that in, then we can reason as follows:
M1 occurs 2/3 of the time, and M2 occurs 1/3 of the time. If M1 occurs,
P(A) = 1/2. If M2 occurs, P(A) = 0. So, taking the weighted average,

P(A) = 1/2*2/3 + 0*1/3 = 1/3

where this probability is regardless of the Monty outcome.

The final way to view the difference is in terms of **conditional
probability**. Recall that the conditional probability of A given
B, or P(A|B), is defined to be P(A∩B)/P(B).

In light of this, P(A|M1) = 1/2, straightforwardly (using the tables if necessary). Similarly, P(A|M2) = 0.

For scenario 2, however, P(A|M3) = P(A|M4) = 1/3.

If P(A) = P(A|B), we say A and B are **independent**. So,
intuitively, in scenario 1, A is *not* independent of Monty's door
opening, and so the conditional probability of A changes. In scenario 2,
by contrast, A *is* independent of Monty's door opening. By
design, Monty's scenario-2 door opening conveys no information about A.
(Note that, as a consequence, Monty's scenario-2 door opening *does*
convey information about the probability that the prize is behind the
unopened one of door 2 or 3; that now rises to 2/3.)

What is the probability of, in a group of N people, two people having the same birthday?

Let's look at the probability of all N people having different birthdays.
There are 365 choices for the first, times 364 for the second, and so on
to 365-N+1 for the last. Multiplying, we get 365!/(365-N)!. As a
probability, divide this by 365^{N}:

(365/365)*(364/365)*(363/365)*...*(365-N+1)/365

Here's a table of probabilities that everyone has a different birthday, for N=1 to N=25. To get the probability that two people have the same brithday, subtract from 1.

1 | 1.000 |

2 | 0.997 |

3 | 0.992 |

4 | 0.984 |

5 | 0.973 |

6 | 0.960 |

7 | 0.944 |

8 | 0.926 |

9 | 0.905 |

10 | 0.883 |

11 | 0.859 |

12 | 0.833 |

13 | 0.806 |

14 | 0.777 |

15 | 0.747 |

16 | 0.716 |

17 | 0.685 |

18 | 0.653 |

19 | 0.621 |

20 | 0.589 |

21 | 0.556 |

22 | 0.524 |

23 | 0.493 |

24 | 0.462 |

25 | 0.431 |

At what point are the odds of *no* birthdays in common less than
50%?

How about a different problem: we have 1,000,000 possible network addresses. Each network card gets its address assigned at random.

**Question 1**. How many addresses can we allocate randomly
before the probability of an address conflict exceeds 50%?

Answer: For N cards, probability of **no** conflict is

(M/M)*(M-1)/M*(M-2)/M...*(M-N+1)/M

= 1*(1-1/M)*(1-2/M)*...*(1-(N-1)/M)

For N small, this is near to 100%, because all the factors are near to 1.0. Let's use the following approximation, good when a and b are small:

(1-a)(1-b) ≃ 1-(a+b)

The exact answer is 1 - (a+b) + ab, but when a and b are small, say < 1/100, then ab is smaller still: less than 1/10,000.

Using this approximation with the second form above, this becomes

1 - (1+2+...+(N-1))/M.

The exact value of the sum (1+2+...+(N-1) is (N-1)N/2. It is is roughly N^{2}/2.
Plugging this into the formula above, our estimate for the probability of
**no** conflict 1 - N^{2}/2M. Again, for small N,
this is nearly 100%. For M=365 and N=23, this formula gives .337, which is
definitely a little smaller than what we got before. But it's a good
start. It works well when N^{2} is much less than M. Here, N^{2}
is about the same as M.

If 1 - N^{2}/2M is the probability of **no conflict**,
then the probability of a **collision** (or conflict) is N^{2}/2M.
If we choose N so large that this yields a number greater than 1.0, then
we're definitely past the useful range of the approximation.

For the network-addresses problem, trying to solve N^{2}/2M =
0.5, we get N = sqrt(M), which would mean 1000 addresses. The exact
probability of having a collision in 1000 addresses is ~39.3%; we get a
50% change at around 1177 addresses.

Having the probability of a network meltdown be 50% seems unwieldy. That brings us to

**Question 2**. How many addresses do we have to choose
before the probability of an address conflict exceeds 0.001?

Using the same formula, we are solving N^{2}/2M = 1/1000, or N^{2}
= 2,000, or N ≃ 45. Not a very big network. Here, for N=45 addresses the
probability of collision is just under 0.001 and for N=46 it is just over.
So the approximation works quite well. This is because N^{2} is
much smaller than M.

Let's flip a coin N=100 times and look at how many heads we get. On average, we'll get 50, but how wide is the range?

Demo in Python, using binomial.py. Start with 20.

How many ways can we get k heads, k<=20? That's just choosing a subset
of {1,2,...,20} of size k. So it's (20 choose k). The *probability*
of this outcome is (20 choose k)/2^{20}. The probability of 10
heads (and 10 tails) is thus (20 choose 10)/2^{20} =
184,756/1,048,576 = 17.62%

Try for N=1000 also?

Now let's do the same, but this time with an "unfair coin" that has heads with a fixed probability p, 0<p<1.

Demo for p=0.2, p=0.1

Connection to binomial coefficients: now P(count=k) = (n choose k)*p^{k}(1-p)^{N-k}.
This is a coefficient in the expansion of (p+(1-p))^{N}.

We can calculate the **mean**: Np. We can also find the **standard
deviation** (which I am not defining here, but which reflects the
width of the probability curve); it turns out to be sqrt(Np(1-p))

For coins, with N=100, the mean is 50 and the standard deviation is sqrt(25) = 5. For p=0.2, stdev = 4. For p=0.1, it is 3.

The Poisson distribution is all about counting some kind of event, with two rules:

- The events should happen with a constant average rate, λ>0 (this rate can be per time or per spatial interval)
- The events are
**independent**, so the occurrence of one event in an interval doesn't affect the probability of any other events.

The basic Poisson question is what is the probability of getting k
events in an interval of length 1? If X represents the count, it
turns out that P(X=k) = λ^{k}e^{-λ}/k!.

The average value of these counts is just λ, the average rate (as expected!). The standard deviation is sqrt(λ).

Another way to look at the Poisson rate λ is that the underlying events are sprinkled along an axis, and you are counting the number in a spatial interval of length 1.

Examples (including some that are technically approximations)

- 911 calls per hour
- alpha-particle decays per hour
- earthquakes per year
- If function rand() returns a random value in the range 0 to N-1, and we call it K times, on average each value occurs K/N times (true for K<N and also for K>N). This is a Poisson distribution with λ = K/N.
- Choosing N uniformly distributed random points in the interval 0≤x≤1, and then sorting and plotting them. Seen in increasing order, the points are Poisson distributed, with λ=N, with the conditional-probability caveat that we just happened to get N points (P(X=N)).

The second rule above is that the underlying events being counted are completely independent. In particular, the occurrence of one event does not affect the occurrence of another. If the rate is λ events per unit time interval, the average time between events is 1/λ. This means that, at any instant, the average time until the next event is 1/λ, and this must be true even if an event has just occurred. The Poisson events are each "infinitesimally" likely, but they occur "infinitely often", with a finite rate λ per unit time (or unit space).

Flipping coins is sort of like Poisson. Suppose we flip coins at a **fixed**
rate of 100 times/minute, and events are heads. The average number of
heads is 50 per minute. It's not quite Poisson, because you cannot
possibly get more than 100 heads. Once 100 heads occurs, the probability
of another is zero.

A much closer binomial approximation to the Poisson distribution is if the probability of a head is small. For example, suppose the probability of a head is 0.1, and we flip the coin 500 times per minute. The average number of heads per minute, λ, is still 50. In general, the Poisson approximation to the binomial distribution becomes very good if p small, < 0.05 (for coin flips, p=0.5) and N is large (N>20). If we're counting heads after N=100 flips, we take λ=50 = (1/2)*100. In general, for a binomial distribution with probability p and N tries, λ = pN

Note that events in this binomial approach to the Poisson distribution are the heads, from unfair coins with a very small rate of heads. However, the flipping rate is correspondingly much higher, so that the average rate of heads per minute, λ, remains a reasonable value. The Poisson process is sometimes described as a binomial distribution where the head probability p is infinitesimal, but the count N is infinite, in such a way that the rate λ = pN remains fixed at an ordinary value.

**Why the sum of λ ^{k}e^{-λ}/k! is 1.0**.
This is because of the power-series formula for e

For the binomial distribution the mean is pN and the standard deviation is sqrt(Np(1-p)). For the Poisson distribution, λ = pN is the mean, same as for the binomial distribution. The Poisson standard deviation is sqrt(λ) = sqrt(pN). The difference in standard deviation is by a factor of sqrt(1-p), which is small for p close to 0. For example, if p=0.1, then sqrt(1-p) = 0.949, which is ~5% less than 1. But for p=1/2, sqrt(1-p) ≃ 70%, and the Poisson curve is too wide by about 30-40%.

With 23 people, let's consider events to be birthdates. The average
number λ of birthdays on a given date is 23/365. But these are likely to
be Poisson-distributed. So the expected number of dates with k=2 birthdays
should be around 365*λ^{2}e^{-λ}/2!, which is 0.68. This
doesn't quite give a probability of a collision, but if the expected
number of collisions is 0.68, that means that at least one collision is
reasonably likely.

The ultimate approximation here is the normal distribution (y = e^{-x²}),
the classic "bell-shaped curve" of probability. I'm not going to consider
it, except to say that it is completely characterized by the mean (the
center-point of the peak) and the standard deviation: how wide is the
peak. Specifically, the probability of being within ±1.0 standard
deviations of the mean is
0.68268949213.
(That value is erf(1/sqrt(2)), which is defined as an integral involving y
= e^{-x²}.)

Suppose you reduce each data value to an integer h in the range 0..N-1, using some "pseudorandom" process, and then put that data value in slot A[h] of an array of size N. What is the probability of no "collisions"?

For example, we have K=100 strings we want to store in an array of size N=200. Each string s is converted to a pseudorandom integer < 200, using a "hash" function h(s). The Poisson events are now the 100 different values of h(s), distributed "spatially" among the numbers < 200. Think of throwing 100 darts at a dartboard with 200 spaces. Each space will receive, on average, 1/2 a dart. How many spaces can we expect to receive 2 darts?

We can analyze this using the Birthday technique above; here, we have a
~50% chance of a collision if the number of data values is roughly
sqrt(N). So, either collisions are likely *or* the array A is not
very full. In the case above with K=100 and N=200, the probability of
collisions is quite high.

We can also use the Poisson distribution (as an approximation, but a good
one). If there are K data values, then each array slot has on average λ =
K/N occupants. Therefore, applying the Poisson distribution (really a
Binomial distribution but with a small p of 1/N), we get the probability
of k occupants for a cell is λ^{k}e^{-λ}/k!. Out of N
cells, the expected *count* of cells with k occupants is then Nλ^{k}e^{-λ}/k!

Number of empty array cells: e^{-λ}, equal to 200*0.6065 =
121.306 ≃ **121**

Number of array cells with one value: Nλe^{-λ}, which for N=200
and λ=0.5 works out to about **61**

Number of array cells with two values: Nλ^{2}e^{-λ}/2!.
This works out to about **15**. We can store two strings at
the same location by making our array of size 200 be an array of *lists*
(often so-called "linked lists"), so the collisions themselves are not the
issue. However, note that this means that, for ~30 strings (15 slots,
times 2 strings in each slot), we have a list to search

Number of array cells with three values: Nλ^{3}e^{-λ}/3!
≃ **2.5**.

At this point, we have accounted for 199.5 out of 200 array slots; having more than three strings that hash to the same slot is unlikely.

Once we calculate the number of k=0 (empty) cells, we can get subsequent ones by multiplying by λ/k:

k | expected count |

0 | 121.306 |

1 | 121.306*0.5/1 = 60.65 |

2 | 60.65*0.5/2 = 15.16 |

3 | 15.16*0.5/3 = 2.53 |

The factor of λ moves the numerator from λ^{k-1} to λ^{k};
the k in the denominator moves that from 1/(k-1)! to 1/k!

Suppose A and B are events. The notation P(A|B) is the probability of A,
*given* that we know B. Formally, P(A|B) is defined as P(A∩B)/P(B).
The point is that the interrelatedness of A and B is important. If P(A|B)
= P(A), then A and B are said to be *independent*: knowing B does
not influence the probability of A. Another way to describe this is that
P(A)*P(B) = P(A∩B) (this follows from the definition of P(A|B) and the
definition of independence).

Here is Bayes' formula:

P(A|B) = P(B|A)*P(A)/P(B)

One of the classic applications is in testing. Suppose you have a medical test. The false-positive rate is 1%, and the false-negative rate is 1%. Sounds good, right?

But suppose the underlying condition occurs with frequency 1 in 1,000.
Let's let A be the probability someone has the condition, and B the
probability that the test showed positive for that person. We want P(A|B):
if someone tests positive, how likely is it that they really *are*
positive?

We know P(A) = 0.001.

Also, P(B|A) = 0.99, the probability of a positive test *given*
that someone is sick. This is just 1.0 minus the false-negative rate.

It takes a little more thought to get P(B), the probability of a positive
test. If someone not is sick, the probability of a positive test is 0.01,
and the probability of not being sick is 0.999. Multiplying these gives
0.00999, which is the probability of a positive test *given* that
someone is well.

We *could* round this off to 0.01 and stop there, but we really
should consider the outcome for those who are sick. For those, the
probability of a positive test is 0.99, and the probability of *being*
sick is 0.001. So, the probability of a positive test *given* that
someone is sick is 0.99*0.001 = 0.00099. The exact value for P(B) is the
sum of P(positive test | not sick) and P(positive test | sick), which
is 0.01*0.999 + 0.99*0.001 = 0.01098.

Maybe a better way to do this is to make a table, based on 100,000 people, of whom 100 are sick

sick | not sick | row totals | |

positive test | 99 | 999 | 1098 |

negative test | 1 | 98901 | 98902 |

100 | 99,900 | 100,000 |

Again, we get P(B) = 1,098/100,000.

All this leads to

P(A|B) = 0.99*0.001/0.01098 = 0.09016, or 9%.

So if we test a bunch of people, and get 100 who test positive, only 9 of
them actually will *be* positive.

This has serious implications for testing of rare diseases. It is widely
believed that we'd all be better off with widespread use of an
early-screening test for cancer. But most cancers are rare, and most tests
have a false-positive test much higher than 1%. So the large majority of
people who would be treated would in fact be free of cancer! And since
cancer treatments tend to have very significant side effects, *it is
possible that people would, on average, be worse off with more
widespread testing*.

(This probably does not apply to testing for SARS-CoV-2, because P(A) is,
in exposed populations, often much higher.)