The birthday paradox is the surprising result that if you have 23 people in a room, there is a 50% chance that two of them share a birthday. I want to explore this topic and verify that this is true.
Instead of delving into the maths with formulas, we will be using Monte Carlo methods with random simulations.
When answering questions about a group of people having shared birthdays, we need to assume one of two things.
- The birthdays are distributed uniformly, and each day is equally likely to be someone’s birthday.
- Some birthdays are more likely than others, and people tend to have kids more at certain times of the year.
First of all, I will assume that any day of the year is equally likely to be a birthday. In reality, some days are most likely more common than others. But this would only make the probability of same birthdays higher, so it’s not a bad assumption.
def make_room(N): days = range(365) return random.choices(days, k=N) make_room(5)
[21, 174, 343, 350, 119]
Let’s make a small function that checks if a room has a shared birthday.
def check_shared_birthday(room): bdays = set() for b in room: if b in bdays: return True bdays.add(b) return False check_shared_birthday([1, 2, 3, 4]) check_shared_birthday([4, 2, 3, 4])
Now we just do a bunch of trials to figure out the probability to some accuracy.
def birthday_probability(N, runs): results = np.zeros(runs) for i in range(runs): room = make_room(N) shared = check_shared_birthday(room) results[i] = shared return np.mean(results)
Looks reasonable. Let’s plot this over a range of
uniform_probs = [birthday_probability(x, 100_000) for x in range(75)] _ = plt.plot(uniform_probs) # Plot 0.5 probability _ = plt.axhline(0.5, color='orange', alpha=0.45) _ = plt.axvline(23, color='orange', alpha=0.45)
This worked well, but it’s not convenient to run a bunch of simulations every time you want to use this graph. Plus it only work for integers. Let’s fix those by approximating the function with a regression.
p = approx_poly(list(range(70)), uniform_probs, 7) p.weights
[0.003, -0.294, 8.822, -8.145, -12.648, 22.073, -8.815]
_ = plt.plot(uniform_probs) plot_fn(p, 0, 75, 0.01)
Much smoother, and much faster to execute.
Inverse of the function
Now we can use this approximation to calculate how many people are needed in a room for exactly fifty percent chance of a birthday collision.
f = lambda x: abs(p(x) - 0.5) x = 0 for i in np.arange(22.5, 23, 0.000001): if f(i) <= f(x): x = i x p(x)
Having 22.8 people in a room means the probability of having a birthday collision is 50%. Let’s plot this as well.
def birthday_reverse(x): f = lambda z: abs(p(z) - x) return min(list(np.arange(0, 75, 0.01)), key=f) plot_fn(birthday_reverse, 0, 1, 0.01)
Birthday distribution based on real data
Earlier we mentioned that birthdays are not perfectly uniform, and people tend to give birth more at specific parts of the year. Because of this, our previous calculations based on the assumption of uniform birthdays will be slightly off.
In order to make this more accurate, we need a dataset of birthdays. Rob Murphy 1 provides just that.
URL = "https://www.panix.com/~murphy/bdata.txt" bdata = requests.get(URL).text.split("\r\n")[1:-1]
days =  for row in bdata: day, count = row.split() count = int(count) for i in range(count): days.append(day)
Now all we need to do is to change
make_room to use this dataset instead of uniform random numbers.
def make_room(N): return random.choices(days, k=N)
Looks like in a room with 23 people, the probability of finding a shared birthday slightly went up. Let’s see how the graph looks.
bdata_probs = [birthday_probability(x, 100_000) for x in range(75)]
_ = plt.plot(uniform_probs) _ = plt.plot(bdata_probs)
Doesn’t look any different. Probably not worth bothering with this, assuming a uniform distribution will get you pretty much identical results.