blah blah

In order to judge the distribution, I’ll use the standard deviation.

In [2]:
statistics.stdev([5, 5, 6, 5, 5])
statistics.stdev([4, 4, 10, 4, 4])
Out [2]:
0.4472135954999579
Out [2]:
2.6832815729997477
In [3]:
def evaluate_hash(fn, values):
    counts = [0] * 256
    
    for val in values:
        h = fn(val)
        counts[h] += 1
    
    return statistics.stdev(counts)
In [4]:
unkeyed_hash("Test message 1")
unkeyed_hash("Test message 2")
unkeyed_hash("Test message 3")
Out [4]:
168
Out [4]:
184
Out [4]:
236

Let’s see how good Python’s built-in hash function is on “normal” input.

In [5]:
values = []

for i in range(5000):
    values.append(f"Test {i}")
    
evaluate_hash(crc32, values)
Out [5]:
1.4656190554459485

Now let’s try to skew it a little by choosing the values more carefully.

In [6]:
values = []

i = 0
while len(values) < 5000:
    val = f"Test {i}"
    if crc32(val) == 5:
        values.append(val)
    i += 1

evaluate_hash(crc32, values)
Out [6]:
312.5

Wow, any data structure or partitioning that expects uniform output from a hash function would be devastated.

In [7]:
hash1 = pearson_with_key("Quentin Coldwater")
hash2 = pearson_with_key("Josh Hoberman")

values = []

i = 0
while len(values) < 5000:
    val = f"Test {i}"
    if hash1(val) == 5:
        values.append(val)
    i += 1
        
evaluate_hash(hash1, values)
Out [7]:
312.5

A similar result, but now let’s try a different key with the same values.

In [8]:
evaluate_hash(hash2, values)
Out [8]:
4.4860349757800115