blah blah
In order to judge the distribution, I’ll use the standard deviation.
statistics.stdev([5, 5, 6, 5, 5])
statistics.stdev([4, 4, 10, 4, 4])0.4472135954999579
2.6832815729997477
def evaluate_hash(fn, values):
counts = [0] * 256
for val in values:
h = fn(val)
counts[h] += 1
return statistics.stdev(counts)unkeyed_hash("Test message 1")
unkeyed_hash("Test message 2")
unkeyed_hash("Test message 3")168
184
236
Let’s see how good Python’s built-in hash function is on “normal” input.
values = []
for i in range(5000):
values.append(f"Test {i}")
evaluate_hash(crc32, values)1.4656190554459485
Now let’s try to skew it a little by choosing the values more carefully.
values = []
i = 0
while len(values) < 5000:
val = f"Test {i}"
if crc32(val) == 5:
values.append(val)
i += 1
evaluate_hash(crc32, values)312.5
Wow, any data structure or partitioning that expects uniform output from a hash function would be devastated.
hash1 = pearson_with_key("Quentin Coldwater")
hash2 = pearson_with_key("Josh Hoberman")
values = []
i = 0
while len(values) < 5000:
val = f"Test {i}"
if hash1(val) == 5:
values.append(val)
i += 1
evaluate_hash(hash1, values)312.5
A similar result, but now let’s try a different key with the same values.
evaluate_hash(hash2, values)4.4860349757800115