Hash Flooding

blah blah

In order to judge the distribution, I’ll use the standard deviation.

In [2]:

statistics.stdev([5, 5, 6, 5, 5])
statistics.stdev([4, 4, 10, 4, 4])

Out [2]:

0.4472135954999579

Out [2]:

2.6832815729997477

In [3]:

def evaluate_hash(fn, values):
    counts = [0] * 256
    
    for val in values:
        h = fn(val)
        counts[h] += 1
    
    return statistics.stdev(counts)

In [4]:

unkeyed_hash("Test message 1")
unkeyed_hash("Test message 2")
unkeyed_hash("Test message 3")

Out [4]:

Let’s see how good Python’s built-in hash function is on “normal” input.

In [5]:

values = []

for i in range(5000):
    values.append(f"Test {i}")
    
evaluate_hash(crc32, values)

Out [5]:

1.4656190554459485

Now let’s try to skew it a little by choosing the values more carefully.

In [6]:

values = []

i = 0
while len(values) < 5000:
    val = f"Test {i}"
    if crc32(val) == 5:
        values.append(val)
    i += 1

evaluate_hash(crc32, values)

Out [6]:

312.5

Wow, any data structure or partitioning that expects uniform output from a hash function would be devastated.

In [7]:

hash1 = pearson_with_key("Quentin Coldwater")
hash2 = pearson_with_key("Josh Hoberman")

values = []

i = 0
while len(values) < 5000:
    val = f"Test {i}"
    if hash1(val) == 5:
        values.append(val)
    i += 1
        
evaluate_hash(hash1, values)

Out [7]:

312.5

A similar result, but now let’s try a different key with the same values.

In [8]:

evaluate_hash(hash2, values)

Out [8]:

4.4860349757800115

Hash Flooding

Leave a Comment