Pearson Hashing is a simple hash function that reads an arbitrary number of bytes and outputs a single byte. Instead of complicated algorithms, the Pearson hash uses a lookup table containing a permutation of all possible bytes.
In this article I will walk through how to implement the function, and how to extend its output into N bytes instead of just one.
Since the main part of the algorithm is the lookup table, let’s see how we might generate one.
def make_lookup(rng):
l = []
for i in range(256):
while True:
x = rng()
if x not in l:
l.append(x)
break
return lThis allows us to plug an RNG depending on our use case. For example we can choose Python’s random module.
import random
lookup = make_lookup(lambda: random.randint(0, 256))
# Inspect the first 10 elements
lookup[:10][244, 1, 137, 131, 254, 122, 237, 144, 112, 42]
We can also use /dev/urandom to generate our lookup table like this.
import os
lookup = make_lookup(lambda: os.urandom(1)[0])
# Inspect the first 10 elements
lookup[:10][219, 154, 235, 209, 172, 126, 127, 15, 70, 83]
def hash_single(lookup, buf):
h = 0
for byte in buf:
h = lookup[h ^ byte]
return hhash_single(lookup, b"Hello, world!")
hash_single(lookup, b"Hello, World!")
hash_single(lookup, b"Arbitrary-length inputs")22
219
91
There are two things that affect the output of this function, aside from the buffer we are hashing. The first one is the internal state, which is an 8-bit value. The second one is the lookup table.
If we change either of those, we will be able to get multiple bytes of output.
Let’s start with this method since it is easier.
Info With this method, you the maximum output size you will be able to get is 256 bytes. If more output is needed, initialize an empty state and feed the byte number to each state value before feeding the actual data.
def hash_multiple_starts(lookup, buf, size=1):
state = list(range(size))
for byte in buf:
for i in range(size):
state[i] = lookup[state[i] ^ byte]
return bytes(state)hash_multiple_starts(lookup, b"Hi there!", 1).hex()
hash_multiple_starts(lookup, b"Hi there!", 2).hex()
hash_multiple_starts(lookup, b"Hi there!", 3).hex()'25'
'25f7'
'25f723'
hash_multiple_starts(lookup, b"Hello, world!", 1).hex()
hash_multiple_starts(lookup, b"Hello, world!", 2).hex()
hash_multiple_starts(lookup, b"Hello, world!", 3).hex()'16'
'16aa'
'16aa9e'
BITS = 64
BYTES = int(BITS / 8)
lookups = [make_lookup(lambda: os.urandom(1)[0]) for _ in range(BYTES)]def hash_multiple_lookups(lookups, buf):
state = [0 for _ in lookups]
for byte in buf:
for i, s in enumerate(state):
state[i] = lookups[i][s ^ byte]
return bytes(state)hash_multiple_lookups(lookups, b"Hello world").hex()
hash_multiple_lookups(lookups, b"Hello World").hex()'c7a28c34f3d67d47'
'79a1dd2b2829948a'
def biased_coin_flips(n):
res = ""
for _ in range(n):
while True:
val = os.urandom(1)[0]
if val < 100:
break
if val < 90:
res += 'H'
else:
res += 'T'
return res
biased_coin_flips(40)'THHTHTHHTHHHHTHHHHHHHHHHHHHHHHHHHTHHHHHH'
Here’s a biased coin that results in Heads a lot more often than Tails. To be precise, 90% of all flips result in Heads. Let’s calculate how many flips we’ll need to extract 32 unbiased bits.
bits_per_flip = -math.log(0.9, 2)
32 / bits_per_flip210.5220313267387
for _ in range(10):
data = biased_coin_flips(211).encode('ascii')
hash_multiple_starts(lookup, data, 4).hex()'332b6166'
'caafb949'
'7418e373'
'b46f8495'
'10067eb6'
'fce7e552'
'7bf477e9'
'b18929fc'
'268538cc'
'd183a329'