Pearson Hashing is a simple hash function that reads an arbitrary number of bytes and outputs a single byte. Instead of complicated algorithms, the Pearson hash uses a lookup table containing a permutation of all possible bytes.

In this article I will walk through how to implement the function, and how to extend its output into N bytes instead of just one.

Since the main part of the algorithm is the lookup table, let’s see how we might generate one.

Generating lookup tables

In [2]:
def make_lookup(rng):
    l = []
    for i in range(256):
        while True:
            x = rng()
            if x not in l:
                l.append(x)
                break
    return l

This allows us to plug an RNG depending on our use case. For example we can choose Python’s random module.

In [3]:
import random
lookup = make_lookup(lambda: random.randint(0, 256))

# Inspect the first 10 elements
lookup[:10]
Out [3]:
[244, 1, 137, 131, 254, 122, 237, 144, 112, 42]

We can also use /dev/urandom to generate our lookup table like this.

In [4]:
import os
lookup = make_lookup(lambda: os.urandom(1)[0])

# Inspect the first 10 elements
lookup[:10]
Out [4]:
[219, 154, 235, 209, 172, 126, 127, 15, 70, 83]

Hashing (single-byte)

In [5]:
def hash_single(lookup, buf):
    h = 0
    for byte in buf:
        h = lookup[h ^ byte]
    return h
In [6]:
hash_single(lookup, b"Hello, world!")
hash_single(lookup, b"Hello, World!")
hash_single(lookup, b"Arbitrary-length inputs")
Out [6]:
22
Out [6]:
219
Out [6]:
91

Hashing (multi-byte)

Theory

There are two things that affect the output of this function, aside from the buffer we are hashing. The first one is the internal state, which is an 8-bit value. The second one is the lookup table.

If we change either of those, we will be able to get multiple bytes of output.

Different starting points

Let’s start with this method since it is easier.

Info With this method, you the maximum output size you will be able to get is 256 bytes. If more output is needed, initialize an empty state and feed the byte number to each state value before feeding the actual data.

In [7]:
def hash_multiple_starts(lookup, buf, size=1):
    state = list(range(size))
    
    for byte in buf:
        for i in range(size):
            state[i] = lookup[state[i] ^ byte]
    
    return bytes(state)
In [8]:
hash_multiple_starts(lookup, b"Hi there!", 1).hex()
hash_multiple_starts(lookup, b"Hi there!", 2).hex()
hash_multiple_starts(lookup, b"Hi there!", 3).hex()
Out [8]:
'25'
Out [8]:
'25f7'
Out [8]:
'25f723'
In [9]:
hash_multiple_starts(lookup, b"Hello, world!", 1).hex()
hash_multiple_starts(lookup, b"Hello, world!", 2).hex()
hash_multiple_starts(lookup, b"Hello, world!", 3).hex()
Out [9]:
'16'
Out [9]:
'16aa'
Out [9]:
'16aa9e'

Multiple lookup tables

In [10]:
BITS = 64
BYTES = int(BITS / 8)

lookups = [make_lookup(lambda: os.urandom(1)[0]) for _ in range(BYTES)]
In [11]:
def hash_multiple_lookups(lookups, buf):
    state = [0 for _ in lookups]
    
    for byte in buf:
        for i, s in enumerate(state):
            state[i] = lookups[i][s ^ byte]
    
    return bytes(state)
In [12]:
hash_multiple_lookups(lookups, b"Hello world").hex()
hash_multiple_lookups(lookups, b"Hello World").hex()
Out [12]:
'c7a28c34f3d67d47'
Out [12]:
'79a1dd2b2829948a'

Use Cases

Randomness extractor

In [13]:
def biased_coin_flips(n):
    res = ""
    
    for _ in range(n):
        while True:
            val = os.urandom(1)[0]
            if val < 100:
                break
                
        if val < 90:
            res += 'H'
        else:
            res += 'T'
    return res

biased_coin_flips(40)
Out [13]:
'THHTHTHHTHHHHTHHHHHHHHHHHHHHHHHHHTHHHHHH'

Here’s a biased coin that results in Heads a lot more often than Tails. To be precise, 90% of all flips result in Heads. Let’s calculate how many flips we’ll need to extract 32 unbiased bits.

In [14]:
bits_per_flip = -math.log(0.9, 2)

32 / bits_per_flip
Out [14]:
210.5220313267387
In [15]:
for _ in range(10):
    data = biased_coin_flips(211).encode('ascii')
    hash_multiple_starts(lookup, data, 4).hex()
Out [15]:
'332b6166'
Out [15]:
'caafb949'
Out [15]:
'7418e373'
Out [15]:
'b46f8495'
Out [15]:
'10067eb6'
Out [15]:
'fce7e552'
Out [15]:
'7bf477e9'
Out [15]:
'b18929fc'
Out [15]:
'268538cc'
Out [15]:
'd183a329'