Pearson Hashing

Pearson Hashing is a simple hash function that reads an arbitrary number of bytes and outputs a single byte. Instead of complicated algorithms, the Pearson hash uses a lookup table containing a permutation of all possible bytes.

In this article I will walk through how to implement the function, and how to extend its output into N bytes instead of just one.

Since the main part of the algorithm is the lookup table, let’s see how we might generate one.

Generating lookup tables

In [2]:

def make_lookup(rng):
    l = []
    for i in range(256):
        while True:
            x = rng()
            if x not in l:
                l.append(x)
                break
    return l

This allows us to plug an RNG depending on our use case. For example we can choose Python’s random module.

In [3]:

import random
lookup = make_lookup(lambda: random.randint(0, 256))

# Inspect the first 10 elements
lookup[:10]

Out [3]:

[244, 1, 137, 131, 254, 122, 237, 144, 112, 42]

In [4]:

import os
lookup = make_lookup(lambda: os.urandom(1)[0])

# Inspect the first 10 elements
lookup[:10]

Out [4]:

[219, 154, 235, 209, 172, 126, 127, 15, 70, 83]

Hashing (single-byte)

In [5]:

def hash_single(lookup, buf):
    h = 0
    for byte in buf:
        h = lookup[h ^ byte]
    return h

In [6]:

hash_single(lookup, b"Hello, world!")
hash_single(lookup, b"Hello, World!")
hash_single(lookup, b"Arbitrary-length inputs")

Out [6]:

Hashing (multi-byte)

Theory

There are two things that affect the output of this function, aside from the buffer we are hashing. The first one is the internal state, which is an 8-bit value. The second one is the lookup table.

Different starting points

In [7]:

def hash_multiple_starts(lookup, buf, size=1):
    state = list(range(size))
    
    for byte in buf:
        for i in range(size):
            state[i] = lookup[state[i] ^ byte]
    
    return bytes(state)

In [8]:

hash_multiple_starts(lookup, b"Hi there!", 1).hex()
hash_multiple_starts(lookup, b"Hi there!", 2).hex()
hash_multiple_starts(lookup, b"Hi there!", 3).hex()

Out [8]:

'25'

Out [8]:

'25f7'

Out [8]:

'25f723'

In [9]:

hash_multiple_starts(lookup, b"Hello, world!", 1).hex()
hash_multiple_starts(lookup, b"Hello, world!", 2).hex()
hash_multiple_starts(lookup, b"Hello, world!", 3).hex()

Out [9]:

'16'

Out [9]:

'16aa'

Out [9]:

'16aa9e'

Multiple lookup tables

In [10]:

BITS = 64
BYTES = int(BITS / 8)

lookups = [make_lookup(lambda: os.urandom(1)[0]) for _ in range(BYTES)]

In [11]:

def hash_multiple_lookups(lookups, buf):
    state = [0 for _ in lookups]
    
    for byte in buf:
        for i, s in enumerate(state):
            state[i] = lookups[i][s ^ byte]
    
    return bytes(state)

In [12]:

hash_multiple_lookups(lookups, b"Hello world").hex()
hash_multiple_lookups(lookups, b"Hello World").hex()

Out [12]:

'c7a28c34f3d67d47'

Out [12]:

'79a1dd2b2829948a'

Use Cases

Randomness extractor

In [13]:

def biased_coin_flips(n):
    res = ""
    
    for _ in range(n):
        while True:
            val = os.urandom(1)[0]
            if val < 100:
                break
                
        if val < 90:
            res += 'H'
        else:
            res += 'T'
    return res

biased_coin_flips(40)

Out [13]:

'THHTHTHHTHHHHTHHHHHHHHHHHHHHHHHHHTHHHHHH'

Here’s a biased coin that results in Heads a lot more often than Tails. To be precise, 90% of all flips result in Heads. Let’s calculate how many flips we’ll need to extract 32 unbiased bits.

In [14]:

bits_per_flip = -math.log(0.9, 2)

32 / bits_per_flip

Out [14]:

210.5220313267387

In [15]:

for _ in range(10):
    data = biased_coin_flips(211).encode('ascii')
    hash_multiple_starts(lookup, data, 4).hex()

Out [15]:

'332b6166'

Out [15]:

'caafb949'

Out [15]:

'7418e373'

Out [15]:

'b46f8495'

Out [15]:

'10067eb6'

Out [15]:

'fce7e552'

Out [15]:

'7bf477e9'

Out [15]:

'b18929fc'

Out [15]:

'268538cc'

Out [15]:

'd183a329'

Pearson Hashing

Generating lookup tables

Hashing (single-byte)

Hashing (multi-byte)

Theory

Different starting points

Multiple lookup tables

Use Cases

Randomness extractor

Leave a Comment