I want to go through the process of writing a set in pure Python, and use it to demonstrate the speed advantage of a HashSet when compared to a naive set.

What is a set?

In mathematics, a set is an unordered collection of distinct objects. While sets usually have many useful operations, the one we’ll be implementing today has only one useful property. It is a collection that can only contain each value once.

Let’s define a rough interface that we’ll be implementing. An implementation of set will define 3 methods.

  1. contains(value) will return whether or not the given value has been added to the set
  2. add(value) will add the value to the set
  3. remove(value) will remove the value from the set

Let’s make a naive implementation of a set using a Python list.

ListSet

Aside from ignoring duplicates, a Python list does what we want. The only thing we need to do is to check for existing values.

class ListSet:
    def __init__(self):
        self._list = []

    def contains(self, value):
        return value in self._list

    def add(self, value):
        self._list.append(value)

    def remove(self, value):
        self._list.remove(value)

That’s all the code for a ListSet. It is extremely simple, and the obvious implementation of our requirements.

The only problem with it is the fact that it gets very slow when you add a lot of elements into it. The reason for that is the fact that contains and remove both do a linear scan of the list for every operation. And the way we can mitigate this is with a buckets and hashes.

HashSet

In order to get around the requirement for a linear scan, we need to put our items into buckets. And when we are looking for an item, instead of scanning all of them, we can just scan the correct bucket.

In order to do this, we need a consistent way to turn a value into a bucket number. Python has such a function, called hash(value). This function can turn any Python type into an integer.

Let’s assume we have 15 buckets, and we want to classify the value "Hello" into one of them. We can simply use hash("Hello") % 15 as the bucket number.

Constructor

def __init__(self, N=30):
    self.n = N
    self.buckets = [[] for _ in range(N)]

The list comprehension creates N lists to serve as our buckets. We also store the number of buckets to use in our calculations later.