Basic Math on How Bloom Filter Works



home · about · subscribe

August 27, 2016 · -

I gave an overview on bloom filter in the previous blog post. But if I summarize the properties of bloom filter as a data structure;

We know that at this point that Bloom Filters are space efficient data structures which allow small false positive rate to be able to gain space from the items that are inserted into the Bloom Filter. As we increase the size of the Bloom Filter, we decrease the false positive rates as we reduce the hash collisions that occur for multiple items that are inserted in the Bloom Filter. Actually, as we increase the size of the bloom filter, we could approximate a vanilla hashmap. That means that we are actually not going to have hash collisions so that every item is mapped to a unique bit in the bloom filter.

In this note, I am going to go a little deeper; to detail how the internals of the bloom filter works as well as tradeoff in a problem formulation.

Formulation

Bloom filter is nothing but array of nn bits. We have generally kk hash functions to be able to insert multiple bits in the array for a given item.

We are using multiple hash functions because this makes full hash collisions less frequent and make the lookup more robust so that the false positive rate of the Bloom Filter would be very small. As you could imagine, multiple hash functions allow more than one collision per position in the bloom filter, but it may not report false positive rates as all of the hash value positions should return to 1 when there is going to be a false positive report.

The reason why we define bits per object is because I am going to give the error formula in terms of the number of bits per object.

Insertion

The way we insert an element in the bloom filter is to just set the positions for the values of the hash functions kk. The way we formulate this is the following:

For positions that are returned by the hash functions are going to be set to 1.

Lookup

Lookup is also very similar to inserts, we are going to check the positions for the value of the hash functions kk. If every single bit in each position is set to 1, then we would say the item is “probably” there. - xx: return True if and only if A[hi(x)]=1A[h_i(x)] = 1 for every i=1,2,,ki = 1, 2, \ldots, k

Error Rate

We could compute the false positive rate or error rate with the following formula. Note that as we increase the total hash functions, the error rate goes down as well as the we increase the bloom filter size since that would increase the total number of bits per object in dataset.

For a given bb, we could find the optimal kk by taking the derivative of the error rate with respect to kk.

How to set kk?

By doing so, for a given plausible bb, we could compute the total number of hash functions(kk) very easily.

All Rights Reserved

Copyright, 2020