bloom

Implementation of the code kata http://codekata.com/kata/kata05-bloom-filters/

To build and run the tests

In the root directory, build a Docker image with the command

docker build -t bloom_filter .

To build and run C tests, use the command docker run -v $PWD:/src/app -w=/src/app bloom_filter /bin/bash -c 'make && make mem_test'

To run Python (statistics) tests, run the command

docker run -v $PWD:/src/app -w=/src/app bloom_filter /bin/bash -c 'cd python && python stats_test.py'

On Linux or Mac, you can use the provided convenience script docker_test_script.sh to execute these two commands in order.

Non-docker instructions

Alternatively, you can run outside a docker if you prefer. You will need installed the following requirements:

gcc compatible compiler
libcheck (I used version 0.11.0. At the time of this writing if you run apt-get install libcheck it will install version 0.10.0. That may or may not work.)
Python 2.7
numpy
scipy
Optional: matplotlib for generating plots.

Run the command make && make run_tests to build and test the C library. Run python python/stats_test.py to run the statistics tests.

Random Strings

Random strings provided by www.random.org

Implementation Notes and Thoughts

This kata poses several interesting challenges. Here I log some of the thinking that went into specific implementation decisions.

The Bit Field

The first component I decided to implement was the bit field. We need the ability to set bits in some portion of memory and then check to see if those bits were set. We have to make a choice for what we mean by "the 3rd bit" in the field. I initially decided that within a byte, bits would be counted from right to left, (i.e. least significant to most significant), and then bytes would be counted in memory order. So a 3 byte bit field would be counted as

Byte 0								Byte 1								Byte 2
7	6	5	4	3	2	1	0	15	14	13	12	11	10	9	8	23	22	21	20	19	18	17	16

In order to force good memory alignment, I later switched to a uint64_t array instead of a char array, but used the same basic idea.

Number of bits

I decided to use 32 bit addresses for the bit field. This allows up to ~4 billion bits in the field. A Bloom filter of this size could contain approx 270 million entries with a 0.1% false positive rate.

Hash Functions

We need a collection of hash functions which will map strings (const char *) to integers (uint32_t). For starters our basic requirements will be

The same hash function has different results for different strings (hash(i, str1) != hash(i str2))
Different hash functions have different results for same strings (hash(i, str) != hash(j, str)).

Of course, the hash functions will a necessarily have collisions, but we'll ignore this complication for the moment.

The "Dumb" Hash

As a first pass, I used the hash function

uint32_t hash(int function_number, const char *str) {
    uint32_t h1 = 0;
    uint32_t h2 = 0;

    while(*str) {
        h1 += *str;
        h2 ^= *str;

        str++;
    }

    return h1 + h2*function_number;
}

This collection of functions is awful, in that it does not satisfy the needs of a "good" hash function, but for the purposes of getting the basic Bloom filter up and running this will suffice.

A need to slightly improve

While writing a test which enforced the behavior that long words (in this case, 30 letters) could be read from the dictionary file I encountered some bad behavior. The test loads the dictionary, and then checks that the word can be found. Surprisingly, without even adding the word to the dictionary file the test passed right away even though it should have failed. I suspected this was being caused by a false positive, so I wrote another test which adds a word, and then checks to see if a very similar word is not in the dictionary. This second test failed, confirming my suspicions. Decreasing the false positive rate of the filter had no effect, leading me to the conclusion that the dumb collection of hash functions were the culprit. I altered the hash function collection slightly to a second iteration:

const int a_big_prime_number = 0x0600BC09;

uint32_t hash(int function_number, const char *str) {
    uint32_t h1 = 0;
    uint32_t h2 = a_big_prime_number;

    while(*str) {
        h1 += *str;
        h2 ^= *str << 5;

        str++;
    }

    return h1 + h2*function_number;
}

I hadn't anticipated that I would need to alter the hash functions until I had reached the performance testing stage of development. After all, even though the functions were pretty bad I assumed I would avoid collisions well enough in small early tests. Yet, interestingly enough, a rather simple test forced this change. This illustrates how important having "good" uniform hash functions are to the false positive rate of the Bloom filter. After changing the hash function in this way, my similar word test passed and the long word test failed, which is the behavior I expected in the first place.

Should we write deterministic tests for the existence of false positives?

This seemed like an interesting question and I have seen implementation of the Bloom filter which did this. I decided to not write this kind of test however, because

It would require engineering collisions, which, while not the most complicated thing in the world, would at least take a little effort
The test would need to be rewritten every time the has function implementation changed.

These complications could be avoided by mocking the hash functions, but at that point what are you even testing which isn't already under test?

Profiling and Performance

With the core Bloom filter implemented, we are ready to start profiling. When creating a filter of a particular capacity and false positive rate, we assume that our hash functions are sufficiently uniform. We would like to verify this assumption. We also want to confirm that we are actually attaining the desired false positive rate (or lower).

To do this, I decided to create Python wrappers for the Bloom library. I could then leverage all of Python's excellent math and graphing capabilities. While a more robust Python binding might be desired, I kept this to the bare minimum for what was required for me to do the needed profiling.

First Profile

We want our hash functions to be uniform. So how do they stack up? Below is a histogram of the frequency of the hashes of words in /usr/share/dict/words, a file which on my machine contains 235886 words.

Obviously there is room for improvement!

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
python		python
src		src
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker_test_script.sh		docker_test_script.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bloom

To build and run the tests

Non-docker instructions

Random Strings

Implementation Notes and Thoughts

The Bit Field

Number of bits

Hash Functions

The "Dumb" Hash

A need to slightly improve

Should we write deterministic tests for the existence of false positives?

Profiling and Performance

First Profile

About

Releases

Packages

Languages

weirichd/bloom

Folders and files

Latest commit

History

Repository files navigation

bloom

To build and run the tests

Non-docker instructions

Random Strings

Implementation Notes and Thoughts

The Bit Field

Number of bits

Hash Functions

The "Dumb" Hash

A need to slightly improve

Should we write deterministic tests for the existence of false positives?

Profiling and Performance

First Profile

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages