Skip to content

Latest commit

 

History

History
229 lines (166 loc) · 11 KB

ch12.asciidoc

File metadata and controls

229 lines (166 loc) · 11 KB

Bloom Filters

In the previous chapter we learned how to validate a Merkle Block. A full node can provide a Proof of Inclusion for transactions of interest through the merkleblock command. But how does the full node know which transactions are of interest?

A light client could tell the full node its addresses (or ScriptPubKeys). The full node can check for transactions that are relevant to these addresses, but that would be compromising the light client’s privacy. A light client wouldn’t want to reveal, for example, that it has 1000 BTC to a full node. Privacy leaks are security leaks, and in Bitcoin, it’s generally a good idea to not leak any privacy whenever possible.

One solution is for the light client to tell the full node enough information to create a superset of all transactions of interest. To create this superset, we use what’s called a Bloom Filter.

What is a Bloom Filter?

A Bloom Filter is a filter for all possible transactions. Full nodes run transactions through a Bloom Filter and send merkleblock commands for transactions that make it through.

Suppose there are 50 total transactions. There is one transaction a light client is interested in. The light client wants to "hide" the transaction among a group of 5 transactions. This requires a function that groups the 50 transactions into 10 different buckets and the full node can then send single bucket of transactions, in a manner of speaking. This grouping would have to be deterministic, that is, be the same each time. How can this be accomplished?

The solution is to use a hash function to get a deterministic number and modulo to organize transactions into buckets.

A Bloom Filter is a computer science structure that can be used on any data in a set, so suppose that we have one item "hello world" that we want to create a Bloom Filter for. We need a hash function, so we’ll use one we already have hash256. The process of figuring out what bucket our item goes into looks like this:

link:code-ch12/examples.py[role=include]
  1. Our bit_field is the list of "buckets" and we want there to be 10.

  2. Hash the item with hash256.

  3. Interpret this as a Big-Endian integer and modulo by 10 to determine the bucket this item belongs to.

  4. We indicate the bucket we want in the Bloom Filter.

Conceptually, what we just did looks like Figure 12-1:

Simple Bloom Filter
Figure 1. 10-bit Bloom Filter with 1 element

Our Bloom Filter consists of:

  1. The size of the bit field

  2. The hash function used (and how we converted that to a number)

  3. The bit field, which indicates the bucket we’re interested in.

This works great for a single item, so this would work for a single address/ScriptPubKey/Transaction ID of interest. What do we do when we’re interested in more than 1 item?

We can run a second item through the same filter and set that bit to 1 as well. The full node can then send multiple buckets of transactions instead of a single bucket. Let’s create a Bloom Filter with two items, "hello world" and "goodbye" (Figure 12-2)

Two Item Bloom Filter
Figure 2. 10-bit Bloom Filter with 2 elements
link:code-ch12/examples.py[role=include]
  1. We are creating a filter for two items here, but this can be extended to many more.

If the space of all possible items is 50, 10 items on average will make it through this filter instead of the 5 when we only had 1 item of interest because 2 buckets are returned, not 1.

Going a step further

Supposing that the space of all items is 1 million and we want bucket sizes to still be 5, we would need a Bloom Filter that’s 1,000,000 / 5 = 200,000 bits long. Each bucket would have on average 5 items and we would get 5 times the number of items we’re interested in, 20% of which would be items of interest. 200,000 bits is 25,000 bytes and is a lot to transmit. Can we do better?

A Bloom Filter using multiple hash functions can shorten the bit field considerably. If we use 5 hash functions over a bit field of 32, we have 32!/(27!5!) ~ 200,000 possible 5-bit combinations in that 32-bit field. Of 1 million possible items, 5 items on average should have that 5-bit combination. Instead of transmitting 25k bytes, we can transmit just 32 bits or 4 bytes!

Here’s what that would look like. For simplicity, we stick to the same 10-bit field but still have 2 items of interest:

link:code-ch12/examples.py[role=include]
  1. We iterate over 2 different hash functions (hash256 and hash160), but we could just as easily have more.

Conceptually, Figure 12-3 shows what the code above does:

Multiple Hash Functions
Figure 3. 10-bit Bloom Filter with 2 elements and 2 hash functions

A Bloom Filter can be optimized by changing the number of hash functions and bit field size to get a desirable false-positive rate.

BIP0037 Bloom Filters

BIP0037 specifies Bloom Filters in network communication. The information contained in a Bloom Filter are:

  1. The size of the bit field, or how many buckets there are. The size is specified in bytes (8 bits per byte) and rounded up if necessary.

  2. The number of hash functions.

  3. A "tweak" to be able to change the Bloom Filter slightly if it hits too many items.

  4. The bit field that results from running the Bloom Filter over the items of interest.

While we could define lots of hash functions (sha512, keccak384, ripemd160, blake256, etc), in practice, we use a single hash function with a different seed. This allows the implementation to be simpler.

The hash function we use is called murmur3. Unlike sha256, murmur3 is not cryptographically secure, but is much faster. The task of filtering and getting a deterministic, evenly distributed modulo does not require cryptographic security, but benefits from speed, so murmur3 is the appropriate tool for the job. The seed formula is defined this way:

i*0xfba4c795 + tweak

The fba4c795 number is a constant for Bitcoin Bloom Filters. i is 0 for the first hash function, 1 for the second, 2 for the third and so on. The tweak is a bit of entropy that can be added if the results of of one tweak are not satisfactory. The hash functions and the size of the bit field are used to calculate the bit field which then get transmitted.

link:code-ch12/examples.py[role=include]
  1. murmur3 is implemented in pure Python in helper.py

  2. BIP37_CONSTANT is the fba4c795 number specified in BIP0037

  3. Iterate over some items of interest.

  4. We use 2 hash functions.

  5. Seed formula.

  6. murmur3 returns a number, so we don’t have to do any conversion to an integer.

This 2-byte Bloom Filter has 4 bits set to 1 out of 16, so the probability of any random item passing through this filter is 1/4*1/4=1/16. If space of all items numbers 160, a client will receive 10 items on average, 2 of which will be interesting.

We can start coding a BloomFilter class.

link:code-ch12/bloomfilter.py[role=include]

Loading a Bloom Filter

Once a light client has created a Bloom Filter, it needs to let the full node know the details of the filter so the full node can send Proofs of Inclusion. The first thing a light client must do is set the optional relay flag in the version message (see [chapter_networking]) to 1. This tells the full node not to send transaction messages unless they match a Bloom Filter or has specifically requested the message. After the relay flag, a light client then communicates to the full node the Bloom Filter itself. The command to set the Bloom Filter is called filterload. The payload looks like Figure 12-4:

filterload Command
Figure 4. Parsed filterload

The elements of a Bloom Filter are encoded into bytes. The bit field, hash function count and tweak are encoded in this message. The last field, matched item flag, is a way of asking the full node to add any matched transactions to the Bloom Filter.

Getting Merkle Blocks

There is one more command that a light client needs and that’s Merkle Block information about transactions of interest from the full node. The getdata command is what communicates blocks and transactions. The specific type of data that a light client will want from a full node is something called a filtered block. A filtered block is asking for transactions that pass through the bloom filter in the form of Merkle Blocks. In other words, the light client can ask for Merkle Blocks whose transactions of interest match the Bloom Filter.

Here is what the payload for getdata looks like Figure 12-5:

getdata Command
Figure 5. Parsed getdata

The number of items as a varint specify how many items we want. The each item has a type. A type value of 1 is a Transaction ([chapter_tx_parsing]), 2, a normal Block ([chapter_blocks]), 3, a Merkle Block ([chapter_spv]) and 4, a Compact Block (not covered in this book).

We can create this message in network.py.

link:code-ch12/network.py[role=include]
  1. Store the items we want.

  2. We add items to the message using the add_data method.

Getting Transactions of Interest

A light client which loads a Bloom Filter with a full node will get all the information needed to prove that transactions of interest are included in particular blocks.

link:code-ch12/examples.py[role=include]
  1. We are creating a Bloom Filter that’s 30 bytes, 5 hash functions using a particularly popular 90’s tweak.

  2. We filter for the address above.

  3. We send the filterload command from the Bloom Filter we made.

  4. We get the block headers after last_block_hex.

  5. We create a getdata message for Merkle Blocks that may have transactions of interest.

  6. We request a Merkle Block proving transactions of interest to us are included. Most blocks will probably be complete misses.

  7. The getdata message asks for 2000 Merkle Blocks after the block defined by last_block_hex.

  8. We wait for the merkleblock command, which proves inclusion and the tx command which gives us the transaction of interest.

  9. We check that the Merkle Block proves transaction inclusion.

  10. We’re looking for UTXOs for address, and we print to screen if we find one.

What we’ve done in the above is look at 2000 blocks after a particular block for UTXOs corresponding to a particular address. This is without the use of any block explorer, which preserves, to some degree, our privacy.

Conclusion

In this chapter, we created everything necessary to connecting peer to peer as a light client, ask for and receive the UTXOs necessary to construct a transaction all while preserving some privacy by using a Bloom Filter.

We now turn to Segwit, which is a new type of transaction that was activated in 2017.