Improving in memory performance

This document explores ideas for improving in-memory performance of the fractal tree library. At the writing of this document, we will soon ship TokuMX 1.0.3 and TokuDB 7.0.3.

Currently, our bottlenecks on in-memory performance seem to be the following. The OMTs used to store basement nodes and OMTs used to index message buffers of internal nodes exhibit poor caching behavior. Although the two issues are very similar, let's look at them individually. We will start with the basement nodes.

As of July 19, 2013, on master, the in-memory representation of basement nodes is as follows. At a high level, the data is stored in a mempool, and the OMT references pointers that happen to point in the mempool. When a basement node is read into memory, the data is layed out, in key order, in a mempool. An OMT, whose values are pointers into the mempool, is then built. So, suppose we have mempool that is addressed at 0xa0000000 (I do not know or care if that can or cannot be a real address). Suppose each leafentry was exactly 16 bytes. Then the OMT built for the basement node will have values (0xa0000010, 0xa0000020, 0xa0000030 ...).

OMTs currently have two formats. One is the "array" format. The array format is what it says, it's basically just an array of OMTVALUEs. This format is very efficient for space. This format also makes it simple to create OMTs from arrays, which we do in multiple places like the transaction/MVCC code. The other format is the "tree" format. In the tree format, the internal representation is still an array, but OMT nodes store pointers to children. This format is a little less efficient for space.

OMTs start in array format. As soon as an OMTVALUE is inserted or deleted, the OMT converts to the tree format. So, a basement node that is brought into memory and only ever queried should always remain in array format. When the OMT is rebalanced, it is rebalanced into the array format.

Additionally, when a basement node's mempool has all of its memory used up, a new mempool is allocated, and elements of the OMT are packed into the new mempool in sequential order. This process is called compacting the OMT. Note that we cannot do a simple realloc because the OMTs store the actual pointers into the mempool, and not an offset into the mempool.

Now that we understand how a basement node is manipulated, let's discuss the inefficiencies, in no particular order.

I see three inefficiencies, in order of importance. The first one is much bigger. Right now, to search a basement node, as we search within the OMT, we must dereference a keys that are sparsely located within the mempool. I say sparsely because the keys are packed with leafentries that contain vals. The vals may be quite big, hundreds of bytes. So, each key may reside in its own L1 cacheline, practically guaranteeing a cache miss per level with each search. A second inefficiency is that the entire OMT probably does not fit in a single cacheline, so searching the OMT WITHOUT dereferencing the key will probably result in a bunch of cache misses. The third inefficiency is that when we want to grow the mempool, we have to iterate over the contents of the OMT and pack it into a new space. This is because inside the OMT, we store the direct address inside the mempool instead of an offset into the mempool.

Solving the third inefficiency is simple. Instead of storing an eight byte address in the OMT, we can store a four byte offset. On pure insertion workloads, we can do a simple realloc when the mempool runs out of space, and do the compression of the key space only when we notice the mempool becoming heavily fragmented on delete. I prototyped this once upon a time with trac ticket #4135, but there was too much other stuff happening for this change to make it in. Solving this will also enable dealing with the first two inefficiencies.

High level challenges and thoughts on whatever we do:

getting serialization and deserialization to work right
getting rebalancing of leaf nodes on checkpoint clone to work right. We should also be careful of adding work to this path, as it may hurt checkpoint variability.
Assuming we want to use a different OMT, I don't want to change the existing OMT. At least not immedietely. That is well tested working code used everywhere, and no need to risk issues in other areas when optimizing this one.

THIS DOCUMENT IS NOT YET COMPLETE!!!

rebalancing nodes on checkpoint clone, is this an issue? array mode v. non-array mode Not sure I want to get rid of old OMT, lots of stuff uses it, don't want to risk changing that think on order of doing stuff, like first fix mempools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving in memory performance

Clone this wiki locally