-
Notifications
You must be signed in to change notification settings - Fork 122
Improving in memory query performance Design
The purpose of this document is to explain the design and implementation for Improving-in-memory-performance.
See Improving-in-memory-performance#current-behavior
There was some technical debt that prevented us from easily designing and implementing this project. As of Dec 10, 2013, fixing this technical debt is on master in Tokutek/ft-index
Prior to this effort, basement nodes weren't really a data structure (were implemented by a set of a macros) and leafentries included keys.
Single leafentries (or groups) were accessed or modified in many different places.
We created a class bn_data
to contain all leafentries and keys in a basement node:
- A
mempool
to hold the actual leafentries - An
OMT
to store pointers into themempool
in sorted order
bn_data
hides most of the complexity of basement nodes (see ft/bndata.h
and ft/bndata.cc
).
For simplicity LEAFENTRY
refers to the old leafentry that included a key, and leafentry
refers to the new leafentry that does not include a key.
Next step in our refactoring was to create the concept of KL_PAIR
s (Key/leafentry pairs).
All methods for accessing LEAFENTRY
s was refactored to one of three cases
- Those that needed access to just the key
- Those that needed access to just the leafentry
- Those that needed both
We wrote the basement node class such that its api treated keys and leafentries separately. We didn't yet have a dmt, so what we did was made an intermediate structure; in memory the key and leafentry pair were stored next to each other.
[4 bytes key_size] [key_size bytes] [rest of leafentry]
For example, leafentry with a key of "foo":
[4 byte integer: 4] [f o o \0] [rest of leafentry]
"klpair": keylen followed by key bytes followed by the leafentry. To do that, during deserialization we split up the KEY/Leafentry pair into key and leafentry, and then wrote them into the mempool next to each other.
We stored key leafentry pairs contiguously (4 bytes for the keylen, the key itself, then the leafentry).
bn_data
used an OMT
to hold pointers into the mempool.
The OMT
has since turned into a wrapper around the templated omt (omt<void*>
).
We replaced the OMT
in bn_data
with a templated omt (omt<kl_pair_struct*>
).
This allowed the compiler to ensure additional type safety (instead of remembering to cast void*s back and forth) and also improved performance by a small amount. (~5% on ydb-level tests)
See Improving-in-memory-performance#inefficiencies and Improving-in-memory-performance#proposal. This details the progress of the development branch (see Tokutek/ft-index#46) as of Dec 10, 2013.
We replaced 8-byte pointers with 4-byte offsets into the mempool. This saves 4 bytes per key/leafentry pair in a basement node, and also opens up future optimizations (see remaining/future work at end) As long as there isn't wasted space in a mempool (e.g. in serial insertion case), increasing the size of a mempool can be a simple realloc instead of a bunch of memcpys followed by fixing up pointers in the omt.
We inlined the keys (4 byte for keylen and the key itself) into the omt. So the omt now needed to store the 4 byte keylen, key itself, and a 4 byte offset into the mempool. Keys are not necessarily constant length so we needed a new data structure.
We created a new data structure very similar to the omt that supports dynamic sized elements.
The dmt
is a weight balanced search tree with dynamic sized values.
Like the omt
, the dmt
has an array form for faster lookups. The dmt
only supports array format when all its values are the same size (for example: when all keys in a fractal tree are the same length).
In array format we don't store nodes nor the value length (we store the value length only once) to save memory. Similar to the omt, when a delete or insert happens in the middle of the array, we convert to a dynamic tree form.
The algorithm for the dmt is virtually identical (conceptually) to the omt with several exceptions:
- The dmt does not try to convert a dynamic tree back to array format.
- This can happen if a node is serialized to disk and read back (and everything is constant sized)
- Note: The dmt will remain in tree format in the case of serialized insertions.
- Where the omt sometimes moves values between nodes, the dmt instead always moves the metadata of the nodes.
- This ensures that when switching nodes we do not have to allocate additional memory.
- Note that this is not technically necessary when all keys are constant length but keys can get much larger than the metadata for a node.
The api for the dmt is mostly like the omt except:
- Inserts use functors
- Optimized creation of dmts has a different api (may use a builder object)
The dmt is currently only used in basement nodes.
We introduce on-disk format 25, which changes the on-disk format for partitions (basement nodes) only.
(Old) format 24 was:
- 1 byte magic number
- 4 byte
num_entries
- All
LEAFENTRY
s in sorted order
Format 25 is:
- 1 byte magic number
- 4 byte
num_entries
- A new basement-node header:
- 4 byte
key_data_size
(total memory for all keys combined including extra 4 bytes each for key)- Note: The extra 4 bytes is used either for the keylen (variable-length keys), or for the offset of that key's leafentry (for fixed-length keys)
- 4 byte
val_data_size
(total memory for allleafentry
s - 4 byte
fixed_key_length
(0 if keys are variable length) - 1 byte
all_keys_same_length
(0 or 1) - 1 byte
keys_vals_separate
(0 or 1)- Note: Currently keys_vals_separate must equal all_keys_same_length. In the future we may change the on disk format for variable sized keys.
- 4 byte
- if
keys_vals_separate == 0
(which means variable sized keys)- All
LEAFENTRY
s in sorted order (this part exactly the same as version 24)
- All
- else (fixed sized keys)
- for each key:
- 4 byte le_offset (offset of leafentry in mempool)
-
fixed_key_length
bytes: the actual key
- The leafentry mempool. (equivalently all
leafentry
s in sorted order)- Note: Before serialization the mempool is compacted (converted to sorted order with no fragmentation)
- for each key:
Deserialization for variable sized keys/upgrading from version 24 is nearly identical. If the version is 24 we don't expect the additional header and overallocate memory (and then compact after loading). If the version is 25 we do the same but we don't need to overallocate memory. Thus the upgrade is almost trivial.
Deserialization for fixed-sized keys is nearly trivial. The memory for the keys is already in the dmt array format (unless the key len is not a multiple of 4, in which case it allocates extra memory and aligns it). Reading the leafentries is a memcpy.
- DONE: Serial insertion optimization for basement nodes.
- When there is no wasted space in a mempool, we can do a simple realloc to move leafentries from a small mempool to a larger one (instead of looking at each leafentry and copying them one at a time).
- Code review
- dmt
- DONE: (Required) Additional tests
- (Optional) fixing comments (some still refer to omt)
- (Optional) adding comments (describe in dmt.h what a dmt is)
- (optional) Remove technical debt by combining omt and dmt into single data structure.
- Possibly make templated omt be a wrapper around the templated dmt.
- (related, but can be done in the future) Tokutek/ft-index#108
-
toku_memory_footprint
calculation is wrong. Can be too large in common cases (and too small in pathological cases) - Modify memory calculations for omt and dmt to properly use toku_memory_footprint. (Work is wasted effort until/unless 108 is done)
- Non trivial due to using extra space in the mempool for a temporary array.
-
- (optional, probably not worth it) New disk format for variable sized keys.
- (Future performance) Further reduce cache misses:
- Change the dmt to have a third format (or possibly replace array format).
- Use a VanEmdeboas-like layout to further reduce cache misses during searches
- This requires some additional research/design (probably with Michael)
- Risk:
- May end up costing more in computation (to access layout) for point queries than we save in cache misses/may not give any enhanced performance.
- (low) risk of slightly lower performance for range queries even if point queries are faster
- risk of slower (de)serialization
- Potential performance improvement is higher for smaller keys.
- Performance:
- The dmt has two forms: array and dynamic tree
- We can add a third form: fixed-key tree (inlined keys, all same size). Would automatically turn into a dynamic tree if necessary.
- This might improve performance slightly and reduce memory usage under certain circumstances.
- Does not help at all for read-only workloads
- We can add a third form: fixed-key tree (inlined keys, all same size). Would automatically turn into a dynamic tree if necessary.
- DONE: An intermediate third form: (LOW EFFORT COMPARED TO OTHERS)
- Keep the dynamic tree, but just keep a boolean that says "so far, everything is same size"
- This would allow for the nice serialization. Deletes would never remove this bit, and inserts CAN if they're different size.
- We can try to improve (de)serialization time for fixed-length keys that are not a multiple of 4 bytes
- We can try to improve serialization time (maybe not re-sort & compact under all situations).
- e.g. skip compaction if there is no wasted space in mempool
- The dmt has two forms: array and dynamic tree