sstable
means sorted string table
, which is a simple abstraction to efficiently store large numbers of key-value pairs while optimizing for high throughput, sequential read/write workloads.
Sorted string table, random reads and writes are not an option, instead we will want to stream the data in and once we're done, flush it back to disk as a streaming operation. This way, we can amortize the disk I/O costs. Nothing revolutionary, moving right along.
A "Sorted String Table" then is exactly what it sounds like, it is a file which contains a set of arbitrary, sorted key-value pairs inside. Duplicate keys are fine, there is no need for "padding" for keys or values, and keys and values are arbitrary blobs. Read in the entire file sequentially and you have a sorted index. Optionally, if the file is very large, we can also prepend, or create a standalone key:offset index for fast access. That's all an SSTable is: very simple, but also a very useful way to exchange large, sorted data segments.
How to achieve fast read and write?
For read, once sstable on disk it is effectively immutable and become static index
How about write: update and delete?
writes are always fast regardless of the size of dataset (append-only), and random reads are either served from memory or require a quick disk seek.
Once the SSTable is on disk, it is immutable, hence updates and deletes can't touch the data. Instead, a more recent value is simply stored in MemTable in case of update, and a "tombstone" record is appended for deletes. Because we check the indexes in sequence, future reads will find the updated or the tombstone record without ever reaching the older values! Finally, having hundreds of on-disk SSTables is also not a great idea, hence periodically we will run a process to merge the on-disk SSTables, at which time the update and delete records will overwrite and remove the older data.
(image from [internet](https://www.cnblogs.com/cobbliu/p/6194072.html))-
meta block
- encoding
- size
- value of max key
- value of min key
-
data block
- record group, each record <key, value>
- pre-fix
- data index block
- support binary search of key
- the key might be generated by system
- e.g. last key of block1 is “abcdefg”, first key of block2 is "abefg", then a key of "abd" could distinguish those two block
Why there are restart point
or offset
?
e.g.: <a,test>、<ab,test2>、<c,test3> in the same record
<key, value> | the length of shared key | the length of none-shared key | the length of value | none-shared key | value |
---|---|---|---|---|---|
<a, test> | 0 | 1 | 4 | a | test |
<ab, test2> | 1 | 1 | 5 | b | test2 |
<c, test3> | 0 | 1 | 5 | c | test3 |
The structure of block
is the same, just a list of records of <key, value>
data block | data index block | meta block | |
---|---|---|---|
key | actual key | last key of datablock [i-1] < key < first key of datablock[i+1] |
metaname |
value | actual value | block handle of datablock i | block handle |
void TableBuilder::Add(const Slice& key, const Slice& value) {
r->index_block.Add(r->last_key, Slice(handle_encoding));
r->filter_block->AddKey(key);
r->data_block.Add(key, value);
const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();
if (estimated_block_size >= r->options.block_size) { // [perry] default is 4KB
Flush();
}
// [Perry] Here is the code for how to write a block record
// Add "<shared><non_shared><value_size>" to buffer_
PutVarint32(&buffer_, shared);
PutVarint32(&buffer_, non_shared);
PutVarint32(&buffer_, value.size());
// Add string delta to buffer_ followed by value
buffer_.append(key.data() + shared, non_shared);
buffer_.append(value.data(), value.size());
The general format of IndexBlock is similar as Datablock
, the difference is content. Data block records <key, value>, Index block records index of key.
IndexBlock is used for binary search. Key inside each block is sorted, let's say the last key of previous block is "helloleveldb", and the first key of current block is "helloword", leveldb will calculate a divider between two blocks, say, "hellom", then it could represent such idea: for any key smaller than "hellom" will be recorded in previous block, for any key larger than "hellom" will be recorded in current block, thus the information inside index record would be
key = divider key
value = previous data block(offset, size)
Query inside block is represented by Block::Iter
. In the function of Seek()
- the collection of restarts set recors all blocks offset and key, sorted by key
- during seek, its uses the key to do binary search, find the prefix range it belonging to
SeekToRestartPoint()
then continue toParseNextKey
until find an entry which is not smaller than query key