sstable

sstable means sorted string table, which is a simple abstraction to efficiently store large numbers of key-value pairs while optimizing for high throughput, sequential read/write workloads.

Why SSTable

Sorted string table, random reads and writes are not an option, instead we will want to stream the data in and once we're done, flush it back to disk as a streaming operation. This way, we can amortize the disk I/O costs. Nothing revolutionary, moving right along.

A "Sorted String Table" then is exactly what it sounds like, it is a file which contains a set of arbitrary, sorted key-value pairs inside. Duplicate keys are fine, there is no need for "padding" for keys or values, and keys and values are arbitrary blobs. Read in the entire file sequentially and you have a sorted index. Optionally, if the file is very large, we can also prepend, or create a standalone key:offset index for fast access. That's all an SSTable is: very simple, but also a very useful way to exchange large, sorted data segments.

SSTable and BigTable

How to achieve fast read and write?
For read, once sstable on disk it is effectively immutable and become static index
How about write: update and delete?

SSTables and Log Structured Merge Trees

Write & Read

writes are always fast regardless of the size of dataset (append-only), and random reads are either served from memory or require a quick disk seek.

Update & Delete

Once the SSTable is on disk, it is immutable, hence updates and deletes can't touch the data. Instead, a more recent value is simply stored in MemTable in case of update, and a "tombstone" record is appended for deletes. Because we check the indexes in sequence, future reads will find the updated or the tombstone record without ever reaching the older values! Finally, having hundreds of on-disk SSTables is also not a great idea, hence periodically we will run a process to merge the on-disk SSTables, at which time the update and delete records will overwrite and remove the older data.

SSTable format

(image from [internet](https://www.cnblogs.com/cobbliu/p/6194072.html))

meta block
- encoding
- size
- value of max key
- value of min key
data block
- record group, each record <key, value>
- pre-fix

data index block
- support binary search of key
- the key might be generated by system
  - e.g. last key of block1 is “abcdefg”, first key of block2 is "abefg", then a key of "abd" could distinguish those two block

Why there are restart point or offset?
e.g.: <a,test>、<ab,test2>、<c,test3> in the same record

<key, value>	the length of shared key	the length of none-shared key	the length of value	none-shared key	value
<a, test>	0	1	4	a	test
<ab, test2>	1	1	5	b	test2
<c, test3>	0	1	5	c	test3

The structure of block is the same, just a list of records of <key, value>

	data block	data index block	meta block
key	actual key	last key of datablock [i-1] < key < first key of datablock[i+1]	metaname
value	actual value	block handle of datablock i	block handle

Code

TableBuilder::Add()

void TableBuilder::Add(const Slice& key, const Slice& value) {

  r->index_block.Add(r->last_key, Slice(handle_encoding));

  r->filter_block->AddKey(key); 

  r->data_block.Add(key, value);

  const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();
  if (estimated_block_size >= r->options.block_size) {  // [perry] default is 4KB
    Flush();
  }

Datablock

BlockBuilder::Add

  // [Perry] Here is the code for how to write a block record

  // Add "<shared><non_shared><value_size>" to buffer_
  PutVarint32(&buffer_, shared);
  PutVarint32(&buffer_, non_shared);
  PutVarint32(&buffer_, value.size());

  // Add string delta to buffer_ followed by value
  buffer_.append(key.data() + shared, non_shared);
  buffer_.append(value.data(), value.size());

IndexBlock

The general format of IndexBlock is similar as Datablock, the difference is content. Data block records <key, value>, Index block records index of key.
IndexBlock is used for binary search. Key inside each block is sorted, let's say the last key of previous block is "helloleveldb", and the first key of current block is "helloword", leveldb will calculate a divider between two blocks, say, "hellom", then it could represent such idea: for any key smaller than "hellom" will be recorded in previous block, for any key larger than "hellom" will be recorded in current block, thus the information inside index record would be

key = divider key
value = previous data block(offset, size)

Query

Query inside block is represented by Block::Iter. In the function of Seek()

the collection of restarts set recors all blocks offset and key, sorted by key
during seek, its uses the key to do binary search, find the prefix range it belonging to
SeekToRestartPoint() then continue to ParseNextKey until find an entry which is not smaller than query key

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

leveldb_sstable.md

leveldb_sstable.md

sstable

Why SSTable

SSTable and BigTable

SSTables and Log Structured Merge Trees

Write & Read

Update & Delete

SSTable format

Code

Datablock

IndexBlock

Query

Files

leveldb_sstable.md

Latest commit

History

leveldb_sstable.md

File metadata and controls

sstable

Why SSTable

SSTable and BigTable

SSTables and Log Structured Merge Trees

Write & Read

Update & Delete

SSTable format

Code

Datablock

IndexBlock

Query