Chapter 6: B-Tree Variants

Common Themes for B-Trees

Tree Structure
Balancing
- Splits
- Merges
Lookup
Delete
On Disk Representation

Chapter Mindmap

B-Tree Variants

Copy-on-write B-Trees - Nodes are immutable. For updates, copies are made, updated and written to new locations
Lazy B-Tree - Reduces I/O requests for same node by buffering updates to nodes
FD-Trees - buffers updates in a small B-Tree, Full Tree is written into a immutable run
Bw-Trees - separate B-Tree nodes into smaller parts written in an append-only manner, reduces cost of small writes by batching
Cache-oblivious B-tree - treat on-disk data structures very similar to in-memory ones

Copy-on-Write B-Tree

Page is copied, when the content is to be modified. Copied page is modified and parallel tree hierarchy is created
Once Tree is created, top most page is atomically updated
Reads can happen in parallel, while the content is modified
Simple approach, doesn't require a Locks
Requires more space

LMDB

Lightning Memory-Mapped Database (LMDB) implements Copy-on-write approach

LMDB single level data store, directly from memory-map
Doesn't have DB level caching

Update process

Identify the target leaf Node to be updated
Copy all the Nodes in the path
Update the target leaf and propagate changes upwards

LMDB keeps 2 version of root node - latest version and one with updates in progress. All reads go through latest version Once update is completed, latest version is replaced with new version

Lazy B-Tree

Motivation - Reduce the cost of updating B-Tree

Adopts lightweight, concurrency and update friendly approach by buffering and delaying updates

Wired Tiger

MongoDB Storage engine
Uses different format for in-memory and on-disk pages
Updates are saved to in-memory buffer
Page is reconciled with buffer before writing

Update Buffers are accessed during reads Update buffer us merged with page content upon flush Update Buffer are implemented as SkipLists

Lazy-Adaptive Tree

Optimized version to keep buffers at subtree level, instead of Node level

Update Process

Add update to Root node buffer
Buffer becomes full or reach a threshold storage
Propagate changes to lower level
If lower level fill up, continue till we reach leaf level
Batch operations are performed at Leaf level, including Tree Structure changes

Schematic representation, when root buffer is not full

FD Tree

Small mutable BTree and multiple immutable sorted runs.

Buffers updates in B-Tree
Once Tree is full, it is written to a sorted immutable file
The immutable files/runs are merged, if they reach a certain Threshold
Pointers between levels are maintained by Fractional Cascading

Fractional Cascading

What is the problem we are trying to solve?

Since FD Tree write sorted immutable runs/files, how do we located entries in the Sorted Arrays?

FD Trees use Fractional Cascading to quickly find the element. The idea is to build bridges between neighbor levels. Bridges are build by pulling elements from lower level to higher levels, if they don't exist and pointing to the location in pulled array.

Let's look at the example array below

We choose to make bridge with every alternate array (highlighted)

After building fences, this is how final array looks like

What about Run Size?

FD Tree create logarithmic size sorted runs i.e array size increases by a factor of k

Mutable Tree accepts inputs
Once Tree becomes full, contents are merged to Level 1 run
Once Level 1 run reaches a threshold, it's contents are merged with Level 2 run and so-on
FD Trees uses adaptive version of Fractional Cascading, where head element is propagated as bridge element (in our example above we used every alternative element)
To handle multiple entries of same key, a Tombstone is inserted, which leads to entry being discarded at lowest level

Bw-Trees or Buzzword-Tree

Breaks a Tree node into smaller parts, which are written in append-only manner

Motivation is to Address

Write Amplification - updates to Node require updates to disk page
Space Amplification - reserve extra space to make updates possible
Solving concurrency problems

Bw-Tree batches updates to different Nodes by append-only storage, link nodes as chain and use in-memory structure that allows pointers between nodes with atomic swap operation. Atomic swap avoid need of Lock

Update Chains

Create a Base Node
All modifications create delta nodes, linked together as linked list to the base node
Since nodes are of different sizes, they can be stored together as they won't be page aligned

Although this is space efficient, read path has to traverse all the nodes

A simple representation of Node would be something like this

Each Node has a logical identifier. There is an in-memory mapping table that maintains mapping between identifier and disk location.

Simple example of BW-Tree (all nodes not shown)

Update Process

Find target leaf node, start from root and traverse down with help of mapping table
Create a new delta node and attach to the last node (base node or delta node)
Update Mapping Table with delta node created in Step #2

A Compare and Swap operation is performed in in-memory structure, which avoid use of Latches.

Structural Modifications

Like a B-Tree, BW-Tree may also need Nodes to be split or merged.

Split Operation

Merge delta nodes with Base node
A special split delta node is added to the node being split, for readers
Split delta node invalidates records in the splitting node and has link to new logical sibling node
Parent is updated with newly created node

Cache-Oblivious B-Trees

Treats on-disk structures very similar to in-memory structures
Gives asymptotically optimal performance
Doesn't need to know details about block size, disk page or cache lines

Disk is partitioned into blocks and data is transferred between disk and cache as blocks

Van Emde Boas Layout

A Cache-oblivious B-Tree consists of a static B-Tree and a packed array Structure.
Tree is split at the middle of the edge and this process is recursive
N square subtrees are formed
Recursive tree is stored in contiguous memory
Contains gaps for future inserted elements
If gaps are not present, elements have to be moved
If array is too dense or sparse, tree is rebuilt

Merging

A special remove delete node is appended to right sibling, indicating merge
A merge delta node is created on left sibling to point to the content of right sibling
Parent is updated by removing pointer to right sibling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 06: B-Tree Variants.md

Chapter 06: B-Tree Variants.md

Chapter 6: B-Tree Variants

Chapter Mindmap

B-Tree Variants

Copy-on-Write B-Tree

LMDB

Lazy B-Tree

Wired Tiger

Lazy-Adaptive Tree

FD Tree

Fractional Cascading

What about Run Size?

Bw-Trees or Buzzword-Tree

Update Chains

Structural Modifications

Split Operation

Cache-Oblivious B-Trees

Van Emde Boas Layout

Merging

Files

Chapter 06: B-Tree Variants.md

Latest commit

History

Chapter 06: B-Tree Variants.md

File metadata and controls

Chapter 6: B-Tree Variants

Chapter Mindmap

B-Tree Variants

Copy-on-Write B-Tree

LMDB

Lazy B-Tree

Wired Tiger

Lazy-Adaptive Tree

FD Tree

Fractional Cascading

What about Run Size?

Bw-Trees or Buzzword-Tree

Update Chains

Structural Modifications

Split Operation

Cache-Oblivious B-Trees

Van Emde Boas Layout

Merging