Skip to content

Commit

Permalink
docs/rfcs: pebble table format versions
Browse files Browse the repository at this point in the history
To safely support both range keys and block property collectors /
filters, which requires updating the SSTable layout, new table format
versions must be introduced.

Add a small RFC that proposes a new Pebble magic number and versioning
scheme and outlines relationship between the SSTable format version and
the existing Pebble format major version.

Informs #1339.
  • Loading branch information
nicktrav committed Jan 20, 2022
1 parent f3d164d commit b58a7bd
Showing 1 changed file with 261 additions and 0 deletions.
261 changes: 261 additions & 0 deletions docs/RFCS/20220112_pebble_sstable_format_versions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,261 @@
- Feature Name: Pebble SSTable Format Versions
- Status: in-progress
- Start Date: 2022-01-12
- Authors: Nick Travers
- RFC PR: https://github.com/cockroachdb/pebble/pull/1450
- Pebble Issues:
https://github.com/cockroachdb/pebble/issues/1409
https://github.com/cockroachdb/pebble/issues/1339
- Cockroach Issues:

# Summary

To safely support changes to the SSTable structure, a new versioning scheme
under a Pebble magic number is proposed.

This RFC also outlines the relationship between the SSTable format version and
the existing Pebble format major version, in addition to how the two are to
be used in Cockroach for safely enabling new table format versions.

# Motivation

Pebble currently uses a "format major version" scheme for the store (or DB)
that indicates which Pebble features should be enabled when the store is first
opened, before any SSTables are opened. The versions indicate points of
backwards incompatibility for a store. For example, the introduction of the
`SetWithDelete` key kind is gated behind a version, as is block property
collection. This format major version scheme was introduced in
[#1227](https://github.com/cockroachdb/pebble/issues/1227).

While Pebble can use the format major version to infer how to load and
interpret data in the LSM, the SSTables that make up the store itself have
their own notion of a "version". This "SSTable version" (also referred to as a
"table format") is written to the footer (or trailing section) of each SSTable
file and determines how the file is to be interpreted by Pebble. Currently,
Pebble supports two table formats - LevelDB's format, and RocksDB's v2 format.
Pebble inherited the latter as the default table format as it was the version
that RocksDB used at the time Pebble was being developed, and remained the
default to allow for a simpler migration path from Cockroach clusters that were
originally using RocksDB as the storage engine. The RocksDBv2 table format adds
various features on top of the LevelDB format, including a two-level index,
configurable checksum algorithms, and an explicit versioning scheme to allow
for the introduction of changes, amongst other features.

While the RocksDBv2 SSTable format has been sufficient for Pebble's needs since
inception, new Pebble features and potential backports from RocksDB itself
require that the SSTable format evolve over time and therefore that the table
format be updated. As the majority of new features added over time will be
specific to Pebble, it does not make sense to repurpose the RocksDB format
versions that exist upstream for use with Pebble features (at the time of
writing, RocksDB had added versions 3 and 4 on top of the version 2 in use by
Pebble). A new Pebble-specific table format scheme is proposed.

In the context of a distributed system such as Cockroach, certain SSTable
features are backwards incompatible (e.g. the block property collection and
filtering feature extends the RocksDBv2 SSTable block index format to encoded
various block properties, which is a breaking change). Participants must
_first_ ensure that their stores have the code-level features available to read
and write these newer SSTables (indicated by Pebble's format major version).
Once all stores agree that they are running the minimum Pebble format major
version and will not roll back (e.g. Cockroach cluster version finalization),
SSTables can be written and read using more recent table formats. The Pebble
"format major version" and "table format version" are therefore no longer
independent - the former implies an upper bound on the latter.

Additionally, certain SSTable generation operations are independent of a
specific Pebble instance. For example, SSTable construction for the purposes of
backup and restore generates SSTables that are stored external to a specific
Pebble store (e.g. in cloud storage) can be used at a later point in time to
restore a store. SSTables constructed for such purposes must be carefully
versioned to ensure compatibility with existing clusters that may run with a
mixture of Pebble versions.

As a real-world example of the need for the above, consider two Cockroach nodes
each with a Pebble store, one at version A, the other at version B (version A
(newer) > B (older)). Store A constructs an SSTable for an external backup
containing a newer block index format (for block property collection). This
SSTable is then imported in to store B. Store B fails to read the SSTable as it
is not running with a format major version recent enough make sense of the
newer index format. The two stores require a method for agreeing on a minimum
supported table format.

The remainder of this document outlines a new table format for Pebble. This new
table format will be used for new table-level features such as block properties
and range keys (see
[#1339](https://github.com/cockroachdb/pebble/issues/1339)), but also for
backporting table-level features from RocksDB that would be useful to Pebble
(e.g. version 3 avoids encoding sequence numbers in the index, and version 4
uses delta encoding for the block offsets in the index, both of which are
useful for Pebble).

# Technical design

## Pebble magic number

The last 8 bytes of an SSTable is referred to as the "magic number".

LevelDB uses the first 8 bytes of the SHA1 hash of the string
`http://code.google.com/p/leveldb/` for the magic number.

RocksDB uses its own magic number, which indicates the use of a slightly
different table layout - the footer (the name for the end of an SSTable) is
slightly larger to accommodate a 32-bit version number and 8 bits for a
checksum type to be used for all blocks in the SSTable.

A new 8-byte magic number will be introduced for Pebble:

```
\xf0\x9f\xaa\xb3\xf0\x9f\xaa\xb3 // 🪳🪳
```

## Pebble version scheme

Tables with a Pebble magic number will use a dedicated versioning scheme,
starting with version `1`. No new versions other than version `2` will be
supported for tables containing the RocksDB magic number.

The choice of switching to a Pebble versioning scheme starting `1` simplifies
the implementation. Essentially all existing Pebble stores are managed via
Cockroach, and were either previously using RocksDB and migrated to Pebble, or
were created as Pebble stores. In both situations the table format used is
RocksDB v2.

Given that Pebble has not needed (and likely will not need) to support other
RocksDB table formats, it is reasonable to introduce a new magic number for
Pebble and reset the version counter to v1.

The following initial versions will correspond to the following new Pebble
features:

- Version 1: block property collectors (block properties are encoded into the
block index)
- Version 2: range key (a new block is present in the table for range keys).

Subsequent alterations to the SSTable format should only increment the _Pebble
version number_. It should be noted that backported RocksDB table format
features (e.g. RocksDB versions 3 and 4) would use a different version number,
within the Pebble version sequence. While possibly confusing, the RocksDB
features are being "adopted" by Pebble, rather than directly ported, so a
Pebble specific version number is appropriate.

An alternative would be to allow RocksDB table format features to be backported
into Pebble under their existing RocksDB magic number, _alongside_
Pebble-specific features. The complexity required to determine the set of
characteristics to read and write to each SSTable would increase with such a
scheme, compared to the simpler "linear history" approach described above,
where new features simply ratchet the Pebble table format version number.

## Footer format

The footer format for SSTables with Pebble magic numbers _will remain the same_
as the RocksDB footer format - specifically, the trailing 53-bytes of the
SSTable consisting of the following fields with the given indices,
little-endian encoded:

- `0`: Checksum type
- `1-20`: Meta-index block handle
- `21-40`: Index block handle
- `41-44`: Version number
- `45-52`: Magic number

## Changes / additions to `sstable.TableFormat`

The `sstable.TableFormat` enum is a `uint32` representation of the tuple
`(magic number, format version). The current values are:

```go
type TableFormat uint32

const (
TableFormatRocksDBv2 TableFormat = iota
TableFormatLevelDB
)
```

It should be noted that this enum is _not_ persisted in the SSTable. It is
purely an internal type that represents the tuple that simplifies a number of
version checks when reading / writing an SSTable. The values are free to
change, provided care is taken with default values and existing usage.

The existing `sstable.TableFormat` will be altered to reflect the "linear"
nature of the version history. New versions will be added with the next value
in the sequence.

```go
const (
TableFormatUnknown TableFormat = iota
TableFormatLevelDB // The original LevelDB table format.
TableFormatRocksDBv2 // The current default table format.
TableFormatPebbleDBv1 // Block properties.
TableFormatPebbleDBv2 // Range keys.
...
TableFormatPebbleDBvN
)
```

The introduction of `TableFormatUknown` can be used to ensure that where a
`sstable.TableFormat` is _not_ specified, Pebble can select a suitable default
for writing the table (most likely based on the format major version in use by
the store; more in the next section).

## Interaction with the format major version

The `FormatMajorVersion` type is used to determine the set of features the
store supports.

A Pebble store may be read-from / written-to by a Pebble binary that supports
newer features, with more recent Pebble format major versions. These newer
features could include the ability to read and write more recent SSTables.
While the store _could_ read and write SSTables at the most recent version the
binary supports, it is not safe to do so, for reasons outlined earlier.

The format major version will have a "maximum table format version" associated
with it that indicates the maximum `sstable.TableFormat` that can be safely
handled by the store.

When introducing a new _table format_ version, it should be gated behind an
associated `FormatMajorVersion` that has the new table format as its "maximum
table format version".

For example:

```go
// Existing verisons.
FormatDefault.TableFormatVersionMax() // sstable.TableFormatRocksDBv2
...
FormatSetWithDelete.TableFormatVersionMax() // sstable.TableFormatRocksDBv2
// Proposed versions with Pebble version scheme.
FormatBlockPropertyCollector.TableFormatVersionMax() // sstable.TableFormatPebbleDBv1
FormatRangeKeys.TableFormatVersionMax() // sstable.TableFormatPebbleDBv2
```

## Usage in Cockroach

The introduction of new SSTable format versions needs to be carefully
coordinated between stores to ensure there are no incompatibilities (i.e. newer
store writes an SSTable that cannot be understood by other stores).

It is only safe to use a new table format when all nodes in a cluster have been
finalized. A newer Cockroach node, with newer Pebble code, should continue to
write SSTables with a table format version equal to or less than the smallest
table format version across all nodes in the cluster. Once the cluster version
has been finalized, and `(*DB).RatchetFormatMajorVersion(FormatMajorVersion)`
has been called, nodes are free to write SSTables at newer table format
versions.

At runtime, Pebble exposes a `(*DB).FormatMajorVersion()` method, which may be
used to determine the current format major version of the store, and hence, the
table format version. Pebble will also expose a `(*DB).TableFormat()` method
that will allow the caller to introspect the current table format version of
the store.

In addition to the above, there are situations where SSTables are created for
consumption at a later point in time, independent of any Pebble store -
specifically backup and restore. Currently, Cockroach uses two functions in
`pkg/sstable` to construct SSTables for both ingestion and backup
([here](https://github.com/cockroachdb/cockroach/blob/20eaf0b415f1df361246804e5d1d80c7a20a8eb6/pkg/storage/sst_writer.go#L57)
and
[here](https://github.com/cockroachdb/cockroach/blob/20eaf0b415f1df361246804e5d1d80c7a20a8eb6/pkg/storage/sst_writer.go#L78)).
Both will need to be updated to take into account the cluster version to ensure
that SSTables with newer versions are only written once the cluster version has
been finalized.

0 comments on commit b58a7bd

Please sign in to comment.