- Feature Name: Pebble SSTable Format Versions
- Status: draft
- Start Date: 2022-01-12
- Authors: Nick Travers
- RFC PR: #1450
- Pebble Issues: #1409 #1339
- Cockroach Issues:
To safely support changes to the SSTable structure, a new versioning scheme under a Pebble magic number is proposed.
This RFC also outlines the relationship between the SSTable format version and the existing Pebble format major version, in addition to how the two are to be used in Cockroach for safely enabling new table format versions.
Pebble currently uses a "format major version" scheme for the store (or DB)
that indicates which Pebble features should be enabled when the store is first
opened, before any SSTables are opened. The versions indicate points of
backwards incompatibility for a store. For example, the introduction of the
SetWithDelete
key kind is gated behind a version, as is block property
collection. This format major version scheme was introduced in
#1227.
While Pebble can use the format major version to infer how to load and interpret data in the LSM, the SSTables that make up the store itself have their own notion of a "version". This "SSTable version" (also referred to as a "table format") is written to the footer (or trailing section) of each SSTable file and determines how the file is to be interpreted by Pebble. Currently, Pebble supports two table formats - LevelDB's format, and RocksDB's v2 format. Pebble inherited the latter as the default table format as it was the version that RocksDB used at the time Pebble was being developed, and remained the default to allow for a simpler migration path from Cockroach clusters that were originally using RocksDB as the storage engine. The RocksDBv2 table format adds various features on top of the LevelDB format, including a two-level index, configurable checksum algorithms, and an explicit versioning scheme to allow for the introduction of changes, amongst other features.
While the RocksDBv2 SSTable format has been sufficient for Pebble's needs since inception, new Pebble features and potential backports from RocksDB itself require that the SSTable format evolve over time and therefore that the table format be updated. As the majority of new features added over time will be specific to Pebble, it does not make sense to repurpose the RocksDB format versions that exist upstream for use with Pebble features (at the time of writing, RocksDB had added versions 3 and 4 on top of the version 2 in use by Pebble). A new Pebble-specific table format scheme is proposed.
In the context of a distributed system such as Cockroach, certain SSTable features are backwards incompatible (e.g. the block property collection and filtering feature extends the RocksDBv2 SSTable block index format to encoded various block properties, which is a breaking change). Participants must first ensure that their stores have the code-level features available to read and write these newer SSTables (indicated by Pebble's format major version). Once all stores agree that they are running the minimum Pebble format major version and will not roll back (e.g. Cockroach cluster version finalization), SSTables can be written and read using more recent table formats. The Pebble "format major version" and "table format version" are therefore no longer independent - the former implies an upper bound on the latter.
Additionally, certain SSTable generation operations are independent of a specific Pebble instance. For example, SSTable construction for the purposes of backup and restore generates SSTables that are stored external to a specific Pebble store (e.g. in cloud storage) can be used at a later point in time to restore a store. SSTables constructed for such purposes must be carefully versioned to ensure compatibility with existing clusters that may run with a mixture of Pebble versions.
As a real-world example of the need for the above, consider two Cockroach nodes each with a Pebble store, one at version A, the other at version B (version A (newer) > B (older)). Store A constructs an SSTable for an external backup containing a newer block index format (for block property collection). This SSTable is then imported in to store B. Store B fails to read the SSTable as it is not running with a format major version recent enough make sense of the newer index format. The two stores require a method for agreeing on a minimum supported table format.
The remainder of this document outlines a new table format for Pebble. This new table format will be used for new table-level features such as block properties and range keys (see #1339), but also for backporting table-level features from RocksDB that would be useful to Pebble (e.g. version 3 avoids encoding sequence numbers in the index, and version 4 uses delta encoding for the block offsets in the index, both of which are useful for Pebble).
The last 8 bytes of an SSTable is referred to as the "magic number".
LevelDB uses the first 8 bytes of the SHA1 hash of the string
http://code.google.com/p/leveldb/
for the magic number.
RocksDB uses its own magic number, which indicates the use of a slightly different table layout - the footer (the name for the end of an SSTable) is slightly larger to accommodate a 32-bit version number and 8 bits for a checksum type to be used for all blocks in the SSTable.
A new magic number will be introduced for Pebble. Similar to LevelDB, the first 8 bytes of the SHA1 hash of the string "github.com/cockroachdb/pebble" will be used:
SHA1('github.com/cockroachdb/pebble') = \x69\xda\xf0\x0e\x5c\x1d\x47\x82
Tables with a Pebble magic number will use a dedicated versioning scheme,
starting with version 1
. No new versions other than version 2
will be
supported for tables containing the RocksDB magic number.
The choice of switching to a Pebble versioning scheme starting 1
simplifies
the implementation. Essentially all existing Pebble stores are managed via
Cockroach, and were either previously using RocksDB and migrated to Pebble, or
were created as Pebble stores. In both situations the table format used is
RocksDB v2.
Given that Pebble has not needed (and likely will not need) to support other RocksDB table formats, it is reasonable to introduce a new magic number for Pebble and reset the version counter to v1.
The following initial versions will correspond to the following new Pebble features:
- Version 1: block property collectors (block properties are encoded into the block index)
- Version 2: range key (a new block is present in the table for range keys).
Subsequent alterations to the SSTable format should only increment the Pebble version number. It should be noted that backported RocksDB table format features (e.g. RocksDB versions 3 and 4) would use a different version number, within the Pebble version sequence. While possibly confusing, the RocksDB features are being "adopted" by Pebble, rather than directly ported, so a Pebble specific version number is appropriate.
An alternative would be to allow RocksDB table format features to be backported into Pebble under their existing RocksDB magic number, alongside Pebble-specific features. The complexity required to determine the set of characteristics to read and write to each SSTable would increase with such a scheme, compared to the simpler "linear history" approach described above, where new features simply ratchet the Pebble table format version number.
The footer format for SSTables with Pebble magic numbers will remain the same as the RocksDB footer format - specifically, the trailing 53-bytes of the SSTable consisting of the following fields with the given indices, little-endian encoded:
0
: Checksum type1-20
: Meta-index block handle21-40
: Index block handle41-44
: Version number45-52
: Magic number
The sstable.TableFormat
enum is a uint32
representation of the tuple
`(magic number, format version). The current values are:
type TableFormat uint32
const (
TableFormatRocksDBv2 TableFormat = iota
TableFormatLevelDB
)
It should be noted that this enum is not persisted in the SSTable. It is purely an internal type that represents the tuple that simplifies a number of version checks when reading / writing an SSTable. The values are free to change, provided care is taken with default values and existing usage.
The existing sstable.TableFormat
will be altered to reflect the "linear"
nature of the version history. New versions will be added with the next value
in the sequence.
const (
TableFormatLevelDB TableFormat = iota + 1
TableFormatRocksDBv2 // The current default table format.
TableFormatPebbleDBv1 // Block properties.
TableFormatPebbleDBv2 // Range keys
...
TableFormatPebbleDBvN
)
The use of iota + 1
can be used to ensure that where a sstable.TableFormat
is not specified, Pebble can select a suitable default for writing the table
(most likely based on the format major version in use by the store; more in the
next section).
The FormatMajorVersion
type is used to determine the set of features the
store supports.
A Pebble store may be read-from / written-to by a Pebble binary that supports newer features, with more recent Pebble format major versions. These newer features could include the ability to read and write more recent SSTables. While the store could read and write SSTables at the most recent version the binary supports, it is not safe to do so, for reasons outlined earlier.
The format major version will have a "maximum table format version" associated
with it that indicates the maximum sstable.TableFormat
that can be safely
handled by the store.
When introducing a new table format version, it should be gated behind an
associated FormatMajorVersion
that has the new table format as its "maximum
table format version".
For example:
// Existing verisons.
FormatDefault.TableFormatVersionMax() // sstable.TableFormatRocksDBv2
...
FormatSetWithDelete.TableFormatVersionMax() // sstable.TableFormatRocksDBv2
// Proposed versions with Pebble version scheme.
FormatBlockPropertyCollector.TableFormatVersionMax() // sstable.TableFormatPebbleDBv1
FormatRangeKeys.TableFormatVersionMax() // sstable.TableFormatPebbleDBv2
The introduction of new SSTable format versions needs to be carefully coordinated between stores to ensure there are no incompatibilities (i.e. newer store writes an SSTable that cannot be understood by other stores).
It is only safe to use a new table format when all nodes in a cluster have been
finalized. A newer Cockroach node, with newer Pebble code, should continue to
write SSTables with a table format version equal to or less than the smallest
table format version across all nodes in the cluster. Once the cluster version
has been finalized, and (*DB).RatchetFormatMajorVersion(FormatMajorVersion)
has been called, nodes are free to write SSTables at newer table format
versions.
At runtime, Pebble exposes a (*db).FormatMajorVersion()
method, which may be
used to determine the current format major version of the store, and hence, the
table format version.
In addition to the above, there are situations where SSTables are created for
consumption at a later point in time, independent of any Pebble store -
specifically backup and restore. Currently, Cockroach uses two functions in
pkg/sstable
to construct SSTables for both ingestion and backup
(here
and
here).
Both will need to be updated to take into account the cluster version to ensure
that SSTables with newer versions are only written once the cluster version has
been finalized.
- Are there other locations in Cockroach, other than
sstable_writer.go
that we need to update to gate the table format version we use when writing tables?