-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs/rfcs: pebble table format versions #1450
docs/rfcs: pebble table format versions #1450
Conversation
A new magic number will be introduced for Pebble. Similar to LevelDB, the first | ||
8 bytes of the SHA1 hash of the string "github.com/cockroachdb/pebble" will be | ||
used: | ||
|
||
``` | ||
SHA1('github.com/cockroachdb/pebble') = \x69\xda\xf0\x0e\x5c\x1d\x47\x82 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also just be pebbledb
if we want to be cute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be cute, could put a U+1FAB3
in there (I'm still disappointed in myself for not picking a lock emoji for the encrypted backup file magic number)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 0 of 1 files reviewed, 9 unresolved discussions (waiting on @nicktrav and @sumeerbhola)
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 17 at r1 (raw file):
This RFC also outlines the relationship between the SSTable format version and the existing Pebble format major version, in addition to how the two are are to
s/are are/are/
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 23 at r1 (raw file):
Pebble currently uses a "format major version" scheme for the store (or DB) that indicates which Pebble features should be enabled when store is first
s/store/the store/
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 36 at r1 (raw file):
file and determines how the file is to be interpreted by Pebble. Currently, Pebble supports two table formats - LevelDB's format, and RocksDB's v2 format. The latter is the currently the default in Pebble and was selected to ensure a
s/the currently the default/the default/
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 38 at r1 (raw file):
The latter is the currently the default in Pebble and was selected to ensure a smooth transition to Pebble from Cockroach stores that originally used RocksDB as the engine.
I don't recall the differences between the LevelDB format and RocksDB format, but I think the RocksDB format is the default for Pebble-generated tables because of its features.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 153 at r1 (raw file):
## Changes / additions to `sstable.TableFormat` The `sstable.TableFormat` enum is a `unit32` representation of the tuple
s/unit32
/uint32
/
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 230 at r1 (raw file):
finalized. TODO(travers): Add paragraph here on what we need to update in Cockroach.
This might not be everything, but these two sstable.Writer
constructions will need to be updated to determine the table format version for the current format major version. https://github.com/cockroachdb/cockroach/blob/1db01e81aa3a783a8f2a587821d9068872c2de56/pkg/storage/sst_writer.go#L57-L88
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 236 at r1 (raw file):
runtime as to what table format version to use. TODO(travers): Is the above accurate? Unresolved question.
I think it'd be good to get @cockroachdb/bulk-io to review this. I'm not sure what guarantees we provide today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @dt and @nicktrav)
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 52 at r1 (raw file):
features are backwards incompatible (e.g. the block property collection and filtering feature extends the RocksDBv2 SSTable block index format to encoded various block properties, which is a breaking change). Participants must
nit: SetWithDelete is incompatible too. We got away with it since it happens to stay hidden in the sstables owned by a Pebble instance.
I think we should explicitly state that one of the reasons we need sstable format versions is that sstables can exist independent of a Pebble instance. We already have mechanisms in place to ensure that sstables created within a Pebble instance are conformant with the (distributed) cluster version.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 126 at r1 (raw file):
Subsequent alterations to the SSTable format should only increment the _Pebble version number_. It should be noted that backported RocksDB table format features (e.g. RocksDB versions 3 and 4) would use a version number under the
use a different version ...
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 242 at r1 (raw file):
- Is it safe to assume that for the purposes of backup and restore, a table with a newer table format would never need to be read by a store in a cluster running at a later version?
should this say "older version"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @dt, @jbowens, @nicktrav, and @sumeerbhola)
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 242 at r1 (raw file):
Previously, sumeerbhola wrote…
should this say "older version"?
Mostly. We'd want backup to explicitly control the format version it uses to write, so that a newer node, with newer pebble code, still writes backup files in the older format version when in a mixed version cluster so that old nodes in that cluster could still restore that backup. However once the cluster version has finalized, we're free to use the latest format version if we want, as we do not support restoring a backup written by a newer major version on an older major version.
3a0db44
to
d1427d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 10 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 17 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
s/are are/are/
Done.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 23 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
s/store/the store/
Done.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 36 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
s/the currently the default/the default/
Done.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 38 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I don't recall the differences between the LevelDB format and RocksDB format, but I think the RocksDB format is the default for Pebble-generated tables because of its features.
Re-worked this section slightly.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 52 at r1 (raw file):
nit: SetWithDelete is incompatible too. We got away with it since it happens to stay hidden in the sstables owned by a Pebble instance.
I was thinking of this case too when writing this. Where I landed was that as its introduction didn't require altering the "shape" of the SSTable (just a new key type), it didn't make sense to bump the table format version. My reading of the Rocks code is that this is the distinction taken there too - changes to the physical layout of the table are gated on table format version bumps, whereas code-level features are gated behind regular versions (RocksDB major/minor combinations).
explicitly state that one of the reasons we need sstable format versions is that sstables can exist independent of a Pebble instance
Agree. Added a paragraph on this.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 99 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Could also just be
pebbledb
if we want to be cute.
Going to just keep this as it is.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 126 at r1 (raw file):
Previously, sumeerbhola wrote…
use a different version ...
Done.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 153 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
s/
unit32
/uint32
/
Done.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 230 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
This might not be everything, but these two
sstable.Writer
constructions will need to be updated to determine the table format version for the current format major version. https://github.com/cockroachdb/cockroach/blob/1db01e81aa3a783a8f2a587821d9068872c2de56/pkg/storage/sst_writer.go#L57-L88
Got it. Noted it below.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 236 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I think it'd be good to get @cockroachdb/bulk-io to review this. I'm not sure what guarantees we provide today.
Added @dt as a reviewer.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 242 at r1 (raw file):
should this say "older version"?
Yep. Typo. Updated.
we do not support restoring a backup written by a newer major version on an older major version.
Cool. This is the confirmation I needed to close out this particular question.
We'd want backup to explicitly control the format version it uses to write
I'll weave this comment into the section above. I assume we'd now need to do some plumbing in sstable_writer.go
(here and here) to pass in the current cluster version so that the writer can make a determination as to what table format to use. @dt - does that sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 10 unresolved discussions (waiting on @dt, @jbowens, @nicktrav, and @sumeerbhola)
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 52 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
nit: SetWithDelete is incompatible too. We got away with it since it happens to stay hidden in the sstables owned by a Pebble instance.
I was thinking of this case too when writing this. Where I landed was that as its introduction didn't require altering the "shape" of the SSTable (just a new key type), it didn't make sense to bump the table format version. My reading of the Rocks code is that this is the distinction taken there too - changes to the physical layout of the table are gated on table format version bumps, whereas code-level features are gated behind regular versions (RocksDB major/minor combinations).
explicitly state that one of the reasons we need sstable format versions is that sstables can exist independent of a Pebble instance
Agree. Added a paragraph on this.
If old code can't read it because the InternalKeyKind is unknown to it, then I think it is a sstable version change.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 68 at r2 (raw file):
specific Pebble instance. For example, SSTable construction for the purposes of backup and restore generates SSTables that are stored external to a specific Pebble store (e.g. in cloud storage) can be used a later point in time to
... and can be used at a later point ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modulo the note about RatchetFormatMajorVersion
Reviewable status: 0 of 1 files reviewed, 5 unresolved discussions (waiting on @dt, @nicktrav, and @sumeerbhola)
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 253 at r2 (raw file):
This implies that a Cockroach node would only start writing with a more recent table format version after it is restarted following cluster version finalization.
Documenting our synchronous discussion: The (*pebble.DB).RatchetFormatMajorVersion
method may be called during cluster version finalization, which upgrades the database's format major version. We can generate tables using db.FormatMajorVersion.TableFormat()
.
d1427d5
to
cb3089a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 6 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 99 at r1 (raw file):
Previously, dt (David Taylor) wrote…
If you want to be cute, could put a
U+1FAB3
in there (I'm still disappointed in myself for not picking a lock emoji for the encrypted backup file magic number)
Given it's four bytes in UTF-8, we could have two :) ... or pick another emoji. Unfortunately there's no emoji for Pebble, only a "rock", though it's all open to interpretation!
@jbowens @sumeerbhola - do you have any preference on the magic number? Emoji, vs SHA, vs "pebbledb"?
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 68 at r2 (raw file):
Previously, sumeerbhola wrote…
... and can be used at a later point ...
Fixed.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 253 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Documenting our synchronous discussion: The
(*pebble.DB).RatchetFormatMajorVersion
method may be called during cluster version finalization, which upgrades the database's format major version. We can generate tables usingdb.FormatMajorVersion.TableFormat()
.
Updated. Removed one of the open questions.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 251 at r3 (raw file):
table format version. In addition to the above, there are situations where SSTables are created for
cc: @dt - any preference here on how to plumb through the version into these functions?
There's also an open question as to whether there are other locations we'd need to update to pass in the version. Are there other locations that stand out?
Once [Table format version RFC][1] is implemented, the `TableFormat` enum will include at least two more values for two new Pebble-specific table features. In anticipation of the change, re-work the existing enum values to introduce an "unknown" format version (for when the value is unspecified and assumes the default value), and order the two existing values in terms of their natural "lineage" (e.g. LevelDB begat RocksDB, which begat ..., etc.). In locations where an unspecified `TableFormat` makes its way all the way down to the sstable, panic. This ensures that a table format version is selected explicitly. [1]: cockroachdb#1450
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @dt, @jbowens, @nicktrav, and @sumeerbhola)
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 99 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Given it's four bytes in UTF-8, we could have two :) ... or pick another emoji. Unfortunately there's no emoji for Pebble, only a "rock", though it's all open to interpretation!
@jbowens @sumeerbhola - do you have any preference on the magic number? Emoji, vs SHA, vs "pebbledb"?
no preference
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 249 at r3 (raw file):
At runtime, Pebble exposes a `(*db).FormatMajorVersion()` method, which may be used to determine the current format major version of the store, and hence, the table format version.
nit: we should also expose the table format version directly via a method on DB
since the caller won't know how to map the format major version of the store to the sstable format version.
I suspect that's what @jbowens was also stating in his comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 236 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Added @dt as a reviewer.
During a mixed version state, when writing somewhere external (backups, some future disagg storage, etc), no node should write a file that any other node in cluster might not be able to read, that is, we need to use the preferred format of the current cluster version. as soon as we finalize, we can use the new format, since backups are only forwards compatible (newer nodes can read one version older backup, but newer backups should not be read by older nodes if made after finalization)
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 251 at r3 (raw file):
Previously, nicktrav (Nick Travers) wrote…
cc: @dt - any preference here on how to plumb through the version into these functions?
There's also an open question as to whether there are other locations we'd need to update to pass in the version. Are there other locations that stand out?
We'll need to plumb the cluster version to them, since until the upgrade is finalized we need to make formats readable by the prior version. once we finalize we can use whatever that version likes best, provided the next version will be able to read it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 242 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
should this say "older version"?
Yep. Typo. Updated.
we do not support restoring a backup written by a newer major version on an older major version.
Cool. This is the confirmation I needed to close out this particular question.
We'd want backup to explicitly control the format version it uses to write
I'll weave this comment into the section above. I assume we'd now need to do some plumbing in
sstable_writer.go
(here and here) to pass in the current cluster version so that the writer can make a determination as to what table format to use. @dt - does that sound reasonable?
Yeah, I think those are exactly the right spots: in both cases -- backup and ingestion writers -- the produced file needs to be readable by any other node in the cluster so if we're in mixed version, we need to use the format version that corresponds to the current cluster version.
cb3089a
to
33cbb61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Dismissed @jbowens and @sumeerbhola from 5 discussions.
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 99 at r1 (raw file):
Previously, sumeerbhola wrote…
no preference
@dt - you win. "🪳🪳" it is.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 236 at r1 (raw file):
Previously, dt (David Taylor) wrote…
During a mixed version state, when writing somewhere external (backups, some future disagg storage, etc), no node should write a file that any other node in cluster might not be able to read, that is, we need to use the preferred format of the current cluster version. as soon as we finalize, we can use the new format, since backups are only forwards compatible (newer nodes can read one version older backup, but newer backups should not be read by older nodes if made after finalization)
Cool. I think we're good here.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 242 at r1 (raw file):
Previously, dt (David Taylor) wrote…
Yeah, I think those are exactly the right spots: in both cases -- backup and ingestion writers -- the produced file needs to be readable by any other node in the cluster so if we're in mixed version, we need to use the format version that corresponds to the current cluster version.
👍 Resolved this.
docs/RFCS/20220112_pebble_sstable_format_versions.md, line 249 at r3 (raw file):
Previously, sumeerbhola wrote…
nit: we should also expose the table format version directly via a method on
DB
since the caller won't know how to map the format major version of the store to the sstable format version.
I suspect that's what @jbowens was also stating in his comment.
Added.
To safely support both range keys and block property collectors / filters, which requires updating the SSTable layout, new table format versions must be introduced. Add a small RFC that proposes a new Pebble magic number and versioning scheme and outlines relationship between the SSTable format version and the existing Pebble format major version. Informs cockroachdb#1339.
33cbb61
to
9a20a85
Compare
Once [Table format version RFC][1] is implemented, the `TableFormat` enum will include at least two more values for two new Pebble-specific table features. In anticipation of the change, re-work the existing enum values to introduce an "unknown" format version (for when the value is unspecified and assumes the default value), and order the two existing values in terms of their natural "lineage" (e.g. LevelDB begat RocksDB, which begat ..., etc.). In locations where an unspecified `TableFormat` makes its way all the way down to the sstable, panic. This ensures that a table format version is selected explicitly. [1]: #1450
To safely support both range keys and block property collectors /
filters, which requires updating the SSTable layout, new table format
versions must be introduced.
Add a small RFC that proposes a new Pebble magic number and versioning
scheme and outlines relationship between the SSTable format version and
the existing Pebble format major version.
Informs #1339.