Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/rfcs: pebble table format versions #1450

Merged

Conversation

nicktrav
Copy link
Contributor

To safely support both range keys and block property collectors /
filters, which requires updating the SSTable layout, new table format
versions must be introduced.

Add a small RFC that proposes a new Pebble magic number and versioning
scheme and outlines relationship between the SSTable format version and
the existing Pebble format major version.

Informs #1339.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Comment on lines 93 to 109
A new magic number will be introduced for Pebble. Similar to LevelDB, the first
8 bytes of the SHA1 hash of the string "github.com/cockroachdb/pebble" will be
used:

```
SHA1('github.com/cockroachdb/pebble') = \x69\xda\xf0\x0e\x5c\x1d\x47\x82
```
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also just be pebbledb if we want to be cute.

Copy link
Member

@dt dt Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be cute, could put a U+1FAB3 in there (I'm still disappointed in myself for not picking a lock emoji for the encrypted backup file magic number)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 0 of 1 files reviewed, 9 unresolved discussions (waiting on @nicktrav and @sumeerbhola)


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 17 at r1 (raw file):

This RFC also outlines the relationship between the SSTable format version and
the existing Pebble format major version, in addition to how the two are are to

s/are are/are/


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 23 at r1 (raw file):

Pebble currently uses a "format major version" scheme for the store (or DB)
that indicates which Pebble features should be enabled when store is first

s/store/the store/


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 36 at r1 (raw file):

file and determines how the file is to be interpreted by Pebble. Currently,
Pebble supports two table formats - LevelDB's format, and RocksDB's v2 format.
The latter is the currently the default in Pebble and was selected to ensure a

s/the currently the default/the default/


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 38 at r1 (raw file):

The latter is the currently the default in Pebble and was selected to ensure a
smooth transition to Pebble from Cockroach stores that originally used RocksDB
as the engine.

I don't recall the differences between the LevelDB format and RocksDB format, but I think the RocksDB format is the default for Pebble-generated tables because of its features.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 153 at r1 (raw file):

## Changes / additions to `sstable.TableFormat`

The `sstable.TableFormat` enum is a `unit32` representation of the tuple

s/unit32/uint32/


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 230 at r1 (raw file):

finalized.

TODO(travers): Add paragraph here on what we need to update in Cockroach.

This might not be everything, but these two sstable.Writer constructions will need to be updated to determine the table format version for the current format major version. https://github.com/cockroachdb/cockroach/blob/1db01e81aa3a783a8f2a587821d9068872c2de56/pkg/storage/sst_writer.go#L57-L88


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 236 at r1 (raw file):

runtime as to what table format version to use.

TODO(travers): Is the above accurate? Unresolved question.

I think it'd be good to get @cockroachdb/bulk-io to review this. I'm not sure what guarantees we provide today.

@jbowens jbowens requested a review from dt January 13, 2022 19:06
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @dt and @nicktrav)


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 52 at r1 (raw file):

features are backwards incompatible (e.g. the block property collection and
filtering feature extends the RocksDBv2 SSTable block index format to encoded
various block properties, which is a breaking change). Participants must

nit: SetWithDelete is incompatible too. We got away with it since it happens to stay hidden in the sstables owned by a Pebble instance.

I think we should explicitly state that one of the reasons we need sstable format versions is that sstables can exist independent of a Pebble instance. We already have mechanisms in place to ensure that sstables created within a Pebble instance are conformant with the (distributed) cluster version.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 126 at r1 (raw file):

Subsequent alterations to the SSTable format should only increment the _Pebble
version number_. It should be noted that backported RocksDB table format
features (e.g. RocksDB versions 3 and 4) would use a version number under the

use a different version ...


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 242 at r1 (raw file):

- Is it safe to assume that for the purposes of backup and restore, a table
  with a newer table format would never need to be read by a store in a cluster
  running at a later version?

should this say "older version"?

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @dt, @jbowens, @nicktrav, and @sumeerbhola)


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 242 at r1 (raw file):

Previously, sumeerbhola wrote…

should this say "older version"?

Mostly. We'd want backup to explicitly control the format version it uses to write, so that a newer node, with newer pebble code, still writes backup files in the older format version when in a mixed version cluster so that old nodes in that cluster could still restore that backup. However once the cluster version has finalized, we're free to use the latest format version if we want, as we do not support restoring a backup written by a newer major version on an older major version.

@nicktrav nicktrav force-pushed the nickt.sstable-format-versions-rfc branch from 3a0db44 to d1427d5 Compare January 14, 2022 18:33
Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 10 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 17 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

s/are are/are/

Done.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 23 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

s/store/the store/

Done.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 36 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

s/the currently the default/the default/

Done.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 38 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I don't recall the differences between the LevelDB format and RocksDB format, but I think the RocksDB format is the default for Pebble-generated tables because of its features.

Re-worked this section slightly.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 52 at r1 (raw file):

nit: SetWithDelete is incompatible too. We got away with it since it happens to stay hidden in the sstables owned by a Pebble instance.

I was thinking of this case too when writing this. Where I landed was that as its introduction didn't require altering the "shape" of the SSTable (just a new key type), it didn't make sense to bump the table format version. My reading of the Rocks code is that this is the distinction taken there too - changes to the physical layout of the table are gated on table format version bumps, whereas code-level features are gated behind regular versions (RocksDB major/minor combinations).

explicitly state that one of the reasons we need sstable format versions is that sstables can exist independent of a Pebble instance

Agree. Added a paragraph on this.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 99 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Could also just be pebbledb if we want to be cute.

Going to just keep this as it is.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 126 at r1 (raw file):

Previously, sumeerbhola wrote…

use a different version ...

Done.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 153 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

s/unit32/uint32/

Done.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 230 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

This might not be everything, but these two sstable.Writer constructions will need to be updated to determine the table format version for the current format major version. https://github.com/cockroachdb/cockroach/blob/1db01e81aa3a783a8f2a587821d9068872c2de56/pkg/storage/sst_writer.go#L57-L88

Got it. Noted it below.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 236 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I think it'd be good to get @cockroachdb/bulk-io to review this. I'm not sure what guarantees we provide today.

Added @dt as a reviewer.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 242 at r1 (raw file):

should this say "older version"?

Yep. Typo. Updated.

we do not support restoring a backup written by a newer major version on an older major version.

Cool. This is the confirmation I needed to close out this particular question.

We'd want backup to explicitly control the format version it uses to write

I'll weave this comment into the section above. I assume we'd now need to do some plumbing in sstable_writer.go (here and here) to pass in the current cluster version so that the writer can make a determination as to what table format to use. @dt - does that sound reasonable?

@nicktrav nicktrav mentioned this pull request Jan 14, 2022
29 tasks
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 10 unresolved discussions (waiting on @dt, @jbowens, @nicktrav, and @sumeerbhola)


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 52 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

nit: SetWithDelete is incompatible too. We got away with it since it happens to stay hidden in the sstables owned by a Pebble instance.

I was thinking of this case too when writing this. Where I landed was that as its introduction didn't require altering the "shape" of the SSTable (just a new key type), it didn't make sense to bump the table format version. My reading of the Rocks code is that this is the distinction taken there too - changes to the physical layout of the table are gated on table format version bumps, whereas code-level features are gated behind regular versions (RocksDB major/minor combinations).

explicitly state that one of the reasons we need sstable format versions is that sstables can exist independent of a Pebble instance

Agree. Added a paragraph on this.

If old code can't read it because the InternalKeyKind is unknown to it, then I think it is a sstable version change.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 68 at r2 (raw file):

specific Pebble instance. For example, SSTable construction for the purposes of
backup and restore generates SSTables that are stored external to a specific
Pebble store (e.g. in cloud storage) can be used a later point in time to

... and can be used at a later point ...

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:LGTM:

modulo the note about RatchetFormatMajorVersion

Reviewable status: 0 of 1 files reviewed, 5 unresolved discussions (waiting on @dt, @nicktrav, and @sumeerbhola)


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 253 at r2 (raw file):

This implies that a Cockroach node would only start writing with a more recent
table format version after it is restarted following cluster version
finalization.

Documenting our synchronous discussion: The (*pebble.DB).RatchetFormatMajorVersion method may be called during cluster version finalization, which upgrades the database's format major version. We can generate tables using db.FormatMajorVersion.TableFormat().

@nicktrav nicktrav force-pushed the nickt.sstable-format-versions-rfc branch from d1427d5 to cb3089a Compare January 19, 2022 17:10
Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 6 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 99 at r1 (raw file):

Previously, dt (David Taylor) wrote…

If you want to be cute, could put a U+1FAB3 in there (I'm still disappointed in myself for not picking a lock emoji for the encrypted backup file magic number)

Given it's four bytes in UTF-8, we could have two :) ... or pick another emoji. Unfortunately there's no emoji for Pebble, only a "rock", though it's all open to interpretation!

@jbowens @sumeerbhola - do you have any preference on the magic number? Emoji, vs SHA, vs "pebbledb"?


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 68 at r2 (raw file):

Previously, sumeerbhola wrote…

... and can be used at a later point ...

Fixed.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 253 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Documenting our synchronous discussion: The (*pebble.DB).RatchetFormatMajorVersion method may be called during cluster version finalization, which upgrades the database's format major version. We can generate tables using db.FormatMajorVersion.TableFormat().

Updated. Removed one of the open questions.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 251 at r3 (raw file):

table format version.

In addition to the above, there are situations where SSTables are created for

cc: @dt - any preference here on how to plumb through the version into these functions?

There's also an open question as to whether there are other locations we'd need to update to pass in the version. Are there other locations that stand out?

nicktrav added a commit to nicktrav/pebble that referenced this pull request Jan 20, 2022
Once [Table format version RFC][1] is implemented, the `TableFormat`
enum will include at least two more values for two new Pebble-specific
table features.

In anticipation of the change, re-work the existing enum values to
introduce an "unknown" format version (for when the value is unspecified
and assumes the default value), and order the two existing values in
terms of their natural "lineage" (e.g. LevelDB begat RocksDB, which
begat ..., etc.).

In locations where an unspecified `TableFormat` makes its way all the
way down to the sstable, panic. This ensures that a table format version
is selected explicitly.

[1]: cockroachdb#1450
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @dt, @jbowens, @nicktrav, and @sumeerbhola)


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 99 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Given it's four bytes in UTF-8, we could have two :) ... or pick another emoji. Unfortunately there's no emoji for Pebble, only a "rock", though it's all open to interpretation!

@jbowens @sumeerbhola - do you have any preference on the magic number? Emoji, vs SHA, vs "pebbledb"?

no preference


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 249 at r3 (raw file):

At runtime, Pebble exposes a `(*db).FormatMajorVersion()` method, which may be
used to determine the current format major version of the store, and hence, the
table format version.

nit: we should also expose the table format version directly via a method on DB since the caller won't know how to map the format major version of the store to the sstable format version.
I suspect that's what @jbowens was also stating in his comment.

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 236 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Added @dt as a reviewer.

During a mixed version state, when writing somewhere external (backups, some future disagg storage, etc), no node should write a file that any other node in cluster might not be able to read, that is, we need to use the preferred format of the current cluster version. as soon as we finalize, we can use the new format, since backups are only forwards compatible (newer nodes can read one version older backup, but newer backups should not be read by older nodes if made after finalization)


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 251 at r3 (raw file):

Previously, nicktrav (Nick Travers) wrote…

cc: @dt - any preference here on how to plumb through the version into these functions?

There's also an open question as to whether there are other locations we'd need to update to pass in the version. Are there other locations that stand out?

We'll need to plumb the cluster version to them, since until the upgrade is finalized we need to make formats readable by the prior version. once we finalize we can use whatever that version likes best, provided the next version will be able to read it too.

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 242 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

should this say "older version"?

Yep. Typo. Updated.

we do not support restoring a backup written by a newer major version on an older major version.

Cool. This is the confirmation I needed to close out this particular question.

We'd want backup to explicitly control the format version it uses to write

I'll weave this comment into the section above. I assume we'd now need to do some plumbing in sstable_writer.go (here and here) to pass in the current cluster version so that the writer can make a determination as to what table format to use. @dt - does that sound reasonable?

Yeah, I think those are exactly the right spots: in both cases -- backup and ingestion writers -- the produced file needs to be readable by any other node in the cluster so if we're in mixed version, we need to use the format version that corresponds to the current cluster version.

@nicktrav nicktrav force-pushed the nickt.sstable-format-versions-rfc branch from cb3089a to 33cbb61 Compare January 20, 2022 20:41
Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Dismissed @jbowens and @sumeerbhola from 5 discussions.
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 99 at r1 (raw file):

Previously, sumeerbhola wrote…

no preference

@dt - you win. "🪳🪳" it is.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 236 at r1 (raw file):

Previously, dt (David Taylor) wrote…

During a mixed version state, when writing somewhere external (backups, some future disagg storage, etc), no node should write a file that any other node in cluster might not be able to read, that is, we need to use the preferred format of the current cluster version. as soon as we finalize, we can use the new format, since backups are only forwards compatible (newer nodes can read one version older backup, but newer backups should not be read by older nodes if made after finalization)

Cool. I think we're good here.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 242 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Yeah, I think those are exactly the right spots: in both cases -- backup and ingestion writers -- the produced file needs to be readable by any other node in the cluster so if we're in mixed version, we need to use the format version that corresponds to the current cluster version.

👍 Resolved this.


docs/RFCS/20220112_pebble_sstable_format_versions.md, line 249 at r3 (raw file):

Previously, sumeerbhola wrote…

nit: we should also expose the table format version directly via a method on DB since the caller won't know how to map the format major version of the store to the sstable format version.
I suspect that's what @jbowens was also stating in his comment.

Added.

To safely support both range keys and block property collectors /
filters, which requires updating the SSTable layout, new table format
versions must be introduced.

Add a small RFC that proposes a new Pebble magic number and versioning
scheme and outlines relationship between the SSTable format version and
the existing Pebble format major version.

Informs cockroachdb#1339.
@nicktrav nicktrav force-pushed the nickt.sstable-format-versions-rfc branch from 33cbb61 to 9a20a85 Compare January 20, 2022 20:43
@nicktrav nicktrav merged commit b58a7bd into cockroachdb:master Jan 20, 2022
@nicktrav nicktrav deleted the nickt.sstable-format-versions-rfc branch January 20, 2022 21:13
nicktrav added a commit that referenced this pull request Jan 21, 2022
Once [Table format version RFC][1] is implemented, the `TableFormat`
enum will include at least two more values for two new Pebble-specific
table features.

In anticipation of the change, re-work the existing enum values to
introduce an "unknown" format version (for when the value is unspecified
and assumes the default value), and order the two existing values in
terms of their natural "lineage" (e.g. LevelDB begat RocksDB, which
begat ..., etc.).

In locations where an unspecified `TableFormat` makes its way all the
way down to the sstable, panic. This ensures that a table format version
is selected explicitly.

[1]: #1450
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants