Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add MVCC range tombstones tech note #83762

Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Jul 3, 2022

Rendered version


This describes the current state, but the details will likely change as
we iterate on the implementation.

Resolves #83406.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-tech-note branch from bcc5198 to aa0c474 Compare July 3, 2022 21:26
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jul 3, 2022

@nicktrav As we discussed in the meeting last week, here's the high-level tech note for MVCC range tombstones. Let me know when we have something for Pebble range keys, and I'll link it from here -- I figure including low-level implementation details here is probably not useful to most readers, but let me know if you feel otherwise.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-tech-note branch 10 times, most recently from ceebbaf to afee3ec Compare July 5, 2022 09:41
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-tech-note branch from afee3ec to 3288118 Compare July 5, 2022 12:58
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @itsbilal, @jbowens, and @nicktrav)


docs/tech-notes/mvcc-range-tombstones.md line 105 at r1 (raw file):

Time
5 a5 b5

worth clarifying that a5 is the value for a@5 and so on.


docs/tech-notes/mvcc-range-tombstones.md line 126 at r1 (raw file):

MVCC iteration later.

In the KV API, however, this distinction doesn't really matter: `Get(c)` would

worth saying what read timestamp this Get is using.


docs/tech-notes/mvcc-range-tombstones.md line 231 at r1 (raw file):

A notable and unfortunate implementation detail is that these KV mutations must
take out a latch that is slightly wider than the written span (e.g.
`EndKey.Next()`) in order to look for any surrounding MVCC range keys that might

how is the start key widened?


docs/tech-notes/mvcc-range-tombstones.md line 385 at r1 (raw file):

it it has no existing point keys: a get of `bar` would return `bar@4` and
`bar@2` even though there is no real point key at `bar`, as this might be
required e.g. for conflict checks. These synthetic tombstones would similarly

why is an MVCCGet returning multiple versions for the same key?
Also, worth saying what timestamp this get is operating at (and similarly for the MVCCScan description earlier).


docs/tech-notes/mvcc-range-tombstones.md line 405 at r1 (raw file):

* `RangeKeyCount`: The number of MVCC range key fragment stacks, i.e. ignoring
  historical versions.

this includes fragmentation caused by historical versions, yes?
That is, if we have [a,z)@10, and [b,c)@5, [d,e)@5, we would have fragmentation into [a,b)@10, [b,c)@10, [c,d)@10, [d,e)@10, [e,z)@10, so it would count as 5? If yes, could you enhance the example below to clarify this.


docs/tech-notes/mvcc-range-tombstones.md line 430 at r1 (raw file):

* `RangeKeyCount`: 3 (`[a-b)`, `[b-c)`, and `[c-d)`).
* `RangeKeyBytes`: 48 (3 * 2 * 2 for encoded key bounds `a`, `b`, `b`, `c`,

where does this last *2 come from. Is it the \x00 sentinel byte?


docs/tech-notes/mvcc-range-tombstones.md line 510 at r1 (raw file):

example, if there is a range tombstone `[b-d)@2`, and iteration stops between
`c@3` and `c@1`, returning a resume span at `c@1`, then the previous SST will
contain `[b-c\0)@2` and the next SST will contain `[c-d)@2`. This ensures that

This is ugly, but I can see why you had to do this.
Were there any alternatives considered?
And I am puzzled by the following comment in the code, regarding "allow this overlap". AddSSTable is adding a single sstable, so the usual non-overlap rule in DB.Ingest is trivially satisfied. Were there any Pebble changes made to allow for overlap when ingesting multiple sstables?

	// NB: If the result contains MVCC range tombstones, this can cause MVCC range
	// tombstones in two subsequent SSTs to overlap. For example, given the range
	// tombstone [a-f)@5, if we return a resume key at c@3 then the response will
	// contain a truncated MVCC range tombstone [a-c\0)@5 which covers the point
	// keys at c, but resuming from c@3 will contain the MVCC range tombstone
	// [c-f)@5 which overlaps with the MVCC range tombstone in the previous
	// response for the interval [c-c\0)@5. AddSSTable will allow this overlap
	// during ingestion.
	StopMidKey bool

docs/tech-notes/mvcc-range-tombstones.md line 538 at r1 (raw file):

There will be no corresponding point key events: it is up to the consumer to
apply these to relevant point keys as appropriate. However, rangefeeds with
diffs enabled will respect the MVCC range tombstone as the previous (empty)

The diff semantics are somewhat unclear to me.
When the point key is the latest and the prev is the range key, will the latter show as a synthesized point key?
If the range key is the latest, is the with-diff ignored -- maybe that's what you meant by "no corresponding point key events".
Do we know how customers use changefeeds with-diff and how will they be affected?


docs/tech-notes/mvcc-range-tombstones.md line 554 at r1 (raw file):

CDC will not need to take these into account yet, because the two initial
use-cases (SQL schema GC and import cancellation) take the tables offline and
prevent CDC changefeeds across them (including later catchup scans). However,

I suppose this answers my previous question.


docs/tech-notes/mvcc-range-tombstones.md line 468 at r2 (raw file):

However, MVCC range tombstones also allow for much more efficient GC using
Pebble range deletes rather than Pebble point deletes when there is no live data
above it. This optimization will is not yet implemented, but will be necessary

something wrong with this sentence: ... will is ...

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-tech-note branch from 3288118 to cb93271 Compare July 5, 2022 15:35
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @10, @5, @aliher1911, @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)


docs/tech-notes/mvcc-range-tombstones.md line 105 at r1 (raw file):

Previously, sumeerbhola wrote…

worth clarifying that a5 is the value for a@5 and so on.

Done.


docs/tech-notes/mvcc-range-tombstones.md line 126 at r1 (raw file):

Previously, sumeerbhola wrote…

worth saying what read timestamp this Get is using.

Done.


docs/tech-notes/mvcc-range-tombstones.md line 231 at r1 (raw file):

Previously, sumeerbhola wrote…

how is the start key widened?

This, unfortunately:

// Prevish returns a previous key in lexicographic sort order. It is impossible
// in general to find the exact immediate predecessor key, because it has an
// infinite number of 0xff bytes at the end, so this returns the nearest
// previous key right-padded with 0xff up to length bytes. An infinite number of
// keys may exist between Key and Key.Prevish(), as keys have unbounded length.
// This also implies that k.Prevish().IsPrev(k) will often be false.
//
// PrevishKeyLength can be used as a reasonable length in most situations.
//
// The method may only take a shallow copy of the Key, so both the receiver and
// the return value should be treated as immutable after.
func (k Key) Prevish(length int) Key {
return Key(encoding.BytesPrevish(k, length))
}


docs/tech-notes/mvcc-range-tombstones.md line 385 at r1 (raw file):

Previously, sumeerbhola wrote…

why is an MVCCGet returning multiple versions for the same key?
Also, worth saying what timestamp this get is operating at (and similarly for the MVCCScan description earlier).

It shouldn't, thanks. I keep thinking of this from the iterator point of view (which surfaces historical versions), rather than the get/scan point of view.


docs/tech-notes/mvcc-range-tombstones.md line 405 at r1 (raw file):

Previously, sumeerbhola wrote…

this includes fragmentation caused by historical versions, yes?
That is, if we have [a,z)@10, and [b,c)@5, [d,e)@5, we would have fragmentation into [a,b)@10, [b,c)@10, [c,d)@10, [d,e)@10, [e,z)@10, so it would count as 5? If yes, could you enhance the example below to clarify this.

Yep, done.


docs/tech-notes/mvcc-range-tombstones.md line 430 at r1 (raw file):

Previously, sumeerbhola wrote…

where does this last *2 come from. Is it the \x00 sentinel byte?

It is, made this explicit.


docs/tech-notes/mvcc-range-tombstones.md line 510 at r1 (raw file):

Were there any alternatives considered?

We originally discussed this back in #76131 (review) and related Slack discussions. We considered e.g. exporting range keys separately from point keys and such, but ended up preferring this approach since we expect there to be 0 range keys in the typical case.

And I am puzzled by the following comment in the code, regarding "allow this overlap". AddSSTable is adding a single sstable, so the usual non-overlap rule in DB.Ingest is trivially satisfied. Were there any Pebble changes made to allow for overlap when ingesting multiple sstables?

Right, this may turn out not to be an issue. I though we might have to add handling in CheckSSTConflicts such that it wouldn't conflict with itself, but this is only relevant when we're backfilling MVCC history (otherwise we're writing at the current timestamp), and in that case we won't even check for conflicts.

Furthermore, for backup specifically, we won't ingest MVCC range tombstones at all; we'll only ingest live data, but we'll have to read the MVCC range tombstones from the SSTs to build the live data view, in which case I believe the Pebble SST iterator will already merge these automatically.

I'll write some tests around this and make sure it works as expected, then update the comment and doc.

Were there any Pebble changes made to allow for overlap when ingesting multiple sstables?

Not that I'm aware of.


docs/tech-notes/mvcc-range-tombstones.md line 538 at r1 (raw file):

When the point key is the latest and the prev is the range key, will the latter show as a synthesized point key?

Yes. The main point is that if we have a real point key at b@1 covered by [a- c)@2, then writing b@3 will emit an event with a previous value that's empty rather than the value of b@1, as one would expect. Update the text with this example.

Do we know how customers use changefeeds with-diff and how will they be affected?

As you found below, changefeeds won't currently be affected.


docs/tech-notes/mvcc-range-tombstones.md line 468 at r2 (raw file):

Previously, sumeerbhola wrote…

something wrong with this sentence: ... will is ...

Thanks, fixed.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-tech-note branch 5 times, most recently from ad29fcd to 78fbdf4 Compare July 5, 2022 16:23
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @10, @5, @aliher1911, @erikgrinaker, @itsbilal, @nicktrav, and @sumeerbhola)


docs/tech-notes/mvcc-range-tombstones.md line 183 at r2 (raw file):

* `PutMVCCRangeKey(MVCCRangeKey, MVCCValue)`:  Writes the given range key
  with the given value. Replaces any existing range key(s) within the keyspan
  and timestamp. Errors if the value is not a tombstone.

nit: maybe "within the keyspan and at an identical timestamp"?[


docs/tech-notes/mvcc-range-tombstones.md line 387 at r3 (raw file):

Additionally, an `MVCCGet` will emit a synthetic point tombstone for the key
even it it has no existing point keys: `Get(bar)` at timestamp >= 4 would return

nit: s/it it/if it/

Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes for great reading. Thank you for putting it together! :lgtm:

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (and 1 stale) (waiting on @10, @5, @aliher1911, @erikgrinaker, @itsbilal, and @sumeerbhola)


docs/tech-notes/mvcc-range-tombstones.md line 259 at r3 (raw file):

The properties of point and range keys are accessed via:

* `HasPointAndRange()`: whether the current position has a point and/or

nit: can we specify here that this returns a (bool, bool) tuple? The reader may assume this is a single bool, then be confused to encounter the tuple in the examples.


docs/tech-notes/mvcc-range-tombstones.md line 299 at r3 (raw file):

last in this example.

When using `SeekGE` within range key bounds, the iterator may land on the bare

Might be useful to have a case where there's a point key with no covering range key (i.e. the existing, pre-range key case), and therefore the seek will "move up" to the point key, rather than stopping at the range key lower bound. For example, SeekGE(e@5) if there was an e1 with no covering range key.

Maybe that's obvious, in which case perhaps just mention it in writing, for completeness?


docs/tech-notes/mvcc-range-tombstones.md line 315 at r3 (raw file):

| `c@2`    | `c@2`       | `false`, `true`    |               | `[b-d)@4`, `[b-d)@2` |

The same is not true for `SeekLT`: when using it within range key bounds, it

Worth an additional, tabulated example, to underscore the difference in the SeekLT case? Or even just a subset of the above operations that would demonstrate wherein difference lies.


docs/tech-notes/mvcc-range-tombstones.md line 349 at r3 (raw file):

Additionally, `EngineIterator` has corresponding functionality for processing
raw Pebble range keys, but these are primarily intended for low-level use,
e.g. when building Raft snapshots.

Worth adding a brief, final paragraph with something like "When in doubt, use X iterator implementation with Y settings"?


docs/tech-notes/mvcc-range-tombstones.md line 464 at r3 (raw file):

## MVCC Garbage Collection

This is still under development, but once complete, MVCC range tombstones will

Curious - are you planning on going back and updating this note once GC is done? Or publishing a new note for GC?


docs/tech-notes/mvcc-range-tombstones.md line 487 at r3 (raw file):

`Writer` section above for details.

However, the old `SSTIterator` does not support range keys, as it is built on a

Are we tracking the removal of this "old" iterator impl?


docs/tech-notes/mvcc-range-tombstones.md line 575 at r3 (raw file):

building them in the first place.

## Related Reading

We also have cockroachdb/pebble#1341, which we intend to land as the "official RFC" for Pebble range keys (despite it being finalized after we implemented range keys). Not sure if / how you want to incorporate here. We could also just loop back and add a reference to RFC in its final form once we land it.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-tech-note branch from 78fbdf4 to b06e75a Compare July 6, 2022 10:45
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @10, @5, @aliher1911, @itsbilal, @nicktrav, and @sumeerbhola)


docs/tech-notes/mvcc-range-tombstones.md line 299 at r3 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Might be useful to have a case where there's a point key with no covering range key (i.e. the existing, pre-range key case), and therefore the seek will "move up" to the point key, rather than stopping at the range key lower bound. For example, SeekGE(e@5) if there was an e1 with no covering range key.

Maybe that's obvious, in which case perhaps just mention it in writing, for completeness?

We can cover this with the existing example, with SeekGE(d@5), if I'm understanding you correctly? Added.


docs/tech-notes/mvcc-range-tombstones.md line 315 at r3 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Worth an additional, tabulated example, to underscore the difference in the SeekLT case? Or even just a subset of the above operations that would demonstrate wherein difference lies.

Sure, may as well.


docs/tech-notes/mvcc-range-tombstones.md line 349 at r3 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Worth adding a brief, final paragraph with something like "When in doubt, use X iterator implementation with Y settings"?

I don't think I want to prescribe any settings, because it's very situation-dependant, but clarified that EngineIterator is mostly internal and that almost everyone wants an MVCCIterator.


docs/tech-notes/mvcc-range-tombstones.md line 464 at r3 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Curious - are you planning on going back and updating this note once GC is done? Or publishing a new note for GC?

Yes, I'm going to update this note as the implementation matures.


docs/tech-notes/mvcc-range-tombstones.md line 487 at r3 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Are we tracking the removal of this "old" iterator impl?

Not explicitly. Bulk IO will have to migrate all of their code, so I guess it falls under the MVCC range tombstone backup/restore work.


docs/tech-notes/mvcc-range-tombstones.md line 575 at r3 (raw file):

Previously, nicktrav (Nick Travers) wrote…

We also have cockroachdb/pebble#1341, which we intend to land as the "official RFC" for Pebble range keys (despite it being finalized after we implemented range keys). Not sure if / how you want to incorporate here. We could also just loop back and add a reference to RFC in its final form once we land it.

Sure, thanks, linked it for now and we can update it later.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-tech-note branch from b06e75a to 45f3f0d Compare July 6, 2022 11:52
Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @10, @5, @aliher1911, @itsbilal, and @sumeerbhola)

@erikgrinaker
Copy link
Contributor Author

Merging for now, will be making additional passes later. TFTRs!

bors r=sumeerbhola,jbowens,nicktrav

@erikgrinaker
Copy link
Contributor Author

bors r-

Forgot about the AddSSTable bit that I wanted to test and update.

@craig
Copy link
Contributor

craig bot commented Jul 6, 2022

Canceled.

This describes the current state, but the details will likely change as
we iterate on the implementation.

Release note: None
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-tech-note branch from 45f3f0d to 5e1cfb5 Compare July 6, 2022 19:14
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jul 6, 2022

Added some tests and docs updates for the Export SST overlap over in #83927, and added a sentence to the tech note.

bors r=sumeerbhola,jbowens,nicktrav

@craig
Copy link
Contributor

craig bot commented Jul 6, 2022

Build succeeded:

@craig craig bot merged commit ce1b42b into cockroachdb:master Jul 6, 2022
@erikgrinaker erikgrinaker deleted the mvcc-range-tombstones-tech-note branch July 12, 2022 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

docs: add tech note for MVCC range tombstones
5 participants