-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvnemesis: uniquely identify all versions #89477
Conversation
181726d
to
2d2e706
Compare
57cf6f5
to
a03f3f9
Compare
No rush on reviewing this Erik, it's not ready for merging anyway, need to come up with a version of the plumbing that could at least in theory get consensus. But this seems to work and I'll base DeleteRangeUsingTombstone on it so that we can start getting some test coverage. |
9c10c9f
to
2b0afee
Compare
I haven't done the plumbing and there are some test failures to fix up (because I changed some proto sizes under |
@erikgrinaker maybe we should kick this off with a 30 minute live walkthrough? I haven't written enough high-level description yet, though I suspect you're in the loop already. |
Sure, we can do that. After standup? I'm pretty familiar already, but the diff is large enough that there are probably bits that I've missed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass, reviewing everything outside of kvnemesis
itself. I think the plumbing here came out really nice!
Haven't squashed yet, but it's rebased and has a nice PR message. PTALL @erikgrinaker Hope to get this in on Monday 🤞🏽🤞🏽🤞🏽🤞🏽🤞🏽 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable seems to have given up on this PR, so I haven't reviewed everything again in detail, but looked over the comments and the code outside of the kvnemesis package.
Reviewed 234 of 248 files at r17, 20 of 20 files at r18, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvnemesis/validator.go
line 1235 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Figured it out, we're using this iter just to find all points that have MVCC history, and then throwing each of them in
validReadTimes
which does do proper tombstone handling. Added a comment and removed some effectively unused code here.
Got it, thanks for updating the comment. There's a duplicate "MVCC range deletions" in the comment here.
pkg/kv/kvnemesis/validator.go
line 467 at r17 (raw file):
Previously, tbg (Tobias Grieger) wrote…
(I'm allowing this now, just wanted to give you a ping in case you think we should discuss w KV folks)
Since the operation is non-atomic, the caller has to keep retrying until it succeeds across all ranges. The semantics here are effectively the same as for ClearRange, which was used previously.
pkg/storage/mvcc.go
line 1105 at r17 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I removed this TODO btw since there isn't a problem now, but want to make sure it's future proof.
Oof, yeah, this isn't great. I mean, it works, but I wouldn't expect this to be part of the pebbleMVCCScanner
API guarantees, so someone can easily end up breaking this later. On the other hand, we wouldn't want to add in any non-trivial work to track this in results
either, since this is on the hottest read path we have. I think I'm ok with this for now just because we only ever make use of this in test code and I don't have any better ideas, but it deserves a TODO or NB at least.
pkg/storage/mvcc.go
line 6623 at r18 (raw file):
} bounds := it.RangeBounds().Clone()
nit: let's reuse the span.EndKey
slice instead, when we assign at the end of the loop. It doesn't currently matter much, since this is only used in test code, but just in case someone starts using it in production paths.
The fixup commits are all there, so if you really wanted to, you could find the one that looks familiar from last time and review from there. But there is limited utility there too, just pointing it out as an option. TFTR, will address later. |
46e199f
to
c5cd8ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to get this green now.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
07bd6ab
to
5468c82
Compare
This is essentially a v2 of kvnemesis. While a lot of code has changed, it's not a rewrite, rather we are actually bringing kvnemesis closer to the idea which ultimately led to it being built. That idea is that if values can uniquely identify the operation which wrote them, serializability checking becomes easier as any observation of a value totally orders the reader and the writer with respect to each other. "Easier" meant both simpler code, as well as actually being able to computationally do the verification on complex histories. Prior to this PR, kvnemesis was writing unique values where it could, but it couldn't do it for deletions - after all, a deletion is like writing a "nothing" to MVCC, and there wasn't any way to make two "nothings" distinguishable. Having broken with the basic premise of unique values, there was a lot of bending over backwards going on to avoid, for the most part, devolving into a "normal" serializability checker. However, for contrived edge cases this would not be avoidable, and so there would be histories that kvnemesis just couldn't handle. "v2" (this PR) gets this right. The main contribution is that we now thread kvnemesis' sequence numbers all the way down into MVCC and back up with the RangeFeed. Values are now truly unique: a deletion tombstone is identifiable via its `MVCCValueHeader`, which carries the `kvnemesisutil.Seq` of the `Operation` that originally wrote it. On top of this, everything "just works" as you expect. Plumbing testing-only fields through the KV API protobufs isn't something that was taken lightly and that required a good amount of deliberation. However, we figured out a clever (maybe too clever?) way to have these fields be zero-sized in production, meaning they cannot possibly affect production logic and they also do not influence struct sizes or the wire encoding. As a drawback, kvnemesis requires the `crdb_test` build tag (it will `t.Skip()` otherwise). The remainder of this PR is mostly improvements in code quality. `echodriven` testing was introduced for many of the tests that could benefit from it. The core components of kvnemesis were reworked for clarity (the author spent the first week very confused and wishes for that experience to remain unrepeated by anyone). kvnemesis also tracks the execution timestamps as reported by the KV layer, which a future change could [use](cockroachdb#92898) for additional validation. Of note is the updated doc comment, which is reproduced below in entirety. Fixes cockroachdb#90955. Fixes cockroachdb#88988. Fixes cockroachdb#76435. Fixes cockroachdb#69642. ---- Package kvnemesis exercises the KV API with random concurrent traffic (as well as splits, merges, etc) and then validates that the observed behaviors are serializable. It does so in polynomial time based on the techniques used by [Elle] (see in particular section 4.2.3), using the after-the-fact MVCC history as a record of truth. It ensures that all write operations embed a unique identifier that is stored in MVCC history, and can thus identify which of its operations' mutations are reflected in the database ("recoverability" in Elle parlance). A run of kvnemesis proceeds as follows. First, a Generator is instantiated. It can create, upon request, a sequence in which each element is a random Operation. Operations that are mutations (i.e. not pure reads) are assigned a unique kvnemesisutil.Seq which will be stored alongside the MVCC data written by the Operation. A pool of worker threads concurrently generates Operations and executes them against a CockroachDB cluster. Some of these Operations may succeed, some may fail, and for some of them an ambiguous result may be encountered. Alongside this random traffic, kvnemesis maintains a RangeFeed that ingests the MVCC history. This creates a "carbon copy" of the MVCC history. All of these workers wind down once they have in aggregate completed a configured number of steps. Next, kvnemesis validates that the Operations that were executed and the results they saw match the MVCC history, i.e. checks for Serializability. In general, this is an NP-hard problem[^1], but the use of unique sequence numbers (kvnemesisutil.Seq) makes it tractable, as each mutation in the MVCC keyspace uniquely identifies the Operation that must have written the value. We make use of this property as follows. First, the Validate method iterates through all Operations performed and, for each, inspects - the type of the Operation (i.e. Put, Delete, Get, ...), - the (point or range) key of the operation, and - its results (i.e. whether there was an error or which key-value pairs were returned). Each atomic unit (i.e. individual non-transactional operation, or batch of non-transactional operations, or transaction) results in a slice (in order in which the Operations within executed[^2]) of observations, where each element is either an observed read or an observed write. For example, a Batch that first writes `v1` to `k1`, then scans `[k1-k3)` (reading its own write as well as some other key k2 with value v2) and then deletes `k3` would generate the following slice: [ observedWrite(k1->v1), observedScan(k1->v1, k2->v2), observedWrite(k3->v3), ] Each such slice (i.e. atomic unit) will then be compared to the MVCC history. For all units that succeeded, their writes must match up with versions in the MVCC history, and this matching is trivial thanks to unique values (including for deletions, since we embed the kvnemesisutil.Seq in the value), and in particular a single write will entirely fix the MVCC timestamp at which the atomic unit must have executed. For each read (i.e. get or scan), we compute at which time intervals each read would have been valid. For example, if the MVCC history for a key `k1` is as follows: k1 ----------------- t5 v2 t4 t3 t2 <del> t1 v1 then - observedGet(k1->v1) is valid for [t1,t2), - observedGet(k1->nil) is valid for [0,t1) and [t2,t5), and - observedGet(k1->v2) is valid for [t5,inf). By intersecting the time intervals for each Operation in an atomic unit, we then get the MVCC timestamps at which this Operation would have observed what it ended up observing. If this intersection is empty, serializability must have been violated. In the error case, kvnemesis verifies that no part of the Operation became visible. For ambiguous results, kvnemesis requires that either no Operation become visible, or otherwise treats the atomic unit as committed. The KV API also has the capability to return the actual execution timestamp directly with responses. At the time of writing, kvnemesis does verify that it does do this reliably, but it does not verify that the execution timestamp is contained in the intersection of time intervals obtained from inspecting MVCC history[^3]. [Elle]: https://arxiv.org/pdf/2003.10554.pdf [^1]: https://dl.acm.org/doi/10.1145/322154.322158 [^2]: there is currently concurrency within the atomic unit in kvnemesis. It could in theory carry out multiple reads concurrently while not also writing, such as DistSQL does, but this is not implemented today. See: cockroachdb#64825 [^3]: tracked in cockroachdb#92898. Epic: none Release note: None
See cockroachdb#93154. Release note: None
Fixes a kvnemesis flake: > kvnemesis.go:166: error applying x.AdminChangeReplicas(ctx, tk(9865574548402477322), [{REMOVE_VOTER n2,s2}]) // replica cannot hold lease: replica cannot hold lease Release note: None
This works around cockroachdb#92189 while it's being fixed (and allows me to merge the PR this commit is in, which otherwise fairly reliably fails maye_stress in CI). Release note: None
bors r=erikgrinaker |
Build succeeded: |
Found in cockroachdb#89477. Epic: none Release note: None
92257: kvcoord: return error on failed scan response checksum r=erikgrinaker a=tbg Found in #89477. Epic: none Release note: None Co-authored-by: Tobias Grieger <[email protected]>
This is essentially a v2 of kvnemesis. While a lot of code has changed, it's not a rewrite, rather we are actually bringing kvnemesis closer to the idea which ultimately led to it being built. That idea is that if values can uniquely identify the operation which wrote them, serializability checking becomes easier as any observation of a value totally orders the reader and the writer with respect to each other. "Easier" meant both simpler code, as well as actually being able to computationally do the verification on complex histories.
Prior to this PR, kvnemesis was writing unique values where it could, but it couldn't do it for deletions - after all, a deletion is like writing a "nothing" to MVCC, and there wasn't any way to make two "nothings" distinguishable. Having broken with the basic premise of unique values, there was a lot of bending over backwards going on to avoid, for the most part, devolving into a "normal" serializability checker. However, for contrived edge cases this would not be avoidable, and so there would be histories that kvnemesis just couldn't handle.
"v2" (this PR) gets this right. The main contribution is that we now thread kvnemesis' sequence numbers all the way down into MVCC and back up with the RangeFeed. Values are now truly unique: a deletion tombstone is identifiable via its
MVCCValueHeader
, which carries thekvnemesisutil.Seq
of theOperation
that originally wrote it. On top of this, everything "just works" as you expect.Plumbing testing-only fields through the KV API protobufs isn't something that was taken lightly and that required a good amount of deliberation. However, we figured out a clever (maybe too clever?) way to have these fields be zero-sized in production, meaning they cannot possibly affect production logic and they also do not influence struct sizes or the wire encoding. As a drawback, kvnemesis requires the
crdb_test
build tag (it willt.Skip()
otherwise).The remainder of this PR is mostly improvements in code quality.
echodriven
testing was introduced for many of the tests that could benefit from it. The core components of kvnemesis were reworked for clarity (the author spent the first week very confused and wishes for that experience to remain unrepeated by anyone). kvnemesis also tracks the execution timestamps as reported by the KV layer, which a future change could use for additional validation.Of note is the updated doc comment, which is reproduced below in entirety.
Fixes #90955.
Fixes #88988.
Fixes #76435.
Fixes #69642.
Package kvnemesis exercises the KV API with random concurrent traffic (as
well as splits, merges, etc) and then validates that the observed behaviors
are serializable.
It does so in polynomial time based on the techniques used by Elle (see in
particular section 4.2.3), using the after-the-fact MVCC history as a record
of truth. It ensures that all write operations embed a unique identifier that
is stored in MVCC history, and can thus identify which of its operations'
mutations are reflected in the database ("recoverability" in Elle parlance).
A run of kvnemesis proceeds as follows.
First, a Generator is instantiated. It can create, upon request, a sequence
in which each element is a random Operation. Operations that are mutations
(i.e. not pure reads) are assigned a unique kvnemesisutil.Seq which will be
stored alongside the MVCC data written by the Operation.
A pool of worker threads concurrently generates Operations and executes them
against a CockroachDB cluster. Some of these Operations may
succeed, some may fail, and for some of them an ambiguous result may be
encountered.
Alongside this random traffic, kvnemesis maintains a RangeFeed that ingests
the MVCC history. This creates a "carbon copy" of the MVCC history.
All of these workers wind down once they have in aggregate completed a
configured number of steps.
Next, kvnemesis validates that the Operations that were executed and the
results they saw match the MVCC history, i.e. checks for Serializability. In
general, this is an NP-hard problem1, but the use of unique sequence
numbers (kvnemesisutil.Seq) makes it tractable, as each mutation in the MVCC
keyspace uniquely identifies the Operation that must have written the value.
We make use of this property as follows. First, the Validate method iterates
through all Operations performed and, for each, inspects
Each atomic unit (i.e. individual non-transactional operation, or batch of
non-transactional operations, or transaction) results in a slice (in order
in which the Operations within executed2) of observations, where each
element is either an observed read or an observed write.
For example, a Batch that first writes
v1
tok1
, then scans[k1-k3)
(reading its own write as well as some other key k2 with value v2) and then
deletes
k3
would generate the following slice:Each such slice (i.e. atomic unit) will then be compared to the MVCC history.
For all units that succeeded, their writes must match up with versions in
the MVCC history, and this matching is trivial thanks to unique values
(including for deletions, since we embed the kvnemesisutil.Seq in the value),
and in particular a single write will entirely fix the MVCC timestamp at
which the atomic unit must have executed.
For each read (i.e. get or scan), we compute at which time intervals each
read would have been valid. For example, if the MVCC history for a key
k1
is as follows:
then
By intersecting the time intervals for each Operation in an atomic unit, we
then get the MVCC timestamps at which this Operation would have observed what
it ended up observing. If this intersection is empty, serializability must have
been violated.
In the error case, kvnemesis verifies that no part of the Operation became visible.
For ambiguous results, kvnemesis requires that either no Operation become visible,
or otherwise treats the atomic unit as committed.
The KV API also has the capability to return the actual execution timestamp directly
with responses. At the time of writing, kvnemesis does verify that it does do this
reliably, but it does not verify that the execution timestamp is contained in the
intersection of time intervals obtained from inspecting MVCC history3.
Epic: none
Footnotes
https://dl.acm.org/doi/10.1145/322154.322158 ↩
there is currently concurrency within the atomic unit in kvnemesis. It
could in theory carry out multiple reads concurrently while not also writing,
such as DistSQL does, but this is not implemented today. See:
https://github.com/cockroachdb/cockroach/issues/64825 ↩
tracked in https://github.com/cockroachdb/cockroach/issues/92898. ↩