-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Migrate MVCCStats.contains_estimates from bool to int64 #37583
storage: Migrate MVCCStats.contains_estimates from bool to int64 #37583
Conversation
0a22a46
to
8def60c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good so far! I think there's still some confusion about these computations but that's understandable, this stuff melts my brain every time I look at it.
Here's what I'm thinking about the migration. Naively (and this won't be enough) commands (in batcheval, i.e. AddSSTable) would do something like:
cArgs.Stats.ContainsEstimates++
Now here's still a problem: consider some nodes are running the old version still and we propose a command that introduces ContainsEstimates for the first time. It looks good, because the new nodes will compute 1 + 0 = 1
and the old nodes will see a protobuf-varint encoded 1 on the wire, which they decode to ContainsEstimates == true
(see here)
But now we run a second command with an estimate, so the new nodes update ContainsEstimates = 1+1 = 2
but the old node still decodes 2 as true
and true
encodes the same as 1
(in protobuf parlance). So we're screwed because now the old nodes persist different bytes for their stats as the new nodes. We can't use anything but
0and
1(with the semantics that
1` never gets removed) as long as old nodes are around.
This poses an awkward problem since we need to uphold this also when we can't check the cluster version (i.e. below Raft).
I think the way out is to never use ContainsEstimates == 1
except as a signal that we're still using the legacy behavior. We can then adapt the logic that adds ContainsEstimates
so that we have
func (ms *MVCCStats) Subtract(oms MVCCStats) { // similar for Add
if ms.ContainsEstimates == 1 || oms.ContainsEstimates == 1 {
ms.ContainsEstimates = 1
} else {
ms.ContainsEstimates -= oms.ContainsEstimates
}
}
and wherever we generate a delta we would use
if !r.ClusterSettings().Version.IsActive(cluster.VersionContainsEstimatesCounter) {
cArgs.Stats.ContainsEstimates = 1
} else {
cArgs.Stats.ContainsEstimates += 2
}
Let's walk through an example to see how this changes things and run three AddSSTable commands in a mixed cluster. All of them will set cArgs.Stats.ContainsEstimates = 1
and thanks to ms.{Add,Subtract}
above and the fact that nobody will ever use any ContainsEstimates
which are not 0
or 1
(since old nodes are around), this will behave (and encode) exactly like the old nodes encoded false
and true
. Of course there's an exception, which is that MVCCStatsDelta
uses sint64
and there, 1
encodes like a regular varint(2)
, so when in MVCCStatsDelta
we need to map 1 -> -1
(i.e. ContainsEstimates == true
corresponds to ContainsEstimates == -1
). That is just to say that the MVCCStatsDelta
and the various other structs of the same goal (which we use for more efficient encodings in some places) need to have extra code in the wrappers:
cockroach/pkg/storage/engine/enginepb/mvcc3.go
Lines 19 to 37 in 39ba88b
// ToStats converts the receiver to an MVCCStats. | |
func (ms *MVCCStatsDelta) ToStats() MVCCStats { | |
return MVCCStats(*ms) | |
} | |
// ToStatsDelta converts the receiver to an MVCCStatsDelta. | |
func (ms *MVCCStats) ToStatsDelta() MVCCStatsDelta { | |
return MVCCStatsDelta(*ms) | |
} | |
// ToStats converts the receiver to an MVCCStats. | |
func (ms *MVCCPersistentStats) ToStats() MVCCStats { | |
return MVCCStats(*ms) | |
} | |
// ToPersistentStats converts the receiver to an MVCCPersistentStats. | |
func (ms *MVCCStats) ToPersistentStats() MVCCPersistentStats { | |
return MVCCPersistentStats(*ms) | |
} |
Now when the cluster version flips, nodes will start to emit even numbers. Due to our semantics (see Subtract
above) as long as the range still has ContainsEstimates == 1
, we stick with 1
. But once the consistency checker adjusts the stats under the new version (i.e. resets to zero), future updates will actually use the counter semantics, as we wanted.
^- this is all pretty out in the weeds (especially the different proto encodings), so it's totally fine by me (and a valid starting point) to forget about the migration for now and to get the code in good order first. We can then give the migration focused attention. The migration is unfortunately pretty critical since replicas will diverge if we get it wrong.
Reviewed 28 of 28 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @giorgosp)
pkg/storage/client_split_test.go, line 934 at r1 (raw file):
} // TODO(giorgosp): Why msLeft.ContainsEstimates is 193 and msRight.ContainsEstimates is 0??
Each writeRandomTimeSeriesDataToRange
write adds an estimate. You split at the midpoint, which recomputes the stats on the left and right, so you should end up with no estimates on either side. I've commented on the code in the split trigger, it's not correct and this test catches the problem.
pkg/storage/replica_consistency.go, line 186 at r1 (raw file):
} if delta.ContainsEstimates == 0 && testingFatalOnStatsMismatch {
Make this <= 0
(and mention in the comment that it's never supposed to go negative, but who knows)
pkg/storage/replica_raft.go, line 2284 at r1 (raw file):
// If we were running the new VersionContainsEstimatesCounter cluster version, // the consistency checker will be able to reset the stats itself. if !r.ClusterSettings().Version.IsActive(cluster.VersionContainsEstimatesCounter) {
It's usually not possible to use IsActive
safely below Raft (i.e. the code that applies Raft commands to the state machine, i.e. here). The reason is that the cluster version is communicated asynchronously, so nodes will receive it at different times, and will just randomly start using it at whatever log position the update comes in. This can lead to one node running this code and another not, for the same command, thus updating the in-memory state in different ways. I think you want to leave this code in as it was before but change the split code so that it generates a ms
that has ContainsEstimates
set to a negative value that cancels that of the left-hand side (but only if the version is active). The result will be that when the new version is active, the addition results in zero but it's going to be overwritten by zero once more. If the version isn't active yet, we overwrite with zero and that's ok too.
pkg/storage/batcheval/cmd_add_sstable.go, line 99 at r1 (raw file):
// remove the ContainsEstimates flag) after they are done ingesting files by // sending an explicit recompute. // TODO(giorgosp): Should I check if it's 1 already or it always comes as 0?
In practice it will be 0 but you can't know that. In theory the AddSSTableRequest could be in a batch with other requests that cause estimates. Just write stats.ContainsEstimates++
(here and everywhere else). Note the need for a version gate as mentioned in the main commit msg.
pkg/storage/batcheval/cmd_end_transaction.go, line 1049 at r1 (raw file):
recStats := rec.GetMVCCStats() leftDeltaMS.Subtract(recStats) // subtract pre-split absolute stats leftDeltaMS.ContainsEstimates = 0 // if there were any, recomputation removed them
If the left hand side of the split has, say, ContainsEstimates = 10
, then leftDeltaMS.Subtract(recStats)
should result in leftDeltaMS.ContainsEstimates = -10
which downstream of Raft would give a sum of zero, as desired. (You need to check the cluster version before you emit a non-1 delta). You shouldn't need to touch ContainsEstimates
manually here (except for migration stuff).
But note that the resulting leftDeltaMS
will mess up rightDeltaMS
below if we're not careful. So what you want to do is rightDeltaMS.ContainsEstimates = 0
after subtracting from rightDeltaMS
.
pkg/storage/engine/enginepb/mvcc.go, line 146 at r1 (raw file):
// If either stats object contains estimates, their difference does too. // TODO(giorgosp): This is problematic, i.e. 1 - 1 = 0, which means the stats // are exact. Should we add them or just keep one of them?
See the comment about the migration on the main thread.
pkg/storage/engine/enginepb/mvcc.proto, line 113 at r1 (raw file):
// where complete accuracy is required, and instead should be recomputed // when necessary. optional int64 contains_estimates = 14 [(gogoproto.nullable) = false];
This needs a big comment about the migration when we've figured it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @giorgosp)
pkg/storage/engine/enginepb/mvcc3.proto, line 111 at r1 (raw file):
option (gogoproto.equal) = true; sint64 contains_estimates = 14;
I emailed you this already, but for posterity: this can be int64
(it's basically always zero, and if it's not, it's basically always positive, and only negative in very rare cases such as splits and stats adjustments).
Using int64
has the big advantage that all of this "mapping 1 to -1" problem goes away, i.e. ContainsEstimates == 1
should correspond to ContainsEstimates == true
without problems.
8def60c
to
9cb57ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the thorough reply Tobias! I have made some small changes
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @giorgosp and @tbg)
pkg/storage/client_split_test.go, line 934 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Each
writeRandomTimeSeriesDataToRange
write adds an estimate. You split at the midpoint, which recomputes the stats on the left and right, so you should end up with no estimates on either side. I've commented on the code in the split trigger, it's not correct and this test catches the problem.
Thanks!
pkg/storage/replica_consistency.go, line 186 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Make this
<= 0
(and mention in the comment that it's never supposed to go negative, but who knows)
Done.
pkg/storage/replica_raft.go, line 2284 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
It's usually not possible to use
IsActive
safely below Raft (i.e. the code that applies Raft commands to the state machine, i.e. here). The reason is that the cluster version is communicated asynchronously, so nodes will receive it at different times, and will just randomly start using it at whatever log position the update comes in. This can lead to one node running this code and another not, for the same command, thus updating the in-memory state in different ways. I think you want to leave this code in as it was before but change the split code so that it generates ams
that hasContainsEstimates
set to a negative value that cancels that of the left-hand side (but only if the version is active). The result will be that when the new version is active, the addition results in zero but it's going to be overwritten by zero once more. If the version isn't active yet, we overwrite with zero and that's ok too.
I see, thanks!
pkg/storage/batcheval/cmd_add_sstable.go, line 99 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
In practice it will be 0 but you can't know that. In theory the AddSSTableRequest could be in a batch with other requests that cause estimates. Just write
stats.ContainsEstimates++
(here and everywhere else). Note the need for a version gate as mentioned in the main commit msg.
Is this what you meant?
pkg/storage/engine/enginepb/mvcc.go, line 146 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
See the comment about the migration on the main thread.
Done.
pkg/storage/engine/enginepb/mvcc3.proto, line 111 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I emailed you this already, but for posterity: this can be
int64
(it's basically always zero, and if it's not, it's basically always positive, and only negative in very rare cases such as splits and stats adjustments).Using
int64
has the big advantage that all of this "mapping 1 to -1" problem goes away, i.e.ContainsEstimates == 1
should correspond toContainsEstimates == true
without problems.
Done.
5d0691b
to
6606331
Compare
1f07f87
to
7de5cd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tbg, made some small changes to have a meaninful base to proceed.
- ContainsEstimates is int64 instead of sint64 as you suggested to not add unnecessary complexity.
- AddSSTable command sets it to 1 if it is running the old version. Otherwise it adds 2 to it. Is this what you meant Should we add some specific test?
TestRaftBelowProtos
fails, my understanding is that it checks the size of the stats for each case? Can I just replace the expected number with the actual?- Should I change
MVCCMerge
in order to ContainsEstimates = 2 when the new version is active instead of= 1
thatupdateStatsOnMerge
does ? - Also cleaned up the change from
RecomputeStats
, it was meaningless. I have to implement returning-ContainsEstimates
and see what breaks but after we emitContainsEstimates
correctly.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @giorgosp and @tbg)
pkg/storage/batcheval/cmd_end_transaction.go, line 1049 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
If the left hand side of the split has, say,
ContainsEstimates = 10
, thenleftDeltaMS.Subtract(recStats)
should result inleftDeltaMS.ContainsEstimates = -10
which downstream of Raft would give a sum of zero, as desired. (You need to check the cluster version before you emit a non-1 delta). You shouldn't need to touchContainsEstimates
manually here (except for migration stuff).But note that the resulting
leftDeltaMS
will mess uprightDeltaMS
below if we're not careful. So what you want to do isrightDeltaMS.ContainsEstimates = 0
after subtracting fromrightDeltaMS
.
I don't really understand what is going on here. I have to read up on this. However, seems like rightDeltaMS.ContainsEstimates
could be true if bothDeltaMS.ContainsEstimates
comes as true. So maybe we shouldn't always return rightDeltaMS.ContainsEstimates
as 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @giorgosp and @tbg)
pkg/cli/debug.go, line 627 at r5 (raw file):
failed = true } else { ms.ContainsEstimates = 1
why set this to one? Looks like you'd just leave it alone.
pkg/storage/replica_raft.go, line 2285 at r5 (raw file):
Add a TODO here that we can remove this at some point in the future
// TODO(tbg): splits reset ContainsEstimates as of 19.2, so this can be removed at some point in the future.
pkg/storage/batcheval/cmd_add_sstable.go, line 102 at r5 (raw file):
// // If we are running the old ContainsEstimates version (boolean), we just set it to 1. // Otherwise we make it bigger than one to indicate that we are running the newer version.
You're going to have this in a few places, so I'd say add an explanation of these tricky semantics on cluster.VersionContainsEstimatesCounter
and then in every other place like here, add something like
_ = cluster.VersionContainsEstimatesCounter // see for info on ContainsEstimates migration
This allows navigating quickly to the explanation without duplicating the explanation in various places.
I think we should also explore centralizing this behavior. Let's say we want to avoid forcing each command that introduces an estimate to do this dance. We should be able to put code into the replica proposal path that does this check only once and massages the output to 1
if the cluster version is not active yet. I think that might be preferable.
pkg/storage/batcheval/cmd_end_transaction.go, line 1049 at r1 (raw file):
Previously, giorgosp (George Papadrosou) wrote…
I don't really understand what is going on here. I have to read up on this. However, seems like
rightDeltaMS.ContainsEstimates
could be true ifbothDeltaMS.ContainsEstimates
comes as true. So maybe we shouldn't always returnrightDeltaMS.ContainsEstimates
as 0?
The high level overview is this: we want to split the replica and so we need to split the stats. We want to recompute as little as possible because for performance reasons. If the range has no estimates, things are "easy" because we can just recompute one half and use math for the other half. We're also writing new data into the ranges in this method, so we have to track their contributions. If the initial range had estimates, we need to recompute both initially.
a few hours later... just a heads up that I have a review open on this but I got my brain melted a bit by these split computations -- I'm going to send a PR in the next couple of days that should clarify these computations a bunch. I'll ping you on the PR. Will review the rest later but generally looks good, I left some comments above this line.
Here's the PR: #38084 |
Pulling out the stats-related computations into a helper taylored exactly to what happens during a split provides a convenient venue for lengthy rumination on the finer points of said stats computations, while decluttering the split trigger's main method. This was motivated by cockroachdb#37583, which needs to adjust some of this code. Now that it's (hopefully) more accessible, it should be more straight- forward to figure out what exactly to do. Release note: None
7de5cd2
to
51728de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I ll rebase after it gets merged
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @giorgosp and @tbg)
pkg/cli/debug.go, line 627 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
why set this to one? Looks like you'd just leave it alone.
It was true on master and I accidentally removed it in a commit. https://github.com/cockroachdb/cockroach/blob/master/pkg/cli/debug.go#L624
pkg/storage/replica_raft.go, line 2285 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Add a TODO here that we can remove this at some point in the future
// TODO(tbg): splits reset ContainsEstimates as of 19.2, so this can be removed at some point in the future.
Done.
Pulling out the stats-related computations into a helper taylored exactly to what happens during a split provides a convenient venue for lengthy rumination on the finer points of said stats computations, while decluttering the split trigger's main method. This was motivated by cockroachdb#37583, which needs to adjust some of this code. Now that it's (hopefully) more accessible, it should be more straight- forward to figure out what exactly to do. Release note: None
9c71899
to
1dcd2a4
Compare
8b8be0c
to
d393a68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tbg, @nvanbenschoten, I have rebased this PR and added some tests for the ContainsEstimates handling during command proposal and application.
I have also added a change to batcheval.RecomputeStats
but I am not sure if just checking for the cluster version is enough. Should we also check the loaded mvcc stats and if ContainsEstimates=1, keep it 1 in the returned delta?
Actually, I am not sure if the RecomputeStats
command goes through Raft so that ContainsEstimates is handled as in the other commands and we won't have to do something special.
I also see that commits need a "Release justification" message, not sure what to write in this (and if this PR will make it to 19.2 anyway).
Thanks!
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @tbg)
pkg/storage/replica_proposal.go, line 728 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
// Encode that this command (and any that follow) uses regular arithmetic for ContainsEstimates
// by making sure ContainsEstimates is > 1.
// This will be interpreted during command application.
Done.
pkg/storage/replica_proposal.go, line 732 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
// This range may still need to have its commands processed by nodes which treat ContainsEstimates
// as a bool, so clamp it to {0,1}. This enables use of bool semantics in command application.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @giorgosp! You're right that this PR will not make 19.2; we're in the release stabilization period and won't merge anything that can wait until the 20.1 cycle (ie. in a few weeks or so). For the same reason we also won't be able to give this PR more attention right now; there are some fires that need to be put out before we release. That's unfortunate (I'd love for you to be able to get this off your plate) but realistically it won't happen. I'll come back to this PR when the new cycle starts, and will get it in. If you're available to help out then great, otherwise I'm just going to amend your PR - pretty sure not much is left to be done, but I'd need to re-review before the merge.
Reviewed 21 of 21 files at r14.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! Looking forward to contribute to the next cycle
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @tbg)
1af0fe7
to
c2d2fa5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tbg ping, rebased this PR :)
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @tbg)
pkg/storage/batcheval/cmd_recompute_stats.go, line 103 at r15 (raw file):
// means some extra engine churn. cArgs.Stats.Add(delta) if !cArgs.EvalCtx.ClusterSettings().Version.IsActive(cluster.VersionContainsEstimatesCounter) {
Does this change make sense? I could also add a unit test for it.
7d01b41
to
c0da358
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I made a few changes (to your fork), please make sure you don't accidentally push over them; git fetch --all && git reset --hard @{upstream}
should get you to the right place:
- grab a missing lock in the test (CI was red because of this)
- don't truncate ContainsEstimates to {0,1} when it's negative. Instead, when it's negative and the version isn't active, refuse the command (as a safety check: the only caller emitting negative estimates is RecomputeStats, and that does the check already)
- update an existing test to catch the problem fixed in 2 (the problem was that the recomputation would be applied with the factor of 2x).
- rebase (with some fallout, but nothing serious)
- make sure the split recomputation actually works, by making sure the legacy
if split != nil
reset path is only taken when there isn't a delta in the incoming split (and making the split trigger emit a wrong delta manually once)
I'll give this another pass tomorrow, but it's ready to merge and I'll do so when I've looked again.
Reviewed 12 of 21 files at r15, 10 of 10 files at r16.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @giorgosp)
pkg/storage/batcheval/cmd_recompute_stats.go, line 103 at r15 (raw file):
Previously, giorgosp (George Papadrosou) wrote…
Does this change make sense? I could also add a unit test for it.
This preserves the old behavior, so 👍
64038e2
to
b11f067
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok @tbg thanks for jumping in! I went through the changes. I won't manage to have a unit test for RecomputeStats
ready in the next couple of days, but I think the negative check you added for ContainsEstimates
is a safety net, right?
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @tbg)
b11f067
to
9e63ec8
Compare
This migration makes ContainsEstimates a counter so that the consistency checker can reset it (by returning a -ContainsEstimates) delta) without racing with another command that introduces new estimate stats. Resolves cockroachdb#37120 Release note: None
9e63ec8
to
35bdbf0
Compare
bors r=tbg |
Build failed |
uitest failed during lint. Is this a known flake?
|
bors r=tbg |
37583: storage: Migrate MVCCStats.contains_estimates from bool to int64 r=tbg a=giorgosp This migration allows ContainsEstimates to be reset after a stats recomputation (by returning a -ContainsEstimates delta) without worrying about a race condition. Another command may add 1 to it flag and it will still be stored as valid. Resolves #37120 Release note: None 42129: colexec: fix AND and OR projections in some cases r=yuzefovich a=yuzefovich Previously, the original batch length was not respected when the selection vector is present. This resulted in, for example, query 19 of TPCH benchmark to return an error. This is now fixed. I have troubles coming up with a reduced reproduction though. I also checked that on release-19.2 branch the query is executed correctly with vectorized, so it must be the switch to flat bytes that triggers the problem. Release note: None 42172: colexec: fix sorted distinct with nulls behavior r=yuzefovich a=yuzefovich Previously, sorted distinct when the nulls might be present would always get the value at the index without paying attention whether that value is actually now. This is incorrect behavior because it is undefined in some cases (like when getting from flat bytes). Now this is fixed. Fixes: #42055. Release note: None Co-authored-by: George Papadrosou <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
Build succeeded |
This migration allows ContainsEstimates to be reset after a stats
recomputation (by returning a -ContainsEstimates delta) without
worrying about a race condition. Another command may add 1 to it
flag and it will still be stored as valid.
Resolves #37120
Release note: None