-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: conditional puts do not respect the optional nature of value checksums #22636
Comments
We see the same symptom in #22581. |
I just ran into this with local testing as well. I'm guessing one of the liveness CPuts is causing issues. Investigating now. |
Yeah, we're going to need CPut to ignore the checksum when performing its existing value comparison in
Even if we decide not to pursue steps 3 and 4, I think step 2 is still important. The reason for this is that even before 75cbeb8, value checksums were meant to be optional, so comparing rawBytes directly like we do in cockroach/pkg/sql/sqlbase/rowfetcher.go Lines 1060 to 1070 in b0b6553
|
Haven't thought this through entirely, but why can't you reintroduce 75cbeb8 behind a 2.0 isMinVersion check? |
Yeah if we get step 2 in place by the 2.0 release then you're right that we can put it behind a 2.0 isMinVersion check. |
Adriatic is currently suffering the same fate. |
Should be fixed if you pick up #22637. |
@BramGruneir Adriatic is no longer seeing this after #22637, right? |
Yep, all good. Wanted to let it bake a while and Adriatic is looking great right now. |
…-enable See cockroachdb#15851. This change re-enables `TestVersionUpgrade`, which has been broken for at-least the past few weeks. It also verifies that the test would have caught the regression in cockroachdb#22636. In doing so, it improves `TestVersionUpgrade` by splitting the version upgrades into a series of incremental steps. Release note: None
…-enable See cockroachdb#15851. This change re-enables `TestVersionUpgrade`, which has been broken for at-least the past few weeks. It also verifies that the test would have caught the regression in cockroachdb#22636. In doing so, it improves `TestVersionUpgrade` by splitting the version upgrades into a series of incremental steps. Release note: None
heartbeat failed on epoch increment
errorsFixes cockroachdb#23984. Fixes cockroachdb#22636. In cockroachdb#22636 we noticed that conditional puts were not respecting the fact that Value checksums are meant to be optional. This is because they include checksums when performing the equality check between expected and existing values. This caused serious issues and forced us to roll back 75cbeb8. This was unfortunate, but the damage was contained. It demonstrated that we need to be very careful about changing checksums on values. Later, we found in cockroachdb#23984 that a path somewhere in IMPORT was forgetting to set checksums on Values. This was causing SQL operations at a much later time to fail for the same reason that 75cbeb8 caused issues. This bug will be fixed in a different change, but the fact that it exists shows how fragile the current approach is because despite the intention, checksums are absolutely not optional. This change addresses both of these problems by making checksums optional, as they were intended to be. CPut and InitPut no longer compare the checksums of the expected and existing values. Instead, they call into a new `EqualData` method, which ignores the checksum header when comparing byte values. This is a bit higher risk than most of the changes we've been accepting at this stage of the release cycle, but it would be really nice to get this change in to 2.0 so that we can stay on track with the timeline proposed in cockroachdb#22636 (comment) (although I'm not positive that the last step of the timeline is correct because I don't think we can rely on the "baked in" protection with such a low-level change). It will also save anyone who has used IMPORT already and hit cockroachdb#22636. This change should be safe from beneath Raft inconsistency for the same reason that 75cbeb8 was safe. All CPuts beneath Raft are still careful to set checksums, so we should not see any divergence in Raft state because of this error handling change. After this change, the only uses of `bytes.Equal(v1.RawBytes, v2.RawBytes)` are in tests and the `value.Equal` method. There may be an argument for making even `value.Equal` ignore the checksum header, although it isn't used outside of other `proto.Equal` methods. ``` Nathans-MacBook-Pro:cockroach$ grep -lRE 'bytes\.Equal\(.*\.RawBytes,*,.*\.RawBytes,*\)' pkg pkg/roachpb/data.pb.go pkg/storage/engine/mvcc_test.go ``` Release note: None
Fixes cockroachdb#23984. Fixes cockroachdb#22636. In cockroachdb#22636 we noticed that conditional puts were not respecting the fact that Value checksums are meant to be optional. This is because they include checksums when performing the equality check between expected and existing values. This caused serious issues and forced us to roll back 75cbeb8. This was unfortunate, but the damage was contained. It demonstrated that we need to be very careful about changing checksums on values. Later, we found in cockroachdb#23984 that a path somewhere in IMPORT was forgetting to set checksums on Values. This was causing SQL operations at a much later time to fail for the same reason that 75cbeb8 caused issues. This bug will be fixed in a different change, but the fact that it exists shows how fragile the current approach is because despite the intention, checksums are absolutely not optional. This change addresses both of these problems by making checksums optional, as they were intended to be. CPut and InitPut no longer compare the checksums of the expected and existing values. Instead, they call into a new `EqualData` method, which ignores the checksum header when comparing byte values. This is a bit higher risk than most of the changes we've been accepting at this stage of the release cycle, but it would be really nice to get this change in to 2.0 so that we can stay on track with the timeline proposed in cockroachdb#22636 (comment) (although I'm not positive that the last step of the timeline is correct because I don't think we can rely on the "baked in" protection with such a low-level change). It will also save anyone who has used IMPORT already and hit cockroachdb#22636. This change should be safe from beneath Raft inconsistency for the same reason that 75cbeb8 was safe. All CPuts beneath Raft are still careful to set checksums, so we should not see any divergence in Raft state because of this error handling change. After this change, the only uses of `bytes.Equal(v1.RawBytes, v2.RawBytes)` are in tests and the `value.Equal` method. There may be an argument for making even `value.Equal` ignore the checksum header, although it isn't used outside of other `proto.Equal` methods. ``` Nathans-MacBook-Pro:cockroach$ grep -lRE 'bytes\.Equal\(.*\.RawBytes,*,.*\.RawBytes,*\)' pkg pkg/roachpb/data.pb.go pkg/storage/engine/mvcc_test.go ``` Release note: None
For posterity, I'm kind of skeptical that we can follow through with step 4:
The reason is that checksums are used so prevalently, so early into node startup, and at such low levels (for instance, in liveness heartbeats). Because of this, I don't know if we can ever rely on the check that clusters are upgraded through every major release to "bake in" the knowledge that a 2.1 node is talking to at least 2.0 nodes like we do for "baked in" SQL migrations. |
We check versions very early in the RPC pipeline, so we should be confident that a 2.1 node will never connect to a pre-2.0 node. |
Upgrading a cluster across 75cbeb8 causes a never-ending stream of
heartbeat failed on epoch increment
errors, and the cluster is unable to serve requests.Steps to reproduce:
git checkout 75cbeb84848c807437ce84dcb93229d17fbb4a7e~
(note the ~)make
git checkout 75cbeb84848c807437ce84dcb93229d17fbb4a7e
make
The text was updated successfully, but these errors were encountered: