-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: fix and test a bogus source of replica divergence errors #37668
Conversation
There are "known unknown" problems regarding shutting down gRPC or the stopper, which can leave the process dangling while its sockets are already closed. The UX is poor and it's likely very annoying to find, fix, and regress against the root cause. Luckily, all we want to achieve is that the process is dead soon after the client disconnects, and we can do that. If I were to rewrite this code, I would probably not even bother with stopping the stopper or grpc, but just call `os.Exit` straight away. I'm not doing this right now to minimize fallout since this change will be backported to release-19.1. Release note (bug fix): Fixed a case in which `./cockroach quit` would return success even though the server process was still running in a severely degraded state.
In cockroachdb#35861, I made changes to the consistency checksum computation that were not backwards-compatible. When a 19.1 node asks a 2.1 node for a fast SHA, the 2.1 node would run a full computation and return a corresponding SHA which wouldn't match with the leaseholder's. Bump ReplicaChecksumVersion to make sure that we don't attempt to compare SHAs across these two releases. Fixes cockroachdb#37425. Release note (bug fix): Fixed a potential source of (faux) replica inconsistencies that can be reported while running a mixed v19.1 / v2.1 cluster. This error (in that situation only) is benign and can be resolved by upgrading to the latest v19.1 patch release. Every time this error occurs a "checkpoint" is created which will occupy a large amount of disk space and which needs to be removed manually (see <store directory>/auxiliary/checkpoints).
The mixed version test was always verifying the first node by accident. Release note: None
572b8c9
to
326a213
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, 1 of 1 files at r5.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/cmd/roachtest/cluster.go, line 1058 at r4 (raw file):
// TODO(tbg): the checks can fail for silly reasons like missing gossiped // descriptors, etc. -- not worth failing the test for. Ideally this would // be rock solid.
Is it still worth logging?
pkg/cmd/roachtest/cluster.go, line 1090 at r4 (raw file):
} var db *gosql.DB
Comment that you're trying to find a live node and that this isn't actually the consistency check.
This regression tests cockroachdb#37425, which exposed an incompatibility between v19.1 and v2.1. `./bin/roachtest run --local version/mixed/nodes=3` ran successfully after these changes. I took the opportunity to address a TODO in FailOnReplicaDivergence. Release note: None
Release note: None
326a213
to
e3ae436
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments addressed, TFTR!
bors r=nvanbenschoten
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)
37668: storage: fix and test a bogus source of replica divergence errors r=nvanbenschoten a=tbg An incompatibility in the consistency checks was introduced between v2.1 and v19.1. See individual commit messages and #37425 for details. Release note (bug fix): Fixed a potential source of (faux) replica inconsistencies that can be reported while running a mixed v19.1 / v2.1 cluster. This error (in that situation only) is benign and can be resolved by upgrading to the latest v19.1 patch release. Every time this error occurs a "checkpoint" is created which will occupy a large amount of disk space and which needs to be removed manually (see <store directory>/auxiliary/checkpoints). Release note (bug fix): Fixed a case in which `./cockroach quit` would return success even though the server process was still running in a severely degraded state. 37701: workloadcccl: fix two regressions in fixtures make/load r=nvanbenschoten a=danhhz The SQL database for all the tables in the BACKUPs created by `fixtures make` used to be "csv" (an artifact of the way we made them), but as of #37343 it's the name of the generator. This seems better so change `fixtures load` to match. The same PR also (accidentally) started adding foreign keys in the BACKUPs, but since there's one table per BACKUP (another artifact of the way we used to make fixtures), we can't restore the foreign keys. It'd be nice to switch to one BACKUP with all tables and get the foreign keys, but the UX of the postLoad hook becomes tricky and I don't have time right now to sort it all out. So, revert to the previous behavior (no fks in fixtures) for now. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Daniel Harrison <[email protected]>
Build succeeded |
An incompatibility in the consistency checks was introduced between v2.1 and v19.1.
See individual commit messages and #37425 for details.
Release note (bug fix): Fixed a potential source of (faux) replica
inconsistencies that can be reported while running a mixed v19.1 / v2.1
cluster. This error (in that situation only) is benign and can be
resolved by upgrading to the latest v19.1 patch release. Every time this
error occurs a "checkpoint" is created which will occupy a large amount
of disk space and which needs to be removed manually (see /auxiliary/checkpoints).
Release note (bug fix): Fixed a case in which
./cockroach quit
wouldreturn success even though the server process was still running in a
severely degraded state.