-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: version/mixed/nodes=5 failed #37425
Comments
Uh-oh. What's going on here? Looks like the test first fails because... unknown? Then the dead node detection finds that n4 is dead (log suggests it was just down when the test failed). Then we run the consistency checker as part of the test harness and find... stuff on r1. Surprisingly it prints a diff, I thought it wouldn't do that, but it's good that it did. |
PS this is on 19.1, so the predecessor should be 2.1 |
The test workload should have run for 2h30m according to this formula cockroach/pkg/cmd/roachtest/version.go Lines 51 to 59 in 6bc4875
The workload output indicates that it ran only for ~43m. The reason is that this cockroach stop command failed
It finally fails ~2m30s later:
The stop command is actually preceded by a cockroach/pkg/cmd/roachtest/version.go Lines 128 to 140 in 6bc4875
Looking at the logs for n4, we see that the server did not stop cleanly; the log output continues. However, it seems that the network listener was closed since we're seeing these grpc errors (trying to connect to itself):
This also explains why the dead node detection fired after the: it uses The deadlock on Ok, on to more interesting things. Looking into why there was a diff in the first place (this is a fast consistency check, which isn't supposed to even iterate over the contents of the range, so how could it produce a diff?) I found that cockroach/pkg/sql/sem/builtins/generator_builtins.go Lines 960 to 967 in fb8e85d
Oooh... I see what's going on here. n5 (the "failing" follower with all of the data) is running v.2.1, which doesn't know about the stats only mode. All it knows is that it's asked to compute a checksum and create a diff, which it happily does. But it's also returning zero stats because it doesn't have that field in the proto at all. The leaseholder is running v19.1 and sees zero stats, and additionally its own result didn't populate a diff because it wasn't asked to. Ergo, we see exactly the failure that presents itself here: the leaseholder thinks n5 had zero persisted stats but at the same time had lots of data that "we" didn't have. I should've thought harder about mixed versions when I refactored this stuff. Luckily, there's a version I can bump to eliminate this problem:
|
In cockroachdb#35861, I made changes to the consistency checksum computation that were not backwards-compatible. When a 19.1 node asks a 2.1 node for a fast SHA, the 2.1 node would run a full computation and return a corresponding SHA which wouldn't match with the leaseholder's. Bump ReplicaChecksumVersion to make sure that we don't attempt to compare SHAs across these two releases. Fixes cockroachdb#37425. Release note (bug fix): Fixed a potential source of (faux) replica inconsistencies that can be reported while running a mixed v19.1 / v2.1 cluster. This error (in that situation only) is benign and can be resolved by upgrading to the latest v19.1 patch release. Every time this error occurs a "checkpoint" is created which will occupy a large amount of disk space and which needs to be removed manually (see <store directory>/auxiliary/checkpoints).
This regression tests cockroachdb#37425, which exposed an incompatibility between v19.1 and v2.1. `./bin/roachtest run --local version/mixed/nodes=3` ran successfully after these changes. I took the opportunity to address a TODO in FailOnReplicaDivergence. Release note: None
This regression tests cockroachdb#37425, which exposed an incompatibility between v19.1 and v2.1. `./bin/roachtest run --local version/mixed/nodes=3` ran successfully after these changes. I took the opportunity to address a TODO in FailOnReplicaDivergence. Release note: None
37668: storage: fix and test a bogus source of replica divergence errors r=nvanbenschoten a=tbg An incompatibility in the consistency checks was introduced between v2.1 and v19.1. See individual commit messages and #37425 for details. Release note (bug fix): Fixed a potential source of (faux) replica inconsistencies that can be reported while running a mixed v19.1 / v2.1 cluster. This error (in that situation only) is benign and can be resolved by upgrading to the latest v19.1 patch release. Every time this error occurs a "checkpoint" is created which will occupy a large amount of disk space and which needs to be removed manually (see <store directory>/auxiliary/checkpoints). Release note (bug fix): Fixed a case in which `./cockroach quit` would return success even though the server process was still running in a severely degraded state. 37701: workloadcccl: fix two regressions in fixtures make/load r=nvanbenschoten a=danhhz The SQL database for all the tables in the BACKUPs created by `fixtures make` used to be "csv" (an artifact of the way we made them), but as of #37343 it's the name of the generator. This seems better so change `fixtures load` to match. The same PR also (accidentally) started adding foreign keys in the BACKUPs, but since there's one table per BACKUP (another artifact of the way we used to make fixtures), we can't restore the foreign keys. It'd be nice to switch to one BACKUP with all tables and get the foreign keys, but the UX of the postLoad hook becomes tricky and I don't have time right now to sort it all out. So, revert to the previous behavior (no fks in fixtures) for now. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Daniel Harrison <[email protected]>
In cockroachdb#35861, I made changes to the consistency checksum computation that were not backwards-compatible. When a 19.1 node asks a 2.1 node for a fast SHA, the 2.1 node would run a full computation and return a corresponding SHA which wouldn't match with the leaseholder's. Bump ReplicaChecksumVersion to make sure that we don't attempt to compare SHAs across these two releases. Fixes cockroachdb#37425. Release note (bug fix): Fixed a potential source of (faux) replica inconsistencies that can be reported while running a mixed v19.1 / v2.1 cluster. This error (in that situation only) is benign and can be resolved by upgrading to the latest v19.1 patch release. Every time this error occurs a "checkpoint" is created which will occupy a large amount of disk space and which needs to be removed manually (see <store directory>/auxiliary/checkpoints).
SHA: https://github.com/cockroachdb/cockroach/commits/1810a4eaa07b412b2d0899d25bb16a28a2746d48 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1300948&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/699f675c73f8420802f92e46f65e6dce52abc12f Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1306268&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/db98d5fb943e0a45b3878bdf042838408e9aee40 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1308281&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/61715f0f96f519d599eec6541bbee7394d63209a Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1312952&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/979b47cb3c6cd55d0d4c142bd97cb569a1813c2a
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1281674&tab=buildLog
The text was updated successfully, but these errors were encountered: