-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: replicagc-changed-peers/restart=false failed #67910
Comments
@lunevalex this failure had #67319. We're not seeing liveness fail here. The problem is really here:
Note that these are all "small" rangeIDs. They all belong to system tables, r39 is also the last such range. The system tables are all supposed to be 5x-replicated. The logs show these messages over minutes:
These ranges are generally on n3, n4, n5 (recall that n1 and n2 are down at this point in the test). We check the replica count for five minutes, which happens to match the time until store dead. I'm not sure how this ever worked, as this can only work if the adaptive replication factor becomes three (instead of five as mandated by zone), but investigating the code here cockroach/pkg/kv/kvserver/allocator.go Lines 526 to 529 in 80bc7f6
shows that in this situation (two dead but not decommission{ed,ing} nodes, four live nodes) we'll ask for the full five replicas and "always" have. Earlier in the test, we have the system ranges on n1-n3 (adaptive replication factor is three since we haven't started n4-n6 yet) and then we go and stop n3, initiate decommissioning on n1-n3, and start n4-n6. This will keep the adaptive replication factor at three, since decommissioning nodes are excluded, so in principle all replicas would migrate to n4-n6. However, the test never waits for n3 to lose all replicas. It only does so for n1 and n2, and stops both when done (the reason why it doesn't check n3 is because it uses the node metrics to find out, but n3 is intentionally down; we could check the meta ranges instead). So when the test proceeds, it may end up with some system ranges on n3 and two out of n4-n6. We then recommission n1-n3 (but don't restart n1-n2) and restart n3. n1 and n2 are now down but not decommissioning, so the adaptive replication factor goes back to 5 and the replicas that are on n3 and n4-n6 can't move, as there aren't five live nodes in the system. Long story short, this failure makes sense, but why are we seeing it now, and likely most of the time (since the restart=true flavor, for which the same analysis holds, just failed as well - #67914)? |
The test ends up in the following situation: n1: down, no replicas n2: down, no replicas n3: alive, with constraint that wants all replicas to move, and there may be a few ranges still on n3 n4-n6: alive where the ranges predominantly 3x-replicated. The test is then verifying that the replica count (as in, replicas on n3, in contrast to replicas assigned via the meta ranges) on n3 drops to zero. However, system ranges cannot move in this configuration. The number of cluster nodes is six (decommission{ing,ed} nodes would be excluded, but no nodes are decommission{ing,ed} here) and so the system ranges operate at a replication factor of five. There are only four live nodes here, so if n3 is still a member of any system ranges, they will stay there and the test fails. This commit attempts to rectify that by making sure that while n3 is down earlier in the test, all replicas are moved from it. That was always the intent of the test, which is concerned with n3 realizing that replicas have moved elsewhere and initiating replicaGC; however prior to this commit it was always left to chance whether n3 would or would not have replicas assigned to it by the time the test moved to the stage above. The reason the test wasn't previously waiting for all replicas to be moved off n3 while it was down was that it required checking the meta ranges, which wasn't necessary for the other two nodes. This commit passed all five runs of replicagc-changed-peers/restart=false, so I think it reliably addresses the problem. There is still the lingering question of why this is failing only now (note that both flavors of the test failed on master last night, so I doubt it is rare). We just merged cockroachdb#67319 which is likely somehow related. Fixes cockroachdb#67910. Fixes cockroachdb#67914. Release note: None
Feels like this test must've changed behavior as a result of #67319. I'm not exactly sure how. |
Fails 5/5 on c3049f4. Passes 5/5 when I revert ab15a0a (#67319). So it's pretty conclusive that that PR changed something. @lunevalex any idea what exactly? I assume that in the earlier phase of the test (n1-n3 decommissioning, n3 down, n4-n6 just started) we are somehow ensuring that no system range ends up on n3 by the time n1 and n2 have shed all their replicas. Maybe we were previously giving priority to moving replicas off n3 (which is down) rather than n1 and n2, and so checking that n1/n2 were empty implied that n3 was empty? |
@tbg very interesting, I wonder if #67714 will fix this because @aayushshah15 saw something very similar at a customer. |
roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ 5e46fd88b11007ddaf0b5350ed28d11b0c3bfdaf:
Reproduce
To reproduce, try: ## Simple repro (linux-only):
$ make cockroachshort bin/worklaod bin/roachprod bin/roachtest
$ PATH=$PWD/bin:$PATH roachtest run replicagc-changed-peers/restart=false --local
## Proper repro probably needs more roachtest flags, or running
## the programs remotely on GCE. For more details, refer to
## pkg/cmd/roachtest/README.md. Same failure on other branches
|
roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ b02d22f9b3d30a0288ad1d8464dd6f2d82c08f0d:
Reproduce
To reproduce, try: ## Simple repro (linux-only):
$ make cockroachshort bin/worklaod bin/roachprod bin/roachtest
$ PATH=$PWD/bin:$PATH roachtest run replicagc-changed-peers/restart=false --local
## Proper repro probably needs more roachtest flags, or running
## the programs remotely on GCE. For more details, refer to
## pkg/cmd/roachtest/README.md. Same failure on other branches
|
roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ 9baaa282b3a09977b96bd3e5ae6e2346adfa2c16:
Reproduce
To reproduce, try: ## Simple repro (linux-only):
$ make cockroachshort bin/worklaod bin/roachprod bin/roachtest
$ PATH=$PWD/bin:$PATH roachtest run replicagc-changed-peers/restart=false --local
## Proper repro probably needs more roachtest flags, or running
## the programs remotely on GCE. For more details, refer to
## pkg/cmd/roachtest/README.md. Same failure on other branches
|
roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ f7528c59e296ed9acd2a20d590f2a42bbad0dcd0:
Reproduce
To reproduce, try: ## Simple repro (linux-only):
$ make cockroachshort bin/worklaod bin/roachprod bin/roachtest
$ PATH=$PWD/bin:$PATH roachtest run replicagc-changed-peers/restart=false --local
## Proper repro probably needs more roachtest flags, or running
## the programs remotely on GCE. For more details, refer to
## pkg/cmd/roachtest/README.md. Same failure on other branches
|
67526: roachtest: make timeout obvious in posted issues r=stevendanna a=tbg When a test times out, roachtest will rip the cluster out from under it to try to force it to terminate. This is essentially guaranteed to produce a posted issue that sweeps the original reason of the failure (the timeout) under the rug. Instead, such issues now plainly state that there was a timeout and refer the readers to the artifacts. See here for an example issue without this fix: #67464 cc @dt, who pointed this out [internally] [internally]: https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1626098863019500 Release note: None 67824: dev: teach `dev` how to do cross builds r=rail a=rickystewart Closes #67709. Release note: None 67825: changefeedccl: immediately stop sending webhook sink rows upon error r=spiffyyeng a=spiffyyeng Previously, the sink waited until flushing to acknowledge HTTP errors, leaving any messages between the initial error and flush to potentially be out of order. Now, errors are checked before each message is sent and the sink is restarted if one is detected to maintain ordering. Resolves #67772 Release note: None 67894: sql: add support for unique expression indexes r=mgartner a=mgartner Release note: None 67916: roachtest: fix replicagc-changed-peers r=aliher1911 a=tbg The test ends up in the following situation: n1: down, no replicas n2: down, no replicas n3: alive, with constraint that wants all replicas to move, and there may be a few ranges still on n3 n4-n6: alive where the ranges predominantly 3x-replicated. The test is then verifying that the replica count (as in, replicas on n3, in contrast to replicas assigned via the meta ranges) on n3 drops to zero. However, system ranges cannot move in this configuration. The number of cluster nodes is six (decommission{ing,ed} nodes would be excluded, but no nodes are decommission{ing,ed} here) and so the system ranges operate at a replication factor of five. There are only four live nodes here, so if n3 is still a member of any system ranges, they will stay there and the test fails. This commit attempts to rectify that by making sure that while n3 is down earlier in the test, all replicas are moved from it. That was always the intent of the test, which is concerned with n3 realizing that replicas have moved elsewhere and initiating replicaGC; however prior to this commit it was always left to chance whether n3 would or would not have replicas assigned to it by the time the test moved to the stage above. The reason the test wasn't previously waiting for all replicas to be moved off n3 while it was down was that it required checking the meta ranges, which wasn't necessary for the other two nodes. This commit passed all five runs of replicagc-changed-peers/restart=false, so I think it reliably addresses the problem. There is still the lingering question of why this is failing only now (note that both flavors of the test failed on master last night, so I doubt it is rare). We just merged #67319 which is likely somehow related. Fixes #67910. Fixes #67914. Release note: None 67961: bazel: use `action_config`s over `tool_path`s in cross toolchains r=rail a=rickystewart This doesn't change much in practice, but does allow us to use the actual `g++` compiler for C++ compilation, which wasn't the case before. The `tool_path` constructor is actually [deprecated](https://github.com/bazelbuild/bazel/blob/203aa773d7109a0bcd9777ba6270bd4fd0edb69f/tools/cpp/cc_toolchain_config_lib.bzl#L419) in favor of `action_config`s, so this is future-proofing. Release note: None 67962: bazel: start building geos in ci r=rail a=rickystewart Only the most recent commit applies for this review -- the other is from #67961. Closes #66388. Release note: None 68065: cli: skip TestRemoveDeadReplicas r=irfansharif a=tbg Refs: #50977 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Ryan Min <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>
The test ends up in the following situation: n1: down, no replicas n2: down, no replicas n3: alive, with constraint that wants all replicas to move, and there may be a few ranges still on n3 n4-n6: alive where the ranges predominantly 3x-replicated. The test is then verifying that the replica count (as in, replicas on n3, in contrast to replicas assigned via the meta ranges) on n3 drops to zero. However, system ranges cannot move in this configuration. The number of cluster nodes is six (decommission{ing,ed} nodes would be excluded, but no nodes are decommission{ing,ed} here) and so the system ranges operate at a replication factor of five. There are only four live nodes here, so if n3 is still a member of any system ranges, they will stay there and the test fails. This commit attempts to rectify that by making sure that while n3 is down earlier in the test, all replicas are moved from it. That was always the intent of the test, which is concerned with n3 realizing that replicas have moved elsewhere and initiating replicaGC; however prior to this commit it was always left to chance whether n3 would or would not have replicas assigned to it by the time the test moved to the stage above. The reason the test wasn't previously waiting for all replicas to be moved off n3 while it was down was that it required checking the meta ranges, which wasn't necessary for the other two nodes. This commit passed all five runs of replicagc-changed-peers/restart=false, so I think it reliably addresses the problem. There is still the lingering question of why this is failing only now (note that both flavors of the test failed on master last night, so I doubt it is rare). We just merged cockroachdb#67319 which is likely somehow related. Fixes cockroachdb#67910. Fixes cockroachdb#67914. Release note: None
roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ f0e2aa6abbbbf3318ea20e7dbcbe40819a809b83:
Reproduce
To reproduce, try:
# From https://go.crdb.dev/p/roachstress, perhaps edited lightly. caffeinate ./roachstress.sh replicagc-changed-peers/restart=false
Same failure on other branches
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: