Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: loqrecovery/workload=movr/rangeSize=2mb failed #91016

Closed
cockroach-teamcity opened this issue Oct 31, 2022 · 6 comments · Fixed by #91185
Closed

roachtest: loqrecovery/workload=movr/rangeSize=2mb failed #91016

cockroach-teamcity opened this issue Oct 31, 2022 · 6 comments · Fixed by #91185
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 31, 2022

roachtest.loqrecovery/workload=movr/rangeSize=2mb failed with artifacts on release-22.2.0 @ 207058d86e0b30ee27866cdf7df923791e1adf55:

test artifacts and logs in: /artifacts/loqrecovery/workload=movr/rangeSize=2mb/run_1
	test_runner.go:1061,test_runner.go:960: test timed out (0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-21074

@cockroach-teamcity cockroach-teamcity added branch-release-22.2.0 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Oct 31, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Oct 31, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Oct 31, 2022
@erikgrinaker erikgrinaker added T-kv-replication and removed T-kv KV Team labels Nov 1, 2022
@blathers-crl
Copy link

blathers-crl bot commented Nov 1, 2022

cc @cockroachdb/replication

@aliher1911
Copy link
Contributor

Digging into this there are 2 problems.

  1. LOQ tries to recover unrecoverable and I think it should be fixed.
    Here
    _, ok := change.Desc.GetReplicaDescriptor(rankedDescriptors.storeID())

    we only report error if preferred replica is being removed.
    What happens though is, that any update will result in descriptor trying to revert local replica id that we bumped to a higher number to suppress any conflicts with other replcias. That causes node to terminate. I think we need to highlight that as overridable plan error instead.
  2. Test should not hang if one node paniced and should report an error.

@aliher1911
Copy link
Contributor

@tbg what do you think, should we disallow plan creation if we see any descriptor updates? If it would be reported as error and require 'force' to create a plan, then one can just go and remove replica from plan if they know it would cause a panic.

@tbg
Copy link
Member

tbg commented Nov 1, 2022

Could you type out a few more details? i.e. we have stores X Y Z and then the follwing update is made and after the update Y will panic with which error etc. Thanks!

@aliher1911
Copy link
Contributor

aliher1911 commented Nov 2, 2022

We have stores 1,2,3,4 with replicas on 1,2,3. There's a pending descriptor change to add a learner to node 4. Nodes 2 and 3 are killed. Recovery proceeds and picks 1 despite it having an unapplied descriptor change. Update removes all other replicas from descriptor and bumps replica ID from 4 to 15. Node restarts and tries to apply committed log with descriptor change which tries to change replica ID back to its previous value. At this point it panics:

I221031 08:35:30.595993 171 1@kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 49  the server is terminating due to a fatal error (see the DEV channel for details)
W221031 08:35:30.603964 222 kv/txn.go:705 ⋮ [intExec=‹protectedts-GetMetadata›] 51  failure aborting transaction: ‹node unavailable; try another peer›; abort caused by: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] can't determine lease status of (n1,s1):1 due to node liveness error: liveness record not found in cache; r31: replica (n4,s4):3 not lease holder; lease holder unknown›
W221031 08:35:30.604248 221 kv/txn.go:705 ⋮ [-] 52  failure aborting transaction: ‹node unavailable; try another peer›; abort caused by: failed to fetch protectedts metadata: failed to read metadata: ‹protectedts-GetMetadata›: context canceled
I221031 08:35:30.604665 249 kv/kvserver/liveness/liveness.go:767 ⋮ [n5,liveness-hb] 53  unable to get liveness record from KV: unable to get liveness: aborted during DistSender.Send: context canceled
E221031 08:35:30.604644 416 kv/kvserver/queue.go:1127 ⋮ [n5,replicaGC,s5,r38/15:‹/Table/3{7-8}›] 54  ‹failed to send RPC: sending to all replicas failed; last error: rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.1.50:48606->10.142.0.155:26257: use of closed network connection›
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50  attempted to change replica's ID from 15 to 4
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !goroutine 171 [running]:
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x1)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0x89
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc000ce71c0, {{{0xc0004b4e10, 0x24}, {0x527d379, 0x1}, {0x0, 0x0}, {0x0, 0x0}}, 0x17231a6a39d10e7d, ...})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/util/log/clog.go:262 +0x97
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepthInternal({0x6081c08, 0xc0037048d0}, 0x2, 0x4, 0x0, 0x0?, {0x52d1051, 0x2e}, {0xc003633730, 0x2, ...})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/util/log/channels.go:106 +0x645
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(...)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/util/log/channels.go:39
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/util/log/log_channels_generated.go:848
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).setDescLockedRaftMuLocked(0xc002ebec00, {0x6081c08, 0xc0037048d0}, 0xc0035af730)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:329 +0x410
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).setDescRaftMuLocked(0xc002ebec00, {0x6081c08, 0xc0037048d0}, 0x0?)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:297 +0x97
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleDescResult(...)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_result.go:270
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult(0xc002ebed00, {0x6081c08, 0xc0037048d0}, 0xc000d69040)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:1381 +0x59e
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects(0xc002ebed00, {0x6081c08, 0xc0037048d0}, {0x60c1ab8?, 0xc000d69008})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:1245 +0x6b9
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCheckedCmdIter({0x7f55585f9c10?, 0xc002ebf0a8}, 0xc003634948)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:206 +0x158
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch(0xc00334afd0, {0x6081c08, 0xc002ec5170}, {0x60b0e58, 0xc002ebf048})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:290 +0x1f8
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries(0xc003634fd0, {0x6081c08, 0xc002ec5170})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:246 +0x9a
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked(_, {_, _}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1038 +0x1c05
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady(_, {_, _}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:657 +0x17b
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady(0xc001a79500, 0xc000a5c7e0?)
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:641 +0xb8
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker(0xc002b3fa40, {0x6081c08, 0xc002f59d40})
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:308 +0x25d
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:489 +0x146
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
F221031 08:35:30.596073 171 kv/kvserver/replica_init.go:329 ⋮ [n5,s5,r38/15:‹/Table/3{7-8}›,raft] 50 !  github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:480 +0x43b

Note the real case in logs has 4 voters and one added learner, but it doesn't make much difference for this case.

@aliher1911
Copy link
Contributor

Underlying failure is tracked in a separate issue #91271

@lunevalex lunevalex added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants