You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Infrequent but persistent, unexplained failures of tpccbench/nodes=9/cpu=4/multi-region suggest a potential livelock in AdminScatterRequest. Each of the unexplained failures [1], [2], [3], time out during tpcc.scatterRanges (see run_071903.867880943_n4_cockroach-workload-r.log in the attached artifacts.zip). They were previously investigated, arriving at the infra. flake conclusion, in each case. However, upon a closer examination of the failure observed in [1], it appears to be a correctness issue.
Observations
From test.log, we observe that the workload timed out after 7h despite the fact that it was expected to run for 45m,
07:19:03 cluster.go:2411: running cmd `./cockroach workload run tp...` on nodes [:4]; details in run_071903.867880943_n4_cockroach-workload-r.log
12:11:07 test_impl.go:414: test failure #1: full stack retained in failure_1.log: (test_runner.go:1161).runTest: test timed out (7h0m0s)
From run_071903.867880943_n4_cockroach-workload-r.log (see artifacts.zip),
Furthermore, periodically we observe the following error message,
E240309 11:55:16.864850 1020345 kv/kvserver/queue.go:1181 ⋮ [T1,Vsystem,n5,replicate,s5,r237/4:‹/Table/108/1/166{6/7/…-7/6/…}›] 8948 operation ‹"replicate queue process replica 237"› timed out after 1m0.002s (given timeout 1m0s): change replicas of r237 failed: aborted in DistSender: result is ambiguous: context deadline exceeded
followed by,
I240309 12:01:54.098442 1036106 kv/kvserver/replica_command.go:1754 ⋮ [T1,Vsystem,n5,replicate,s5,r237/4:‹/Table/108/1/166{6/7/…-7/6/…}›] 9068 could not successfully add and upreplicate LEARNER replica(s) on [n4,s4], rolling back: change replicas of r237 failed: aborted in DistSender: result is ambiguous: context deadline exceeded
This operation time out occurs roughly every ~5 minutes during the full 7h window,
Thus, raft application of adminScatter appears to be failing, including its rollback?! Furthermore, the op is kept in the replication queue (?) indefinitely, effectively resulting in an infinite attempt to keep applying it?
NOTE: artifacts.zip can be found in gs://cockroach-testeng/artifacts_122267.zip. Grafana URL will expire in ~2 weeks owing to the 60d retention policy for prometheus metrics.
NOTE: debug.zip was unfortunately unavailable for this run due to a bug which was later fixed in [5].
Misc
ALTER TABLE %s SCATTER is not even publicly documented [4] at this time. Is it intended to remain experimental?
The text was updated successfully, but these errors were encountered:
srosenberg
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
labels
Apr 12, 2024
Background
Infrequent but persistent, unexplained failures of
tpccbench/nodes=9/cpu=4/multi-region
suggest a potential livelock inAdminScatterRequest
. Each of the unexplained failures [1], [2], [3], time out duringtpcc.scatterRanges
(seerun_071903.867880943_n4_cockroach-workload-r.log
in the attachedartifacts.zip
). They were previously investigated, arriving at the infra. flake conclusion, in each case. However, upon a closer examination of the failure observed in [1], it appears to be a correctness issue.Observations
From
test.log
, we observe that the workload timed out after7h
despite the fact that it was expected to run for45m
,From
run_071903.867880943_n4_cockroach-workload-r.log
(seeartifacts.zip
),SIGQUIT
results in the following stacktrace (redundant frames elided),CPU graph shows very light utilization during the 7h window,
Queue Processing Failures graph shows persistent replication failures during the 7h window,
Analysis
On
n1
, we can see distsql forSCATTER
awaiting completion for ~4.5h,On
n6
, we see the following call sequence,adminScatter
->initializeRaftLearners
->execChangeReplicasTxn
->sendPartialBatch
->(*Retry).Next
,Furthermore, periodically we observe the following error message,
followed by,
This operation time out occurs roughly every ~5 minutes during the full 7h window,
Thus, raft application of
adminScatter
appears to be failing, including its rollback?! Furthermore, the op is kept in the replication queue (?) indefinitely, effectively resulting in an infinite attempt to keep applying it?NOTE:
artifacts.zip
can be found ings://cockroach-testeng/artifacts_122267.zip
. Grafana URL will expire in ~2 weeks owing to the 60d retention policy for prometheus metrics.NOTE:
debug.zip
was unfortunately unavailable for this run due to a bug which was later fixed in [5].Misc
ALTER TABLE %s SCATTER
is not even publicly documented [4] at this time. Is it intended to remain experimental?[1] #119900 (comment)
[2] #105970 (comment)
[3] #104629 (comment)
[4] cockroachdb/docs#4975
[5] #119996
Jira issue: CRDB-37783
Epic CRDB-39952
The text was updated successfully, but these errors were encountered: