Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: potential livelock in AdminScatterRequest #122267

Open
srosenberg opened this issue Apr 12, 2024 · 0 comments
Open

kv: potential livelock in AdminScatterRequest #122267

srosenberg opened this issue Apr 12, 2024 · 0 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@srosenberg
Copy link
Member

srosenberg commented Apr 12, 2024

Background

Infrequent but persistent, unexplained failures of tpccbench/nodes=9/cpu=4/multi-region suggest a potential livelock in AdminScatterRequest. Each of the unexplained failures [1], [2], [3], time out during tpcc.scatterRanges (see run_071903.867880943_n4_cockroach-workload-r.log in the attached artifacts.zip). They were previously investigated, arriving at the infra. flake conclusion, in each case. However, upon a closer examination of the failure observed in [1], it appears to be a correctness issue.

Observations

From test.log, we observe that the workload timed out after 7h despite the fact that it was expected to run for 45m,

07:19:03 cluster.go:2411: running cmd `./cockroach workload run tp...` on nodes [:4]; details in run_071903.867880943_n4_cockroach-workload-r.log
12:11:07 test_impl.go:414: test failure #1: full stack retained in failure_1.log: (test_runner.go:1161).runTest: test timed out (7h0m0s)

From run_071903.867880943_n4_cockroach-workload-r.log (see artifacts.zip),

07:19:03 cluster.go:2414: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=613 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}

SIGQUIT results in the following stacktrace (redundant frames elided),

github.com/cockroachdb/cockroach/pkg/workload/tpcc.scatterRanges(0xc008282c30)
        github.com/cockroachdb/cockroach/pkg/workload/tpcc/ddls.go:280
github.com/cockroachdb/cockroach/pkg/workload/tpcc.(*tpcc).partitionAndScatterWithDB(0x640963c?, 0x9?)
        github.com/cockroachdb/cockroach/pkg/workload/tpcc/tpcc.go:952
github.com/cockroachdb/cockroach/pkg/workload/tpcc.(*tpcc).partitionAndScatter(0xc001068c40, {0xc00811b9e0?, 0x0?, 0x0?})
        github.com/cockroachdb/cockroach/pkg/workload/tpcc/tpcc.go:931
github.com/cockroachdb/cockroach/pkg/workload/tpcc.(*tpcc).Ops(0xc001068c40, {0x7c13ed8, 0xc001376f50}, {0xc00811b9e0, 0x9, 0x11}, 0xc00828a4e0)
        github.com/cockroachdb/cockroach/pkg/workload/tpcc/tpcc.go:737

CPU graph shows very light utilization during the 7h window,

cpu_utilization

Queue Processing Failures graph shows persistent replication failures during the 7h window,

replication_queue_failures

Analysis

On n1, we can see distsql for SCATTER awaiting completion for ~4.5h,

chan receive [275 minutes] [Created by pgwire.(*Server).serveImpl in goroutine 576507 @ server.go:996]
    kvcoord                   dist_sender.go:1584                   (*DistSender).divideAndSendBatchToRanges.func1()
    kvcoord                   dist_sender.go:1769                   (*DistSender).divideAndSendBatchToRanges(#346, {#166, #5777}, #5261, {{#5842, 9, 0x10}, {#4704, 2, 2}}, ...)
    kvcoord                   dist_sender.go:1114                   (*DistSender).Send(#346, {#166, #5832}, #5261)
    kv                        db.go:229                             (*CrossRangeTxnWrapperSender).Send(#4128, {#166, #5832}, #5261)
    kv                        sender.go:475                         SendWrappedWithAdmission({_, _}, {_, _}, {{0, 0}, 0, {0, 0}, {0, ...}, ...}, ...)
    kv                        sender.go:460                         SendWrappedWith(...)
    kv                        sender.go:492                         SendWrapped({#166, #5832}, {#148, #4128}, {#194, #6039})
    sql                       scatter.go:143                        (*scatterNode).startExec(#2725, {{#166, #5832}, #4886, #5144})
    sql                       plan.go:520                           startExec.func2({#50, #5277}, {#171, #2725})
    sql                       walk.go:112                           (*planVisitor).visitInternal.func1()
    sql                       walk.go:299                           (*planVisitor).visitInternal(#5278, {#171, #2725}, {#133, 7})
    sql                       walk.go:79                            (*planVisitor).visit(#5278, {#171, #2725})
    sql                       walk.go:43                            walkPlan(...)
    sql                       plan.go:523                           startExec({{#166, #5832}, #4886, #5144}, {#171, #2725})
    sql                       plan_node_to_row_source.go:175        (*planNodeToRowSource).Start(#1142, {#166, #5831})
    colexec                   columnarizer.go:183                   (*Columnarizer).Init(#4768, {#166, #5830})
    colflow                   stats.go:94                           (*batchInfoCollector).init(...)
    colexecerror              error.go:92                           CatchVectorizedRuntimeError(#5903)
    colflow                   stats.go:103                          (*batchInfoCollector).Init(#4546, {#166, #5830})
    colflow                   flow_coordinator.go:245               (*BatchFlowCoordinator).Run.(*BatchFlowCoordinator).init.func2()
    colexecerror              error.go:92                           CatchVectorizedRuntimeError(#165)
    colflow                   flow_coordinator.go:244               (*BatchFlowCoordinator).init(...)
    colflow                   flow_coordinator.go:278               (*BatchFlowCoordinator).Run(#5926, {#165, #6038})
    colflow                   vectorized_flow.go:320                (*vectorizedFlow).Run(#3810, {#165, #6038}, 0)
    sql                       distsql_running.go:911                (*DistSQLPlanner).Run(#268, {#166, #5829}, #5157, #2305, #2611, #3554, #5145, #5279)
    sql                       distsql_running.go:1972               (*DistSQLPlanner).PlanAndRun(#268, {#166, #5829}, #5145, #5157, #2305, {{#171, #2725}, 0}, #3554, ...)
    sql                       distsql_running.go:1687               (*DistSQLPlanner).PlanAndRunAll.func3(#5794, #5280, {#166, #5829}, 0, #6100, 0x10)
    sql                       distsql_running.go:1690               (*DistSQLPlanner).PlanAndRunAll(#268, {#166, #5829}, #5145, #5157, #5144, #3554, 0)
    sql                       conn_executor_exec.go:2439            (*connExecutor).execWithDistSQLEngine(#5143, {#166, #5829}, #5144, 3, {#6106, #4410}, 0, #2819)
    sql                       conn_executor_exec.go:1971            (*connExecutor).dispatchToExecutionEngine(#5143, {#166, #5688}, #5144, {#6106, #4410})
    sql                       conn_executor_exec.go:1147            (*connExecutor).execStmtInOpenState(#5143, {#166, #5688}, {{#172, #5804}, {0, 0, 0}, {#5108, 0x1c}, ...}, ...)
    sql                       conn_executor_exec.go:146             (*connExecutor).execStmt.func1({#166, #5687})
    sql                       conn_executor_exec.go:3447            (*connExecutor).execWithProfiling(#202, {#166, #5687}, {#172, #5804}, #222, 0)
    sql                       conn_executor_exec.go:145             (*connExecutor).execStmt(#5143, {#166, #5687}, {{#172, #5804}, {0, 0, 0}, {#5108, 0x1c}, ...}, ...)
    sql                       conn_executor.go:2245                 (*connExecutor).execCmd.func1(#5008, #5009, #5143, #5007, #5282, #5281, #5283, #5010)
    sql                       conn_executor.go:2250                 (*connExecutor).execCmd(#5143)
    sql                       conn_executor.go:2167                 (*connExecutor).run(#5143, {#166, #5702}, 0x3c, #6055, #4757)
    sql                       conn_executor.go:965                  (*Server).ServeConn(#711, {#166, #5702}, {#5042}, 9, #5701)
    pgwire                    conn.go:247                           (*conn).processCommands(#440

On n6, we see the following call sequence, adminScatter -> initializeRaftLearners -> execChangeReplicasTxn -> sendPartialBatch -> (*Retry).Next,

select [Created by grpc.(*Server).serveStreams.func1 in goroutine 477 @ server.go:963]
    retry                     retry.go:128                             (*Retry).Next(#3120)
    kvcoord                   dist_sender.go:1895                      (*DistSender).sendPartialBatch(#221, {#117, #3661}, #737, {{#3752, 9, 0xf}, {#3752, 0xa, 0xf}}, ...)
    kvcoord                   dist_sender.go:1498                      (*DistSender).divideAndSendBatchToRanges(#221, {#117, #3661}, #737, {{#3752, 9, 0xf}, {#3752, 0xa, 0xf}}, ...)
    kvcoord                   dist_sender.go:1114                      (*DistSender).Send(#221, {#117, #3660}, #737)
    kvcoord                   txn_lock_gatekeeper.go:82                (*txnLockGatekeeper).SendLocked(#3072, {#117, #3660}, #737)
    kvcoord                   txn_interceptor_metric_recorder.go:46    (*txnMetricRecorder).SendLocked(#3071, {#117, #3660}, #151)
    kvcoord                   txn_interceptor_span_refresher.go:239    (*txnSpanRefresher).sendLockedWithRefreshAttempts(#3070, {#117, #3660}, #737, 5)
    kvcoord                   txn_interceptor_span_refresher.go:167    (*txnSpanRefresher).SendLocked(#3070, {#117, #3660}, 0)
    kvcoord                   txn_interceptor_committer.go:149         (*txnCommitter).SendLocked(#3069, {#117, #3660}, #737)
    kvcoord                   txn_interceptor_pipeliner.go:295         (*txnPipeliner).SendLocked(#3068, {#117, #3660}, 0x12)
    kvcoord                   txn_interceptor_seq_num_allocator.go:114 (*txnSeqNumAllocator).SendLocked(#3067, {#117, #3660}, #737)
    kvcoord                   txn_interceptor_heartbeater.go:246       (*txnHeartbeater).SendLocked(#3066, {#117, #3660}, #737)
    kvcoord                   txn_coord_sender.go:535                  (*TxnCoordSender).Send(#3065, {#117, #3659}, #737)
    kv                        db.go:1127                               (*DB).sendUsingSender(#2681, {#117, #3659}, #737, {#3935, #3065})
    kv                        txn.go:1282                              (*Txn).Send(#2837, {#117, #3659}, #737)
    kv                        db.go:959                                sendAndFill({#117, #3659}, #3121, #490)
    kv                        txn.go:800                               (*Txn).Run(#2837, {#117, #3659}, #490)
    kvserver                  replica_command.go:2435                  execChangeReplicasTxn.func2({#117, #3764}, #2837)
    kv                        txn.go:1049                              (*Txn).exec(#2837, {#117, #3764}, #3122)
    kv                        db.go:1092                               runTxn({#117, #3764}, #2837, 5)
    kv                        db.go:1055                               (*DB).TxnWithAdmissionControl(#3494, {#117, #3764}, #145, 0, 0, #3122)
    kv                        db.go:1030                               (*DB).Txn(...)
    kvserver                  replica_command.go:2333                  execChangeReplicasTxn({#117, #3764}, #426, #3493, {#88, 9}, {#1110, 0xbb}, {#3123, 1, ...}, ...)
    kvserver                  replica_command.go:1793                  (*Replica).initializeRaftLearners(#1593, {#117, #3764}, #3493, 1, 0, {#88, 9}, {#1110, 0xbb}, ...)
    kvserver                  replica_command.go:1108                  (*Replica).changeReplicasImpl(#1593, {#117, #3764}, #136, 1, 0, {#88, 9}, {#1110, 0xbb}, ...)
    kvserver                  replicate_queue.go:1067                  (*replicateQueue).changeReplicas(...)
    kvserver                  replicate_queue.go:812                   (*replicateQueue).applyChange(0, {#117, #3764}, {0xf, {#135, #1593}, {#107, #2836}, {0, 0, ...}}, ...)
    kvserver                  replicate_queue.go:905                   (*replicateQueue).processOneChange(#207, {#117, #3764}, #1593, #3493, #2835, #767, 1, 0)
    kvserver                  replica_command.go:4029                  (*Replica).adminScatter(_, {_, _}, {{{}, {#3837, 9, 0x10}, {#3838, 9, 0x10}, ...}, ...})
    kvserver                  replica_send.go:1000                     (*Replica).executeAdminBatch(#1593, {#117, #3763}, #747)
    kvserver                  replica_send.go:189                      (*Replica).SendWithWriteBytes(#1593, {#117, #3762}, #747)
    kvserver                  store_send.go:193                        (*Store).SendWithWriteBytes(#174, {#117, #3761}, #747)
    kvserver                  stores.go:202                            (*Stores).SendWithWriteBytes(#256, {#117, #3761}, #747)
    server                    node.go:1388                             (*Node).batchInternal(#2993, {#117, #3760}, {#2998}, #747)
    server                    node.go:1522                             (*Node).Batch(#2993, {#117, #3759}, #747)
    kvpb                      api.pb.go:10481                          _Internal_Batch_Handler.func1({#117, #3759}, {#86, #747})
    rpc                       grpc_interceptor.go:97                   NewServerEx.ServerInterceptor.func12({#117, #3759}, {#86, #747}, #3582, #3278)
    grpc                      server.go:1163                           getChainUnaryHandler.func1({#117, #3759}, {#86, #747})
    rpc                       context.go:169                           NewServerEx.func3({#117, #3759}, {#86, #747}, #3582, #2706)
    grpc                      server.go:1163                           getChainUnaryHandler.func1({#117, #3759}, {#86, #747})
    rpc                       auth.go:95                               kvAuth.unaryInterceptor({#449, {{#16}, {#126, #2701}}}, {#117, #3759}, {#86, #747}, #3582, #2705)
    grpc                      server.go:1163                           getChainUnaryHandler.func1({#117, #3759}, {#86, #747})
    rpc                       context.go:136                           NewServerEx.func1.1({#117, #3759})
    stop                      stopper.go:336                           (*Stopper).RunTaskWithErr(#429, {#117, #3759}, {#2704, #3125}, #3124)
    rpc                       context.go:134                           NewServerEx.func1({#117, #3759}, {#86, #747}, #3582, #3278)
    grpc                      server.go:1154                           NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1({#117, #3759}, {#86, #747}, #3582, 0x70)
    kvpb                      api.pb.go:10483                          _Internal_Batch_Handler({#85, #2993}, {#117, #3759}, #1629, #250)
    grpc                      server.go:1336                           (*Server).processUnaryRPC(#2682, {#134, #435}, #746, #270, #142, 0)
    grpc                      server.go:1704                           (*Server).handleStream(#2682, {#134, #435}, #746, 0)
    grpc                      server.go:965                            (*Server).serveStreams.func1.2()

Furthermore, periodically we observe the following error message,

E240309 11:55:16.864850 1020345 kv/kvserver/queue.go:1181 ⋮ [T1,Vsystem,n5,replicate,s5,r237/4:‹/Table/108/1/166{6/7/…-7/6/…}›] 8948  operation ‹"replicate queue process replica 237"› timed out after 1m0.002s (given timeout 1m0s): change replicas of r237 failed: aborted in DistSender: result is ambiguous: context deadline exceeded

followed by,

I240309 12:01:54.098442 1036106 kv/kvserver/replica_command.go:1754 ⋮ [T1,Vsystem,n5,replicate,s5,r237/4:‹/Table/108/1/166{6/7/…-7/6/…}›] 9068  could not successfully add and upreplicate LEARNER replica(s) on [n4,s4], rolling back: change replicas of r237 failed: aborted in DistSender: result is ambiguous: context deadline exceeded

This operation time out occurs roughly every ~5 minutes during the full 7h window,

grep "change replicas of r237 failed: aborted in DistSender" logs/6.unredacted/cockroach.log |grep "timed out"|awk '{print $2}'|head
07:21:40.172204
07:29:27.029321
07:33:48.037893
07:36:06.904053
07:40:50.796083
07:44:40.926761
07:47:52.041262
07:51:34.179051
07:53:24.248551
07:57:17.388414
grep "change replicas of r237 failed: aborted in DistSender" logs/6.unredacted/cockroach.log |grep "timed out"|awk '{print $2}'|tail
11:40:45.357206
11:43:39.448866
11:47:29.584860
11:51:26.726728
11:55:16.864850
12:01:54.098714
12:03:34.162775
12:08:41.345532
12:09:58.391512
12:14:44.561428

Thus, raft application of adminScatter appears to be failing, including its rollback?! Furthermore, the op is kept in the replication queue (?) indefinitely, effectively resulting in an infinite attempt to keep applying it?

NOTE: artifacts.zip can be found in gs://cockroach-testeng/artifacts_122267.zip. Grafana URL will expire in ~2 weeks owing to the 60d retention policy for prometheus metrics.

NOTE: debug.zip was unfortunately unavailable for this run due to a bug which was later fixed in [5].

Misc

ALTER TABLE %s SCATTER is not even publicly documented [4] at this time. Is it intended to remain experimental?

[1] #119900 (comment)
[2] #105970 (comment)
[3] #104629 (comment)
[4] cockroachdb/docs#4975
[5] #119996

Jira issue: CRDB-37783

Epic CRDB-39952

@srosenberg srosenberg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team labels Apr 12, 2024
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

1 participant