kvserver: merge queue racing against replicate queue #57129

tbg · 2020-11-25T17:05:26Z

On 20.1.1, we observed a range not moving off a node during decommissioning.
Manually running through the replicate queue suggested that "someone" was
rolling back our learner before we could promote it. (This state lasted for
many hours, and presumably wouldn't ever have resolved).

Logs indicated that the merge queue was repeatedly retrying with

cannot up-replicate to s11; missing gossiped StoreDescriptor

(s11 is the decommissioning node's store).

We momentarily disabled the merge queue and decommissioning promptly finished.

I actually don't know why the store descriptor wasn't gossiped; n11 was running and according to my reading of the code the draining status shouldn't have affected that at all:

cockroach/pkg/server/node.go

Lines 689 to 709 in b9ed5df

    
           statusTicker := time.NewTicker(gossipStatusInterval) 
        
           storesTicker := time.NewTicker(gossip.StoresInterval) 
        
           nodeTicker := time.NewTicker(gossip.NodeDescriptorInterval) 
        
           defer storesTicker.Stop() 
        
           defer nodeTicker.Stop() 
        
           n.gossipStores(ctx) // one-off run before going to sleep 
        
           for { 
        
           	select { 
        
           	case <-statusTicker.C: 
        
           		n.storeCfg.Gossip.LogStatus() 
        
           	case <-storesTicker.C: 
        
           		n.gossipStores(ctx) 
        
           	case <-nodeTicker.C: 
        
           		if err := n.storeCfg.Gossip.SetNodeDescriptor(&n.Descriptor); err != nil { 
        
           			log.Warningf(ctx, "couldn't gossip descriptor for node %d: %s", n.Descriptor.NodeID, err) 
        
           		} 
        
           	case <-stopper.ShouldStop(): 
        
           		return 
        
           	} 
        
           }

Also, the node shouldn't even be draining in the first place; it should be decommissioning, but we did find evidence that it was draining, too, which I don't quite understand.

https://cockroachlabs.slack.com/archives/C01351NFLE9/p1606320505056100 has some internal info here.

My assumption is that an explicit ./cockroach node drain was issued against the node at some point.

So for this issue:

make sure merge queue doesn't try to relocate to decommissioning or draining nodes
figure out why a draining node would stop gossiping its store descriptor

Jira issue: CRDB-2858

The text was updated successfully, but these errors were encountered:

tbg · 2021-03-10T10:14:24Z

I hit a variant of this again on an experiment I was running for #59424. I imported tpcc-1000 into a three node cluster, ran tpcc for a bit, brought the cluster down and brought it up as a four-node cluster. Then I immediately ran tpcc again. My invocation was

bin/roachprod run ${c}:4 './cockroach workload run tpcc --warehouses=1000 --workers=1000 --wait=false --ramp=1m0s --duration=1000m0s --scatter --tolerate-errors {pgurl:1}'

Note the --scatter -- it seems to really have thrown a wrench into things. Due to the default rebalance rate of 2MB/s, snapshots take a long time to send and scatter seems to have decided to send tons of snapshots (makes sense, since a fourth empty node was just added). At the same time, the merge queue is busy merging ranges together (lots of neighboring ranges fall short of the 256mb combined target). Since the snapshots wait for ~25 minutes to send, by the time they get through, the merge queue has already removed the learner again.

This is all already sort of concerning, but also the cluster is unavailable for some reason. Everything looks green on the UI, there are no "slow requests" as measured by the respective metrics page, I'm not seeing anything other than what I'm describing before in the logs and yet, simple queries such as select * from tpcc.order_line order by ol_o_id desc limit 1; hang forever.

tbg · 2021-03-10T10:16:10Z

root@localhost:26257/defaultdb> select count(1), operation from crdb_internal.node_inflight_trace_spans group by operation order by count(1) descc
;
  count |                       operation
--------+--------------------------------------------------------
    863 | [async] storage.RaftTransport: processing snapshot
    863 | /cockroach.storage.MultiRaft/RaftSnapshot
    674 | [async] storage.Store: gossip on capacity change
    547 | /cockroach.roachpb.Internal/Batch
     10 | materializer
      7 | [async] sql.rowPusher: send rows
      3 | /cockroach.sql.distsqlrun.DistSQL/FlowStream
      2 | flow
      1 | raft application
      1 | [async] lease-triggers
      1 | [async] storage.pendingLeaseRequest: requesting lease
(11 rows)

tbg · 2021-03-10T10:20:39Z

The other thing that is bad is that the scatter is still running, but the TPCC invocation that started it is long gone. Clearly the cancellation here did not work.

tbg · 2021-03-10T10:26:30Z

My query finally came back. I had upped the rebalance rate to basically unlimited (500MiB/s) and the replica count caught up around the time the query returned:

root@localhost:26257/defaultdb> select * from tpcc.order_line order by ol_o_id desc limit 1;
  ol_o_id | ol_d_id | ol_w_id | ol_number | ol_i_id | ol_supply_w_id | ol_delivery_d | ol_quantity | ol_amount |       ol_dist_info
----------+---------+---------+-----------+---------+----------------+---------------+-------------+-----------+---------------------------
     3093 |       8 |       0 |         1 |   49841 |              0 | NULL          |           5 |    334.35 | haRF4E9zNHsJ7ZvyiJ3n2X1f
(1 row)

Time: 203.509s total (execution 203.509s / network 0.000s)

Confusingly, the same query still is far from instant. I would've expected that. It is simply reading one row from the PK, how can that be slow still? It took 166s this time and there is no "have been waiting" message either. WTH?

tbg · 2021-03-10T10:28:33Z

Heh, ok. I guess I'm just a SQL noob:

  distribution: full
  vectorized: true

  • limit
  │ estimated row count: 1
  │ count: 1
  │
  └── • sort
      │ estimated row count: 300,000,000
      │ order: -ol_o_id
      │
      └── • scan
            estimated row count: 300,000,000 (100% of the table)
            table: order_line@primary
            spans: FULL SCAN

I misread the table's schema. The PK starts with ol_w_id, not ol_o_id.

tbg · 2022-03-21T15:17:46Z

We discussed this just now in the Replication team. As a quick fix, we were considering using the snapshot log truncation constraints which already act as a way to pass information from the replication queue to the raft log and snapshot queues. If the merge queue checked early on (before removing learners) that no constraint is in place, then the race (merge queue always aborts pending replication attempt) may be avoided or at least would have a higher likelihood of disappearing after a few iterations.

Long-term, a move to more centralized planning of range actions (which is something we (KV) have been mulling over for a while now, would likely avoid this pathological interaction in the first place.

tbg · 2022-03-29T12:48:11Z

We saw something similar in https://github.com/cockroachlabs/support/issues/1504, albeit with the replicate queue interrupting the store rebalancer. The stop-gap above would've avoided that as well, if we extended it to let both the replicate and store rebalancer also check for inflight snapshots (originating from the local replica, which is all they can easily learn about).

This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none

This comment has been minimized.

Sign in to view

tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Nov 25, 2020

tbg mentioned this issue Dec 4, 2020

kvserver: properly sequence replication changes #57563

Closed

jlinder added the T-kv KV Team label Jun 16, 2021

erikgrinaker self-assigned this Oct 7, 2021

erikgrinaker added the A-kv-replication Relating to Raft, consensus, and coordination. label Oct 7, 2021

tbg unassigned erikgrinaker Mar 21, 2022

tbg mentioned this issue Mar 30, 2022

kvserver: log when rolling back learner #79035

Closed

aayushshah15 mentioned this issue Apr 4, 2022

roachtest: rebalance/by-load/replicas failed #79118

Closed

exalate-issue-sync bot mentioned this issue Apr 7, 2022

kvserver: avoid races where replication changes can get interrupted cockroachdb/docs#13537

Closed

craig bot closed this as completed in d6240ac Apr 27, 2022

aayushshah15 mentioned this issue Apr 27, 2022

release-22.1: kvserver: avoid races where replication changes can get interrupted #80647

Merged

nvanbenschoten mentioned this issue May 9, 2022

kvserver: replicate queue removes learner being added #81119

Closed

jlinder added the sync-me-3 label May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: merge queue racing against replicate queue #57129

kvserver: merge queue racing against replicate queue #57129

tbg commented Nov 25, 2020 •

edited by cockroach-jira-scripts

Loading

This comment has been minimized.

tbg commented Mar 10, 2021

tbg commented Mar 10, 2021

tbg commented Mar 10, 2021

tbg commented Mar 10, 2021

tbg commented Mar 10, 2021

tbg commented Mar 21, 2022

tbg commented Mar 29, 2022

kvserver: merge queue racing against replicate queue #57129

kvserver: merge queue racing against replicate queue #57129

Comments

tbg commented Nov 25, 2020 • edited by cockroach-jira-scripts Loading

This comment has been minimized.

tbg commented Mar 10, 2021

tbg commented Mar 10, 2021

tbg commented Mar 10, 2021

tbg commented Mar 10, 2021

tbg commented Mar 10, 2021

tbg commented Mar 21, 2022

tbg commented Mar 29, 2022

tbg commented Nov 25, 2020 •

edited by cockroach-jira-scripts

Loading