Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: merge queue racing against replicate queue #57129

Closed
tbg opened this issue Nov 25, 2020 · 8 comments · Fixed by #79379
Closed

kvserver: merge queue racing against replicate queue #57129

tbg opened this issue Nov 25, 2020 · 8 comments · Fixed by #79379
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Nov 25, 2020

On 20.1.1, we observed a range not moving off a node during decommissioning.
Manually running through the replicate queue suggested that "someone" was
rolling back our learner before we could promote it. (This state lasted for
many hours, and presumably wouldn't ever have resolved).

Logs indicated that the merge queue was repeatedly retrying with

cannot up-replicate to s11; missing gossiped StoreDescriptor

(s11 is the decommissioning node's store).

We momentarily disabled the merge queue and decommissioning promptly finished.

I actually don't know why the store descriptor wasn't gossiped; n11 was running and according to my reading of the code the draining status shouldn't have affected that at all:

statusTicker := time.NewTicker(gossipStatusInterval)
storesTicker := time.NewTicker(gossip.StoresInterval)
nodeTicker := time.NewTicker(gossip.NodeDescriptorInterval)
defer storesTicker.Stop()
defer nodeTicker.Stop()
n.gossipStores(ctx) // one-off run before going to sleep
for {
select {
case <-statusTicker.C:
n.storeCfg.Gossip.LogStatus()
case <-storesTicker.C:
n.gossipStores(ctx)
case <-nodeTicker.C:
if err := n.storeCfg.Gossip.SetNodeDescriptor(&n.Descriptor); err != nil {
log.Warningf(ctx, "couldn't gossip descriptor for node %d: %s", n.Descriptor.NodeID, err)
}
case <-stopper.ShouldStop():
return
}
}

Also, the node shouldn't even be draining in the first place; it should be decommissioning, but we did find evidence that it was draining, too, which I don't quite understand.

https://cockroachlabs.slack.com/archives/C01351NFLE9/p1606320505056100 has some internal info here.

My assumption is that an explicit ./cockroach node drain was issued against the node at some point.

So for this issue:

  • make sure merge queue doesn't try to relocate to decommissioning or draining nodes
  • figure out why a draining node would stop gossiping its store descriptor

Jira issue: CRDB-2858

@blathers-crl

This comment has been minimized.

@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Nov 25, 2020
@tbg
Copy link
Member Author

tbg commented Mar 10, 2021

I hit a variant of this again on an experiment I was running for #59424. I imported tpcc-1000 into a three node cluster, ran tpcc for a bit, brought the cluster down and brought it up as a four-node cluster. Then I immediately ran tpcc again. My invocation was

bin/roachprod run ${c}:4 './cockroach workload run tpcc --warehouses=1000 --workers=1000 --wait=false --ramp=1m0s --duration=1000m0s --scatter --tolerate-errors {pgurl:1}'

Note the --scatter -- it seems to really have thrown a wrench into things. Due to the default rebalance rate of 2MB/s, snapshots take a long time to send and scatter seems to have decided to send tons of snapshots (makes sense, since a fourth empty node was just added). At the same time, the merge queue is busy merging ranges together (lots of neighboring ranges fall short of the 256mb combined target). Since the snapshots wait for ~25 minutes to send, by the time they get through, the merge queue has already removed the learner again.

This is all already sort of concerning, but also the cluster is unavailable for some reason. Everything looks green on the UI, there are no "slow requests" as measured by the respective metrics page, I'm not seeing anything other than what I'm describing before in the logs and yet, simple queries such as select * from tpcc.order_line order by ol_o_id desc limit 1; hang forever.

@tbg
Copy link
Member Author

tbg commented Mar 10, 2021

root@localhost:26257/defaultdb> select count(1), operation from crdb_internal.node_inflight_trace_spans group by operation order by count(1) descc
;
  count |                       operation
--------+--------------------------------------------------------
    863 | [async] storage.RaftTransport: processing snapshot
    863 | /cockroach.storage.MultiRaft/RaftSnapshot
    674 | [async] storage.Store: gossip on capacity change
    547 | /cockroach.roachpb.Internal/Batch
     10 | materializer
      7 | [async] sql.rowPusher: send rows
      3 | /cockroach.sql.distsqlrun.DistSQL/FlowStream
      2 | flow
      1 | raft application
      1 | [async] lease-triggers
      1 | [async] storage.pendingLeaseRequest: requesting lease
(11 rows)

@tbg
Copy link
Member Author

tbg commented Mar 10, 2021

The other thing that is bad is that the scatter is still running, but the TPCC invocation that started it is long gone. Clearly the cancellation here did not work.

@tbg
Copy link
Member Author

tbg commented Mar 10, 2021

My query finally came back. I had upped the rebalance rate to basically unlimited (500MiB/s) and the replica count caught up around the time the query returned:

root@localhost:26257/defaultdb> select * from tpcc.order_line order by ol_o_id desc limit 1;
  ol_o_id | ol_d_id | ol_w_id | ol_number | ol_i_id | ol_supply_w_id | ol_delivery_d | ol_quantity | ol_amount |       ol_dist_info
----------+---------+---------+-----------+---------+----------------+---------------+-------------+-----------+---------------------------
     3093 |       8 |       0 |         1 |   49841 |              0 | NULL          |           5 |    334.35 | haRF4E9zNHsJ7ZvyiJ3n2X1f
(1 row)

Time: 203.509s total (execution 203.509s / network 0.000s)

Confusingly, the same query still is far from instant. I would've expected that. It is simply reading one row from the PK, how can that be slow still? It took 166s this time and there is no "have been waiting" message either. WTH?

@tbg
Copy link
Member Author

tbg commented Mar 10, 2021

Heh, ok. I guess I'm just a SQL noob:

  distribution: full
  vectorized: true

  • limit
  │ estimated row count: 1
  │ count: 1
  │
  └── • sort
      │ estimated row count: 300,000,000
      │ order: -ol_o_id
      │
      └── • scan
            estimated row count: 300,000,000 (100% of the table)
            table: order_line@primary
            spans: FULL SCAN

I misread the table's schema. The PK starts with ol_w_id, not ol_o_id.

@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@erikgrinaker erikgrinaker self-assigned this Oct 7, 2021
@erikgrinaker erikgrinaker added the A-kv-replication Relating to Raft, consensus, and coordination. label Oct 7, 2021
@tbg
Copy link
Member Author

tbg commented Mar 21, 2022

We discussed this just now in the Replication team. As a quick fix, we were considering using the snapshot log truncation constraints which already act as a way to pass information from the replication queue to the raft log and snapshot queues. If the merge queue checked early on (before removing learners) that no constraint is in place, then the race (merge queue always aborts pending replication attempt) may be avoided or at least would have a higher likelihood of disappearing after a few iterations.

Long-term, a move to more centralized planning of range actions (which is something we (KV) have been mulling over for a while now, would likely avoid this pathological interaction in the first place.

@tbg
Copy link
Member Author

tbg commented Mar 29, 2022

We saw something similar in https://github.com/cockroachlabs/support/issues/1504, albeit with the replicate queue interrupting the store rebalancer. The stop-gap above would've avoided that as well, if we extended it to let both the replicate and store rebalancer also check for inflight snapshots (originating from the local replica, which is all they can easily learn about).

aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 6, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
tbg pushed a commit to tbg/cockroach that referenced this issue Apr 6, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 6, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 7, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 7, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 9, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 11, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 11, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 21, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 21, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 26, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
@craig craig bot closed this as completed in d6240ac Apr 27, 2022
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 27, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants