-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: merge queue racing against replicate queue #57129
Comments
This comment has been minimized.
This comment has been minimized.
I hit a variant of this again on an experiment I was running for #59424. I imported tpcc-1000 into a three node cluster, ran tpcc for a bit, brought the cluster down and brought it up as a four-node cluster. Then I immediately ran
Note the This is all already sort of concerning, but also the cluster is unavailable for some reason. Everything looks green on the UI, there are no "slow requests" as measured by the respective metrics page, I'm not seeing anything other than what I'm describing before in the logs and yet, simple queries such as |
|
The other thing that is bad is that the scatter is still running, but the TPCC invocation that started it is long gone. Clearly the cancellation here did not work. |
My query finally came back. I had upped the rebalance rate to basically unlimited (500MiB/s) and the replica count caught up around the time the query returned:
Confusingly, the same query still is far from instant. I would've expected that. It is simply reading one row from the PK, how can that be slow still? It took 166s this time and there is no "have been waiting" message either. WTH? |
Heh, ok. I guess I'm just a SQL noob:
I misread the table's schema. The PK starts with |
We discussed this just now in the Replication team. As a quick fix, we were considering using the snapshot log truncation constraints which already act as a way to pass information from the replication queue to the raft log and snapshot queues. If the merge queue checked early on (before removing learners) that no constraint is in place, then the race (merge queue always aborts pending replication attempt) may be avoided or at least would have a higher likelihood of disappearing after a few iterations. Long-term, a move to more centralized planning of range actions (which is something we (KV) have been mulling over for a while now, would likely avoid this pathological interaction in the first place. |
We saw something similar in https://github.com/cockroachlabs/support/issues/1504, albeit with the replicate queue interrupting the store rebalancer. The stop-gap above would've avoided that as well, if we extended it to let both the replicate and store rebalancer also check for inflight snapshots (originating from the local replica, which is all they can easily learn about). |
This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none
This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none
This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none
This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none
This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none
This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none
This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none
This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none
This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none
This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none
This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none
This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none
On 20.1.1, we observed a range not moving off a node during decommissioning.
Manually running through the replicate queue suggested that "someone" was
rolling back our learner before we could promote it. (This state lasted for
many hours, and presumably wouldn't ever have resolved).
Logs indicated that the merge queue was repeatedly retrying with
(s11 is the decommissioning node's store).
We momentarily disabled the merge queue and decommissioning promptly finished.
I actually don't know why the store descriptor wasn't gossiped; n11 was running and according to my reading of the code the draining status shouldn't have affected that at all:
cockroach/pkg/server/node.go
Lines 689 to 709 in b9ed5df
Also, the node shouldn't even be draining in the first place; it should be decommissioning, but we did find evidence that it was draining, too, which I don't quite understand.
https://cockroachlabs.slack.com/archives/C01351NFLE9/p1606320505056100 has some internal info here.
My assumption is that an explicit
./cockroach node drain
was issued against the node at some point.So for this issue:
Jira issue: CRDB-2858
The text was updated successfully, but these errors were encountered: