-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: 17 replicas on a range configured for 3 #17612
Comments
Perhaps related, the Leases metrics are reporting 2.95m epoch leases. How can there be more epoch leases than ranges? |
Wow, it was actually at 10 million leases cluster-wide. That's bonkers, the metric has to be broken somehow |
And good find on the |
Previously if a no-op was returned from Allocator.ComputeAction, we took that as a signal totry rebalancing the range, which can cause a new replica to be added. However, we sometimes return a no-op when it really is the case that no action should be taken, and rebalancing might do more harm than good. Addresses (but might not fix) cockroachdb#17612
There's also the fact that we compute everything based on the replica's local copy of the range descriptor. If that's out of date, we could also find ourselves adding too many ranges. Is it possible for the leaseholder of a range to have an out of date range descriptor? |
Not that I know of. I think the leaseholder has to have an up to date copy of the range descriptor or lots of things would break. |
In order to get faster splits and balancing I have a local edit to @benesch Is this behavior of |
Ah, I think I see what is happening. The reason I added the @benesch, @a-robinson I recall there being discussion during the initial scatter implementation about removing unneeded replicas, but not the details. Do you recall why that wasn't implemented? |
Yeah, waiting for downreplication was implemented then removed because it proved too slow. This was before I discovered (and fixed) the general performance badness caused by range tombstones on empty ranges, so it's possible we should restore downreplication. Anyway, SCATTER is currently designed for a all-splits, all-scatters pipeline. If you just want a balanced set of 1m ranges, @petermattis, I'd recommend issuing one massive SCATTER command after splits finish. It's true that RESTORE just switched to a split-scatter-at-a-time model, but the SCATTER implementation wasn't changed to deal with this. Waiting for downreplication is far more feasible when you're not waiting for 16k ranges to downreplicate in one retry loop with a 60s timeout. |
All that said, the demands on SCATTER are varied enough that it deserves a dedicated queue (i.e., a scatter queue, which might be implemented simply as a range-local flag that influences the replicate queue's behavior) or separate implementations for the different use cases. |
This is what I started with (perform all splits, then all scatters) and its problematic because the splitting happens faster than the replicate queue can keep up and we end up with a massive imbalance that actually slows down the splitting. I think the root problem here is that the replicate queue is throttled by being a single goroutine per node while |
Calling scatter after every split is definitely a bad idea given the current implementation. You could try calling it after every N splits, where N is somewhere in the tens or hundreds, but you'll still probably hit issues due to the concurrency differences you noted. The replicate queue is not currently tuned to work quite so quickly. Is this something that's blocking effective use of sky? Otherwise I'm not sure it merits much effort before 1.1. |
It is blocking trying to create 1m ranges. I'm on to testing other things on |
@a-robinson what about the current implementation makes issuing a scatter command after every split problematic? We no longer add jitter server-side. |
@benesch Did we ever commit the version of scatter that removed replicas after adding new ones? I couldn't find it earlier today. |
I guess if you're limiting the span of the scatter to just the range that was split, it's effectively the same. I was assuming he was scattering the entire table every time, but that's probably not the case. |
@petermattis, never committed. Here's the implementation from Reviewable: // Wait for the new leaseholder's replicate queue to downreplicate the range
// to the appropriate replication factor. Issuing REMOVE_REPLICA requests here
// would be dangerous, as we'd race with the replicate queue and potentially
// cause underreplication.
if err := retry.WithMaxAttempts(ctx, retryOptions, maxAttempts, func() error {
// Get an updated range descriptor, as we might sleep for several seconds
// between retries.
desc = repl.Desc()
action, _ := repl.store.allocator.ComputeAction(ctx, zone, desc)
if action != AllocatorNoop {
if log.V(2) {
log.Infof(ctx, "scatter: allocator requests action %s, sleeping and retrying", action)
}
}
return nil
}); err != nil {
return err
} Point being it never actually issued remove replica commands; it just waited for the replicate queue to sort it out. @RaduBerinde's original implementation, which was just a wrapper around TestingRelocateRanges, did issue remove replica commands, but could cause massive underreplication due to racing with the replicate queue. |
I am limiting the scatter to the range being split. It actually isn't the same. I'm trying to parallelize splitting a table into N pieces. I do a little divide-and-conquer algorithm where I split at the middle split point, scatter, then recurse into the left and right parts. Consider what happens after the first split+scatter. We now have 2 ranges where each range has 6 replicas. The replicate queue will be trying to down replicate, but my load generator is sending splits for these 2 ranges and will then scatter again which adds 3 new replicas. If a single scatter is performed at the end, the max replicas a range will see is 6. With the split+scatter approach, the max replicas for a range is bounded only by the number of nodes in the cluster. |
Thanks, @benesch. Waiting for the replicate queue is going to be slow. Perhaps we can mark the replica to be ignored by the replicate queue (temporarily) while we perform the down-replication. |
|
@bdarnell Interesting point. We don't use the |
Perhaps it does. I never tried. It's possible the old scatter caused
underreplication for some different reason, like the range tombstone
nonsense.
…On Mon, Aug 14, 2017 at 2:48 PM Peter Mattis ***@***.***> wrote:
@bdarnell <https://github.com/bdarnell> Interesting point. We don't use
the ChangeReplicas RPC in Replica.adminScatter. I'm not sure why this
doesn't just work.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17612 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA15IIGp0Xloh1H_Vptcaj2tS6mlV8N1ks5sYJaRgaJpZM4O1R1D>
.
|
Ok, I'm taking a look at this. |
Rather than using custom logic in Replica.adminScatter, call into replicateQueue.processOneChange until no further action needs to be taken. This fixes a major caveat with the prior implementation: replicas were not removed synchronously leading to conditions where an arbitrary number of replicas could be added to a range via scattering. Fixes cockroachdb#17612
Rather than using custom logic in Replica.adminScatter, call into replicateQueue.processOneChange until no further action needs to be taken. This fixes a major caveat with the prior implementation: replicas were not removed synchronously leading to conditions where an arbitrary number of replicas could be added to a range via scattering. Fixes cockroachdb#17612
Noticed on
sky
that after creating 1m ranges we have 9m replicas. Chose a random range (1000) and found 17 replicas. Looks like the replicate queue is churning away removing replicas, but how did we get into this situation. Seems like theAllocatorRemove
action is preferred overAllocatorNoop
, though there is this one oddity:The text was updated successfully, but these errors were encountered: