Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: 17 replicas on a range configured for 3 #17612

Closed
petermattis opened this issue Aug 12, 2017 · 23 comments · Fixed by #17644
Closed

storage: 17 replicas on a range configured for 3 #17612

petermattis opened this issue Aug 12, 2017 · 23 comments · Fixed by #17644
Assignees
Milestone

Comments

@petermattis
Copy link
Collaborator

Noticed on sky that after creating 1m ranges we have 9m replicas. Chose a random range (1000) and found 17 replicas. Looks like the replicate queue is churning away removing replicas, but how did we get into this situation. Seems like the AllocatorRemove action is preferred over AllocatorNoop, though there is this one oddity:

	liveReplicas, deadReplicas := a.storePool.liveAndDeadReplicas(rangeInfo.Desc.RangeID, rangeInfo.Desc.Replicas)
	if len(liveReplicas) < quorum {
		// Do not take any removal action if we do not have a quorum of live
		// replicas.
		return AllocatorNoop, 0
	}
@petermattis petermattis added this to the 1.1 milestone Aug 12, 2017
@petermattis
Copy link
Collaborator Author

Perhaps related, the Leases metrics are reporting 2.95m epoch leases. How can there be more epoch leases than ranges?

@a-robinson
Copy link
Contributor

Perhaps related, the Leases metrics are reporting 2.95m epoch leases. How can there be more epoch leases than ranges?

Wow, it was actually at 10 million leases cluster-wide. That's bonkers, the metric has to be broken somehow

@a-robinson
Copy link
Contributor

And good find on the AllocatorNoop - it seems as though even if we choose AllocatorNoop as our action for this reason, we need to avoid doing any rebalancing if the number of ranges is greater than the target number. I don't know whether that's the only issue involved here, though.

a-robinson added a commit to a-robinson/cockroach that referenced this issue Aug 12, 2017
Previously if a no-op was returned from Allocator.ComputeAction, we
took that as a signal totry rebalancing the range, which can cause a
new replica to be added. However, we sometimes return a no-op when it
really is the case that no action should be taken, and rebalancing might
do more harm than good.

Addresses (but might not fix) cockroachdb#17612
@a-robinson
Copy link
Contributor

There's also the fact that we compute everything based on the replica's local copy of the range descriptor. If that's out of date, we could also find ourselves adding too many ranges. Is it possible for the leaseholder of a range to have an out of date range descriptor?

@a-robinson
Copy link
Contributor

If it makes you feel any better, it's just the epoch-based lease metric that's broken (or confusingly designed, at best), not the leaseholders metric:

screen shot 2017-08-12 at 12 53 34 am

I'll send a fix.

@petermattis
Copy link
Collaborator Author

There's also the fact that we compute everything based on the replica's local copy of the range descriptor. If that's out of date, we could also find ourselves adding too many ranges. Is it possible for the leaseholder of a range to have an out of date range descriptor?

Not that I know of. I think the leaseholder has to have an up to date copy of the range descriptor or lots of things would break.

@petermattis
Copy link
Collaborator Author

In order to get faster splits and balancing I have a local edit to kv to issue the splits in parallel and after each split issue a SCATTER. Apparently the SCATTER is the source of the problem because if I turn it off the replicas/range ratio stays sane.

@benesch Is this behavior of SCATTER expected? I thought restore does a similar split-then-scatter loop.

@petermattis
Copy link
Collaborator Author

Ah, I think I see what is happening. SCATTER currently adds N more replicas to a range where N is the target replication. It then relies on the replicate queue to remove the extra replicas in the background. But in my set up, I have 512 workers splitting and scattering, but there are only 64 replicate queues (1 per node). After 1 split+scatter we'll have a range with ~6 replicas, then perhaps we remove a replica and then another split+scatter arrives and we increase this to 5-12 replicas. Rinse. Repeat. In a cluster with fewer nodes there is a natural limiter in the number of rebalance targets. Not so on sky.

The reason I added the SCATTER in the first place is because without scattering after splitting, the replicate queue can't keep up with the splitting and we end up with a massive imbalance of replicas and leaseholders on a few nodes.

@benesch, @a-robinson I recall there being discussion during the initial scatter implementation about removing unneeded replicas, but not the details. Do you recall why that wasn't implemented?

@benesch
Copy link
Contributor

benesch commented Aug 12, 2017

Yeah, waiting for downreplication was implemented then removed because it proved too slow. This was before I discovered (and fixed) the general performance badness caused by range tombstones on empty ranges, so it's possible we should restore downreplication.

Anyway, SCATTER is currently designed for a all-splits, all-scatters pipeline. If you just want a balanced set of 1m ranges, @petermattis, I'd recommend issuing one massive SCATTER command after splits finish.

It's true that RESTORE just switched to a split-scatter-at-a-time model, but the SCATTER implementation wasn't changed to deal with this. Waiting for downreplication is far more feasible when you're not waiting for 16k ranges to downreplicate in one retry loop with a 60s timeout.

@benesch
Copy link
Contributor

benesch commented Aug 12, 2017

All that said, the demands on SCATTER are varied enough that it deserves a dedicated queue (i.e., a scatter queue, which might be implemented simply as a range-local flag that influences the replicate queue's behavior) or separate implementations for the different use cases.

@petermattis
Copy link
Collaborator Author

Anyway, SCATTER is currently designed for a all-splits, all-scatters pipeline. If you just want a balanced set of 1m ranges, @petermattis, I'd recommend issuing one massive SCATTER command after splits finish.

This is what I started with (perform all splits, then all scatters) and its problematic because the splitting happens faster than the replicate queue can keep up and we end up with a massive imbalance that actually slows down the splitting.

I think the root problem here is that the replicate queue is throttled by being a single goroutine per node while SPLIT AT and SCATTER are limited by how much concurrency the client wishes to use. In my case, I have 512 goroutines performing split+scatter and only 64 goroutines in replicate queues working to keep up with the mess I'm creating.

@a-robinson
Copy link
Contributor

Calling scatter after every split is definitely a bad idea given the current implementation. You could try calling it after every N splits, where N is somewhere in the tens or hundreds, but you'll still probably hit issues due to the concurrency differences you noted. The replicate queue is not currently tuned to work quite so quickly.

Is this something that's blocking effective use of sky? Otherwise I'm not sure it merits much effort before 1.1.

@petermattis
Copy link
Collaborator Author

Is this something that's blocking effective use of sky? Otherwise I'm not sure it merits much effort before 1.1.

It is blocking trying to create 1m ranges. I'm on to testing other things on sky (like figuring out why performance is lower than expected).

@benesch
Copy link
Contributor

benesch commented Aug 14, 2017

@a-robinson what about the current implementation makes issuing a scatter command after every split problematic? We no longer add jitter server-side.

@petermattis
Copy link
Collaborator Author

@benesch Did we ever commit the version of scatter that removed replicas after adding new ones? I couldn't find it earlier today.

@a-robinson
Copy link
Contributor

@a-robinson what about the current implementation makes issuing a scatter command after every split problematic? We no longer add jitter server-side.

I guess if you're limiting the span of the scatter to just the range that was split, it's effectively the same. I was assuming he was scattering the entire table every time, but that's probably not the case.

@benesch
Copy link
Contributor

benesch commented Aug 14, 2017

@petermattis, never committed. Here's the implementation from Reviewable:

// Wait for the new leaseholder's replicate queue to downreplicate the range
// to the appropriate replication factor. Issuing REMOVE_REPLICA requests here
// would be dangerous, as we'd race with the replicate queue and potentially
// cause underreplication.
if err := retry.WithMaxAttempts(ctx, retryOptions, maxAttempts, func() error {
  // Get an updated range descriptor, as we might sleep for several seconds
	// between retries.
  desc = repl.Desc()
 
  action, _ := repl.store.allocator.ComputeAction(ctx, zone, desc)
  if action != AllocatorNoop {
    if log.V(2) {
      log.Infof(ctx, "scatter: allocator requests action %s, sleeping and retrying", action)
    }
  }
  return nil
}); err != nil {
  return err
}

Point being it never actually issued remove replica commands; it just waited for the replicate queue to sort it out. @RaduBerinde's original implementation, which was just a wrapper around TestingRelocateRanges, did issue remove replica commands, but could cause massive underreplication due to racing with the replicate queue.

@petermattis
Copy link
Collaborator Author

I guess if you're limiting the span of the scatter to just the range that was split, it's effectively the same. I was assuming he was scattering the entire table every time, but that's probably not the case.

I am limiting the scatter to the range being split. It actually isn't the same. I'm trying to parallelize splitting a table into N pieces. I do a little divide-and-conquer algorithm where I split at the middle split point, scatter, then recurse into the left and right parts. Consider what happens after the first split+scatter. We now have 2 ranges where each range has 6 replicas. The replicate queue will be trying to down replicate, but my load generator is sending splits for these 2 ranges and will then scatter again which adds 3 new replicas.

If a single scatter is performed at the end, the max replicas a range will see is 6. With the split+scatter approach, the max replicas for a range is bounded only by the number of nodes in the cluster.

@petermattis
Copy link
Collaborator Author

Thanks, @benesch. Waiting for the replicate queue is going to be slow. Perhaps we can mark the replica to be ignored by the replicate queue (temporarily) while we perform the down-replication.

@bdarnell
Copy link
Contributor

Replica.ChangeReplicas() takes a RangeDescriptor to guard against issues like this (the replica change is aborted if the current range descriptor no longer matches the one that was passed in). It's not exposed through the ChangeReplicas RPC, but if it were then I think we could safely remove replicas without risking double-removals due to the replication queue.

@petermattis
Copy link
Collaborator Author

@bdarnell Interesting point. We don't use the ChangeReplicas RPC in Replica.adminScatter. I'm not sure why this doesn't just work.

@benesch
Copy link
Contributor

benesch commented Aug 14, 2017 via email

@petermattis
Copy link
Collaborator Author

Ok, I'm taking a look at this.

petermattis added a commit to petermattis/cockroach that referenced this issue Aug 15, 2017
Rather than using custom logic in Replica.adminScatter, call into
replicateQueue.processOneChange until no further action needs to be
taken. This fixes a major caveat with the prior implementation: replicas
were not removed synchronously leading to conditions where an arbitrary
number of replicas could be added to a range via scattering.

Fixes cockroachdb#17612
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 15, 2017
Rather than using custom logic in Replica.adminScatter, call into
replicateQueue.processOneChange until no further action needs to be
taken. This fixes a major caveat with the prior implementation: replicas
were not removed synchronously leading to conditions where an arbitrary
number of replicas could be added to a range via scattering.

Fixes cockroachdb#17612
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants