storage: 17 replicas on a range configured for 3 #17612

petermattis · 2017-08-12T02:02:04Z

Noticed on sky that after creating 1m ranges we have 9m replicas. Chose a random range (1000) and found 17 replicas. Looks like the replicate queue is churning away removing replicas, but how did we get into this situation. Seems like the AllocatorRemove action is preferred over AllocatorNoop, though there is this one oddity:

	liveReplicas, deadReplicas := a.storePool.liveAndDeadReplicas(rangeInfo.Desc.RangeID, rangeInfo.Desc.Replicas)
	if len(liveReplicas) < quorum {
		// Do not take any removal action if we do not have a quorum of live
		// replicas.
		return AllocatorNoop, 0
	}

The text was updated successfully, but these errors were encountered:

petermattis · 2017-08-12T02:09:36Z

Perhaps related, the Leases metrics are reporting 2.95m epoch leases. How can there be more epoch leases than ranges?

a-robinson · 2017-08-12T02:56:10Z

Perhaps related, the Leases metrics are reporting 2.95m epoch leases. How can there be more epoch leases than ranges?

Wow, it was actually at 10 million leases cluster-wide. That's bonkers, the metric has to be broken somehow

a-robinson · 2017-08-12T02:59:25Z

And good find on the AllocatorNoop - it seems as though even if we choose AllocatorNoop as our action for this reason, we need to avoid doing any rebalancing if the number of ranges is greater than the target number. I don't know whether that's the only issue involved here, though.

Previously if a no-op was returned from Allocator.ComputeAction, we took that as a signal totry rebalancing the range, which can cause a new replica to be added. However, we sometimes return a no-op when it really is the case that no action should be taken, and rebalancing might do more harm than good. Addresses (but might not fix) cockroachdb#17612

a-robinson · 2017-08-12T03:46:00Z

There's also the fact that we compute everything based on the replica's local copy of the range descriptor. If that's out of date, we could also find ourselves adding too many ranges. Is it possible for the leaseholder of a range to have an out of date range descriptor?

a-robinson · 2017-08-12T05:04:23Z

If it makes you feel any better, it's just the epoch-based lease metric that's broken (or confusingly designed, at best), not the leaseholders metric:

I'll send a fix.

petermattis · 2017-08-12T13:06:34Z

There's also the fact that we compute everything based on the replica's local copy of the range descriptor. If that's out of date, we could also find ourselves adding too many ranges. Is it possible for the leaseholder of a range to have an out of date range descriptor?

Not that I know of. I think the leaseholder has to have an up to date copy of the range descriptor or lots of things would break.

petermattis · 2017-08-12T18:39:55Z

In order to get faster splits and balancing I have a local edit to kv to issue the splits in parallel and after each split issue a SCATTER. Apparently the SCATTER is the source of the problem because if I turn it off the replicas/range ratio stays sane.

@benesch Is this behavior of SCATTER expected? I thought restore does a similar split-then-scatter loop.

petermattis · 2017-08-12T19:08:27Z

Ah, I think I see what is happening. SCATTER currently adds N more replicas to a range where N is the target replication. It then relies on the replicate queue to remove the extra replicas in the background. But in my set up, I have 512 workers splitting and scattering, but there are only 64 replicate queues (1 per node). After 1 split+scatter we'll have a range with ~6 replicas, then perhaps we remove a replica and then another split+scatter arrives and we increase this to 5-12 replicas. Rinse. Repeat. In a cluster with fewer nodes there is a natural limiter in the number of rebalance targets. Not so on sky.

The reason I added the SCATTER in the first place is because without scattering after splitting, the replicate queue can't keep up with the splitting and we end up with a massive imbalance of replicas and leaseholders on a few nodes.

@benesch, @a-robinson I recall there being discussion during the initial scatter implementation about removing unneeded replicas, but not the details. Do you recall why that wasn't implemented?

benesch · 2017-08-12T19:20:37Z

Yeah, waiting for downreplication was implemented then removed because it proved too slow. This was before I discovered (and fixed) the general performance badness caused by range tombstones on empty ranges, so it's possible we should restore downreplication.

Anyway, SCATTER is currently designed for a all-splits, all-scatters pipeline. If you just want a balanced set of 1m ranges, @petermattis, I'd recommend issuing one massive SCATTER command after splits finish.

It's true that RESTORE just switched to a split-scatter-at-a-time model, but the SCATTER implementation wasn't changed to deal with this. Waiting for downreplication is far more feasible when you're not waiting for 16k ranges to downreplicate in one retry loop with a 60s timeout.

benesch · 2017-08-12T19:22:52Z

All that said, the demands on SCATTER are varied enough that it deserves a dedicated queue (i.e., a scatter queue, which might be implemented simply as a range-local flag that influences the replicate queue's behavior) or separate implementations for the different use cases.

petermattis · 2017-08-13T14:15:10Z

Anyway, SCATTER is currently designed for a all-splits, all-scatters pipeline. If you just want a balanced set of 1m ranges, @petermattis, I'd recommend issuing one massive SCATTER command after splits finish.

This is what I started with (perform all splits, then all scatters) and its problematic because the splitting happens faster than the replicate queue can keep up and we end up with a massive imbalance that actually slows down the splitting.

I think the root problem here is that the replicate queue is throttled by being a single goroutine per node while SPLIT AT and SCATTER are limited by how much concurrency the client wishes to use. In my case, I have 512 goroutines performing split+scatter and only 64 goroutines in replicate queues working to keep up with the mess I'm creating.

a-robinson · 2017-08-14T15:54:09Z

Calling scatter after every split is definitely a bad idea given the current implementation. You could try calling it after every N splits, where N is somewhere in the tens or hundreds, but you'll still probably hit issues due to the concurrency differences you noted. The replicate queue is not currently tuned to work quite so quickly.

Is this something that's blocking effective use of sky? Otherwise I'm not sure it merits much effort before 1.1.

petermattis · 2017-08-14T15:56:05Z

Is this something that's blocking effective use of sky? Otherwise I'm not sure it merits much effort before 1.1.

It is blocking trying to create 1m ranges. I'm on to testing other things on sky (like figuring out why performance is lower than expected).

benesch · 2017-08-14T15:58:10Z

@a-robinson what about the current implementation makes issuing a scatter command after every split problematic? We no longer add jitter server-side.

petermattis · 2017-08-14T15:59:40Z

@benesch Did we ever commit the version of scatter that removed replicas after adding new ones? I couldn't find it earlier today.

a-robinson · 2017-08-14T16:02:09Z

@a-robinson what about the current implementation makes issuing a scatter command after every split problematic? We no longer add jitter server-side.

I guess if you're limiting the span of the scatter to just the range that was split, it's effectively the same. I was assuming he was scattering the entire table every time, but that's probably not the case.

benesch · 2017-08-14T16:05:55Z

@petermattis, never committed. Here's the implementation from Reviewable:

// Wait for the new leaseholder's replicate queue to downreplicate the range
// to the appropriate replication factor. Issuing REMOVE_REPLICA requests here
// would be dangerous, as we'd race with the replicate queue and potentially
// cause underreplication.
if err := retry.WithMaxAttempts(ctx, retryOptions, maxAttempts, func() error {
  // Get an updated range descriptor, as we might sleep for several seconds
	// between retries.
  desc = repl.Desc()
 
  action, _ := repl.store.allocator.ComputeAction(ctx, zone, desc)
  if action != AllocatorNoop {
    if log.V(2) {
      log.Infof(ctx, "scatter: allocator requests action %s, sleeping and retrying", action)
    }
  }
  return nil
}); err != nil {
  return err
}

Point being it never actually issued remove replica commands; it just waited for the replicate queue to sort it out. @RaduBerinde's original implementation, which was just a wrapper around TestingRelocateRanges, did issue remove replica commands, but could cause massive underreplication due to racing with the replicate queue.

petermattis · 2017-08-14T16:08:27Z

I guess if you're limiting the span of the scatter to just the range that was split, it's effectively the same. I was assuming he was scattering the entire table every time, but that's probably not the case.

I am limiting the scatter to the range being split. It actually isn't the same. I'm trying to parallelize splitting a table into N pieces. I do a little divide-and-conquer algorithm where I split at the middle split point, scatter, then recurse into the left and right parts. Consider what happens after the first split+scatter. We now have 2 ranges where each range has 6 replicas. The replicate queue will be trying to down replicate, but my load generator is sending splits for these 2 ranges and will then scatter again which adds 3 new replicas.

If a single scatter is performed at the end, the max replicas a range will see is 6. With the split+scatter approach, the max replicas for a range is bounded only by the number of nodes in the cluster.

petermattis · 2017-08-14T16:09:41Z

Thanks, @benesch. Waiting for the replicate queue is going to be slow. Perhaps we can mark the replica to be ignored by the replicate queue (temporarily) while we perform the down-replication.

bdarnell · 2017-08-14T16:59:18Z

Replica.ChangeReplicas() takes a RangeDescriptor to guard against issues like this (the replica change is aborted if the current range descriptor no longer matches the one that was passed in). It's not exposed through the ChangeReplicas RPC, but if it were then I think we could safely remove replicas without risking double-removals due to the replication queue.

petermattis · 2017-08-14T18:48:27Z

@bdarnell Interesting point. We don't use the ChangeReplicas RPC in Replica.adminScatter. I'm not sure why this doesn't just work.

benesch · 2017-08-14T18:51:14Z

Perhaps it does. I never tried. It's possible the old scatter caused underreplication for some different reason, like the range tombstone nonsense.

…

On Mon, Aug 14, 2017 at 2:48 PM Peter Mattis ***@***.***> wrote: @bdarnell <https://github.com/bdarnell> Interesting point. We don't use the ChangeReplicas RPC in Replica.adminScatter. I'm not sure why this doesn't just work. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17612 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA15IIGp0Xloh1H_Vptcaj2tS6mlV8N1ks5sYJaRgaJpZM4O1R1D> .

petermattis · 2017-08-14T18:56:07Z

Ok, I'm taking a look at this.

Rather than using custom logic in Replica.adminScatter, call into replicateQueue.processOneChange until no further action needs to be taken. This fixes a major caveat with the prior implementation: replicas were not removed synchronously leading to conditions where an arbitrary number of replicas could be added to a range via scattering. Fixes cockroachdb#17612

petermattis added this to the 1.1 milestone Aug 12, 2017

petermattis assigned petermattis and a-robinson Aug 12, 2017

petermattis mentioned this issue Aug 12, 2017

perf: optimize Raft ticking of quiesced ranges #17609

Closed

a-robinson mentioned this issue Aug 12, 2017

storage: Separate true allocator no-ops from considering rebalancing #17613

Merged

petermattis mentioned this issue Aug 15, 2017

storage: simplify scatter implementation #17644

Merged

petermattis mentioned this issue Aug 15, 2017

storage: transitive prerequisites are not always transferred on command cancellation #16266

Closed

petermattis closed this as completed in #17644 Aug 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: 17 replicas on a range configured for 3 #17612

storage: 17 replicas on a range configured for 3 #17612

petermattis commented Aug 12, 2017

petermattis commented Aug 12, 2017

a-robinson commented Aug 12, 2017

a-robinson commented Aug 12, 2017

a-robinson commented Aug 12, 2017

a-robinson commented Aug 12, 2017

petermattis commented Aug 12, 2017

petermattis commented Aug 12, 2017

petermattis commented Aug 12, 2017

benesch commented Aug 12, 2017

benesch commented Aug 12, 2017

petermattis commented Aug 13, 2017

a-robinson commented Aug 14, 2017

petermattis commented Aug 14, 2017

benesch commented Aug 14, 2017

petermattis commented Aug 14, 2017

a-robinson commented Aug 14, 2017

benesch commented Aug 14, 2017 •

edited

Loading

petermattis commented Aug 14, 2017

petermattis commented Aug 14, 2017

bdarnell commented Aug 14, 2017

petermattis commented Aug 14, 2017

benesch commented Aug 14, 2017 via email

petermattis commented Aug 14, 2017

storage: 17 replicas on a range configured for 3 #17612

storage: 17 replicas on a range configured for 3 #17612

Comments

petermattis commented Aug 12, 2017

petermattis commented Aug 12, 2017

a-robinson commented Aug 12, 2017

a-robinson commented Aug 12, 2017

a-robinson commented Aug 12, 2017

a-robinson commented Aug 12, 2017

petermattis commented Aug 12, 2017

petermattis commented Aug 12, 2017

petermattis commented Aug 12, 2017

benesch commented Aug 12, 2017

benesch commented Aug 12, 2017

petermattis commented Aug 13, 2017

a-robinson commented Aug 14, 2017

petermattis commented Aug 14, 2017

benesch commented Aug 14, 2017

petermattis commented Aug 14, 2017

a-robinson commented Aug 14, 2017

benesch commented Aug 14, 2017 • edited Loading

petermattis commented Aug 14, 2017

petermattis commented Aug 14, 2017

bdarnell commented Aug 14, 2017

petermattis commented Aug 14, 2017

benesch commented Aug 14, 2017 via email

petermattis commented Aug 14, 2017

benesch commented Aug 14, 2017 •

edited

Loading