Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: avoid splitting ranges that cannot be split #9555

Closed
knz opened this issue Sep 26, 2016 · 16 comments
Closed

storage: avoid splitting ranges that cannot be split #9555

knz opened this issue Sep 26, 2016 · 16 comments
Assignees
Milestone

Comments

@knz
Copy link
Contributor

knz commented Sep 26, 2016

When a single KV value becomes larger than the nominal range size, the range grows with it and a split is initiated but the split cannot succeed because the entire size is taken up by a single value. This causes spurious errors and adds pressure on the split queue unnecessarily.

Such "large" ranges should be marked unsplittable and never enter the split queue after they've been marked.

(Suggested by @paperstreet)

@knz knz added this to the 1.0 milestone Sep 26, 2016
@knz
Copy link
Contributor Author

knz commented Sep 26, 2016

Remark by @dt: we can't work around this by simply setting a maximum value size and starting enforcing this max size, because for all we know there are already clusters with column families that would exceed whatever limit we want to enforce, and we couldn't migrate away from that situation any more.

@knz
Copy link
Contributor Author

knz commented Sep 26, 2016

@BramGruneir we thought that you've been looking at split recently, is this perhaps something for you? Otherwise feel free to reassign.

@BramGruneir
Copy link
Member

This seems like a fairly important thing to have, just don't have the time to work on it right now (and clearly didn't back in sept). Un-assigning for now.

@BramGruneir BramGruneir removed their assignment Feb 22, 2017
@spencerkimball
Copy link
Member

At the very least we should only retry the split after a backoff period for a range that has this problem instead of busily retrying. This is a real problem we should probably fix for 1.0. I'm assigning to @petermattis. Re-assign or remove 1.0 milestone.

@petermattis
Copy link
Collaborator

Reproduction instructions:

echo -e "range_min_bytes: 1\nrange_max_bytes: $[64 * 1024]" | ./cockroach zone set -f=- .default
./cockroach sql -e "create database test"
./cockroach sql -e "create table test.kv (k string primary key, v string)"
./cockroach sql -e "insert into test.kv values ('a', repeat('x', 65536))"

The split queue then starts looping trying to split the range:

I170405 15:06:52.945067 137 storage/split_queue.go:118  [split,n1,s1,r10/1:/{Table/51/1/"…-Max},@c420afac00] splitting size=65565 max=65536
I170405 15:06:52.946272 137 storage/split_queue.go:118  [split,n1,s1,r10/1:/{Table/51/1/"…-Max},@c420afac00] splitting size=65565 max=65536
I170405 15:06:54.373220 137 storage/split_queue.go:118  [split,n1,s1,r10/1:/{Table/51/1/"…-Max},@c420afac00] splitting size=65565 max=65536
I170405 15:06:56.805338 137 storage/split_queue.go:118  [split,n1,s1,r10/1:/{Table/51/1/"…-Max},@c420afac00] splitting size=65565 max=65536
I170405 15:07:00.237504 137 storage/split_queue.go:118  [split,n1,s1,r10/1:/{Table/51/1/"…-Max},@c420afac00] splitting size=65565 max=65536

@petermattis
Copy link
Collaborator

The specific cause of the loop is that replica.adminSplitWithDescriptor chooses the start key as the split key and returns nil with an event log indicating that the range is already split at the specified key. The start key is chosen as the split key due to EnsureSafeSplitKey which "truncates" the key returned by MVCCFindSplitKey to a row boundary.

@petermattis
Copy link
Collaborator

We could adjust Replica.mu.maxBytes higher to avoid repeatedly trying to split a range that can't be split. Alternately, we could add a Replica.mu.skipSplitBySize flag, but when would we clear it? When we can't split a range because it is too large, something might eventually change to allow the split to succeed. For example, a single large row might be overwritten with a smaller row and GC would eventually remove the large value.

@spencerkimball
Copy link
Member

How about adding a replica.mu.maxBytesFactor variable which is set to 1 and increases by factors of 2 anytime the range split fails, and is set back to 1 on a successful split?

@petermattis
Copy link
Collaborator

Hmm, that might work, though I'm not sure we need another field. If we can't find a split key for a size-based split, we can call Replica.SetMaxBytes(Replica.GetMVCCStats().Total()). This will prevent another split from being attempted until the range grows in size. Upon a successful split, we call Replica.SetMaxBytes(zone.RangeMaxBytes).

@spencerkimball
Copy link
Member

Sounds like we'd still get plenty of complaints. I'd suggest multiplying current size by two. But the reason I suggested another variable is that we call SetMaxBytes whenever zone configs are gossiped.

@petermattis
Copy link
Collaborator

Sounds like we'd still get plenty of complaints.

I'd rather error on the side of resetting, though. What additional complaints are you worried about? I have my suggestion implemented and it works for my test case. If you add another row to the range the range gets split.

But the reason I suggested another variable is that we call SetMaxBytes whenever zone configs are gossiped.

True, but attempting another split and failing isn't terrible. It is the repeated retries we're trying to avoid and I'd rather retry once than have a bug where the target range size gets huge and we never try to split the range again.

petermattis added a commit to petermattis/cockroach that referenced this issue Apr 6, 2017
If a range cannot be split because a valid split key cannot be found,
set the max size for the range to the range's current size. When a range
is successfully split, reset the max range size to the zone's max range
size. This avoids situations where a range with a single large row
cannot be split.

Fixes cockroachdb#9555
petermattis added a commit to petermattis/cockroach that referenced this issue Apr 6, 2017
If a range cannot be split because a valid split key cannot be found,
set the max size for the range to the range's current size. When a range
is successfully split, reset the max range size to the zone's max range
size. This avoids situations where a range with a single large row
cannot be split.

Fixes cockroachdb#9555
petermattis added a commit to petermattis/cockroach that referenced this issue Apr 6, 2017
If a range cannot be split because a valid split key cannot be found,
set the max size for the range to the range's current size. When a range
is successfully split, reset the max range size to the zone's max range
size. This avoids situations where a range with a single large row
cannot be split.

Fixes cockroachdb#9555
@nvanbenschoten
Copy link
Member

nvanbenschoten commented Apr 24, 2018

I'm trying to understand the decision we made here, especially in the context of #24215. Since we're doubling the size when a split point can't be found, we're not actually doing anything to help us find a split point or reduce the range size in the future. Instead, all we're accomplishing is avoiding the tight loop of retrying split attempts of an unsplittable range over and over again. That's fine and this change was a step in the right direction, but if that's the only thing we were trying to accomplish then why didn't we just stick the replica in the splitQueue's purgatory queue to reduce the frequency of split attempts to some more manageable rate?

@petermattis
Copy link
Collaborator

The purgatory queues are only checked when there are cluster changes, while a range with a single large value should be checked for splitting whenever additional keys are written to it. Am I misunderstanding how you'd want to use the purgatory queue?

@nvanbenschoten
Copy link
Member

The purgatory queues are only checked when there are cluster changes

Are you sure? That looks to me like a detail of the replicateQueue specifically. We should be able to return a timer channel from the splitQueue's purgatoryChan() method, so that the unsplittable ranges are re-assessed periodically. Or we could ping the channel whenever a range is GCed. Either way, I think the purgatory queue is the right place for these ranges and that its interface is flexible enough to fit to our needs.

@petermattis
Copy link
Collaborator

Are you sure?

No, I'm not sure. I haven't looked at the purgatory queue in a while. My statement was made from memory and doesn't necessarily reflect reality.

@bdarnell
Copy link
Contributor

Yeah, purgatory makes sense for this. The only reason that it wasn't used is that it's not widely understood, as seen here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants