storage: avoid splitting ranges that cannot be split #9555

knz · 2016-09-26T20:54:54Z

When a single KV value becomes larger than the nominal range size, the range grows with it and a split is initiated but the split cannot succeed because the entire size is taken up by a single value. This causes spurious errors and adds pressure on the split queue unnecessarily.

Such "large" ranges should be marked unsplittable and never enter the split queue after they've been marked.

(Suggested by @paperstreet)

knz · 2016-09-26T20:56:06Z

Remark by @dt: we can't work around this by simply setting a maximum value size and starting enforcing this max size, because for all we know there are already clusters with column families that would exceed whatever limit we want to enforce, and we couldn't migrate away from that situation any more.

knz · 2016-09-26T20:58:35Z

@BramGruneir we thought that you've been looking at split recently, is this perhaps something for you? Otherwise feel free to reassign.

BramGruneir · 2017-02-22T15:33:39Z

This seems like a fairly important thing to have, just don't have the time to work on it right now (and clearly didn't back in sept). Un-assigning for now.

spencerkimball · 2017-04-03T23:34:54Z

At the very least we should only retry the split after a backoff period for a range that has this problem instead of busily retrying. This is a real problem we should probably fix for 1.0. I'm assigning to @petermattis. Re-assign or remove 1.0 milestone.

petermattis · 2017-04-05T19:10:44Z

Reproduction instructions:

echo -e "range_min_bytes: 1\nrange_max_bytes: $[64 * 1024]" | ./cockroach zone set -f=- .default
./cockroach sql -e "create database test"
./cockroach sql -e "create table test.kv (k string primary key, v string)"
./cockroach sql -e "insert into test.kv values ('a', repeat('x', 65536))"

The split queue then starts looping trying to split the range:

I170405 15:06:52.945067 137 storage/split_queue.go:118  [split,n1,s1,r10/1:/{Table/51/1/"…-Max},@c420afac00] splitting size=65565 max=65536
I170405 15:06:52.946272 137 storage/split_queue.go:118  [split,n1,s1,r10/1:/{Table/51/1/"…-Max},@c420afac00] splitting size=65565 max=65536
I170405 15:06:54.373220 137 storage/split_queue.go:118  [split,n1,s1,r10/1:/{Table/51/1/"…-Max},@c420afac00] splitting size=65565 max=65536
I170405 15:06:56.805338 137 storage/split_queue.go:118  [split,n1,s1,r10/1:/{Table/51/1/"…-Max},@c420afac00] splitting size=65565 max=65536
I170405 15:07:00.237504 137 storage/split_queue.go:118  [split,n1,s1,r10/1:/{Table/51/1/"…-Max},@c420afac00] splitting size=65565 max=65536

petermattis · 2017-04-05T20:33:23Z

The specific cause of the loop is that replica.adminSplitWithDescriptor chooses the start key as the split key and returns nil with an event log indicating that the range is already split at the specified key. The start key is chosen as the split key due to EnsureSafeSplitKey which "truncates" the key returned by MVCCFindSplitKey to a row boundary.

petermattis · 2017-04-05T21:21:54Z

We could adjust Replica.mu.maxBytes higher to avoid repeatedly trying to split a range that can't be split. Alternately, we could add a Replica.mu.skipSplitBySize flag, but when would we clear it? When we can't split a range because it is too large, something might eventually change to allow the split to succeed. For example, a single large row might be overwritten with a smaller row and GC would eventually remove the large value.

spencerkimball · 2017-04-05T22:00:58Z

How about adding a replica.mu.maxBytesFactor variable which is set to 1 and increases by factors of 2 anytime the range split fails, and is set back to 1 on a successful split?

petermattis · 2017-04-06T00:21:28Z

Hmm, that might work, though I'm not sure we need another field. If we can't find a split key for a size-based split, we can call Replica.SetMaxBytes(Replica.GetMVCCStats().Total()). This will prevent another split from being attempted until the range grows in size. Upon a successful split, we call Replica.SetMaxBytes(zone.RangeMaxBytes).

spencerkimball · 2017-04-06T01:26:59Z

Sounds like we'd still get plenty of complaints. I'd suggest multiplying current size by two. But the reason I suggested another variable is that we call SetMaxBytes whenever zone configs are gossiped.

petermattis · 2017-04-06T01:45:14Z

Sounds like we'd still get plenty of complaints.

I'd rather error on the side of resetting, though. What additional complaints are you worried about? I have my suggestion implemented and it works for my test case. If you add another row to the range the range gets split.

But the reason I suggested another variable is that we call SetMaxBytes whenever zone configs are gossiped.

True, but attempting another split and failing isn't terrible. It is the repeated retries we're trying to avoid and I'd rather retry once than have a bug where the target range size gets huge and we never try to split the range again.

If a range cannot be split because a valid split key cannot be found, set the max size for the range to the range's current size. When a range is successfully split, reset the max range size to the zone's max range size. This avoids situations where a range with a single large row cannot be split. Fixes cockroachdb#9555

nvanbenschoten · 2018-04-24T19:48:56Z

I'm trying to understand the decision we made here, especially in the context of #24215. Since we're doubling the size when a split point can't be found, we're not actually doing anything to help us find a split point or reduce the range size in the future. Instead, all we're accomplishing is avoiding the tight loop of retrying split attempts of an unsplittable range over and over again. That's fine and this change was a step in the right direction, but if that's the only thing we were trying to accomplish then why didn't we just stick the replica in the splitQueue's purgatory queue to reduce the frequency of split attempts to some more manageable rate?

petermattis · 2018-04-24T19:52:19Z

The purgatory queues are only checked when there are cluster changes, while a range with a single large value should be checked for splitting whenever additional keys are written to it. Am I misunderstanding how you'd want to use the purgatory queue?

nvanbenschoten · 2018-04-24T20:06:10Z

The purgatory queues are only checked when there are cluster changes

Are you sure? That looks to me like a detail of the replicateQueue specifically. We should be able to return a timer channel from the splitQueue's purgatoryChan() method, so that the unsplittable ranges are re-assessed periodically. Or we could ping the channel whenever a range is GCed. Either way, I think the purgatory queue is the right place for these ranges and that its interface is flexible enough to fit to our needs.

petermattis · 2018-04-24T20:37:27Z

Are you sure?

No, I'm not sure. I haven't looked at the purgatory queue in a while. My statement was made from memory and doesn't necessarily reflect reality.

bdarnell · 2018-04-24T22:12:00Z

Yeah, purgatory makes sense for this. The only reason that it wasn't used is that it's not widely understood, as seen here.

knz added this to the 1.0 milestone Sep 26, 2016

knz assigned BramGruneir Sep 26, 2016

knz mentioned this issue Sep 26, 2016

sql: implement a maximum size checker on columns and families #9556

Open

BramGruneir removed their assignment Feb 22, 2017

spencerkimball assigned petermattis Apr 3, 2017

spencerkimball mentioned this issue Apr 4, 2017

stability: range is already split error when single MVCC key gets too large #10095

Closed

petermattis mentioned this issue Apr 6, 2017

storage: avoid unsplittable range #14654

Merged

petermattis closed this as completed in #14654 Apr 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: avoid splitting ranges that cannot be split #9555

storage: avoid splitting ranges that cannot be split #9555

knz commented Sep 26, 2016 •

edited by dt

Loading

knz commented Sep 26, 2016 •

edited

Loading

knz commented Sep 26, 2016

BramGruneir commented Feb 22, 2017

spencerkimball commented Apr 3, 2017

petermattis commented Apr 5, 2017

petermattis commented Apr 5, 2017

petermattis commented Apr 5, 2017

spencerkimball commented Apr 5, 2017

petermattis commented Apr 6, 2017

spencerkimball commented Apr 6, 2017

petermattis commented Apr 6, 2017

nvanbenschoten commented Apr 24, 2018 •

edited

Loading

petermattis commented Apr 24, 2018

nvanbenschoten commented Apr 24, 2018

petermattis commented Apr 24, 2018

bdarnell commented Apr 24, 2018

storage: avoid splitting ranges that cannot be split #9555

storage: avoid splitting ranges that cannot be split #9555

Comments

knz commented Sep 26, 2016 • edited by dt Loading

knz commented Sep 26, 2016 • edited Loading

knz commented Sep 26, 2016

BramGruneir commented Feb 22, 2017

spencerkimball commented Apr 3, 2017

petermattis commented Apr 5, 2017

petermattis commented Apr 5, 2017

petermattis commented Apr 5, 2017

spencerkimball commented Apr 5, 2017

petermattis commented Apr 6, 2017

spencerkimball commented Apr 6, 2017

petermattis commented Apr 6, 2017

nvanbenschoten commented Apr 24, 2018 • edited Loading

petermattis commented Apr 24, 2018

nvanbenschoten commented Apr 24, 2018

petermattis commented Apr 24, 2018

bdarnell commented Apr 24, 2018

knz commented Sep 26, 2016 •

edited by dt

Loading

knz commented Sep 26, 2016 •

edited

Loading

nvanbenschoten commented Apr 24, 2018 •

edited

Loading