-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: avoid load based splits in middle of SQL row #103690
Conversation
8aa9339
to
5f3661f
Compare
f3c1f70
to
788e5b3
Compare
788e5b3
to
b341eb0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli, @nvanbenschoten, @renatolabs, and @smg260)
pkg/kv/kvserver/replica_split_load.go
line 263 at r3 (raw file):
// NOTE: The returned split key CAN BE BETWEEN A SQL ROW, The split key // returned should only be used to engage a split via adminSplitWithDescriptor // where findNextSafeKey is set to true.
Is it possible/easy to reproduce this bug in a unit test?
Feel free to ignore if it's not relevant; I'm just lurking and learning more about splits.
Code quote:
// NOTE: The returned split key CAN BE BETWEEN A SQL ROW, The split key
// returned should only be used to engage a split via adminSplitWithDescriptor
// where findNextSafeKey is set to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miraradeva, @nvanbenschoten, @renatolabs, and @smg260)
pkg/kv/kvserver/replica_split_load.go
line 263 at r3 (raw file):
Previously, miraradeva (Mira Radeva) wrote…
Is it possible/easy to reproduce this bug in a unit test?
Feel free to ignore if it's not relevant; I'm just lurking and learning more about splits.
Certainly possible. I'm writing up a test like this now.
b341eb0
to
f8848f9
Compare
f8848f9
to
36e00b5
Compare
96e8f6c
to
357fb7c
Compare
It was possible that no load-based split key was ever found when using the weighted finder. This was because we discard potentially unsafe samples, which could have been safe split points. This reverts commit e4f003b. Resolves: cockroachdb#103672
cfc8b3d
to
dbfe79a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r11, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @kvoli, @miraradeva, and @sumeerbhola)
pkg/kv/kvserver/replica_split_load.go
line 286 at r6 (raw file):
Previously, kvoli (Austen) wrote…
Why would we want to start searching from the row prefix? [...] the request key may not exist in the range but other column families in the same row do exist, and that searching from the request key may lead to splitting after it, instead of before it?
It is solely for the reason you gave, where all the keys for the row are <= the sampled split key.
Could we update the comment to mention that?
pkg/storage/mvcc.go
line 5963 at r10 (raw file):
Split at
f
,RHS
[f,k]
will have 100% of the load.
After this split, we will have carved off a cold range [a,f) from a hot range [f,k] and failed to bisect the load. However, we do expect that the range will attempt to load-based split again. When it does, it will fall into the "L is empty" case, and will split at g
. Is that correct?
pkg/storage/mvcc.go
line 5934 at r11 (raw file):
// the range which spans [startKey,endKey). The returned split key is safe, // meaning it is guaranteed to (1) not be a column family key within a table // row and be after the first table row in the range. For non-table ranges, the
I think you wanted "and (2) be" here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @kvoli, @miraradeva, and @nvanbenschoten)
pkg/kv/kvserver/replica_command.go
line 355 at r10 (raw file):
Previously, kvoli (Austen) wrote…
On second thought, it may be better to not include the change in this PR. I already overloaded it with a few other things.
Ack
pkg/storage/mvcc.go
line 5963 at r10 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Split at
f
,RHS
[f,k]
will have 100% of the load.After this split, we will have carved off a cold range [a,f) from a hot range [f,k] and failed to bisect the load. However, we do expect that the range will attempt to load-based split again. When it does, it will fall into the "L is empty" case, and will split at
g
. Is that correct?
btw, I was not clear in my earlier example. I meant the initial f<i>
to be keys that are not just unsafe, but also get an error when passed to EnsureSafeSplitKey
. So replica.loadSplitKey
would have to pass it through to the code here.
I don't think this fact changes the general problem, which you elaborated on with more examples. And it also seems we are in agreement that if we do end up splitting off a cold range, the next split will likely correct the behavior. So we'll end up with 3 ranges instead of 2.
dbfe79a
to
a145013
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TYFTR, Sumeer mentioned over slack that his review should not block.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @miraradeva, @nvanbenschoten, and @sumeerbhola)
pkg/kv/kvserver/replica_split_load.go
line 286 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could we update the comment to mention that?
Done.
pkg/storage/mvcc.go
line 5963 at r10 (raw file):
When it does, it will fall into the "L is empty" case, and will split at
g
. Is that correct?
Yes. There will be 2 splits (3 ranges) r0[a,z) => r0[a,f) r1[f, z) => r0[a,f) r1[f,g) r2[g,z)
instead of 2. Assuming no changes in requests, this should occur after 10 seconds and the first split [a,f)
. It is a bit wasteful though, because we split the range's load stats evenly over the LHS and RHS - so it is probable the LHS [a,f)
gets picked up for rebalancing and becomes a pain to merge (if [a,f) + [f,g) < 0.5 * split threshold
), with wasted snapshots realigning the replicas.
I don't think this fact changes the general problem, which you elaborated on with more examples. And it also seems we are in agreement that if we do end up splitting off a cold range, the next split will likely correct the behavior. So we'll end up with 3 ranges instead of 2.
Agreed.
pkg/storage/mvcc.go
line 5934 at r11 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think you wanted "and (2) be" here.
Fixed and reworded.
It was possible for a SQL row to be torn across two ranges due to the load-based splitter not rejecting potentially unsafe split keys. It is impossible to determine with keys sampled from response spans, whether a key is certainly unsafe or safe. This commit side steps this problem by re-using the `adminSplitWithDescriptor` command to find the first real key, after or at the provided `args.SplitKey`. This ensures that the split key will always be a real key whilst not requiring any checks in the splitter itself. The updated `adminSplitWithDescriptor` is local only and requires opting into finding the first safe key by setting `findFirstSafeKey` to `true`. As such, all safe split key checks are also removed from the `split` pkg, with a warning added that the any split key returned is unsafe. Resolves: cockroachdb#103483 Release note (bug fix): It was possible for a SQL row to be split across two ranges. When this occurred, SQL queries could return unexpected errors. This bug is resolved by these changes, as we now inspect the real keys, rather than just request keys to determine load-based split points.
a145013
to
bf2aa42
Compare
Non-blocking review, discussed over slack. Necessary to merge.
bors r=nvanbenschoten |
Build succeeded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/mvcc.go
line 5963 at r10 (raw file):
I think I missed something. In the case you mentioned
If the LB desired key is
f<i>, i > 11
andR
is empty:
f will end up being the last key in the LHS, and we will fail to split f. IIUC, we don't have any logic to step back from the last key the way we step forward from the fist key (in mvccMinSplitKey
), so this failure may not be fixed by another split in the future.
I don't know if this is mostly a theoretical concern. #103672 mentions
Table key with one or more .Next() 0x00 appended
which could satisfy the case where it is past all the valid f keys, and can't be successfully passed through EnsureSafeSplitKey
. Does this kind of thing occur due to reverse scans (which call SeekLT
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/mvcc.go
line 5963 at r10 (raw file):
Previously, sumeerbhola wrote…
I think I missed something. In the case you mentioned
If the LB desired key is
f<i>, i > 11
andR
is empty:f will end up being the last key in the LHS, and we will fail to split f. IIUC, we don't have any logic to step back from the last key the way we step forward from the fist key (in
mvccMinSplitKey
), so this failure may not be fixed by another split in the future.I don't know if this is mostly a theoretical concern. #103672 mentions
Table key with one or more .Next() 0x00 appended
which could satisfy the case where it is past all the valid f keys, and can't be successfully passed through
EnsureSafeSplitKey
. Does this kind of thing occur due to reverse scans (which callSeekLT
)?
I'm only aware of one support related issue where we spit inbetween a row, prompting #103672.
If the LB split key was indeed unsafe due to one or more .Next() 0x00 appended, then the split would be acted upon as is prior to this PR (ignoring the safe sample only retention).
This leads me to believe that type of sample has been rare in the past, given the original issue is from 2019.
On a related note, this class of issues where the R is empty can still exist without needing unsafe keys.
Consider the kv
workload schema, which has 10 keys all belonging to a single table range [/Table/106/1, /Max]
.
I added additional logging for span sampling events, even if we didn't sample the key.
set tracing = on; select count(v) from [select v from kv order by k]; set tracing = off;
select message from [show trace for session] where message ~ 'executing (Get|Scan)' or message ~ 'sampling';
message
---------------------------------------------------------------------------------
executing Scan [/Table/106/1,/Table/106/2), [txn: 0c8c2532], [can-forward-ts]
sampling /Table/106/{1-2}
(2 rows)
Time: 1ms total (execution 1ms / network 0ms)
If the workload contains only scans for /Table/106/1
and /Table/106/2
, we will suggest a spliit at /Table/106/2
in the weighted finder and never suggest a split in the unweighted finder (it would amplify QPS).
The problem is, R
is empty by definition, since /Table/106/2
must sort after all keys in the table (only 1 index). Note that we use the union of spans the request iterated over (ca76c28) - so if the batch hits a limit and has a resumeSpan, we are unlikely to run into this problem
Passing percentiles and/or sampling more than the union of spans iterated over seems necessary.
In any case, this was desired behavior pre 23.1. There are load based rebalancing tests, such as allocbench
which use some mix of spanning requests and split reasonably well. For a workload with only full table scans on a range bytes < resume threshold, we will never split - evidenced by the small number of splits in #103872 for splits/load/spanning/nodes=3/obj=cpu
- which had the above pattern when investigating.
The load based splitting behavior was changed in cockroachdb#103690 to always find a real key to split at, rather than use sampled response spans as split points. This behavior change affected the number of splits in `split/load` roachtests. Update the min and max boundaries for various `split/load` roachtests to conform to the new expected behavior. Fixes: cockroachdb#103691 Fixes: cockroachdb#103935 Resolves: cockroachdb#103872 Release note: None
103985: roachtest: update activerecord adapter version r=rafiss a=andyyang890 This patch updates the version of the CockroachDB activerecord adapter used for testing to v7.0.2. Informs #97283 Informs #99620 Release note: None 104084: roachtest: bump split boundaries to prevent flakes r=tbg a=kvoli The load based splitting behavior was changed in #103690 to always find a real key to split at, rather than use sampled response spans as split points. This behavior change affected the number of splits in `split/load` roachtests. Update the min and max boundaries for various `split/load` roachtests to conform to the new expected behavior. See #103872 for updated split distribution over 190 test runs. Fixes: #103691 Fixes: #103935 Resolves: #103872 Release note: None Co-authored-by: Andy Yang <[email protected]> Co-authored-by: Austen McClernon <[email protected]>
The load based splitting behavior was changed in #103690 to always find a real key to split at, rather than use sampled response spans as split points. This behavior change affected the number of splits in `split/load` roachtests. Update the min and max boundaries for various `split/load` roachtests to conform to the new expected behavior. Fixes: #103691 Fixes: #103935 Resolves: #103872 Release note: None
It was possible for a SQL row to be torn across two ranges due to the
load-based splitter not rejecting potentially unsafe split keys. It is
impossible to determine with just the sampled request keys, whether a
key is certainly unsafe or safe, so a split key is returned regardless of error.
This PR side steps this problem by re-using the
adminSplitWithDescriptor
command to find the first real key, after orat the provided
args.SplitKey
. This ensures that the split key willalways be a real key whilst not requiring any checks in the splitter
itself.
The updated
adminSplitWithDescriptor
is local only and requires optinginto finding the first safe key by setting
findFirstSafeKey
totrue
.As such, all safe split key checks are also removed from the
split
pkg, with a warning added that the any split key returned is unsafe.
Note that the weighted load based split finder, used for CPU splits
did not suffer from returning potentially unsafe splits due to e4f003b.
However, it was possible that no load-based split key was ever found
when using the weighted finder. This was because we discard potentially
unsafe samples, which could have been safe split points.
This PR reverts commit e4f003b, as the
safe split key is enforced elsewhere, mentioned above.
Resolves: #103483
Resolves: #103672
Release note (bug fix): It was possible for a SQL row to be split across
two ranges. When this occurred, SQL queries could return unexpected
errors. This bug is resolved by these changes, as we now sample the real
keys, rather than just request keys to determine load-based split points.