-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: TRUNCATE preserves split points of indexes #63043
Conversation
Looks like @nvanbenschoten and I get to have a PR bakeoff :) #63041 |
15cee20
to
4471755
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preserving more split than some constant multiple of the number of nodes seems a bit crazy. I am not convinced that if there are 10k ranges on a table we want to make 10k splits. Can we do something to find some uniform accounting of data? I feel like we have some library tools to combine spans in various ways. My guess is that something like a 100 splits will get you most of the benefit at a fraction of the costs most of the time.
Reviewable status: complete! 0 of 0 LGTMs obtained
Yes, I agree - this makes a lot of sense. @nvanbenschoten did something more intelligent in his version in #63041. |
4471755
to
3e0be78
Compare
This PR now permits customizing the number of preserved splits. We could do a customizable multiple of the number of nodes in the cluster too, I suppose. I'm not really sure what people will want to do from a customizability perspective. Another question that came up: what do we do with stickybit range splits? I think it's either we preserve all of them along with their stickybits, which has the downside of keeping way too many if people have way too many in the first place, or we preserve the range splits that happen to be sampled and throw away their stickybits. The latter is what this PR does. |
3e0be78
to
bf4bb5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together!
Another question that came up: what do we do with stickybit range splits? I think it's either we preserve all of them along with their stickybits, which has the downside of keeping way too many if people have way too many in the first place, or we preserve the range splits that happen to be sampled and throw away their stickybits. The latter is what this PR does.
This is an interesting question. Right now, we just throw away sticky bits in unsplitRangesForTable
, regardless of whether they have an expiration or not. But an argument could be made that we should be translating them to the new set of indexes. I don't think this is that important to do, but moving them over does seem like the "right" thing to do so that users don't have to recreate manual range splits after a TRUNCATE. I'm not sure if you would want to do that in this PR.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)
pkg/sql/truncate.go, line 304 at r1 (raw file):
} if nSplits := int(enableSplitsAfterTruncate.Get(p.execCfg.SV())); nSplits > 0 {
I think it would make sense to move this up right after the call to unsplitRangesForTable
and then pull it into a helper method.
pkg/sql/truncate.go, line 304 at r1 (raw file):
} if nSplits := int(enableSplitsAfterTruncate.Get(p.execCfg.SV())); nSplits > 0 {
Do we want a cluster setting for this, or should this be enabled unconditionally?
pkg/sql/truncate.go, line 304 at r1 (raw file):
} if nSplits := int(enableSplitsAfterTruncate.Get(p.execCfg.SV())); nSplits > 0 {
Right now, we're splitting each index into a fixed number of ranges, instead of the entire table into a fixed number of ranges. So if nSplits = 10
and a table has only a PK, it will be split 10 times. But if the table has 500 indexes, it will be split 5000 times. That feels a little strange to me, as I would expect the number of indexes to be independent of the number of splits. What do you think?
pkg/sql/truncate.go, line 304 at r1 (raw file):
} if nSplits := int(enableSplitsAfterTruncate.Get(p.execCfg.SV())); nSplits > 0 {
It probably makes the most sense to make nSplits
a function of the number of nodes in the cluster. We can get this count with a query like:
SELECT count(*) FROM crdb_internal.kv_node_status
We can then scale this by some constant, which can be a cluster setting if we'd like. I'd default it to something like 4.
pkg/sql/truncate.go, line 337 at r1 (raw file):
// from the ranges that are returned. // // Does anyone know a better way of evenly spacing out n items from a
Does https://github.com/cockroachdb/cockroach/pull/63041/files#diff-5e1177063f978b0ad4aa2a220185ec66302fe8e63f3f7270509a6d54950926e8R441 do what we want any better?
pkg/sql/truncate.go, line 359 at r1 (raw file):
newStartKey := append(newIndexPrefix, startKey[len(indexPrefix):]...) b.AddRawRequest(&roachpb.AdminSplitRequest{
I think you can use p.ExecCfg().DB.SplitAndScatter(ctx, splitPoint, expTime)
to avoid the proto manipulation and to merge the split and scatter into a single operation.
pkg/sql/truncate.go, line 364 at r1 (raw file):
}, SplitKey: newStartKey, ExpirationTime: p.execCfg.Clock.Now().Add(time.Minute.Nanoseconds(), 0),
@ajwerner had a good idea of jittering this expiration time by something like 20% of the duration.
Also, 1 minute seems a little short. I think 10 minutes would be a little more reasonable.
pkg/sql/truncate.go, line 388 at r1 (raw file):
OldPrimaryIndexId: oldIndexes[0].ID, OldIndexes: oldIndexIDs[1:], NewPrimaryIndexId: tableDesc.GetPrimaryIndexID(),
newIndexIDs[0]
?
bf4bb5d
to
2cdad08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, we'll avoid caring about keeping sticky bit range splits for now.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/sql/truncate.go, line 304 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we want a cluster setting for this, or should this be enabled unconditionally?
That's a good question. I don't really know. I was thinking that for now, it would probably be nice to let people opt out of the new behavior, so I figured we'd make a cluster setting with a reasonable default but let people move the value to 0 if they didn't want the behavior at all.
pkg/sql/truncate.go, line 304 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Right now, we're splitting each index into a fixed number of ranges, instead of the entire table into a fixed number of ranges. So if
nSplits = 10
and a table has only a PK, it will be split 10 times. But if the table has 500 indexes, it will be split 5000 times. That feels a little strange to me, as I would expect the number of indexes to be independent of the number of splits. What do you think?
Yeah, you're right. When I was making this PR I kind of thought that by default we split each table into 1 range per index, but it's actually not the case.
pkg/sql/truncate.go, line 304 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It probably makes the most sense to make
nSplits
a function of the number of nodes in the cluster. We can get this count with a query like:SELECT count(*) FROM crdb_internal.kv_node_status
We can then scale this by some constant, which can be a cluster setting if we'd like. I'd default it to something like 4.
Okay, makes sense. I almost did this, but wasn't actually able to find a way to get the number of nodes in the cluster easily... this SQL command makes sense though, I should have thought of that.
pkg/sql/truncate.go, line 304 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think it would make sense to move this up right after the call to
unsplitRangesForTable
and then pull it into a helper method.
Done.
pkg/sql/truncate.go, line 337 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Does https://github.com/cockroachdb/cockroach/pull/63041/files#diff-5e1177063f978b0ad4aa2a220185ec66302fe8e63f3f7270509a6d54950926e8R441 do what we want any better?
I don't really know... if you round a float using an int cast, it always rounds down. This means that, given a list like this:
0 1 2 3 4 5 6 7 8 9
and asking for 8 items out of the list, we get:
0, 1, 2, 3, 5, 6, 7, 8
With the algorithm in this PR, we get:
0 2 4 5 6 7 8 9
I think your version is probably better, because it "spaces out the skips" a little bit better - this version puts all of the skips up front.
pkg/sql/truncate.go, line 359 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think you can use
p.ExecCfg().DB.SplitAndScatter(ctx, splitPoint, expTime)
to avoid the proto manipulation and to merge the split and scatter into a single operation.
Hmm, but doesn't this remove the batching that we do here? I'd like to be able to send all of the split points and scatter requests in a single round trip.
pkg/sql/truncate.go, line 364 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
@ajwerner had a good idea of jittering this expiration time by something like 20% of the duration.
Also, 1 minute seems a little short. I think 10 minutes would be a little more reasonable.
Good idea, done.
pkg/sql/truncate.go, line 388 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
newIndexIDs[0]
?
They're equivalent, but I think you're right that it's more symmetrical to do what you're saying.
2cdad08
to
51eb536
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I redid this a little bit to be more similar to your approach, where we divide the range splits evenly across the whole table, not on an index by index basis. The reason for this is that some indexes may be larger than others, and they should consume more of the available presplit points.
PTAL
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @nvanbenschoten)
pkg/sql/truncate.go, line 304 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
That's a good question. I don't really know. I was thinking that for now, it would probably be nice to let people opt out of the new behavior, so I figured we'd make a cluster setting with a reasonable default but let people move the value to 0 if they didn't want the behavior at all.
👍 sounds reasonable to me.
pkg/sql/truncate.go, line 337 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
I don't really know... if you round a float using an int cast, it always rounds down. This means that, given a list like this:
0 1 2 3 4 5 6 7 8 9
and asking for 8 items out of the list, we get:
0, 1, 2, 3, 5, 6, 7, 8
With the algorithm in this PR, we get:
0 2 4 5 6 7 8 9
I think your version is probably better, because it "spaces out the skips" a little bit better - this version puts all of the skips up front.
I think this only works if we perform the step*i
multiplication as a float
then cast. So ranges[int(float64(i)*step)]
instead of ranges[int(step)*i]
.
pkg/sql/truncate.go, line 359 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Hmm, but doesn't this remove the batching that we do here? I'd like to be able to send all of the split points and scatter requests in a single round trip.
Ah, great point. Carry on.
pkg/sql/truncate.go, line 298 at r2 (raw file):
} if err := p.copySplitPointsToNewIndexes(ctx, id, oldIndexIDs, newIndexIDs); err != nil {
nit: consider adding a comment here explaining what this means.
pkg/sql/truncate.go, line 344 at r2 (raw file):
} // copySplitPointsToNewIndexes copies any range split points from the indexes
nit: consider moving this down right above reassignInterleaveIndexReferences
, so that the helper methods are in the order that we call them.
pkg/sql/truncate.go, line 355 at r2 (raw file):
) error { if preservedSplitsMultiple := int(PreservedSplitCountMultiple.Get(p.execCfg.SV())); preservedSplitsMultiple > 0 {
Can we avoid this indentation with an early return?
pkg/sql/truncate.go, line 357 at r2 (raw file):
if preservedSplitsMultiple := int(PreservedSplitCountMultiple.Get(p.execCfg.SV())); preservedSplitsMultiple > 0 { row, err := p.EvalContext().InternalExecutor.QueryRow( ctx, "count-active-nodes", p.txn, "SELECT COUNT(*) FROM crdb_internal.kv_node_status")
I wonder whether we want to pretend that any of this is transactional, or whether we should 1) pass nil
here, and 2) use p.txn.DB()
to issue the split and scatter batch.
pkg/sql/truncate.go, line 423 at r2 (raw file):
// and there are old split points sitting around in the ranges. In this // case, we can't translate the range into the new keyspace, so we don't // bother with this range.
Should we continue
here?
pkg/sql/truncate.go, line 423 at r2 (raw file):
// and there are old split points sitting around in the ranges. In this // case, we can't translate the range into the new keyspace, so we don't // bother with this range.
I wonder whether we should first filter out all split points that we plan to ignore (the continue
cases here) and then enter this downsampling loop. That way, 10k ranges from a previously truncated table doesn't drown out 100 ranges in the current version of the table.
pkg/sql/truncate.go, line 433 at r2 (raw file):
expirationTime += jitter log.Errorf(ctx, "sending split request for key %s with expiration time %s", newStartKey,
Infof
, this is a good thing!
51eb536
to
2206d90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/sql/truncate.go, line 337 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think this only works if we perform the
step*i
multiplication as afloat
then cast. Soranges[int(float64(i)*step)]
instead ofranges[int(step)*i]
.
Oops, good point. Done.
pkg/sql/truncate.go, line 298 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: consider adding a comment here explaining what this means.
Done.
pkg/sql/truncate.go, line 344 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: consider moving this down right above
reassignInterleaveIndexReferences
, so that the helper methods are in the order that we call them.
Done.
pkg/sql/truncate.go, line 355 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Can we avoid this indentation with an early return?
Done.
pkg/sql/truncate.go, line 357 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I wonder whether we want to pretend that any of this is transactional, or whether we should 1) pass
nil
here, and 2) usep.txn.DB()
to issue the split and scatter batch.
No idea. I took your suggestion.
pkg/sql/truncate.go, line 423 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we
continue
here?
Yes.
pkg/sql/truncate.go, line 423 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I wonder whether we should first filter out all split points that we plan to ignore (the
continue
cases here) and then enter this downsampling loop. That way, 10k ranges from a previously truncated table doesn't drown out 100 ranges in the current version of the table.
Done.
pkg/sql/truncate.go, line 433 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Infof
, this is a good thing!
Done.
2206d90
to
af0a4b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)
pkg/sql/truncate.go, line 574 at r3 (raw file):
// Finally, downsample the split points - choose just nSplits of them to keep. step := float64(len(splitPoints)) / float64(nSplits)
Does this correctly handle the case where len(splitPoints) < len(nSplits)
? I think we need:
if step < 1 {
step = 1
}
Oh no...
I guess we'll... just assume there's just 1 node when we're in multi-tenancy mode? |
af0a4b7
to
64ddbf2
Compare
We don't let tenants manipulate range splits directly. So we'll want to make |
It's probably easier to just return early in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFW you don’t press publish, sorry got sidetracked mid review 😞
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @nvanbenschoten)
pkg/sql/truncate.go, line 348 at r2 (raw file):
// on the table given by the tableID. // oldIndexIDs and newIndexIDs must be in the same order and be the same length. func (p *planner) copySplitPointsToNewIndexes(
Does this whole thing need to be gated on multi-tenant?
pkg/sql/truncate.go, line 357 at r2 (raw file):
if preservedSplitsMultiple := int(PreservedSplitCountMultiple.Get(p.execCfg.SV())); preservedSplitsMultiple > 0 { row, err := p.EvalContext().InternalExecutor.QueryRow( ctx, "count-active-nodes", p.txn, "SELECT COUNT(*) FROM crdb_internal.kv_node_status")
does this work in multi-tenant? Want to just default it to 1? row being 0
doesn't feel like a good reason to fail.
pkg/sql/truncate.go, line 372 at r2 (raw file):
// Fetch all of the range descriptors for this index. ranges, err := kvclient.ScanMetaKVs(ctx, p.txn, roachpb.Span{
Is this allowed in multi-tenant?
Done, thanks |
64ddbf2
to
db0714e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @nvanbenschoten)
pkg/sql/truncate.go, line 608 at r4 (raw file):
} b.AddRawRequest(&roachpb.AdminScatterRequest{
Are we smart enough to split the scatter from the batch before the splitting? Should this be in a subsequent batch?
Grr, these tests are extraordinarily flaky. I need to figure out how to improve the situation. |
pkg/sql/truncate.go, line 592 at r5 (raw file):
I was chatting with @nvanbenschoten and based on the findings here #62700 (comment), we should set the expiration point for these to be past the GC TTL to avoid the problem that @nvanbenschoten found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @lunevalex, and @nvanbenschoten)
pkg/sql/truncate.go, line 592 at r5 (raw file):
Previously, lunevalex wrote…
I was chatting with @nvanbenschoten and based on the findings here #62700 (comment), we should set the expiration point for these to be past the GC TTL to avoid the problem that @nvanbenschoten found.
Specifically, we were thinking about setting this to 2x the GC TTL for the table being TRUNCATEd (again, with the 20% jitter). We'll need to do more in #62700 to avoid needing to rely on this expiration for cluster stability, but this seems like a helpful short-term mitigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @lunevalex, and @nvanbenschoten)
pkg/sql/truncate.go, line 592 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Specifically, we were thinking about setting this to 2x the GC TTL for the table being TRUNCATEd (again, with the 20% jitter). We'll need to do more in #62700 to avoid needing to rely on this expiration for cluster stability, but this seems like a helpful short-term mitigation.
On second thought, what we're doing here is really a form of load-based splitting. So I think what would be best is if we pulled this expiration time from the kv.range_split.by_load_merge_delay
cluster setting. This way, we can use that cluster setting to tune the delay for load-based splits generated here and by the splitQueue
.
db0714e
to
30dd423
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @lunevalex, and @nvanbenschoten)
pkg/sql/truncate.go, line 348 at r2 (raw file):
Previously, ajwerner wrote…
Does this whole thing need to be gated on multi-tenant?
Done.
pkg/sql/truncate.go, line 357 at r2 (raw file):
Previously, ajwerner wrote…
does this work in multi-tenant? Want to just default it to 1? row being
0
doesn't feel like a good reason to fail.
Done.
pkg/sql/truncate.go, line 372 at r2 (raw file):
Previously, ajwerner wrote…
Is this allowed in multi-tenant?
Done.
pkg/sql/truncate.go, line 608 at r4 (raw file):
Previously, ajwerner wrote…
Are we smart enough to split the scatter from the batch before the splitting? Should this be in a subsequent batch?
Good point. Done.
pkg/sql/truncate.go, line 592 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
On second thought, what we're doing here is really a form of load-based splitting. So I think what would be best is if we pulled this expiration time from the
kv.range_split.by_load_merge_delay
cluster setting. This way, we can use that cluster setting to tune the delay for load-based splits generated here and by thesplitQueue
.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r6.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @lunevalex)
efb61e8
to
118147d
Compare
Previously, TRUNCATE unconditionally blew away all index split points when creating the replacement empty indexes. This manifested in some bad behavior when performing TRUNCATE on tables or indexes that had heavy load and were heavily split already, because all of a sudden all of the traffic that was nicely dispersed across all of the ranges would redirect to the single new range that TRUNCATE created. The bad performance would remediate over time as the database re-split the new ranges, but preserving the splits across the index swap boundaries is a faster way to get there. Release note (sql change): TRUNCATE is now less disruptive on tables with a lot of concurrent traffic.
118147d
to
e2665f5
Compare
bors r+ |
Build succeeded: |
@jordanlewis @nvanbenschoten can this be backported to release-21.1? |
Yes, I will be backporting it along with #65371 . Thank you for the label and reminder. |
Touches #62672
Previously, TRUNCATE unconditionally blew away all index split points
when creating the replacement empty indexes. This manifested in some bad
behavior when performing TRUNCATE on tables or indexes that had heavy
load and were heavily split already, because all of a sudden all of the
traffic that was nicely dispersed across all of the ranges would
redirect to the single new range that TRUNCATE created.
The bad performance would remediate over time as the database re-split
the new ranges, but preserving the splits across the index swap
boundaries is a faster way to get there.
Release note (sql change): TRUNCATE is now less disruptive on tables
with a lot of concurrent traffic.