-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachpb: add gc hint replica state field and let delete range set it #86565
Conversation
8dac6af
to
78f7e14
Compare
c4cfd98
to
7f11876
Compare
2c49629
to
1e60dad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial draft pass. There are also a few nits here and there, but will wait for the final PR.
pkg/roachpb/metadata.go
Outdated
if leftEmpty || rightEmpty { | ||
// If one of sides doesn't have hint we probably can't do anything sensible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to just use the non-empty one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial thinking was that if we merge with a range that had no hint it means it will likely have data in it so it would be a waste. But maybe you are right and we can still run the range through expedited checks and probe if they are still eligible if we merged "hinted" ranges with empty ones on the range boundaries.
1e60dad
to
0ac298e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a migration here, or we risk divergence in mixed-version clusters where some nodes write the hint during merge/split and others don't.
That is, we need to encode, into the split/merge trigger, the information on whether GCHint can be written. We need a cluster version to gate this during evaluation, and only if that version is active do we inform the commit trigger to act on GCHints. In particular we need to be careful about WriteInitialReplicaState
which is called from splits.
Basic approach looks good!
Reviewed 1 of 14 files at r1, 17 of 17 files at r3, 6 of 6 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @erikgrinaker)
pkg/keys/keys.go
line 283 at r3 (raw file):
// RangeGCHintKey returns a system-local key for last used GC threshold on the // user keyspace. Reads and writes <= this timestamp will not be served.
Copy pasta error
pkg/kv/kvserver/batcheval/cmd_delete_range.go
line 62 at r4 (raw file):
}) if args.UpdateDeleteRangeGcHint {
GCHint (use gogoproto.customname
)
pkg/kv/kvserver/batcheval/cmd_delete_range.go
line 93 at r4 (raw file):
// usage of hint. return result.Result{}, errors.AssertionFailedf( "GcRangeHint must only be used together with UseRangeTombstone")
GC
pkg/kv/kvserver/batcheval/cmd_delete_range.go
line 112 at r4 (raw file):
maybeUpdateGCHint := func(res *result.Result) error { if args.UpdateDeleteRangeGcHint {
You can return early to make this less indented.
if !args.UpdateDeleteRangeGCHint {
return nil
}
if !args.Key.Equal(...) {
return nil
}
...
pkg/kv/kvserver/batcheval/cmd_delete_range.go
line 132 at r4 (raw file):
} } else { log.Warningf(ctx, "attempt to set GC hint for the incompletely covered range")
This can happen legitimately, so I don't think it should log as warning. I think it's generally not a good idea to log from batcheval
. I would just remove this. Instead, we could just not write the hint unless the entire range is affected.
pkg/kv/kvserver/batcheval/cmd_end_transaction.go
line 1234 at r3 (raw file):
{ // If we have GC hints populated that means we are trying to perform // optimized garbage removal in future. For load based splits we want to
I don't understand what load based splits have to do with this.
pkg/kv/kvserver/batcheval/cmd_end_transaction.go
line 1237 at r3 (raw file):
// preserve this behaviour as both ranges likely belong to the same table. // We will try to merge both hints if possible and set new hint on LHS. // If resulting hint makes no sense, merge will reset it so that it is reset
what does "makes no sense" mean? I think we'd just track the larger of the two hints. If it isn't as easy as that explain in more words.
pkg/kv/kvserver/batcheval/result/result.go
line 254 at r3 (raw file):
if q.Replicated.State.GCHint != nil { p.Replicated.State.GCHint = q.Replicated.State.GCHint
Detect conflicting GCHints here to catch potential programming errors.
pkg/kv/kvserver/stateloader/initial.go
line 97 at r3 (raw file):
return enginepb.MVCCStats{}, errors.Wrap(err, "error reading GCHint") } else if !existingGCHint.LatestRangeDeleteTimestamp.IsEmpty() { log.Fatalf(ctx, "expected trivial GCHint, but found %+v", existingGCHint)
return an errors.AssertionFailedf
pkg/kv/kvserver/stateloader/stateloader.go
line 283 at r3 (raw file):
} // LoadGCHint loads GC hint
.
pkg/kv/kvserver/stateloader/stateloader.go
line 293 at r3 (raw file):
} // SetGCHint writes the GC hint
.
pkg/roachpb/metadata.go
line 786 at r3 (raw file):
// Returns true if receiver state was changed. func (h *GCRangeHint) Merge(rhs *GCRangeHint) bool { if h.LatestRangeDeleteTimestamp.Less(rhs.LatestRangeDeleteTimestamp) {
isn't there a method .Forward
for this?
pkg/roachpb/metadata.proto
line 461 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Does this mean that we'd calculate the GC TTL and set the timestamp when the range would fall below the GC TTL? What if the user changes the GC TTL after the drop, to GC it faster?
I think we'll want to set this to the
DeleteRange
evaluation timestamp, and then the GC queue can compare that to the GC TTL.
+1 let's set this to the write timestamp during eval
cef37a7
to
93bbb2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
pkg/kv/kvserver/batcheval/result/result.go
line 254 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Detect conflicting GCHints here to catch potential programming errors.
Should we always reject if it is set on both sides of merge? What would be the use case, sending multiple identical delete requests in a batch?
pkg/kv/kvserver/stateloader/initial.go
line 97 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
return an
errors.AssertionFailedf
Why do other checks here resort to log.Fatalf?
pkg/roachpb/api.proto
line 390 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I wouldn't send the entire GC hint explicitly here, since we may not want callers to get access to everything in that struct. I'd add a boolean field, something like
AggressiveGC
orIsSchemaDrop
or something, and then set the GC hint to the current timestamp based on that.
Done.
pkg/roachpb/metadata.go
line 786 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
isn't there a method
.Forward
for this?
Forward is weird because it changes synthetic flag in timestamp but returns false indicating that it didn't forward it. I would be scared to have equal objects with different content in the state.
pkg/roachpb/metadata.proto
line 461 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
+1 let's set this to the write timestamp during eval
Done.
cd27363
to
809a3c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a migration here
A below-Raft migration, or just a version gate? Was really hoping we'd avoid a migration this cycle, given the issues with the below-Raft migration infrastructure.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)
pkg/kv/kvserver/batcheval/result/result.go
line 254 at r3 (raw file):
Previously, aliher1911 (Oleg) wrote…
Should we always reject if it is set on both sides of merge? What would be the use case, sending multiple identical delete requests in a batch?
I think we need to do something similar to GCThreshold
, and forward it to the highest timestamp. Consider e.g. serverless, where we can have multiple dropped tables in the same range, and we may merge ranges with different dropped tables in the LHS and RHS. Forwarding to the highest timestamp seems like the most reasonable action here, even though it may end up being a futile GC attempt.
I don't think this particular batcheval is about merging ranges, it is merging multiple results for single range. We handle merging ranges in merge trigger and we use highest timestamp out of both. |
809a3c7
to
d93db60
Compare
7c39907
to
00a9ff4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
pkg/kv/kvserver/batcheval/cmd_delete_range.go
line 128 at r20 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Range tombstones are non-transactional, so this can only be
h.Timestamp
.
It is also consistent with what we pass to actual delete operation this way.
1f4c08f
to
6d51aed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good overall. I'll let @tbg have a closer look at the migration concerns, since I don't have a full picture of the pitfalls here.
It might be worth spinning up a mixed-version cluster, run a workload that does a bunch of splits and merges, and then to a full consistency check of all ranges, just to make sure we're not getting any divergence.
Oh, and add on a commit to set this in gcjob.deleteAllSpanData()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of manual testing, could you add a step around here
backupStep, |
that performs a few (random) splits and immediately merges some of them back (Like do three splits, merge the last two)? Can probably just add something into versionUpgradeTestFeatures
var versionUpgradeTestFeatures = versionFeatureStep{ |
stmtFeatureTest("Split-then-merge", v201, `
CREATE DATABASE IF NOT EXISTS splitmerge;
CREATE TABLE splitmerge.t (id INT PRIMARY KEY);
ALTER TABLE splitmerge.t SPLIT AT VALUES(1,2,3); -- probably doing this somewhat wrong
ALTER TABLE splitmerge.t UNSPLIT AT VALUES(2,3); -- probably doing this somewhat wrong
`),
That will give us some confidence that things don't go sideways. We may also want this diff:
diff --git a/pkg/cmd/roachtest/tests/versionupgrade.go b/pkg/cmd/roachtest/tests/versionupgrade.go
index ccae65f35f..8c414fba43 100644
--- a/pkg/cmd/roachtest/tests/versionupgrade.go
+++ b/pkg/cmd/roachtest/tests/versionupgrade.go
@@ -138,7 +138,9 @@ func runVersionUpgrade(ctx context.Context, t test.Test, c cluster.Cluster) {
// and finalizing on the auto-upgrade path.
preventAutoUpgradeStep(1),
// Roll nodes forward.
- binaryUpgradeStep(c.All(), ""),
+ binaryUpgradeStep(c.Node(1), ""),
+ testFeaturesStep,
+ binaryUpgradeStep(c.Range(2, c.Spec().NodeCount), ""),
testFeaturesStep,
// Run a quick schemachange workload in between each upgrade.
// The maxOps is 10 to keep the test runtime under 1-2 minutes.
so that we have a chance to truly exercise this in the case where some nodes are new and some are old, but I think if this causes unrelated trouble we can kick it down the road.
Reviewed 17 of 17 files at r25, 8 of 8 files at r26, 9 of 9 files at r27, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911)
pkg/kv/kvserver/stateloader/stateloader.go
line 308 at r26 (raw file):
hint *roachpb.GCHint, gcHintEnabled bool, ) (bool, error) {
(updated bool, _ error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: not at all against manual testing. Just if we can spend the same time on augmenting the automatic test, it will likely give us more confidence. But if this is too expensive, definitely a manual test + then merge.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911)
1fe3375
to
88e1624
Compare
88e1624
to
ba86aaa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛳️ 🇮🇹
Reviewed 19 of 19 files at r28, 8 of 8 files at r29, 10 of 10 files at r30, 2 of 2 files at r31, all commit messages.
Dismissed @erikgrinaker from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911)
This commit adds a gc hint field to replica state that is backed by a replicated range local key. This field could be set explicitly by client to indicate that all data in the range is ripe for deletion at hint timestamp. MVCC GC could use this information to optimize clear range requests. Release justification: this commit is safe because it adds a new request type and a new status field that doesn't interfere with existing functionality. Release note: None
This commit adds a version gate for GC Hint. This is needed to avoid divergence of replicas when older followers don't yet update state with a new field value. Release justification: This commit adds a version gate for a new feature Release note: None
When cmd_delete_range deletes the whole range it can give a GC hint so that mvcc gc could optimize large number ranges that would be collected using gc clear range requests. Processing those requests together in a quick succession would reduce compaction load on pebble. Release justification: This change is safe as it adds handling for the new type of request that doesn't interact with any existing functionality. Release note: None
This commit adds a test that peforms range split and merge during cluster upgrade. This is covering test cases where raft state gets additional information in new versions and we must ensure that replicas don't diverge in the process of upgrade. Release justification: Commit adds extra tests only. Release note: None
ba86aaa
to
b87e290
Compare
For posterity, I tried testing manually with splits and merges initiated by "new" and "old" versions during upgrade in a mixed version cluster waited for it to sit after that for consistency checks to run and it didn't cause any trouble. So assuming all good on this front. |
bors r=tbg,erikgrinaker |
Build succeeded: |
Touches #86550 |
This commit adds a test that peforms range split and merge during
cluster upgrade. This is covering test cases where raft state gets
additional information in new versions and we must ensure that
replicas don't diverge in the process of upgrade.
Release justification: Commit adds extra tests only.
Release note: None
When cmd_delete_range deletes the whole range it can give a GC hint so
that mvcc gc could optimize large number ranges that would be collected
using gc clear range requests.
Processing those requests together in a quick succession would reduce
compaction load on pebble.
Release justification: This change is safe as it adds handling for the new
type of request that doesn't interact with any existing functionality.
Release note: None
This commit adds a version gate GCHintInReplicaState for GC Hint. This
is needed to avoid divergence of replicas when older followers don't yet
update state with a new field value.
Release justification: This commit adds a version gate for a new feature
Release note: None
This commit adds a gc hint field to replica state that is backed by a
replicated range local key. This field could be set explicitly by client
to indicate that all data in the range is ripe for deletion at hint
timestamp. MVCC GC could use this information to optimize clear range
requests.
Release justification: this commit is safe because it adds a new request
type and a new status field that doesn't interfere with existing
functionality.
Release note: None