-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage/sql: change sticky bit to represent time of expiration #38004
Conversation
b6d20b4
to
dab6353
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, 40 of 40 files at r2, 10 of 10 files at r3, 5 of 5 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @jeffrey-xiao)
pkg/roachpb/api.proto, line 669 at r2 (raw file):
RequestHeader header = 1 [(gogoproto.nullable) = false, (gogoproto.embed) = true]; bytes split_key = 2 [(gogoproto.casttype) = "Key"]; util.hlc.Timestamp expiration_time = 3 [(gogoproto.nullable) = false];
I think we're going to need to reserve 3;
and use 4
here, otherwise, this change will be incompatible with the previous alpha. We probably don't need a full migration here, but lets at least ensure that the migration doesn't crash a node.
pkg/sql/crdb_internal.go, line 1795 at r2 (raw file):
expirationTime := tree.DNull if !desc.StickyBit.Equal(hlc.MaxTimestamp) {
I'm surprised that we have NULL representing the maximum split time (never merge) instead of the minimum split time (always merge). I would have thought you would go the other way with this. Keep in mind that most ranges will not be manually split, so the value here should make sense for automatic splits.
pkg/sql/split.go, line 117 at r3 (raw file):
func (n *splitNode) startExec(params runParams) error { stickyBitEnabled := params.EvalContext().Settings.Version.IsActive(cluster.VersionStickyBit) // TODO(jeffreyxiao): Remove this error in v20.1.
Could you add to this comment that we can also remove splitNode.Force
and experimental_force_split_at
in v20.1?
pkg/sql/split.go, line 191 at r3 (raw file):
// the expiration time of the split. func parseExpirationTime( expireExpr tree.Expr, semaCtx *tree.SemaContext, evalCtx *tree.EvalContext,
nit: put expireExpr
as the last arg.
pkg/sql/split.go, line 208 at r3 (raw file):
var convErr error switch d := d.(type) { case *tree.DString:
A lot of this is copied from somewhere, right? Should we try to unify the code?
pkg/sql/split_test.go, line 150 at r4 (raw file):
var key roachpb.Key var pretty string var expirationTimestamp *string
Does this need to be a *string
?
pkg/sql/unsplit_test.go, line 60 at r4 (raw file):
var key roachpb.Key var pretty string var expirationTimestamp *string
Does this need to be a *string
?
pkg/storage/merge_queue.go, line 295 at r2 (raw file):
// Range was manually split and not expired, so skip merging. if rhsDesc.StickyBit.GoTime().After(mq.store.Clock().PhysicalTime()) {
Why strip the logical component before performing the comparison? We should be able to use the hlc.Timestamp.Less
method.
pkg/storage/replica_command.go, line 227 at r2 (raw file):
log.Event(ctx, "range already split") // Even if the range is already split, we should still update the sticky // bit if it is different.
I think we only want to do this if the expiration timestamp is less, not if it's different.
pkg/storage/replica_command.go, line 397 at r2 (raw file):
// unsplit command as a no-op and return success instead of throwing an // error. if desc.StickyBit.Equal(hlc.Timestamp{}) {
nit: No need for Equal
here, hlc.Timestamp
is comparable so ==
will work.
pkg/storage/replica_command.go, line 420 at r2 (raw file):
InternalCommitTrigger: &roachpb.InternalCommitTrigger{ StickyBitTrigger: &roachpb.StickyBitTrigger{ // Setting StickyBit to nil unsets the sticky bit.
Leave a similar comment here.
Reverts cockroachdb#37728. A cluster setting is coarse to be useful. There could be instances where multiple people are working in the same cluster and there would be conflicts on when they want their sticky bits to expire. Release note(general change): Remove kv.range_merge.manual_split_ttl setting
dab6353
to
fbfc6ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/sql/crdb_internal.go, line 1795 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm surprised that we have NULL representing the maximum split time (never merge) instead of the minimum split time (always merge). I would have thought you would go the other way with this. Keep in mind that most ranges will not be manually split, so the value here should make sense for automatic splits.
I thought it made more sense for the maximum split time to be represented as NULL
since it's the expiration time. If there is no expiration time, I expect that the range would never expire.
pkg/sql/split.go, line 208 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
A lot of this is copied from somewhere, right? Should we try to unify the code?
Yes, this is mostly from AS OF SYSTEM TIME
. I think it does different error checking though (I.E. AS OF SYSTEM TIME
should only take in timestamps in the past). I'm okay with extracting this logic into a common function. Where should it go?
pkg/sql/split_test.go, line 150 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Does this need to be a
*string
?
expiration_timestamp
could be NULL.
pkg/sql/unsplit_test.go, line 60 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Does this need to be a
*string
?
expiration_timestamp
could be NULL.
pkg/storage/merge_queue.go, line 295 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why strip the logical component before performing the comparison? We should be able to use the
hlc.Timestamp.Less
method.
I don't see an option to compare a hlc.Timestamp
with a time.Time
. If you mean to use Clock().Now()
instead of Clock().PhysicalTime()
, I thought it would better to avoid the mutex lock in Clock().PhysicalTime()
, since the logical component isn't significant in this case.
pkg/storage/replica_command.go, line 227 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think we only want to do this if the expiration timestamp is less, not if it's different.
Shouldn't we give users the ability to push expiration timestamps later? E.G. manual load operation is taking longer than expected, so push the sticky bits in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 50 of 50 files at r5, 40 of 40 files at r6, 10 of 10 files at r7, 5 of 5 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @jeffrey-xiao)
pkg/sql/crdb_internal.go, line 1795 at r2 (raw file):
Previously, jeffrey-xiao (Jeffrey Xiao) wrote…
I thought it made more sense for the maximum split time to be represented as
NULL
since it's the expiration time. If there is no expiration time, I expect that the range would never expire.
We'll definitely want the value of an automatic split to be NULL
because auto-splits are so much more common than manual splits, but you might be right that "expiration time" doesn't express exactly the right semantics to allow this. What name for this column would naturally fit with a NULL
value for an automatic split?
pkg/sql/split.go, line 208 at r3 (raw file):
Previously, jeffrey-xiao (Jeffrey Xiao) wrote…
Yes, this is mostly from
AS OF SYSTEM TIME
. I think it does different error checking though (I.E.AS OF SYSTEM TIME
should only take in timestamps in the past). I'm okay with extracting this logic into a common function. Where should it go?
I think we could pull out just the Datum -> hlc.Timestamp
conversion and even that would be a nice win. Just remove the error prefix and use errors.Wrap
at both callers. I'm not sure where a good place for this logic to live is. For now, maybe right below EvalAsOfTimestamp
.
pkg/sql/unsplit_test.go, line 60 at r4 (raw file):
Previously, jeffrey-xiao (Jeffrey Xiao) wrote…
expiration_timestamp
could be NULL.
Then I think we want sql.NullString
.
pkg/storage/merge_queue.go, line 295 at r2 (raw file):
If you mean to use Clock().Now() instead of Clock().PhysicalTime()
Yes, this is what I meant. I don't think we need to be worried about mutex contention for a single call to the clock in mergeQueue.Process
. If you want though, we can pull out a now := mq.store.Clock().Now()
variable and use it here and below so we're not adding any new clock accesses.
pkg/storage/replica_command.go, line 227 at r2 (raw file):
Previously, jeffrey-xiao (Jeffrey Xiao) wrote…
Shouldn't we give users the ability to push expiration timestamps later? E.G. manual load operation is taking longer than expected, so push the sticky bits in the future.
My wording was ambiguous, but that's exactly what I was trying to allow. What I don't think we should allow if for a user to move an expiration timestamp earlier, because then the two ADMIN SPLIT WITH EXPIRATION
statements aren't commutative.
e7f1efd
to
812cac2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/sql/crdb_internal.go, line 1795 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We'll definitely want the value of an automatic split to be
NULL
because auto-splits are so much more common than manual splits, but you might be right that "expiration time" doesn't express exactly the right semantics to allow this. What name for this column would naturally fit with aNULL
value for an automatic split?
How about something like time_to_live
or lifetime
?
pkg/storage/replica_command.go, line 227 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
My wording was ambiguous, but that's exactly what I was trying to allow. What I don't think we should allow if for a user to move an expiration timestamp earlier, because then the two
ADMIN SPLIT WITH EXPIRATION
statements aren't commutative.
Okay that makes sense.
e5981c6
to
9071a9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 16 files at r9.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @jeffrey-xiao, and @nvanbenschoten)
pkg/sql/crdb_internal.go, line 1795 at r2 (raw file):
Previously, jeffrey-xiao (Jeffrey Xiao) wrote…
How about something like
time_to_live
orlifetime
?
We decided on split_enforced_until
offline.
Release note: None
d685667
to
2185bbc
Compare
bors r+ |
38004: storage/sql: change sticky bit to represent time of expiration r=jeffrey-xiao a=jeffrey-xiao This PR does the following: - Revert the cluster setting for manual split ttl. It was agreed that this is probably too coarse to be useful. - Change sticky bit to represent time of expiration instead of time of split. - The expiration time displayed in `crdb_internal.ranges` is `NULL` if the range is permanent (I.E. has sticky bit of `hlc.MaxTimestamp`), otherwise it is the expiration time. This means that for static splits and splits performed by the split queue, it is displayed as `1970-01-01 00:00:00+00:00` (`hlc.Timestamp{}`). - Add `WITH EXPIRATION` option to `SPLIT AT` that accepts the same values as `AS OF SYSTEM TIME`. Co-authored-by: Jeffrey Xiao <[email protected]>
Build succeeded |
In cockroachdb#38147, a new non-nullable field was added to the ReplicaDescriptor proto that is serialized inside RangeDescriptors. RangeDescriptors are updated using CPut to detect races. This means that if a RangeDescriptor had been written by an old version of cockroach and we then attempt to update it, the CPut will fail because the encoded version of it is different (non-nullable proto2 fields are always included). A similar issue was introduced in cockroachdb#38004 which made the StickyBit field on RangeDescriptor non-nullable. We could keep the fields as nullable, but this has allocation costs (and is also more annoying to work with). Worse, the proto spec is pretty explicit about not relying on serialization of a given message to always produce exactly the same bytes: From https://developers.google.com/protocol-buffers/docs/encoding#implications Do not assume the byte output of a serialized message is stable. So instead, we've decided to stop CPut-ing protos and to change the interface of CPut to take the expected value as bytes. To CPut a proto, the encoded value will be read from kv, passed around with the decoded struct, and used in the eventual CPut. This work is upcoming, but the above two PRs are real breakages, so this commit fixes those first. Neither of these PRs made the last alpha so no release note. Release note: None
In cockroachdb#38147, a new non-nullable field was added to the ReplicaDescriptor proto that is serialized inside RangeDescriptors. RangeDescriptors are updated using CPut to detect races. This means that if a RangeDescriptor had been written by an old version of cockroach and we then attempt to update it, the CPut will fail because the encoded version of it is different (non-nullable proto2 fields are always included). A similar issue was introduced in cockroachdb#38004 which made the StickyBit field on RangeDescriptor non-nullable. We could keep the fields as nullable, but this has allocation costs (and is also more annoying to work with). Worse, the proto spec is pretty explicit about not relying on serialization of a given message to always produce exactly the same bytes: From https://developers.google.com/protocol-buffers/docs/encoding#implications Do not assume the byte output of a serialized message is stable. So instead, we've decided to stop CPut-ing protos and to change the interface of CPut to take the expected value as bytes. To CPut a proto, the encoded value will be read from kv, passed around with the decoded struct, and used in the eventual CPut. This work is upcoming, but the above two PRs are real breakages, so this commit fixes those first. Neither of these PRs made the last alpha so no release note. Touches cockroachdb#38308 Release note: None
In cockroachdb#38147, a new non-nullable field was added to the ReplicaDescriptor proto that is serialized inside RangeDescriptors. RangeDescriptors are updated using CPut to detect races. This means that if a RangeDescriptor had been written by an old version of cockroach and we then attempt to update it, the CPut will fail because the encoded version of it is different (non-nullable proto2 fields are always included). A similar issue was introduced in cockroachdb#38004 which made the StickyBit field on RangeDescriptor non-nullable. We could keep the fields as nullable, but this has allocation costs (and is also more annoying to work with). Worse, the proto spec is pretty explicit about not relying on serialization of a given message to always produce exactly the same bytes: From https://developers.google.com/protocol-buffers/docs/encoding#implications Do not assume the byte output of a serialized message is stable. So instead, we've decided to stop CPut-ing protos and to change the interface of CPut to take the expected value as bytes. To CPut a proto, the encoded value will be read from kv, passed around with the decoded struct, and used in the eventual CPut. This work is upcoming, but the above two PRs are real breakages, so this commit fixes those first. Neither of these PRs made the last alpha so no release note. Touches cockroachdb#38308 Release note: None
38302: storage: CPut bytes instead of a proto when updating RangeDescriptors r=tbg a=danhhz In #38147, a new non-nullable field was added to the ReplicaDescriptor proto that is serialized inside RangeDescriptors. RangeDescriptors are updated using CPut to detect races. This means that if a RangeDescriptor had been written by an old version of cockroach and we then attempt to update it, the CPut will fail because the encoded version of it is different (non-nullable proto2 fields are always included). A similar issue was introduced in #38004 which made the StickyBit field on RangeDescriptor non-nullable. We could keep the fields as nullable, but this has allocation costs (and is also more annoying to work with). Worse, the proto spec is pretty explicit about not relying on serialization of a given message to always produce exactly the same bytes: From https://developers.google.com/protocol-buffers/docs/encoding#implications Do not assume the byte output of a serialized message is stable. So instead, we've decided to stop CPut-ing protos and to change the interface of CPut to take the expected value as bytes. To CPut a proto, the encoded value will be read from kv, passed around with the decoded struct, and used in the eventual CPut. This work is upcoming, but the above two PRs are real breakages, so this commit fixes those first. Neither of these PRs made the last alpha so no release note. Touches #38308 Closes #38183 Release note: None Co-authored-by: Daniel Harrison <[email protected]>
friendly reminder to stick the docs-todo label on PRs that need docs. |
As seen in cockroachdb#38004 and cockroachdb#38147, this first came up in the context of CPut with a proto expected value. If the serialization didn't exactly match, the CPut would get a condition failed error, even if the two protos were logically equivalent (missing field vs field with default value). We were hitting this in a mixed version cluster when trying to add a non-nullable field to RangeDescriptor, which is updated via CPut to detect update races. The new fields ended up having to be nullable, which has allocation consequences as well as ergonomic ones. In cockroachdb#38302, we changed the RangeDescriptor CPut to use a *roachpb.Value exactly as it was read from KV for the expected value, but no other callsites were updated. To prevent future issues, this commit changes the CPut signature to accept a *roachpb.Value for the expected value instead of an interface{}. The old method signature is left as CPutDeprecated for now. The CPut usages that are trivial to switch are switched and ones that require any thought are left for a followup PR. Release note: None
39714: client: force CPut callers to use a *roachpb.Value expected value r=tbg a=danhhz As seen in #38004 and #38147, this first came up in the context of CPut with a proto expected value. If the serialization didn't exactly match, the CPut would get a condition failed error, even if the two protos were logically equivalent (missing field vs field with default value). We were hitting this in a mixed version cluster when trying to add a non-nullable field to RangeDescriptor, which is updated via CPut to detect update races. The new fields ended up having to be nullable, which has allocation consequences as well as ergonomic ones. In #38302, we changed the RangeDescriptor CPut to use a *roachpb.Value exactly as it was read from KV for the expected value, but no other callsites were updated. To prevent future issues, this commit changes the CPut signature to accept a *roachpb.Value for the expected value instead of an interface{}. The old method signature is left as CPutDeprecated for now. The CPut usages that are trivial to switch are switched and ones that require any thought are left for a followup PR. Release note: None Co-authored-by: Daniel Harrison <[email protected]>
This PR does the following:
crdb_internal.ranges
isNULL
if the range is permanent (I.E. has sticky bit ofhlc.MaxTimestamp
), otherwise it is the expiration time. This means that for static splits and splits performed by the split queue, it is displayed as1970-01-01 00:00:00+00:00
(hlc.Timestamp{}
).WITH EXPIRATION
option toSPLIT AT
that accepts the same values asAS OF SYSTEM TIME
.