Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/sql: change sticky bit to represent time of expiration #38004

Merged
merged 4 commits into from
Jun 6, 2019

Conversation

jeffrey-xiao
Copy link
Contributor

This PR does the following:

  • Revert the cluster setting for manual split ttl. It was agreed that this is probably too coarse to be useful.
  • Change sticky bit to represent time of expiration instead of time of split.
  • The expiration time displayed in crdb_internal.ranges is NULL if the range is permanent (I.E. has sticky bit of hlc.MaxTimestamp), otherwise it is the expiration time. This means that for static splits and splits performed by the split queue, it is displayed as 1970-01-01 00:00:00+00:00 (hlc.Timestamp{}).
  • Add WITH EXPIRATION option to SPLIT AT that accepts the same values as AS OF SYSTEM TIME.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jeffrey-xiao jeffrey-xiao force-pushed the sticky-bit-expiration branch 4 times, most recently from b6d20b4 to dab6353 Compare June 4, 2019 17:14
@jeffrey-xiao jeffrey-xiao marked this pull request as ready for review June 4, 2019 17:56
@jeffrey-xiao jeffrey-xiao requested review from a team June 4, 2019 17:56
@jeffrey-xiao jeffrey-xiao requested a review from a team as a code owner June 4, 2019 17:56
@jeffrey-xiao jeffrey-xiao requested review from a team June 4, 2019 17:56
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, 40 of 40 files at r2, 10 of 10 files at r3, 5 of 5 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @jeffrey-xiao)


pkg/roachpb/api.proto, line 669 at r2 (raw file):

  RequestHeader header = 1 [(gogoproto.nullable) = false, (gogoproto.embed) = true];
  bytes split_key = 2 [(gogoproto.casttype) = "Key"];
  util.hlc.Timestamp expiration_time = 3 [(gogoproto.nullable) = false];

I think we're going to need to reserve 3; and use 4 here, otherwise, this change will be incompatible with the previous alpha. We probably don't need a full migration here, but lets at least ensure that the migration doesn't crash a node.


pkg/sql/crdb_internal.go, line 1795 at r2 (raw file):

			expirationTime := tree.DNull
			if !desc.StickyBit.Equal(hlc.MaxTimestamp) {

I'm surprised that we have NULL representing the maximum split time (never merge) instead of the minimum split time (always merge). I would have thought you would go the other way with this. Keep in mind that most ranges will not be manually split, so the value here should make sense for automatic splits.


pkg/sql/split.go, line 117 at r3 (raw file):

func (n *splitNode) startExec(params runParams) error {
	stickyBitEnabled := params.EvalContext().Settings.Version.IsActive(cluster.VersionStickyBit)
	// TODO(jeffreyxiao): Remove this error in v20.1.

Could you add to this comment that we can also remove splitNode.Force and experimental_force_split_at in v20.1?


pkg/sql/split.go, line 191 at r3 (raw file):

// the expiration time of the split.
func parseExpirationTime(
	expireExpr tree.Expr, semaCtx *tree.SemaContext, evalCtx *tree.EvalContext,

nit: put expireExpr as the last arg.


pkg/sql/split.go, line 208 at r3 (raw file):

	var convErr error
	switch d := d.(type) {
	case *tree.DString:

A lot of this is copied from somewhere, right? Should we try to unify the code?


pkg/sql/split_test.go, line 150 at r4 (raw file):

		var key roachpb.Key
		var pretty string
		var expirationTimestamp *string

Does this need to be a *string?


pkg/sql/unsplit_test.go, line 60 at r4 (raw file):

		var key roachpb.Key
		var pretty string
		var expirationTimestamp *string

Does this need to be a *string?


pkg/storage/merge_queue.go, line 295 at r2 (raw file):

	// Range was manually split and not expired, so skip merging.
	if rhsDesc.StickyBit.GoTime().After(mq.store.Clock().PhysicalTime()) {

Why strip the logical component before performing the comparison? We should be able to use the hlc.Timestamp.Less method.


pkg/storage/replica_command.go, line 227 at r2 (raw file):

		log.Event(ctx, "range already split")
		// Even if the range is already split, we should still update the sticky
		// bit if it is different.

I think we only want to do this if the expiration timestamp is less, not if it's different.


pkg/storage/replica_command.go, line 397 at r2 (raw file):

	// unsplit command as a no-op and return success instead of throwing an
	// error.
	if desc.StickyBit.Equal(hlc.Timestamp{}) {

nit: No need for Equal here, hlc.Timestamp is comparable so == will work.


pkg/storage/replica_command.go, line 420 at r2 (raw file):

			InternalCommitTrigger: &roachpb.InternalCommitTrigger{
				StickyBitTrigger: &roachpb.StickyBitTrigger{
					// Setting StickyBit to nil unsets the sticky bit.

Leave a similar comment here.

Reverts cockroachdb#37728. A cluster setting is coarse to be useful. There could be
instances where multiple people are working in the same cluster and
there would be conflicts on when they want their sticky bits to expire.

Release note(general change): Remove kv.range_merge.manual_split_ttl
                              setting
@jeffrey-xiao jeffrey-xiao force-pushed the sticky-bit-expiration branch from dab6353 to fbfc6ce Compare June 4, 2019 20:28
Copy link
Contributor Author

@jeffrey-xiao jeffrey-xiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/sql/crdb_internal.go, line 1795 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm surprised that we have NULL representing the maximum split time (never merge) instead of the minimum split time (always merge). I would have thought you would go the other way with this. Keep in mind that most ranges will not be manually split, so the value here should make sense for automatic splits.

I thought it made more sense for the maximum split time to be represented as NULL since it's the expiration time. If there is no expiration time, I expect that the range would never expire.


pkg/sql/split.go, line 208 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

A lot of this is copied from somewhere, right? Should we try to unify the code?

Yes, this is mostly from AS OF SYSTEM TIME. I think it does different error checking though (I.E. AS OF SYSTEM TIME should only take in timestamps in the past). I'm okay with extracting this logic into a common function. Where should it go?


pkg/sql/split_test.go, line 150 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does this need to be a *string?

expiration_timestamp could be NULL.


pkg/sql/unsplit_test.go, line 60 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does this need to be a *string?

expiration_timestamp could be NULL.


pkg/storage/merge_queue.go, line 295 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why strip the logical component before performing the comparison? We should be able to use the hlc.Timestamp.Less method.

I don't see an option to compare a hlc.Timestamp with a time.Time. If you mean to use Clock().Now() instead of Clock().PhysicalTime(), I thought it would better to avoid the mutex lock in Clock().PhysicalTime(), since the logical component isn't significant in this case.


pkg/storage/replica_command.go, line 227 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think we only want to do this if the expiration timestamp is less, not if it's different.

Shouldn't we give users the ability to push expiration timestamps later? E.G. manual load operation is taking longer than expected, so push the sticky bits in the future.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 50 of 50 files at r5, 40 of 40 files at r6, 10 of 10 files at r7, 5 of 5 files at r8.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @jeffrey-xiao)


pkg/sql/crdb_internal.go, line 1795 at r2 (raw file):

Previously, jeffrey-xiao (Jeffrey Xiao) wrote…

I thought it made more sense for the maximum split time to be represented as NULL since it's the expiration time. If there is no expiration time, I expect that the range would never expire.

We'll definitely want the value of an automatic split to be NULL because auto-splits are so much more common than manual splits, but you might be right that "expiration time" doesn't express exactly the right semantics to allow this. What name for this column would naturally fit with a NULL value for an automatic split?


pkg/sql/split.go, line 208 at r3 (raw file):

Previously, jeffrey-xiao (Jeffrey Xiao) wrote…

Yes, this is mostly from AS OF SYSTEM TIME. I think it does different error checking though (I.E. AS OF SYSTEM TIME should only take in timestamps in the past). I'm okay with extracting this logic into a common function. Where should it go?

I think we could pull out just the Datum -> hlc.Timestamp conversion and even that would be a nice win. Just remove the error prefix and use errors.Wrap at both callers. I'm not sure where a good place for this logic to live is. For now, maybe right below EvalAsOfTimestamp.


pkg/sql/unsplit_test.go, line 60 at r4 (raw file):

Previously, jeffrey-xiao (Jeffrey Xiao) wrote…

expiration_timestamp could be NULL.

Then I think we want sql.NullString.


pkg/storage/merge_queue.go, line 295 at r2 (raw file):

If you mean to use Clock().Now() instead of Clock().PhysicalTime()

Yes, this is what I meant. I don't think we need to be worried about mutex contention for a single call to the clock in mergeQueue.Process. If you want though, we can pull out a now := mq.store.Clock().Now() variable and use it here and below so we're not adding any new clock accesses.


pkg/storage/replica_command.go, line 227 at r2 (raw file):

Previously, jeffrey-xiao (Jeffrey Xiao) wrote…

Shouldn't we give users the ability to push expiration timestamps later? E.G. manual load operation is taking longer than expected, so push the sticky bits in the future.

My wording was ambiguous, but that's exactly what I was trying to allow. What I don't think we should allow if for a user to move an expiration timestamp earlier, because then the two ADMIN SPLIT WITH EXPIRATION statements aren't commutative.

@jeffrey-xiao jeffrey-xiao force-pushed the sticky-bit-expiration branch 2 times, most recently from e7f1efd to 812cac2 Compare June 5, 2019 16:56
Copy link
Contributor Author

@jeffrey-xiao jeffrey-xiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/sql/crdb_internal.go, line 1795 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We'll definitely want the value of an automatic split to be NULL because auto-splits are so much more common than manual splits, but you might be right that "expiration time" doesn't express exactly the right semantics to allow this. What name for this column would naturally fit with a NULL value for an automatic split?

How about something like time_to_live or lifetime?


pkg/storage/replica_command.go, line 227 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

My wording was ambiguous, but that's exactly what I was trying to allow. What I don't think we should allow if for a user to move an expiration timestamp earlier, because then the two ADMIN SPLIT WITH EXPIRATION statements aren't commutative.

Okay that makes sense.

@jeffrey-xiao jeffrey-xiao force-pushed the sticky-bit-expiration branch 5 times, most recently from e5981c6 to 9071a9a Compare June 5, 2019 20:20
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 16 files at r9.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @jeffrey-xiao, and @nvanbenschoten)


pkg/sql/crdb_internal.go, line 1795 at r2 (raw file):

Previously, jeffrey-xiao (Jeffrey Xiao) wrote…

How about something like time_to_live or lifetime?

We decided on split_enforced_until offline.

@jeffrey-xiao jeffrey-xiao force-pushed the sticky-bit-expiration branch from d685667 to 2185bbc Compare June 6, 2019 15:36
@jeffrey-xiao
Copy link
Contributor Author

bors r+

craig bot pushed a commit that referenced this pull request Jun 6, 2019
38004: storage/sql: change sticky bit to represent time of expiration r=jeffrey-xiao a=jeffrey-xiao

This PR does the following:

- Revert the cluster setting for manual split ttl. It was agreed that this is probably too coarse to be useful.
- Change sticky bit to represent time of expiration instead of time of split.
- The expiration time displayed in `crdb_internal.ranges` is `NULL` if the range is permanent (I.E. has sticky bit of `hlc.MaxTimestamp`), otherwise it is the expiration time. This means that for static splits and splits performed by the split queue, it is displayed as `1970-01-01 00:00:00+00:00` (`hlc.Timestamp{}`).
- Add `WITH EXPIRATION` option to `SPLIT AT` that accepts the same values as `AS OF SYSTEM TIME`.

Co-authored-by: Jeffrey Xiao <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jun 6, 2019

Build succeeded

@craig craig bot merged commit 2185bbc into cockroachdb:master Jun 6, 2019
@jeffrey-xiao jeffrey-xiao deleted the sticky-bit-expiration branch June 6, 2019 17:34
danhhz added a commit to danhhz/cockroach that referenced this pull request Jun 19, 2019
In cockroachdb#38147, a new non-nullable field was added to the ReplicaDescriptor
proto that is serialized inside RangeDescriptors. RangeDescriptors are
updated using CPut to detect races. This means that if a RangeDescriptor
had been written by an old version of cockroach and we then attempt to
update it, the CPut will fail because the encoded version of it is
different (non-nullable proto2 fields are always included).

A similar issue was introduced in cockroachdb#38004 which made the StickyBit field
on RangeDescriptor non-nullable.

We could keep the fields as nullable, but this has allocation costs (and
is also more annoying to work with).

Worse, the proto spec is pretty explicit about not relying on
serialization of a given message to always produce exactly the same
bytes:

    From https://developers.google.com/protocol-buffers/docs/encoding#implications
    Do not assume the byte output of a serialized message is stable.

So instead, we've decided to stop CPut-ing protos and to change the
interface of CPut to take the expected value as bytes. To CPut a proto,
the encoded value will be read from kv, passed around with the decoded
struct, and used in the eventual CPut. This work is upcoming, but the
above two PRs are real breakages, so this commit fixes those first.

Neither of these PRs made the last alpha so no release note.

Release note: None
danhhz added a commit to danhhz/cockroach that referenced this pull request Jun 19, 2019
In cockroachdb#38147, a new non-nullable field was added to the ReplicaDescriptor
proto that is serialized inside RangeDescriptors. RangeDescriptors are
updated using CPut to detect races. This means that if a RangeDescriptor
had been written by an old version of cockroach and we then attempt to
update it, the CPut will fail because the encoded version of it is
different (non-nullable proto2 fields are always included).

A similar issue was introduced in cockroachdb#38004 which made the StickyBit field
on RangeDescriptor non-nullable.

We could keep the fields as nullable, but this has allocation costs (and
is also more annoying to work with).

Worse, the proto spec is pretty explicit about not relying on
serialization of a given message to always produce exactly the same
bytes:

    From https://developers.google.com/protocol-buffers/docs/encoding#implications
    Do not assume the byte output of a serialized message is stable.

So instead, we've decided to stop CPut-ing protos and to change the
interface of CPut to take the expected value as bytes. To CPut a proto,
the encoded value will be read from kv, passed around with the decoded
struct, and used in the eventual CPut. This work is upcoming, but the
above two PRs are real breakages, so this commit fixes those first.

Neither of these PRs made the last alpha so no release note.

Touches cockroachdb#38308

Release note: None
danhhz added a commit to danhhz/cockroach that referenced this pull request Jun 20, 2019
In cockroachdb#38147, a new non-nullable field was added to the ReplicaDescriptor
proto that is serialized inside RangeDescriptors. RangeDescriptors are
updated using CPut to detect races. This means that if a RangeDescriptor
had been written by an old version of cockroach and we then attempt to
update it, the CPut will fail because the encoded version of it is
different (non-nullable proto2 fields are always included).

A similar issue was introduced in cockroachdb#38004 which made the StickyBit field
on RangeDescriptor non-nullable.

We could keep the fields as nullable, but this has allocation costs (and
is also more annoying to work with).

Worse, the proto spec is pretty explicit about not relying on
serialization of a given message to always produce exactly the same
bytes:

    From https://developers.google.com/protocol-buffers/docs/encoding#implications
    Do not assume the byte output of a serialized message is stable.

So instead, we've decided to stop CPut-ing protos and to change the
interface of CPut to take the expected value as bytes. To CPut a proto,
the encoded value will be read from kv, passed around with the decoded
struct, and used in the eventual CPut. This work is upcoming, but the
above two PRs are real breakages, so this commit fixes those first.

Neither of these PRs made the last alpha so no release note.

Touches cockroachdb#38308

Release note: None
craig bot pushed a commit that referenced this pull request Jun 20, 2019
38302: storage: CPut bytes instead of a proto when updating RangeDescriptors r=tbg a=danhhz

In #38147, a new non-nullable field was added to the ReplicaDescriptor
proto that is serialized inside RangeDescriptors. RangeDescriptors are
updated using CPut to detect races. This means that if a RangeDescriptor
had been written by an old version of cockroach and we then attempt to
update it, the CPut will fail because the encoded version of it is
different (non-nullable proto2 fields are always included).

A similar issue was introduced in #38004 which made the StickyBit field
on RangeDescriptor non-nullable.

We could keep the fields as nullable, but this has allocation costs (and
is also more annoying to work with).

Worse, the proto spec is pretty explicit about not relying on
serialization of a given message to always produce exactly the same
bytes:

    From https://developers.google.com/protocol-buffers/docs/encoding#implications
    Do not assume the byte output of a serialized message is stable.

So instead, we've decided to stop CPut-ing protos and to change the
interface of CPut to take the expected value as bytes. To CPut a proto,
the encoded value will be read from kv, passed around with the decoded
struct, and used in the eventual CPut. This work is upcoming, but the
above two PRs are real breakages, so this commit fixes those first.

Neither of these PRs made the last alpha so no release note.

Touches #38308
Closes #38183

Release note: None

Co-authored-by: Daniel Harrison <[email protected]>
@andreimatei
Copy link
Contributor

friendly reminder to stick the docs-todo label on PRs that need docs.

danhhz added a commit to danhhz/cockroach that referenced this pull request Aug 16, 2019
As seen in cockroachdb#38004 and cockroachdb#38147, this first came up in the context of CPut
with a proto expected value. If the serialization didn't exactly match,
the CPut would get a condition failed error, even if the two protos were
logically equivalent (missing field vs field with default value). We
were hitting this in a mixed version cluster when trying to add a
non-nullable field to RangeDescriptor, which is updated via CPut to
detect update races. The new fields ended up having to be nullable,
which has allocation consequences as well as ergonomic ones.

In cockroachdb#38302, we changed the RangeDescriptor CPut to use a *roachpb.Value
exactly as it was read from KV for the expected value, but no other
callsites were updated. To prevent future issues, this commit changes
the CPut signature to accept a *roachpb.Value for the expected value
instead of an interface{}.

The old method signature is left as CPutDeprecated for now. The CPut
usages that are trivial to switch are switched and ones that require any
thought are left for a followup PR.

Release note: None
craig bot pushed a commit that referenced this pull request Aug 20, 2019
39714: client: force CPut callers to use a *roachpb.Value expected value r=tbg a=danhhz

As seen in #38004 and #38147, this first came up in the context of CPut
with a proto expected value. If the serialization didn't exactly match,
the CPut would get a condition failed error, even if the two protos were
logically equivalent (missing field vs field with default value). We
were hitting this in a mixed version cluster when trying to add a
non-nullable field to RangeDescriptor, which is updated via CPut to
detect update races. The new fields ended up having to be nullable,
which has allocation consequences as well as ergonomic ones.

In #38302, we changed the RangeDescriptor CPut to use a *roachpb.Value
exactly as it was read from KV for the expected value, but no other
callsites were updated. To prevent future issues, this commit changes
the CPut signature to accept a *roachpb.Value for the expected value
instead of an interface{}.

The old method signature is left as CPutDeprecated for now. The CPut
usages that are trivial to switch are switched and ones that require any
thought are left for a followup PR.

Release note: None

Co-authored-by: Daniel Harrison <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants