-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: add support for Lease.MinExpiration #125235
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
A-leader-leases
Related to the introduction of leader leases
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
Comments
nvanbenschoten
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
A-kv-replication
Relating to Raft, consensus, and coordination.
T-kv
KV Team
A-leader-leases
Related to the introduction of leader leases
labels
Jun 6, 2024
cc @cockroachdb/replication |
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jun 10, 2024
Fixes cockroachdb#124693. Fixes cockroachdb#125287. This commit adds logic to retry the synchronous liveness heartbeat which ensures that the liveness record has a later expiration than the prior lease by the time the lease promotion goes into effect, which was added in cockroachdb#123442. This heartbeat may need to be retried because it may have raced with another liveness heartbeat which did not extend the liveness expiration far enough. We opt to allow this race and retry across it instead of detecting it and returning an error from `NodeLiveness.Heartbeat` because: 1. returning an error would have a larger blast radius and could cause other issues. 2. returning an error wouldn't actually fix the tests that are failing, because they would still get an error, just a different one. Before this commit, `TestLeaseQueueProactiveEnqueueOnPreferences` would hit this case fail under deadlock and stress every ~150 iterations. After this commit, the test passes continuously under deadlock stress for over 2000 runs. This makes cockroachdb#123442 even uglier. The nicer solution is cockroachdb#125235, but that is not backportable. Still, we're planning to address that issue as part of the leader leases work, so this is a temporary fix. This also removes a TODO added in 1dc18df. As mentioned above, we don't address it, but instead document the behavior. Release note (bug fix): resolves a concerning log message that says "expiration of liveness record ... is not greater than expiration of the previous lease ... after liveness heartbeat". This message is no longer possible.
craig bot
pushed a commit
that referenced
this issue
Jun 10, 2024
124885: kv: retry liveness heartbeat on race with insufficient expiration r=nvanbenschoten a=nvanbenschoten Fixes #124693. Fixes #125287. This commit adds logic to retry the synchronous liveness heartbeat which ensures that the liveness record has a later expiration than the prior lease by the time the lease promotion goes into effect, which was added in #123442. This heartbeat may need to be retried because it may have raced with another liveness heartbeat which did not extend the liveness expiration far enough. We opt to allow this race and retry across it instead of detecting it and returning an error from `NodeLiveness.Heartbeat` because: 1. returning an error would have a larger blast radius and could cause other issues. 2. returning an error wouldn't actually fix the tests that are failing, because they would still get an error, just a different one. Before this commit, `TestLeaseQueueProactiveEnqueueOnPreferences` would hit this case fail under deadlock and stress every ~150 iterations. After this commit, the test passes continuously under deadlock stress for over 2000 runs. This makes #123442 even uglier. The nicer solution is #125235, but that is not backportable. Still, we're planning to address that issue as part of the leader leases work, so this is a temporary fix. This also removes a TODO added in 1dc18df. As mentioned above, we don't address it, but instead document the behavior. Release note (bug fix): resolves a concerning log message that says "expiration of liveness record ... is not greater than expiration of the previous lease ... after liveness heartbeat". This message is no longer possible. 125309: streamingccl: fix metric name r=msbutler a=stevendanna This metric was re-using the name of another metric. Epic: none Release note: None 125369: streamproducer: add emit_wait and produce_wait to vtable r=rickystewart a=dt Release note: none. Epic: none. 125411: streamingccl: cleanup unit test terminology r=stevendanna a=stevendanna Mixing source/target with serverA/serverB was a bit confusing. Epic: none Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Steven Danna <[email protected]> Co-authored-by: David Taylor <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Jun 10, 2024
Fixes #124693. Fixes #125287. This commit adds logic to retry the synchronous liveness heartbeat which ensures that the liveness record has a later expiration than the prior lease by the time the lease promotion goes into effect, which was added in #123442. This heartbeat may need to be retried because it may have raced with another liveness heartbeat which did not extend the liveness expiration far enough. We opt to allow this race and retry across it instead of detecting it and returning an error from `NodeLiveness.Heartbeat` because: 1. returning an error would have a larger blast radius and could cause other issues. 2. returning an error wouldn't actually fix the tests that are failing, because they would still get an error, just a different one. Before this commit, `TestLeaseQueueProactiveEnqueueOnPreferences` would hit this case fail under deadlock and stress every ~150 iterations. After this commit, the test passes continuously under deadlock stress for over 2000 runs. This makes #123442 even uglier. The nicer solution is #125235, but that is not backportable. Still, we're planning to address that issue as part of the leader leases work, so this is a temporary fix. This also removes a TODO added in 1dc18df. As mentioned above, we don't address it, but instead document the behavior. Release note (bug fix): resolves a concerning log message that says "expiration of liveness record ... is not greater than expiration of the previous lease ... after liveness heartbeat". This message is no longer possible.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jun 26, 2024
Informs cockroachdb#125235. This commit adds a new MinExpiration field to the Lease struct. This field defines the minimum expiration at which the lease expires, independent of any other expiry condition. This field can be used to place a floor on the expiration for epoch-based leases and leader leases (not yet implemented) to prevent expiration regressions when upgrading from an expiration-based lease. The field is not yet used. Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jun 26, 2024
Informs cockroachdb#125235. This commit adds the new MinExpiration field to the range debug page. Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jul 1, 2024
Informs cockroachdb#125235. This commit adds a new MinExpiration field to the Lease struct. This field defines the minimum expiration at which the lease expires, independent of any other expiry condition. This field can be used to place a floor on the expiration for epoch-based leases and leader leases (not yet implemented) to prevent expiration regressions when upgrading from an expiration-based lease. The field is not yet used. Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jul 1, 2024
Informs cockroachdb#125235. This commit adds the new MinExpiration field to the range debug page. Release note: None
craig bot
pushed a commit
that referenced
this issue
Jul 1, 2024
126202: roachpb: add MinExpiration field to Lease r=arulajmani a=nvanbenschoten First half of #125235. This commit adds a new `MinExpiration` field to the `Lease` struct. This field defines the minimum expiration at which the lease expires, independent of any other expiry condition. This field can be used to place a floor on the expiration for epoch-based leases and leader leases (not yet implemented) to prevent expiration regressions when upgrading from an expiration-based lease. The field is not yet used. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jul 2, 2024
Closes cockroachdb#125235. This commit begins using the new `Lease.MinExpiration` field, which was introduced in 7781cfa. `Lease.MinExpiration` is assigned to epoch-based leases (and soon, leader leases) during upgrades from expiration-based lease. It is a simple, direct way to ensure that the expiration of the lease does not regress during such promotions. We previously solved this issue in 6dd54b4 by manually heartbeating the node liveness record when we detected an expiration regression. This was backwards compatible, so it was a better backport target, but it was also complex, as demonstrated by the need for b2ac52a. In the future, we will be able to remove the manual heartbeat logic, as it is no longer necessary once MinTimestamp is supported by all clusters. Epic: None
craig bot
pushed a commit
that referenced
this issue
Jul 8, 2024
126336: sql: override StmtTimeout to 0 for background jobs r=annrpom a=annrpom For background jobs and schema changes, our Internal Executor inherits from cluster settings. If a statement timeout is set cluster-wide, then background jobs will respect this -- which does not make much sense; thus, this patch overrides the StmtTimeout to 0s for background jobs. Fixes: #126261 Release note (bug fix): Background jobs no longer respect a statement timeout. 126419: ui: fix query fetching store ids for db/tables r=xinhaoz a=xinhaoz These queries were updated in #118904 to make them more performant. In that change we mistakenly removed the null check before reading the `rows` field, causing pages using this request to crash if this query fails. Epic: none Fixes: #126348 Release note (bug fix): Database overview and db details pages should not crash if the range information is not available. 126575: kv: add support for Lease.MinExpiration r=nvanbenschoten a=nvanbenschoten Closes #125235. This commit begins using the new `Lease.MinExpiration` field, which was introduced in 7781cfa. `Lease.MinExpiration` is assigned to epoch-based leases (and soon, leader leases) during upgrades from expiration-based lease. It is a simple, direct way to ensure that the expiration of the lease does not regress during such promotions. We previously solved this issue in 6dd54b4 by manually heartbeating the node liveness record when we detected an expiration regression. This was backwards compatible, so it was a better backport target, but it was also complex, as demonstrated by the need for b2ac52a. In the future, we will be able to remove the manual heartbeat logic, as it is no longer necessary once MinTimestamp is supported by all clusters. Epic: None Co-authored-by: Annie Pompa <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
asg0451
pushed a commit
to asg0451/cockroach
that referenced
this issue
Jul 10, 2024
Closes cockroachdb#125235. This commit begins using the new `Lease.MinExpiration` field, which was introduced in 7781cfa. `Lease.MinExpiration` is assigned to epoch-based leases (and soon, leader leases) during upgrades from expiration-based lease. It is a simple, direct way to ensure that the expiration of the lease does not regress during such promotions. We previously solved this issue in 6dd54b4 by manually heartbeating the node liveness record when we detected an expiration regression. This was backwards compatible, so it was a better backport target, but it was also complex, as demonstrated by the need for b2ac52a. In the future, we will be able to remove the manual heartbeat logic, as it is no longer necessary once MinTimestamp is supported by all clusters. Epic: None
nvanbenschoten
added a commit
that referenced
this issue
Aug 29, 2024
Fixes #124693. Fixes #125287. This commit adds logic to retry the synchronous liveness heartbeat which ensures that the liveness record has a later expiration than the prior lease by the time the lease promotion goes into effect, which was added in #123442. This heartbeat may need to be retried because it may have raced with another liveness heartbeat which did not extend the liveness expiration far enough. We opt to allow this race and retry across it instead of detecting it and returning an error from `NodeLiveness.Heartbeat` because: 1. returning an error would have a larger blast radius and could cause other issues. 2. returning an error wouldn't actually fix the tests that are failing, because they would still get an error, just a different one. Before this commit, `TestLeaseQueueProactiveEnqueueOnPreferences` would hit this case fail under deadlock and stress every ~150 iterations. After this commit, the test passes continuously under deadlock stress for over 2000 runs. This makes #123442 even uglier. The nicer solution is #125235, but that is not backportable. Still, we're planning to address that issue as part of the leader leases work, so this is a temporary fix. This also removes a TODO added in 1dc18df. As mentioned above, we don't address it, but instead document the behavior. Release note (bug fix): resolves a concerning log message that says "expiration of liveness record ... is not greater than expiration of the previous lease ... after liveness heartbeat". This message is no longer possible.
celiala
pushed a commit
that referenced
this issue
Sep 12, 2024
Fixes #124693. Fixes #125287. This commit adds logic to retry the synchronous liveness heartbeat which ensures that the liveness record has a later expiration than the prior lease by the time the lease promotion goes into effect, which was added in #123442. This heartbeat may need to be retried because it may have raced with another liveness heartbeat which did not extend the liveness expiration far enough. We opt to allow this race and retry across it instead of detecting it and returning an error from `NodeLiveness.Heartbeat` because: 1. returning an error would have a larger blast radius and could cause other issues. 2. returning an error wouldn't actually fix the tests that are failing, because they would still get an error, just a different one. Before this commit, `TestLeaseQueueProactiveEnqueueOnPreferences` would hit this case fail under deadlock and stress every ~150 iterations. After this commit, the test passes continuously under deadlock stress for over 2000 runs. This makes #123442 even uglier. The nicer solution is #125235, but that is not backportable. Still, we're planning to address that issue as part of the leader leases work, so this is a temporary fix. This also removes a TODO added in 1dc18df. As mentioned above, we don't address it, but instead document the behavior. Release note (bug fix): resolves a concerning log message that says "expiration of liveness record ... is not greater than expiration of the previous lease ... after liveness heartbeat". This message is no longer possible.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Sep 12, 2024
Fixes cockroachdb#124693. Fixes cockroachdb#125287. This commit adds logic to retry the synchronous liveness heartbeat which ensures that the liveness record has a later expiration than the prior lease by the time the lease promotion goes into effect, which was added in cockroachdb#123442. This heartbeat may need to be retried because it may have raced with another liveness heartbeat which did not extend the liveness expiration far enough. We opt to allow this race and retry across it instead of detecting it and returning an error from `NodeLiveness.Heartbeat` because: 1. returning an error would have a larger blast radius and could cause other issues. 2. returning an error wouldn't actually fix the tests that are failing, because they would still get an error, just a different one. Before this commit, `TestLeaseQueueProactiveEnqueueOnPreferences` would hit this case fail under deadlock and stress every ~150 iterations. After this commit, the test passes continuously under deadlock stress for over 2000 runs. This makes cockroachdb#123442 even uglier. The nicer solution is cockroachdb#125235, but that is not backportable. Still, we're planning to address that issue as part of the leader leases work, so this is a temporary fix. This also removes a TODO added in 1dc18df. As mentioned above, we don't address it, but instead document the behavior. Release note (bug fix): resolves a concerning log message that says "expiration of liveness record ... is not greater than expiration of the previous lease ... after liveness heartbeat". This message is no longer possible.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
A-leader-leases
Related to the introduction of leader leases
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
Part of #123847.
This new field can be used to epoch-based leases as well, as long as it is put behind a cluster version gate.
Jira issue: CRDB-39328
Epic CRDB-37522
The text was updated successfully, but these errors were encountered: