-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvnemesis: toggle global_reads attribute in zone configs #63747
Merged
craig
merged 1 commit into
cockroachdb:master
from
nvanbenschoten:nvanbenschoten/kvnemesisGlobal
Apr 27, 2021
Merged
kvnemesis: toggle global_reads attribute in zone configs #63747
craig
merged 1 commit into
cockroachdb:master
from
nvanbenschoten:nvanbenschoten/kvnemesisGlobal
Apr 27, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nvanbenschoten
force-pushed
the
nvanbenschoten/kvnemesisGlobal
branch
from
April 15, 2021 19:43
ec29f36
to
1759a48
Compare
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this pull request
Apr 16, 2021
This commit adds an exponential backoff to the transaction retry loop when it detects that a transaction has been aborted. This was observed to prevent thrashing under heavy read-write contention on `global_read` ranges, which are added to kvnemesis in cockroachdb#63747. These ranges have an added propensity to cause thrashing because every write to these ranges gets bumped to a higher timestamp, so it is currently imperative that a transaction be able to refresh its reads after writing to a global_read range. If other transactions continue to invalidate a read-write transaction's reads, it may never complete and will repeatedly abort conflicting txns after detecting deadlocks. This commit prevents this from stalling kvnemesis indefinitely. I see two ways that we can improve this situation in the future. 1. The first option is that we could introduce some form of pessimistic read-locking for long running read-write transactions, so that they can eventually prevent other transactions from invalidating their reads as they proceed to write to a global_reads range and get their write timestamp bumped. This ensures that when the long-running transaction returns to refresh (if it even needs to, depending on the durability of the read locks) its reads, the refresh will have a high likelihood of succeeding. This is discussed in cockroachdb#52768. 2. The second option is to allow a transaction to re-write its existing intents in new epochs without being bumped by the closed timestamp. If a transaction only got bumped by the closed timestamp when writing new intents, then after a transaction was forced to retry, it would have a high likelihood of succeeding on its second epoch as long as it didn't write to a new set of keys. This is discussed in cockroachdb#63796.
aayushshah15
approved these changes
Apr 19, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this pull request
Apr 21, 2021
Fixes a serious bug revealed by cockroachdb#63747. This commit fixes a bug revealed by kvnemesis where a range-merge watcher on the right-hand side of a range merge could incorrectly determine that a range merge txn had succeeded, when in reality, it had failed. The watcher would then put the RHS leaseholder replica into a stalled state by setting `r.mu.destroyStatus` to `destroyReasonMergePending`, effectively stalling any operation on the range indefinitely. The setup for this bug was that a range was operating with a `global_reads` zone configuration attribute, so it was pushing all writers into the future. The range was split and then rapidly merged back together. During the merge txn, a range-merge watcher (see `maybeWatchForMergeLocked`) began monitoring the state of the range merge txn. The problem was that at the time that the range merge txn committed, neither the meta descriptor version written by the merge or even the meta descriptor version written by the split were visible to the watcher's follow-up query. Because the watcher read below the split txn's descriptor, it came to the wrong conclusion about the merge. It is interesting to think about what is going wrong here, because it's not immediately obvious who is at fault. If a transaction has a commit timestamp in the future of present time, it will need to commit-wait before acknowledging the client. Typically, this is performed in the TxnCoordSender after the transaction has committed and resolved its intents (see TxnCoordSender.maybeCommitWait). It is safe to wait after a future-time transaction has committed and resolved intents without compromising linearizability because the uncertainty interval of concurrent and later readers ensures atomic visibility of the effects of the transaction. In other words, all of the transaction's intents will become visible and will remain visible at once, which is sometimes called "monotonic reads". This is true even if the resolved intents are at a high enough timestamp such that they are not visible to concurrent readers immediately after they are resolved, but only become visible sometime during the writer's commit-wait sleep. This property is central to the correctness of non-blocking transactions. See: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20200811_non_blocking_txns.md However, if a transaction has a commit trigger, the side-effects of the trigger will go into effect immediately upon applying to corresponding Raft log entry. This poses a problem, because we do not want part of a transaction's effects (e.g. its commit trigger) to become visible to onlookers before the rest of its effects do (e.g. its intent writes). To avoid this problem, this commit adds special server-side logic to perform the commit-wait stage of a transaction with a commit trigger early, before its commit trigger fires or its intents are resolved. This results in the transaction waiting longer to commit and resolve its intents, but is otherwise safe and effective. Interestingly, this is quite similar to how Spanner handles its commit-wait rule: > Before allowing any coordinator replica to apply the commit record, the > coordinator leader waits until TT.after(s), so as to obey the commit-wait rule > described in Section 4.1.2. Because the coordinator leader chose s based on > TT.now().latest, and now waits until that timestamp is guaranteed to be in the > past, the expected wait is at least 2 ∗ . This wait is typically overlapped with > Paxos communication. After commit wait, the coordinator sends the commit > timestamp to the client and all other participant leaders. Each participant > leader logs the transaction’s outcome through Paxos. All participants apply at > the same timestamp and then release locks. Of course, the whole point of non-blocking transactions is that we release locks early and use clocks (through uncertainty intervals + a reader-side commit-wait rule) to enforce consistency, so we don't want to make this change for standard transactions. Before this change, I could hit the bug in about 5 minutes of stressing kvnemesis on a roachprod cluster. After this change, I've been able to run kvnemesis for a few hours without issue. Release note (bug fix): Fixed a rare bug present in betas where rapid range splits and merges on a GLOBAL table could lead to a stuck leaseholder replica. The situation is no longer possible.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this pull request
Apr 23, 2021
This commit adds an exponential backoff to the transaction retry loop when it detects that a transaction has been aborted. This was observed to prevent thrashing under heavy read-write contention on `global_read` ranges, which are added to kvnemesis in cockroachdb#63747. These ranges have an added propensity to cause thrashing because every write to these ranges gets bumped to a higher timestamp, so it is currently imperative that a transaction be able to refresh its reads after writing to a global_read range. If other transactions continue to invalidate a read-write transaction's reads, it may never complete and will repeatedly abort conflicting txns after detecting deadlocks. This commit prevents this from stalling kvnemesis indefinitely. I see two ways that we can improve this situation in the future. 1. The first option is that we could introduce some form of pessimistic read-locking for long running read-write transactions, so that they can eventually prevent other transactions from invalidating their reads as they proceed to write to a global_reads range and get their write timestamp bumped. This ensures that when the long-running transaction returns to refresh (if it even needs to, depending on the durability of the read locks) its reads, the refresh will have a high likelihood of succeeding. This is discussed in cockroachdb#52768. 2. The second option is to allow a transaction to re-write its existing intents in new epochs without being bumped by the closed timestamp. If a transaction only got bumped by the closed timestamp when writing new intents, then after a transaction was forced to retry, it would have a high likelihood of succeeding on its second epoch as long as it didn't write to a new set of keys. This is discussed in cockroachdb#63796.
craig bot
pushed a commit
that referenced
this pull request
Apr 23, 2021
63799: kvnemesis: add backoff to retry loop on txn aborts r=nvanbenschoten a=nvanbenschoten This commit adds an exponential backoff to the transaction retry loop when it detects that a transaction has been aborted. This was observed to prevent thrashing under heavy read-write contention on `global_read` ranges, which are added to kvnemesis in #63747. These ranges have an added propensity to cause thrashing because every write to these ranges gets bumped to a higher timestamp, so it is currently imperative that a transaction be able to refresh its reads after writing to a global_read range. If other transactions continue to invalidate a read-write transaction's reads, it may never complete and will repeatedly abort conflicting txns after detecting deadlocks. This commit prevents this from stalling kvnemesis indefinitely. I see two ways that we can improve this situation in the future. 1. The first option is that we could introduce some form of pessimistic read-locking for long running read-write transactions, so that they can eventually prevent other transactions from invalidating their reads as they proceed to write to a global_reads range and get their write timestamp bumped. This ensures that when the long-running transaction returns to refresh (if it even needs to, depending on the durability of the read locks) its reads, the refresh will have a high likelihood of succeeding. This is discussed in #52768. 2. The second option is to allow a transaction to re-write its existing intents in new epochs without being bumped by the closed timestamp. If a transaction only got bumped by the closed timestamp when writing new intents, then after a transaction was forced to retry, it would have a high likelihood of succeeding on its second epoch as long as it didn't write to a new set of keys. This is discussed in #63796. Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this pull request
Apr 27, 2021
Fixes a serious bug revealed by cockroachdb#63747. This commit fixes a bug revealed by kvnemesis where a range-merge watcher on the right-hand side of a range merge could incorrectly determine that a range merge txn had succeeded, when in reality, it had failed. The watcher would then put the RHS leaseholder replica into a stalled state by setting `r.mu.destroyStatus` to `destroyReasonMergePending`, effectively stalling any operation on the range indefinitely. The setup for this bug was that a range was operating with a `global_reads` zone configuration attribute, so it was pushing all writers into the future. The range was split and then rapidly merged back together. During the merge txn, a range-merge watcher (see `maybeWatchForMergeLocked`) began monitoring the state of the range merge txn. The problem was that at the time that the range merge txn committed, neither the meta descriptor version written by the merge or even the meta descriptor version written by the split were visible to the watcher's follow-up query. Because the watcher read below the split txn's descriptor, it came to the wrong conclusion about the merge. It is interesting to think about what is going wrong here, because it's not immediately obvious who is at fault. If a transaction has a commit timestamp in the future of present time, it will need to commit-wait before acknowledging the client. Typically, this is performed in the TxnCoordSender after the transaction has committed and resolved its intents (see TxnCoordSender.maybeCommitWait). It is safe to wait after a future-time transaction has committed and resolved intents without compromising linearizability because the uncertainty interval of concurrent and later readers ensures atomic visibility of the effects of the transaction. In other words, all of the transaction's intents will become visible and will remain visible at once, which is sometimes called "monotonic reads". This is true even if the resolved intents are at a high enough timestamp such that they are not visible to concurrent readers immediately after they are resolved, but only become visible sometime during the writer's commit-wait sleep. This property is central to the correctness of non-blocking transactions. See: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20200811_non_blocking_txns.md However, if a transaction has a commit trigger, the side-effects of the trigger will go into effect immediately upon applying to corresponding Raft log entry. This poses a problem, because we do not want part of a transaction's effects (e.g. its commit trigger) to become visible to onlookers before the rest of its effects do (e.g. its intent writes). To avoid this problem, this commit adds special server-side logic to perform the commit-wait stage of a transaction with a commit trigger early, before its EndTxn evaluates and its commit trigger fires. This results in the transaction waiting longer to commit, run its commit trigger, and resolve its intents, but it is otherwise safe and effective. Interestingly, this is quite similar to how Spanner handles its commit-wait rule: > Before allowing any coordinator replica to apply the commit record, the > coordinator leader waits until TT.after(s), so as to obey the commit-wait rule > described in Section 4.1.2. Because the coordinator leader chose s based on > TT.now().latest, and now waits until that timestamp is guaranteed to be in the > past, the expected wait is at least 2 ∗ . This wait is typically overlapped with > Paxos communication. After commit wait, the coordinator sends the commit > timestamp to the client and all other participant leaders. Each participant > leader logs the transaction’s outcome through Paxos. All participants apply at > the same timestamp and then release locks. Of course, the whole point of non-blocking transactions is that we release locks early and use clocks (through uncertainty intervals + a reader-side commit-wait rule) to enforce consistency, so we don't want to make this change for standard transactions. Before this change, I could hit the bug in about 5 minutes of stressing kvnemesis on a roachprod cluster. After this change, I've been able to run kvnemesis for a few hours without issue. Release note (bug fix): Fixed a rare bug present in betas where rapid range splits and merges on a GLOBAL table could lead to a stuck leaseholder replica. The situation is no longer possible.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this pull request
Apr 27, 2021
Fixes a serious bug revealed by cockroachdb#63747. This commit fixes a bug revealed by kvnemesis where a range-merge watcher on the right-hand side of a range merge could incorrectly determine that a range merge txn had succeeded, when in reality, it had failed. The watcher would then put the RHS leaseholder replica into a stalled state by setting `r.mu.destroyStatus` to `destroyReasonMergePending`, effectively stalling any operation on the range indefinitely. The setup for this bug was that a range was operating with a `global_reads` zone configuration attribute, so it was pushing all writers into the future. The range was split and then rapidly merged back together. During the merge txn, a range-merge watcher (see `maybeWatchForMergeLocked`) began monitoring the state of the range merge txn. The problem was that at the time that the range merge txn committed, neither the meta descriptor version written by the merge or even the meta descriptor version written by the split were visible to the watcher's follow-up query. Because the watcher read below the split txn's descriptor, it came to the wrong conclusion about the merge. It is interesting to think about what is going wrong here, because it's not immediately obvious who is at fault. If a transaction has a commit timestamp in the future of present time, it will need to commit-wait before acknowledging the client. Typically, this is performed in the TxnCoordSender after the transaction has committed and resolved its intents (see TxnCoordSender.maybeCommitWait). It is safe to wait after a future-time transaction has committed and resolved intents without compromising linearizability because the uncertainty interval of concurrent and later readers ensures atomic visibility of the effects of the transaction. In other words, all of the transaction's intents will become visible and will remain visible at once, which is sometimes called "monotonic reads". This is true even if the resolved intents are at a high enough timestamp such that they are not visible to concurrent readers immediately after they are resolved, but only become visible sometime during the writer's commit-wait sleep. This property is central to the correctness of non-blocking transactions. See: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20200811_non_blocking_txns.md However, if a transaction has a commit trigger, the side-effects of the trigger will go into effect immediately upon applying to corresponding Raft log entry. This poses a problem, because we do not want part of a transaction's effects (e.g. its commit trigger) to become visible to onlookers before the rest of its effects do (e.g. its intent writes). To avoid this problem, this commit adds special server-side logic to perform the commit-wait stage of a transaction with a commit trigger early, before its EndTxn evaluates and its commit trigger fires. This results in the transaction waiting longer to commit, run its commit trigger, and resolve its intents, but it is otherwise safe and effective. Interestingly, this is quite similar to how Spanner handles its commit-wait rule: > Before allowing any coordinator replica to apply the commit record, the > coordinator leader waits until TT.after(s), so as to obey the commit-wait rule > described in Section 4.1.2. Because the coordinator leader chose s based on > TT.now().latest, and now waits until that timestamp is guaranteed to be in the > past, the expected wait is at least 2 ∗ . This wait is typically overlapped with > Paxos communication. After commit wait, the coordinator sends the commit > timestamp to the client and all other participant leaders. Each participant > leader logs the transaction’s outcome through Paxos. All participants apply at > the same timestamp and then release locks. Of course, the whole point of non-blocking transactions is that we release locks early and use clocks (through uncertainty intervals + a reader-side commit-wait rule) to enforce consistency, so we don't want to make this change for standard transactions. Before this change, I could hit the bug in about 5 minutes of stressing kvnemesis on a roachprod cluster. After this change, I've been able to run kvnemesis for a few hours without issue. Release note (bug fix): Fixed a rare bug present in betas where rapid range splits and merges on a GLOBAL table could lead to a stuck leaseholder replica. The situation is no longer possible.
Closes cockroachdb#59062. This commit adds a new `ChangeZoneOperation` with the option to toggle the `global_reads` attribute to KV nemesis. This testing ensures that global read ranges are stressed with a variety of different conditions, including range splits, range merges, rebalancing, lease transfers, rangefeeds, and more. So far, there has been one bug revealed by this testing where a range merge watcher incorrectly considered a merge to have succeeded due to a "stale read" when the merge actually failed. This will be fixed in a separate PR that I'll merge before this one. In addition, I reverted a few key changes to demonstrate that this testing would have revealed other issues had they not been addressed proactively. ```go // 1. if in TransferLease, reads above new lease start time are ignored priorReadSum = rspb.FromTimestamp(newLease.Start.ToTimestamp()) // then after 507 runs over 1m11s committed txn non-atomic timestamps: [w]/Table/50/"0d0e019b":1617159805.036363000,2?->v-11 [r]/Table/50/"22ebd28f":[<min>, <max>)-><nil> [s]/Table/50/"{3059eb81"-b379f625"}:{0:[1617159804.712970000,1, <max>), 1:[1617159804.712970000,1, <max>), 2:[1617159804.712970000,1, <max>), gap:[<min>, 1617159804.748590000,0)}->[/Table/50/"380e4125":v-10, /Table/50/"3812daf0":v-9, /Table/50/"3c11a3d5":v-7] [r]/Table/50/"68767751":[<min>, <max>)-><nil> [w]/Table/50/"d05c9e5f":1617159805.036363000,2?->v-12 // 2. if in Subsume, reads above freeze time are ignored priorReadSum = rspb.FromTimestamp(reply.FreezeStart.ToTimestamp()) // then after many runs it would have hit something similar // 3. if in (*txnwait.Queue).forcePushAbort we did not forward the PushRequest // timestamp to req.PushTo (i.e. if we reverted 8ba492c) // b.Header.Timestamp.Forward(req.PushTo) // then after 29 runs over 8s error applying x.ScanForUpdate(ctx, /Table/50/"9e80b889", /Table/50/"e2985e21", 0) // (nil, request timestamp 1618333326.405544000,0 less than PushTo timestamp 1618333326.631766000,3?): request timestamp 1618333326.405544000,0 less than PushTo timestamp 1618333326.631766000,3? ```
nvanbenschoten
force-pushed
the
nvanbenschoten/kvnemesisGlobal
branch
from
April 27, 2021 18:28
1759a48
to
b025216
Compare
craig bot
pushed a commit
that referenced
this pull request
Apr 27, 2021
62435: cli: make --global flag for demo public r=knz a=otan See individual commits for details. Refs: #62025 63971: kv: commit-wait before running commit triggers and resolving intents r=nvanbenschoten a=nvanbenschoten Fixes a serious bug revealed by #63747. This commit fixes a bug revealed by kvnemesis where a range-merge watcher on the right-hand side of a range merge could incorrectly determine that a range merge txn had succeeded, when in reality, it had failed. The watcher would then put the RHS leaseholder replica into a stalled state by setting `r.mu.destroyStatus` to `destroyReasonMergePending`, effectively stalling any operation on the range indefinitely. The setup for this bug was that a range was operating with a `global_reads` zone configuration attribute, so it was pushing all writers into the future. The range was split and then rapidly merged back together. During the merge txn, a range-merge watcher (see `maybeWatchForMergeLocked`) began monitoring the state of the range merge txn. The problem was that at the time that the range merge txn committed, neither the meta descriptor version written by the merge or even the meta descriptor version written by the split were visible to the watcher's follow-up query. Because the watcher read below the split txn's descriptor, it came to the wrong conclusion about the merge. It is interesting to think about what is going wrong here, because it's not immediately obvious who is at fault. If a transaction has a commit timestamp in the future of present time, it will need to commit-wait before acknowledging the client. Typically, this is performed in the TxnCoordSender after the transaction has committed and resolved its intents (see TxnCoordSender.maybeCommitWait). It is safe to wait after a future-time transaction has committed and resolved intents without compromising linearizability because the uncertainty interval of concurrent and later readers ensures atomic visibility of the effects of the transaction. In other words, all of the transaction's intents will become visible and will remain visible at once, which is sometimes called "monotonic reads". This is true even if the resolved intents are at a high enough timestamp such that they are not visible to concurrent readers immediately after they are resolved, but only become visible sometime during the writer's commit-wait sleep. This property is central to the correctness of non-blocking transactions. See: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20200811_non_blocking_txns.md However, if a transaction has a commit trigger, the side-effects of the trigger will go into effect immediately upon applying the corresponding Raft log entry. This poses a problem, because we do not want part of a transaction's effects (e.g. its commit trigger) to become visible to onlookers before the rest of its effects do (e.g. its intent writes). To avoid this problem, this commit adds special server-side logic to perform the commit-wait stage of a transaction with a commit trigger early, before its EndTxn evaluates and its commit trigger fires. This results in the transaction waiting longer to commit, run its commit trigger, and resolve its intents, but it is otherwise safe and effective. Interestingly, this is quite similar to how Spanner handles its commit-wait rule: > Before allowing any coordinator replica to apply the commit record, the > coordinator leader waits until TT.after(s), so as to obey the commit-wait rule > described in Section 4.1.2. Because the coordinator leader chose s based on > TT.now().latest, and now waits until that timestamp is guaranteed to be in the > past, the expected wait is at least 2 ∗ . This wait is typically overlapped with > Paxos communication. After commit wait, the coordinator sends the commit > timestamp to the client and all other participant leaders. Each participant > leader logs the transaction’s outcome through Paxos. All participants apply at > the same timestamp and then release locks. Of course, the whole point of non-blocking transactions is that we release locks early and use clocks (through uncertainty intervals + a reader-side commit-wait rule) to enforce consistency, so we don't want to make this change for standard transactions. Before this change, I could hit the bug in about 5 minutes of stressing kvnemesis on a roachprod cluster. After this change, I've been able to run kvnemesis for a few hours without issue. Release note (bug fix): Fixed a rare bug present in betas where rapid range splits and merges on a GLOBAL table could lead to a stuck leaseholder replica. The situation is no longer possible. cc. @cockroachdb/kv 64012: roachtest: update expected failures for pgjdbc r=RichardJCai a=rafiss When it was easy, I also added the issue number that is causing the failure. fixes #63890 Release note: None Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
bors r+ |
Build succeeded: |
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this pull request
Apr 27, 2021
Fixes a serious bug revealed by cockroachdb#63747. This commit fixes a bug revealed by kvnemesis where a range-merge watcher on the right-hand side of a range merge could incorrectly determine that a range merge txn had succeeded, when in reality, it had failed. The watcher would then put the RHS leaseholder replica into a stalled state by setting `r.mu.destroyStatus` to `destroyReasonMergePending`, effectively stalling any operation on the range indefinitely. The setup for this bug was that a range was operating with a `global_reads` zone configuration attribute, so it was pushing all writers into the future. The range was split and then rapidly merged back together. During the merge txn, a range-merge watcher (see `maybeWatchForMergeLocked`) began monitoring the state of the range merge txn. The problem was that at the time that the range merge txn committed, neither the meta descriptor version written by the merge or even the meta descriptor version written by the split were visible to the watcher's follow-up query. Because the watcher read below the split txn's descriptor, it came to the wrong conclusion about the merge. It is interesting to think about what is going wrong here, because it's not immediately obvious who is at fault. If a transaction has a commit timestamp in the future of present time, it will need to commit-wait before acknowledging the client. Typically, this is performed in the TxnCoordSender after the transaction has committed and resolved its intents (see TxnCoordSender.maybeCommitWait). It is safe to wait after a future-time transaction has committed and resolved intents without compromising linearizability because the uncertainty interval of concurrent and later readers ensures atomic visibility of the effects of the transaction. In other words, all of the transaction's intents will become visible and will remain visible at once, which is sometimes called "monotonic reads". This is true even if the resolved intents are at a high enough timestamp such that they are not visible to concurrent readers immediately after they are resolved, but only become visible sometime during the writer's commit-wait sleep. This property is central to the correctness of non-blocking transactions. See: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20200811_non_blocking_txns.md However, if a transaction has a commit trigger, the side-effects of the trigger will go into effect immediately upon applying to corresponding Raft log entry. This poses a problem, because we do not want part of a transaction's effects (e.g. its commit trigger) to become visible to onlookers before the rest of its effects do (e.g. its intent writes). To avoid this problem, this commit adds special server-side logic to perform the commit-wait stage of a transaction with a commit trigger early, before its EndTxn evaluates and its commit trigger fires. This results in the transaction waiting longer to commit, run its commit trigger, and resolve its intents, but it is otherwise safe and effective. Interestingly, this is quite similar to how Spanner handles its commit-wait rule: > Before allowing any coordinator replica to apply the commit record, the > coordinator leader waits until TT.after(s), so as to obey the commit-wait rule > described in Section 4.1.2. Because the coordinator leader chose s based on > TT.now().latest, and now waits until that timestamp is guaranteed to be in the > past, the expected wait is at least 2 ∗ . This wait is typically overlapped with > Paxos communication. After commit wait, the coordinator sends the commit > timestamp to the client and all other participant leaders. Each participant > leader logs the transaction’s outcome through Paxos. All participants apply at > the same timestamp and then release locks. Of course, the whole point of non-blocking transactions is that we release locks early and use clocks (through uncertainty intervals + a reader-side commit-wait rule) to enforce consistency, so we don't want to make this change for standard transactions. Before this change, I could hit the bug in about 5 minutes of stressing kvnemesis on a roachprod cluster. After this change, I've been able to run kvnemesis for a few hours without issue. Release note (bug fix): Fixed a rare bug present in betas where rapid range splits and merges on a GLOBAL table could lead to a stuck leaseholder replica. The situation is no longer possible.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #59062.
This commit adds a new
ChangeZoneOperation
with the option to togglethe
global_reads
attribute to KV nemesis. This testing ensures thatglobal read ranges are stressed with a variety of different conditions,
including range splits, range merges, rebalancing, lease transfers,
rangefeeds, and more.
So far, there has been one bug revealed by this testing where a range
merge watcher incorrectly considered a merge to have succeeded due to a
"stale read" when the merge actually failed. This will be fixed in a
separate PR that I'll merge before this one. In addition, I reverted a
few key changes to demonstrate that this testing would have revealed
other issues had they not been addressed proactively.