-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rangefeed: cancel async tasks on processor stop #117859
Merged
craig
merged 2 commits into
cockroachdb:master
from
erikgrinaker:rangefeed-processor-cancellation
Jan 20, 2024
Merged
rangefeed: cancel async tasks on processor stop #117859
craig
merged 2 commits into
cockroachdb:master
from
erikgrinaker:rangefeed-processor-cancellation
Jan 20, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
erikgrinaker
changed the title
rangefeed: cancel async task on processor stop
rangefeed: cancel async tasks on processor stop
Jan 17, 2024
erikgrinaker
force-pushed
the
rangefeed-processor-cancellation
branch
2 times, most recently
from
January 17, 2024 11:57
491a48d
to
9dbe072
Compare
erikgrinaker
force-pushed
the
rangefeed-processor-cancellation
branch
2 times, most recently
from
January 17, 2024 13:39
c8589bd
to
e8876bd
Compare
nvanbenschoten
approved these changes
Jan 19, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)
Epic: none Release note: None
Previously, async tasks spawned by the rangefeed processor (typically txn pushes and resolved timestamp scans) were not cancelled when the processor was stopped or the stopper quiesced. If these operations stalled, this could lead to goroutine leaks and node shutdown stalls. However, this was mitigated to some extent by the intent resolver itself detecting stopper quiescence. This patch uses a separate task context for async tasks, which is cancelled either when the processor is stopped or the stopper quiesces. In general, the rangefeed scheduler shutdown logic could use some improvement, but this patch does not attempt a broader cleanup in the interest of backportability. Epic: none Release note: None
erikgrinaker
force-pushed
the
rangefeed-processor-cancellation
branch
from
January 20, 2024 18:25
e8876bd
to
2520ea7
Compare
bors r+ |
Build succeeded: |
craig bot
pushed a commit
that referenced
this pull request
Jan 23, 2024
117612: rangefeed: fix premature checkpoint due to intent resolution race r=erikgrinaker a=erikgrinaker This PR depends on the following changes, which should be backported together: * #117787 * #117859 * #117967 * #117968 * #117969 --- It was possible for rangefeeds to emit a premature checkpoint, before all writes below its timestamp had been emitted. This in turn would cause changefeeds to not emit these write events at all. This could happen because `PushTxn` may return a false `ABORTED` status for a transaction that has in fact been committed, if the transaction record has been GCed (after resolving all intents). The timestamp cache does not retain sufficient information to disambiguate a committed transaction from an aborted one in this case, so it pessimistically assumes an abort (see `Replica.CanCreateTxnRecord` and `batcheval.SynthesizeTxnFromMeta`). However, the rangefeed txn pusher trusted this `ABORTED` status, ignoring the pending txn intents and allowing the resolved timestamp to advance past them before emitting the committed intents. This can lead to the following scenario: - A rangefeed is running on a lagging follower. - A txn writes an intent, which is replicated to the follower. - The closed timestamp advances past the intent. - The txn commits and resolves the intent at the original write timestamp, then GCs its txn record. This is not yet applied on the follower. - The rangefeed pushes the txn to advance its resolved timestamp. - The txn is GCed, so the push returns ABORTED (it can't know whether the txn was committed or aborted after its record is GCed). - The rangefeed disregards the "aborted" txn and advances the resolved timestamp, emitting a checkpoint. - The follower applies the resolved intent and emits an event below the checkpoint, violating the checkpoint guarantee. - The changefeed sees an event below its frontier and drops it, never emitting these events at all. This patch fixes the bug by submitting a barrier command to the leaseholder which waits for all past and ongoing writes (including intent resolution) to complete and apply, and then waits for the local replica to apply the barrier as well. This ensures any committed intent resolution will be applied and emitted before the transaction is removed from resolved timestamp tracking. Resolves #104309. Epic: none Release note (bug fix): fixed a bug where a changefeed could omit events in rare cases, logging the error "cdc ux violation: detected timestamp ... that is less or equal to the local frontier". This can happen if a rangefeed runs on a follower replica that lags significantly behind the leaseholder, a transaction commits and removes its transaction record before its intent resolution is applied on the follower, the follower's closed timestamp has advanced past the transaction commit timestamp, and the rangefeed attempts to push the transaction to a new timestamp (at least 10 seconds after the transaction began). This may cause the rangefeed to prematurely emit a checkpoint before emitting writes at lower timestamps, which in turn may cause the changefeed to drop these events entirely, never emitting them. Co-authored-by: Erik Grinaker <[email protected]>
This was referenced Jan 29, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously, async tasks spawned by the rangefeed processor (typically txn pushes and resolved timestamp scans) were not cancelled when the processor was stopped or the stopper quiesced. If these operations stalled, this could lead to goroutine leaks and node shutdown stalls. However, this was mitigated to some extent by the intent resolver itself detecting stopper quiescence.
This patch uses a separate task context for async tasks, which is cancelled either when the processor is stopped or the stopper quiesces.
In general, the rangefeed scheduler shutdown logic could use some improvement, but this patch does not attempt a broader cleanup in the interest of backportability.
Epic: none
Release note: None