-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: Rework error handling #90810
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this, I agree it's a good idea. I do think if we're defaulting to retryable, we need to make that conditional on the feed having already checkpointed at least once. Otherwise, too many problems that require manual intervention will manifest as a changefeed that is always running but never makes progress. A changefeed that's never made progress isn't worth saving--the rare case that you get a transient error right away isn't worth it.
Reviewed 4 of 4 files at r1, 5 of 23 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
pkg/ccl/changefeedccl/avro.go
line 412 at r2 (raw file):
date := *d.(*tree.DDate) if !date.IsFinite() { return nil, changefeedbase.WithTerminalError(errors.Errorf(
Might be better to have all errors raised out of an encoder be marked terminal unless marked retryable, the only retryable error I can think of that can come out of an encoder is a schema registry connectivity failure.
pkg/ccl/changefeedccl/changefeed_processors.go
line 1042 at r2 (raw file):
if cf.frontier.boundaryType == jobspb.ResolvedSpan_EXIT { err = changefeedbase.WithTerminalError(errors.Wrapf(err, "shut down due to schema change and %s=%q",
Is this always a schema change or can it be end time/initial scan only?
pkg/ccl/changefeedccl/retry.go
line 51 at r2 (raw file):
const resetRetryAfter = 10 * time.Minute // Retry is a think wrapper around retry.Retry which
thin?
pkg/sql/catalog/lease/lease.go
line 1019 at r2 (raw file):
} // IsDraining returns true if this nodes lease manager is draining.
node's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I'm not sure I agree with this. I'm all for detecting errors, doing sanity checking during
changefeed creation. The reality is that changefeed may run for a while w/out checkpoint -- i.e. during initial scan.
Having some arbitrary logic that effectively says that "we will retry, but only if changefeed ran for 10 or more minutes" seems
to be rather arbitrary and not that great either. Lots of things can happen in the cluster during that time.
None of those heuristics would replace the need to proper monitoring and observability.
So.. I'd rather improve those areas if needed, vs above.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB)
pkg/ccl/changefeedccl/avro.go
line 412 at r2 (raw file):
Previously, HonoreDB (Aaron Zinger) wrote…
Might be better to have all errors raised out of an encoder be marked terminal unless marked retryable, the only retryable error I can think of that can come out of an encoder is a schema registry connectivity failure.
These are the only two cases that exist today where this happpens. If we get into situation where we are returning majority of those errors -- then perhaps we can reconsider.
Do you feel strongly?
pkg/ccl/changefeedccl/changefeed_processors.go
line 1042 at r2 (raw file):
Previously, HonoreDB (Aaron Zinger) wrote…
Is this always a schema change or can it be end time/initial scan only?
I believe so; we check for endTime.Empty above or if we haven't reached end time
pkg/ccl/changefeedccl/retry.go
line 51 at r2 (raw file):
Previously, HonoreDB (Aaron Zinger) wrote…
thin?
Done.
pkg/sql/catalog/lease/lease.go
line 1019 at r2 (raw file):
Previously, HonoreDB (Aaron Zinger) wrote…
node's
Done.
f743aea
to
1a93051
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really go for 10 minutes without an initial scan checkpoint? It feels like there's some kind of missing communication, anyway, if it's ever possible for a healthy changefeed to make progress for 10 minutes without reporting it. Maybe we should, say, have special-case logic that if you're a changefeed aggregator on the same node as the changefeed frontier/coordinator, you immediately flush and checkpoint (or initial-scan-checkpoint) after your first successful emit.
Reviewed 1 of 16 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
pkg/ccl/changefeedccl/avro.go
line 412 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
These are the only two cases that exist today where this happpens. If we get into situation where we are returning majority of those errors -- then perhaps we can reconsider.
Do you feel strongly?
I guess if it's just these two and we have the possibility of retryable schema registry errors, this seems fine as is.
Yeah -- absolutely; initial scan checkpoints are expensive, and, by default are set to happen every 10 minutes. |
5156fa4
to
1853e41
Compare
bors r+ |
Build failed: |
bors r+ |
Build failed (retrying...): |
Build failed: |
bors r+ |
Build failed (retrying...): |
🕐 Waiting for PR status (Github check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set. |
Build succeeded: |
@miretskiy do we plan on backporting this change to 22.1? |
22.1? Probably not. 22.2.x -- maybe? |
Let me know if we change our minds on backporting this to 22.2.1 -> going to include it in the docs as a breaking change |
blathers backport 22.2 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 86fffa9 to blathers/backport-release-22.2-90810: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Prior PR cockroachdb#90810 changed error handling approach used by CDC to be "retry by default". The above PR failed to account for some of the situations that may arrise in mixed version state. In particular, when upgrading, if the aggregator node gets restarted with the new version, while the coordinator still runs old binary, *and* the new aggregator encounters a retryable error, that error would not be explicitly marked as retryable, causing the old coordinator binary to treat the error as a terminal error. Other combination (new coordinator, old aggregator) is not succeptible to this situation. This PR partially reverts changes in the cockroachdb#90810 so previously retryable errors continue to be explicitly marked as retryable. There is no need to introduce version gates since the error handling should be backward compatible, and with this PR, operate correctly in the mixed version state. Epic: None Release note: none
93072: backupccl: speed up listing in some backup callsites r=msbutler a=adityamaru There were a couple of calls to `List` that would list all the files in the backup destination bucket. In the presence of a large number of files, usually SST files in the `data/` folder, this listing could be very slow. This could manifest as a RESTORE command hanging during the planning phase when we go to resolve encryption information from the base backup bucket. To fix this, we add a `delimiter` to the `List` call that will make skipping over all files in the `data/` directory much cheaper. Informs: https://github.com/cockroachlabs/support/issues/1937 Release note (bug fix): speedup slow listing calls that could manifest as restore queries hanging during execution, in the presence of several backup files 93155: changefeedccl: Fix error handling in mixed version state r=miretskiy a=miretskiy Prior PR #90810 changed error handling approach used by CDC to be "retry by default". The above PR failed to account for some of the situations that may arrise in mixed version state. In particular, when upgrading, if the aggregator node gets restarted with the new version, while the coordinator still runs old binary, *and* the new aggregator encounters a retryable error, that error would not be explicitly marked as retryable, causing the old coordinator binary to treat the error as a terminal error. Other combination (new coordinator, old aggregator) is not succeptible to this situation. This PR partially reverts changes in the #90810 so previously retryable errors continue to be explicitly marked as retryable. There is no need to introduce version gates since the error handling should be backward compatible, and with this PR, operate correctly in the mixed version state. Epic: None Release note: none Co-authored-by: adityamaru <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Prior PR cockroachdb#90810 changed error handling approach used by CDC to be "retry by default". The above PR failed to account for some of the situations that may arrise in mixed version state. In particular, when upgrading, if the aggregator node gets restarted with the new version, while the coordinator still runs old binary, *and* the new aggregator encounters a retryable error, that error would not be explicitly marked as retryable, causing the old coordinator binary to treat the error as a terminal error. Other combination (new coordinator, old aggregator) is not succeptible to this situation. This PR partially reverts changes in the cockroachdb#90810 so previously retryable errors continue to be explicitly marked as retryable. There is no need to introduce version gates since the error handling should be backward compatible, and with this PR, operate correctly in the mixed version state. Epic: None Release note: none
Changefeeds utilize protected timestamp (PTS) in order to ensure that the data is not garbage collected (GC) prematurely. This subsystem underwent through many rounds of changes, resulting in an unintuitive, and potentially dangerous behavior. This PR updates and improves PTS handling as follows. PR cockroachdb#97148 introduce capability to cancel jobs that hold on to stale PTS records. This PR expands this functionality to apply to all jobs -- not just paused jobs. This is necessary because due to cockroachdb#90810, changefeeds will retry almost every error -- and that means that if a running changefeed jobs fails to make progress for very long time, it is possible that a PTS record will protect GC collection for many days, weeks, or even months. To guard against this case, introduce a new cluster setting `changefeed.protect_timestamp.max_age`, which defaults to generous 4 days, to make sure that even if the explicit changefeed option `gc_protect_expires_after` was not specified, the changefeed will fail after `changefeed.protect_timestamp.max_age` if no progress is made. The fail safe can be disabled by setting `changefeed.protect_timestamp.max_age` to 0; Note, however, that doing so could result in stability issues once stale PTS record released. In addition, this PR deprecates `protect_data_from_gc_on_pause` option. This option is not needed since we now employ "active protected timestamp" management (meaning: there is always a PTS record when running changefeed jobs), and the handling of this record is consistent for both running and paused jobs. Fixes cockroachdb#103464 Release note (enterprise change): Introduce a new `changefeed.protect_timestamp.max_age` setting (default 4 days), which will cancel running changefeed jobs if they fail to make forward progress for much time. This setting is used if the explicit `gc_protect_expires_after` option was not set. In addition, deprecate `protect_data_from_gc_on_pause` option. This option is no longer needed since changefeed jobs always protect data.
Changefeeds utilize protected timestamp (PTS) in order to ensure that the data is not garbage collected (GC) prematurely. This subsystem underwent through many rounds of changes, resulting in an unintuitive, and potentially dangerous behavior. This PR updates and improves PTS handling as follows. PR cockroachdb#97148 introduce capability to cancel jobs that hold on to stale PTS records. This PR expands this functionality to apply to all jobs -- not just paused jobs. This is necessary because due to cockroachdb#90810, changefeeds will retry almost every error -- and that means that if a running changefeed jobs fails to make progress for very long time, it is possible that a PTS record will protect GC collection for many days, weeks, or even months. To guard against this case, introduce a new cluster setting `changefeed.protect_timestamp.max_age`, which defaults to generous 4 days, to make sure that even if the explicit changefeed option `gc_protect_expires_after` was not specified, the changefeed will fail after `changefeed.protect_timestamp.max_age` if no progress is made. This setting only applies to newly created changefeeds. Use `ALTER CHANGEFEED` statement to set `gc_protect_expires_after` option for existing changefeeds to enable PTS expiration. The fail safe can be disabled by setting `changefeed.protect_timestamp.max_age` to 0; Note, however, that doing so could result in stability issues once stale PTS record released. In addition, this PR deprecates `protect_data_from_gc_on_pause` option. This option is not needed since we now employ "active protected timestamp" management (meaning: there is always a PTS record when running changefeed jobs), and the handling of this record is consistent for both running and paused jobs. Fixes cockroachdb#103464 Release note (enterprise change): Introduce a new `changefeed.protect_timestamp.max_age` setting (default 4 days), which will cancel running changefeed jobs if they fail to make forward progress for much time. This setting is used if the explicit `gc_protect_expires_after` option was not set. In addition, deprecate `protect_data_from_gc_on_pause` option. This option is no longer needed since changefeed jobs always protect data.
103528: server: Revert server.shutdown.jobs_wait to 0 r=miretskiy a=miretskiy Revert default setting for `server.shutdown.jobs_wait` to 0 to ensure that shutdown dows not wait for active jobs. Issues: none Epic: None Release note: None 103539: changefeedccl: Improve protected timestamp handling r=miretskiy a=miretskiy Changefeeds utilize protected timestamp (PTS) in order to ensure that the data is not garbage collected (GC) prematurely. This subsystem underwent through many rounds of changes, resulting in an unintuitive, and potentially dangerous behavior. This PR updates and improves PTS handling as follows. PR #97148 introduce capability to cancel jobs that hold on to stale PTS records. This PR expands this functionality to apply to all jobs -- not just paused jobs. This is necessary because due to #90810, changefeeds will retry almost every error -- and that means that if a running changefeed jobs fails to make progress for very long time, it is possible that a PTS record will protect GC collection for many days, weeks, or even months. To guard against this case, introduce a new cluster setting `changefeed.protect_timestamp.max_age`, which defaults to generous 4 days, to make sure that even if the explicit changefeed option `gc_protect_expires_after` was not specified, the changefeed will fail after `changefeed.protect_timestamp.max_age` if no progress is made. The fail safe can be disabled by setting `changefeed.protect_timestamp.max_age` to 0; Note, however, that doing so could result in stability issues once stale PTS record released. In addition, this PR deprecates `protect_data_from_gc_on_pause` option. This option is not needed since we now employ "active protected timestamp" management (meaning: there is always a PTS record when running changefeed jobs), and the handling of this record is consistent for both running and paused jobs. Fixes #103464 Release note (enterprise change): Introduce a new `changefeed.protect_timestamp.max_age` setting (default 4 days), which will cancel running changefeed jobs if they fail to make forward progress for much time. This setting is used if the explicit `gc_protect_expires_after` option was not set. In addition, deprecate `protect_data_from_gc_on_pause` option. This option is no longer needed since changefeed jobs always protect data. Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Prior to this PR, changefeeds would rely on a white list
approach in order to determine which errors were retryable.
All other errors would be deemed terminal, causing changefeed
to fail.
The above approach is brittle, and causes unwanted
changefeed termination.
This PR changes this approach to treat all errors as retryable,
unless otherwise indicated. Errors that are known by changefeed
to be fatal are handled explicitly, by marking such errors
as terminal. For example, changefeeds would exit
if the targeted table is dropped. On the other hand, inability
to read this table for any reason would not be treated as
terminal.
Fixes #90320
Fixes #77549
Fixes #63317
Fixes #71341
Fixes #73016
Informs CRDB-6788
Informs CRDB-7581
Release Note (enterprise change): Changefeed will now treat
all errors, unless otherwise indicated, as retryable errors.