Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdc: Fail job that does not make forward progress for days #102341

Closed
miretskiy opened this issue Apr 26, 2023 · 4 comments
Closed

cdc: Fail job that does not make forward progress for days #102341

miretskiy opened this issue Apr 26, 2023 · 4 comments
Labels
A-cdc Change Data Capture C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-cdc

Comments

@miretskiy
Copy link
Contributor

miretskiy commented Apr 26, 2023

Recent changes were added to ensure that paused changefeed
jobs will eventually expire PTS records and fail.

It appears we should extend this functionality to the running changefeed jobs.
If the PTS record is not advanced for long time and falls too far behind,
fail changefeed.

On the one hand, stuck changefeeds should be a condition that is being monitored
for. However, if it isn't and we allow substantial amount of GC accumulation to occur,
releasing so much data could destabilize the cluster.

It is better to set some conservative thresholds (e.g. 2-3 days) and remove
PTS record + fail the changefeed if no progress is made.

Jira issue: CRDB-27402

Epic CRDB-28844

@miretskiy miretskiy added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs A-cdc Change Data Capture T-cdc labels Apr 26, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 26, 2023

cc @cockroachdb/cdc

@miretskiy
Copy link
Contributor Author

@amruss amruss changed the title cdc: Remove stale PTS records cdc: Fail changefeed that does not make forward progress for days May 10, 2023
@amruss amruss changed the title cdc: Fail changefeed that does not make forward progress for days cdc: Fail job that does not make forward progress for days May 10, 2023
@amruss
Copy link
Contributor

amruss commented May 10, 2023

Gonna put this on the jobs board to consider as a general rule for all jobs

@amruss
Copy link
Contributor

amruss commented May 24, 2023

Addressed: #103539

@amruss amruss closed this as completed May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-cdc
Projects
None yet
Development

No branches or pull requests

2 participants