cdc: Fail job that does not make forward progress for days #102341
Labels
A-cdc
Change Data Capture
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
T-cdc
Recent changes were added to ensure that paused changefeed
jobs will eventually expire PTS records and fail.
It appears we should extend this functionality to the running changefeed jobs.
If the PTS record is not advanced for long time and falls too far behind,
fail changefeed.
On the one hand, stuck changefeeds should be a condition that is being monitored
for. However, if it isn't and we allow substantial amount of GC accumulation to occur,
releasing so much data could destabilize the cluster.
It is better to set some conservative thresholds (e.g. 2-3 days) and remove
PTS record + fail the changefeed if no progress is made.
Jira issue: CRDB-27402
Epic CRDB-28844
The text was updated successfully, but these errors were encountered: