✅ During rolling maintenance, the changefeed jobs restart following node restarts.
Operators can mute alerts described below during routine maintenance procedures to avoid unnecessary distractions.
Changefeeds can suffer permanent failures (that the jobs system will not try to restart). Any increase in this counter should prompt an operator's investigative action.
changefeed.failures
Tier | Definition |
---|---|
CRITICAL | If the number of failures is greater than 0 |
- If the alert goes off during cluster maintenance, mute it. Otherwise start investigation with the query:
select job_id, status,((high_water_timestamp/1000000000)::int::timestamp)-now() as "changefeed latency",created, left(description,60),high_water_timestamp from crdb_internal.jobs where job_type = 'CHANGEFEED' and status in ('running', 'paused','pause-requested') order by created desc;
- If the cluster is not undergoing maintenance, check the health of sink endpoints. In case of Kafka, check for sink connection errors such as
ERROR: connecting to kafka: path.to.cluster:port: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)
Changefeed automatically restart in case of transient errors. However "too many" restarts (outside of a routine maintenance procedure) may be due to a systemic condition and should be investigated.
changefeed.error_retries
Tier | Definition |
---|---|
WARNING | If number of restarts is greater than 50 for more than 15 minutes |
Same as when responding to Changefeed Failures.
Changefeed has fallen behind. Determined by the end to end lag between a committed change and that change applied at destination. This can be due to cluster capacity or changefeed sink availability.
changefeed.commit_latency
Tier | Definition |
---|---|
WARNING | Max end to end lag for any changefeed is greater than 10 minutes |
CRITICAL | Max end to end lag for any changefeed is greater than 15 minutes |
- Open changefeeds metrics dashboard for the cluster (e.g. https://url/#/metrics/changefeeds/cluster) and check max latency
Alternatively, individual changefeed latency can be verified by using the SQL cli
select job_id, status,((high_water_timestamp/1000000000)::int::timestamp)-now() as "changefeed latency",created, left(description,60),high_water_timestamp from crdb_internal.jobs where job_type = 'CHANGEFEED' and status in ('running', 'paused','pause-requested') order by created desc;
- Copy the job number for the changefeed job with highest latency and pause it
PAUSE JOB 681491311976841286;
-
Check the status of the pause request by running the same query from step 1. If the job status is
pause-requested
, check again in few minutes. -
After the job is
paused
, resume the job.
RESUME JOB 681491311976841286;
- The changefeed latency may not progress after above steps due to lack of cluster resources, availability of changefeed sink, etc. Escalate to L2 Support.
A hedge against an operational error. Changefeed jobs should not be paused for long time b/c the protected timestamp prevents garbage collections. This is a safety catch to guard against an inadvertently "forgotten" pause.
jobs.changefeed.currently_paused
Tier | Definition |
---|---|
WARNING | The number of paused changefeeds is greater than 0 for more than 15 minutes |
CRITICAL | The number of paused changefeeds is greater than 0 for more than 60 minutes |
- Open SQL cli and check status of each changefeed.
select job_id, status,((high_water_timestamp/1000000000)::int::timestamp)-now() as "changefeed latency",created, left(description,60),high_water_timestamp from crdb_internal.jobs where job_type = 'CHANGEFEED' and status in ('running', 'paused','pause-requested') order by created desc;
- If all the changefeeds are in
running
state, one or more feed may have ran into an error and recovered. Check the UI (e.g.https://<cluster_url>/#/metrics/changefeeds/cluster
) for number of changefeed restarts. - Resume paused changefeed(s) with the job id (e.g.
RESUME JOB 681491311976841286;
).