backupccl: modify checkpointing heuristic to avoid frequent checkpointing #83184

adityamaru · 2022-06-22T13:54:14Z

Today, the checkpoint loop in the backup resumer naively writes a checkpoint file every minute. In a recent escalation, we saw that these checkpoint files can grow as large as 500 MB post-compression. We also observed a pathological case, where if the writing of checkpoint files slows down enough we could stop processing progress updates fast enough, thereby clogging up the entire distsql flow and bringing the backup to a crawl. This bug was fixed in #83151, such that we now drain the progCh for at least a minute before we attempt to write another checkpoint file. Still, it feels naive to use time as a heuristic to write a checkpoint. We could instead use one or more of the:

CompletedSpans
number of backed-up files
data size of backed-up files

As a heuristic to trigger a write of the checkpoint file. We could also probably continue to use time, but as a configurable ceiling after which we definitely must checkpoint even if none of the other heuristics have forced a write.

Jira issue: CRDB-16909

The text was updated successfully, but these errors were encountered:

blathers-crl · 2022-06-22T13:54:18Z

cc @cockroachdb/bulk-io

adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery labels Jun 22, 2022

blathers-crl bot added the T-disaster-recovery label Jun 22, 2022

adityamaru mentioned this issue Jun 27, 2022

backupccl: slow checkpointing could bring BACKUP to a crawl #83456

Closed

exalate-issue-sync bot assigned rhu713 and unassigned rhu713 Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: modify checkpointing heuristic to avoid frequent checkpointing #83184

backupccl: modify checkpointing heuristic to avoid frequent checkpointing #83184

adityamaru commented Jun 22, 2022 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Jun 22, 2022

backupccl: modify checkpointing heuristic to avoid frequent checkpointing #83184

backupccl: modify checkpointing heuristic to avoid frequent checkpointing #83184

Comments

adityamaru commented Jun 22, 2022 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented Jun 22, 2022

adityamaru commented Jun 22, 2022 •

edited by exalate-issue-sync bot

Loading