backupccl: modify checkpointing heuristic to avoid frequent checkpointing #83184
Labels
A-disaster-recovery
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-disaster-recovery
Today, the checkpoint loop in the backup resumer naively writes a checkpoint file every minute. In a recent escalation, we saw that these checkpoint files can grow as large as 500 MB post-compression. We also observed a pathological case, where if the writing of checkpoint files slows down enough we could stop processing progress updates fast enough, thereby clogging up the entire distsql flow and bringing the backup to a crawl. This bug was fixed in #83151, such that we now drain the progCh for at least a minute before we attempt to write another checkpoint file. Still, it feels naive to use time as a heuristic to write a checkpoint. We could instead use one or more of the:
CompletedSpans
As a heuristic to trigger a write of the checkpoint file. We could also probably continue to use time, but as a configurable ceiling after which we definitely must checkpoint even if none of the other heuristics have forced a write.
Jira issue: CRDB-16909
The text was updated successfully, but these errors were encountered: