Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: modify checkpointing heuristic to avoid frequent checkpointing #83184

Open
adityamaru opened this issue Jun 22, 2022 · 1 comment
Open
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery

Comments

@adityamaru
Copy link
Contributor

adityamaru commented Jun 22, 2022

Today, the checkpoint loop in the backup resumer naively writes a checkpoint file every minute. In a recent escalation, we saw that these checkpoint files can grow as large as 500 MB post-compression. We also observed a pathological case, where if the writing of checkpoint files slows down enough we could stop processing progress updates fast enough, thereby clogging up the entire distsql flow and bringing the backup to a crawl. This bug was fixed in #83151, such that we now drain the progCh for at least a minute before we attempt to write another checkpoint file. Still, it feels naive to use time as a heuristic to write a checkpoint. We could instead use one or more of the:

  • CompletedSpans
  • number of backed-up files
  • data size of backed-up files

As a heuristic to trigger a write of the checkpoint file. We could also probably continue to use time, but as a configurable ceiling after which we definitely must checkpoint even if none of the other heuristics have forced a write.

Jira issue: CRDB-16909

@adityamaru adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery labels Jun 22, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jun 22, 2022

cc @cockroachdb/bulk-io

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery
Projects
None yet
Development

No branches or pull requests

2 participants