backupccl: reduce chances of a backup locking itself out #81593

adityamaru · 2022-05-20T20:34:00Z

Backup manifest resolution and persistence of the BACKUP-CHECKPOINT
file for future job resumptions now happens in the backup resumer. As
part of this resolution block we do two things:

Check the bucket for existing BACKUP-CHECKPOINT/BACKUP_MANIFEST files
to prevent concurrent backups from writing to the same backup.
Write a BACKUP-CHECKPOINT file after we have resolved all the destinations etc.
for the backup.

After these steps we persist the updated job details that would prevent the resumer
from running 1) and 2) on subsequent resumptions. If the job were to be paused after
2) but before we update the details, a subsequent resumption would cause the job to fail
at 1), essentially locking itself out of the bucket it was backing up to.

This change makes 2) the last step before we persist the job details reducing
the chances of such a scenario.

Fixes: #79270

Release note: None

Backup manifest resolution and persistence of the `BACKUP-CHECKPOINT` file for future job resumptions now happens in the backup resumer. As part of this resolution block we do two things: 1. Check the bucket for existing BACKUP-CHECKPOINT/BACKUP_MANIFEST files to prevent concurrent backups from writing to the same backup. 2. Write a BACKUP-CHECKPOINT file after we have resolved all the destinations etc. for the backup. After these steps we persist the updated job details that would prevent the resumer from running 1) and 2) on subsequent resumptions. If the job were to be resumed after 2) but before we update the details, a subsequent resumption would cause the job to fail at 1), essentially locking itself out of the bucket it was backing up to. This change makes 2) the last step before we persist the job details reducing the chances of such a scenario. Fixes: cockroachdb#79270 Release note: None

cockroach-teamcity · 2022-05-20T20:34:07Z

This change is

adityamaru · 2022-05-20T20:36:08Z

I don't love this but thought I'd put it out to see what y'all think. The error is easily reproducible if you pause the job at backup.resolved_job_details_update and resume it.

Another option is we change the file we write to be suffixed with the jobID, and then when checking for other backups we list files that match the BACKUP-CHECKPOINT glob, and only look for files with a different jobID. This would involve changes to the other places we read and write BACKUP-CHECKPOINT as well so I was hesitant to do it as a first solution.

stevendanna

Seems reasonable to reduce the size of the critical region here, but I wonder how much this helps in practice, since I imagine the job update itself is a likely place that we find out that we no longer have a job lease.

Putting the job ID in the naming scheme (or in the checkpoint) might be nice too since then we could potentially point the user to the job that is using the same location. I suppose that doesn't help with 2 nodes running the same job concurrently (one of which has lost the lease and just doesn't know it yet).

adityamaru · 2022-05-29T21:43:02Z

Closing in favour of #81994.

adityamaru requested review from dt, stevendanna and a team May 20, 2022 20:34

stevendanna reviewed May 23, 2022

View reviewed changes

adityamaru closed this May 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: reduce chances of a backup locking itself out #81593

backupccl: reduce chances of a backup locking itself out #81593

adityamaru commented May 20, 2022 •

edited

Loading

cockroach-teamcity commented May 20, 2022

adityamaru commented May 20, 2022 •

edited

Loading

stevendanna left a comment

adityamaru commented May 29, 2022

backupccl: reduce chances of a backup locking itself out #81593

backupccl: reduce chances of a backup locking itself out #81593

Conversation

adityamaru commented May 20, 2022 • edited Loading

cockroach-teamcity commented May 20, 2022

adityamaru commented May 20, 2022 • edited Loading

stevendanna left a comment

Choose a reason for hiding this comment

adityamaru commented May 29, 2022

adityamaru commented May 20, 2022 •

edited

Loading

adityamaru commented May 20, 2022 •

edited

Loading