Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: reduce chances of a backup locking itself out #81593

Closed

Conversation

adityamaru
Copy link
Contributor

@adityamaru adityamaru commented May 20, 2022

Backup manifest resolution and persistence of the BACKUP-CHECKPOINT
file for future job resumptions now happens in the backup resumer. As
part of this resolution block we do two things:

  1. Check the bucket for existing BACKUP-CHECKPOINT/BACKUP_MANIFEST files
    to prevent concurrent backups from writing to the same backup.

  2. Write a BACKUP-CHECKPOINT file after we have resolved all the destinations etc.
    for the backup.

After these steps we persist the updated job details that would prevent the resumer
from running 1) and 2) on subsequent resumptions. If the job were to be paused after
2) but before we update the details, a subsequent resumption would cause the job to fail
at 1), essentially locking itself out of the bucket it was backing up to.

This change makes 2) the last step before we persist the job details reducing
the chances of such a scenario.

Fixes: #79270

Release note: None

Backup manifest resolution and persistence of the `BACKUP-CHECKPOINT`
file for future job resumptions now happens in the backup resumer. As
part of this resolution block we do two things:

1. Check the bucket for existing BACKUP-CHECKPOINT/BACKUP_MANIFEST files
to prevent concurrent backups from writing to the same backup.

2. Write a BACKUP-CHECKPOINT file after we have resolved all the destinations etc.
for the backup.

After these steps we persist the updated job details that would prevent the resumer
from running 1) and 2) on subsequent resumptions. If the job were to be resumed after
2) but before we update the details, a subsequent resumption would cause the job to fail
at 1), essentially locking itself out of the bucket it was backing up to.

This change makes 2) the last step before we persist the job details reducing
the chances of such a scenario.

Fixes: cockroachdb#79270

Release note: None
@adityamaru adityamaru requested review from dt, stevendanna and a team May 20, 2022 20:34
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru
Copy link
Contributor Author

adityamaru commented May 20, 2022

I don't love this but thought I'd put it out to see what y'all think. The error is easily reproducible if you pause the job at backup.resolved_job_details_update and resume it.

Another option is we change the file we write to be suffixed with the jobID, and then when checking for other backups we list files that match the BACKUP-CHECKPOINT glob, and only look for files with a different jobID. This would involve changes to the other places we read and write BACKUP-CHECKPOINT as well so I was hesitant to do it as a first solution.

Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to reduce the size of the critical region here, but I wonder how much this helps in practice, since I imagine the job update itself is a likely place that we find out that we no longer have a job lease.

Putting the job ID in the naming scheme (or in the checkpoint) might be nice too since then we could potentially point the user to the job that is using the same location. I suppose that doesn't help with 2 nodes running the same job concurrently (one of which has lost the lease and just doesn't know it yet).

@adityamaru
Copy link
Contributor Author

Closing in favour of #81994.

@adityamaru adityamaru closed this May 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: acceptance/version-upgrade failed
3 participants