-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: acceptance/version-upgrade failed #79270
Comments
cc @cockroachdb/bulk-io |
It looks like we didn't get any artifacts from this failure. @DarrylWong I wonder if you have any thoughts on this one. |
I ran the roachtest on a gce worker a bunch of times (like 6 or 7 times) but can't seem to reproduce the error. I can dig into it more but not really sure where to start looking. |
roachtest.acceptance/version-upgrade failed with artifacts on master @ 01572daaf94f80f81f843723a8b58d80fe128990:
|
@stevendanna can you add color to this? |
roachtest.acceptance/version-upgrade failed with artifacts on master @ 7f3c06f5f2c26bc84705430a3622f92ec1444e9d:
|
looing into this now since i saw this flake on CI as well. |
I think we have a bug around job resumption semantics in backup now that most of the backup manifest resolution logic has moved inside the resumer: We have this check that reads a cockroach/pkg/ccl/backupccl/backup_planning.go Line 1457 in 9bf4dff
Resume if we have not already resolved the details.URI - https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/backup_job.go#L410.
We then resolve all the details about the manifest, including |
Backup manifest resolution and persistence of the `BACKUP-CHECKPOINT` file for future job resumptions now happens in the backup resumer. As part of this resolution block we do two things: 1. Check the bucket for existing BACKUP-CHECKPOINT/BACKUP_MANIFEST files to prevent concurrent backups from writing to the same backup. 2. Write a BACKUP-CHECKPOINT file after we have resolved all the destinations etc. for the backup. After these steps we persist the updated job details that would prevent the resumer from running 1) and 2) on subsequent resumptions. If the job were to be resumed after 2) but before we update the details, a subsequent resumption would cause the job to fail at 1), essentially locking itself out of the bucket it was backing up to. This change makes 2) the last step before we persist the job details reducing the chances of such a scenario. Fixes: cockroachdb#79270 Release note: None
@adityamaru should this be closed? |
cc @cockroachdb/cdc |
I am seeing a different failure on this test in a PR, and I don't think it's the PR:
Tagging CDC as the owner for jobs but it could also be ZoneConfig. |
Oh maybe you did, I may not have been completely up to date, this happened on top of commit c897707 |
Huh, that should have the patch, I'll do something. |
Oh! this must be on the old instance. We haven't released a |
roachtest.acceptance/version-upgrade failed with artifacts on master @ 8fd5b3500796fae41c07fffd4246648b349b6460:
Parameters: |
@adityamaru removing from our backlog - let me know if there is some reason the CDC team should be involved |
roachtest.acceptance/version-upgrade failed with artifacts on master @ 9a2be9708393081498f54cb393ac6ee982ff000e:
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-14666
The text was updated successfully, but these errors were encountered: