Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: acceptance/version-upgrade failed #79270

Closed
cockroach-teamcity opened this issue Apr 1, 2022 · 18 comments
Closed

roachtest: acceptance/version-upgrade failed #79270

cockroach-teamcity opened this issue Apr 1, 2022 · 18 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-cdc T-disaster-recovery T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 1, 2022

roachtest.acceptance/version-upgrade failed with artifacts on master @ 9a2be9708393081498f54cb393ac6ee982ff000e:

The test failed on branch=master, cloud=local:
test artifacts and logs in: /artifacts/acceptance/version-upgrade/run_1
	assertions.go:262,assertions.go:1332,require.go:1231,versionupgrade.go:128,versionupgrade.go:206,versionupgrade.go:194,acceptance.go:56,acceptance.go:107,test_runner.go:875: 
			Error Trace:	versionupgrade.go:128
			            				versionupgrade.go:206
			            				versionupgrade.go:194
			            				acceptance.go:56
			            				acceptance.go:107
			            				test_runner.go:875
			            				asm_amd64.s:1581
			Error:      	Received unexpected error:
			            	pq: nodelocal://0/1648853850536825842 already contains a BACKUP-CHECKPOINT file (is another operation already in progress?)
			Test:       	acceptance/version-upgrade

	test_impl.go:259,require.go:1234,versionupgrade.go:128,versionupgrade.go:206,versionupgrade.go:194,acceptance.go:56,acceptance.go:107,test_runner.go:875: 
Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-14666

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 1, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Apr 1, 2022
@blathers-crl
Copy link

blathers-crl bot commented Apr 3, 2022

cc @cockroachdb/bulk-io

@stevendanna
Copy link
Collaborator

It looks like we didn't get any artifacts from this failure. @DarrylWong I wonder if you have any thoughts on this one.

@DarrylWong
Copy link
Contributor

I ran the roachtest on a gce worker a bunch of times (like 6 or 7 times) but can't seem to reproduce the error. I can dig into it more but not really sure where to start looking.

@stevendanna stevendanna removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. blocks-22.1.0-beta.2 labels Apr 5, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.acceptance/version-upgrade failed with artifacts on master @ 01572daaf94f80f81f843723a8b58d80fe128990:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/acceptance/version-upgrade/run_1
	assertions.go:262,assertions.go:1332,require.go:1231,versionupgrade.go:128,versionupgrade.go:206,versionupgrade.go:194,acceptance.go:56,acceptance.go:107,test_runner.go:876: 
			Error Trace:	versionupgrade.go:128
			            				versionupgrade.go:206
			            				versionupgrade.go:194
			            				acceptance.go:56
			            				acceptance.go:107
			            				test_runner.go:876
			            				asm_amd64.s:1581
			Error:      	Received unexpected error:
			            	pq: nodelocal://0/1651555424055405752 already contains a BACKUP-CHECKPOINT file (is another operation already in progress?)
			Test:       	acceptance/version-upgrade

	test_impl.go:265,require.go:1234,versionupgrade.go:128,versionupgrade.go:206,versionupgrade.go:194,acceptance.go:56,acceptance.go:107,test_runner.go:876: 
Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@ajwerner
Copy link
Contributor

ajwerner commented May 3, 2022

@stevendanna can you add color to this?

@cockroach-teamcity
Copy link
Member Author

roachtest.acceptance/version-upgrade failed with artifacts on master @ 7f3c06f5f2c26bc84705430a3622f92ec1444e9d:

The test failed on branch=master, cloud=local:
test artifacts and logs in: /artifacts/acceptance/version-upgrade/run_1
	assertions.go:262,assertions.go:1332,require.go:1231,versionupgrade.go:128,versionupgrade.go:206,versionupgrade.go:194,acceptance.go:56,acceptance.go:107,test_runner.go:876: 
			Error Trace:	versionupgrade.go:128
			            				versionupgrade.go:206
			            				versionupgrade.go:194
			            				acceptance.go:56
			            				acceptance.go:107
			            				test_runner.go:876
			            				asm_amd64.s:1581
			Error:      	Received unexpected error:
			            	pq: nodelocal://0/1652250008425133226 already contains a BACKUP-CHECKPOINT file (is another operation already in progress?)
			Test:       	acceptance/version-upgrade

	test_impl.go:265,require.go:1234,versionupgrade.go:128,versionupgrade.go:206,versionupgrade.go:194,acceptance.go:56,acceptance.go:107,test_runner.go:876: 
Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@adityamaru
Copy link
Contributor

looing into this now since i saw this flake on CI as well.

@adityamaru
Copy link
Contributor

I think we have a bug around job resumption semantics in backup now that most of the backup manifest resolution logic has moved inside the resumer:

We have this check that reads a BACKUP-CHECKPOINT file to check for concurrent/previous backups written to the same location -

if err := checkForPreviousBackup(ctx, defaultStore, defaultURI); err != nil {
. This is called at the beginning of Resume if we have not already resolved the details.URI - https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/backup_job.go#L410.

We then resolve all the details about the manifest, including details.URI, and write this resolved manifest to BACKUP-CHECKPOINT. After this step we update the job details to reflect the resolved fields. If the job were to be interrupted and resumed before we updated the job details, the second time around when it checked for concurrent/previous backups it would the see the file it wrote the first time around, and claim that someone else is using this bucket.

adityamaru added a commit to adityamaru/cockroach that referenced this issue May 20, 2022
Backup manifest resolution and persistence of the `BACKUP-CHECKPOINT`
file for future job resumptions now happens in the backup resumer. As
part of this resolution block we do two things:

1. Check the bucket for existing BACKUP-CHECKPOINT/BACKUP_MANIFEST files
to prevent concurrent backups from writing to the same backup.

2. Write a BACKUP-CHECKPOINT file after we have resolved all the destinations etc.
for the backup.

After these steps we persist the updated job details that would prevent the resumer
from running 1) and 2) on subsequent resumptions. If the job were to be resumed after
2) but before we update the details, a subsequent resumption would cause the job to fail
at 1), essentially locking itself out of the bucket it was backing up to.

This change makes 2) the last step before we persist the job details reducing
the chances of such a scenario.

Fixes: cockroachdb#79270

Release note: None
@jlinder jlinder removed the sync-me-3 label May 27, 2022
@ajwerner
Copy link
Contributor

@adityamaru should this be closed?

@adityamaru
Copy link
Contributor

This has been fixed on 22.1 but still needs to be forward ported to master, which is serialized behind #82623. I'll close it out after I've ported #81994 to master.

@blathers-crl blathers-crl bot added the T-cdc label Jul 6, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jul 6, 2022

cc @cockroachdb/cdc

@tbg
Copy link
Member

tbg commented Jul 6, 2022

I am seeing a different failure on this test in a PR, and I don't think it's the PR:

* ERROR: a panic has occurred!
* runtime error: invalid memory address or nil pointer dereference
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	GOROOT/src/runtime/panic.go:1038
*   | runtime.panicmem
*   | 	GOROOT/src/runtime/panic.go:221
*   | runtime.sigpanic
*   | 	GOROOT/src/runtime/signal_unix.go:735
*   | github.com/cockroachdb/cockroach/pkg/config.(*SystemConfig).getZoneEntry
*   | 	github.com/cockroachdb/cockroach/pkg/config/pkg/config/system.go:441
*   | github.com/cockroachdb/cockroach/pkg/config.(*SystemConfig).GetZoneConfigForObject
*   | 	github.com/cockroachdb/cockroach/pkg/config/pkg/config/system.go:428
*   | github.com/cockroachdb/cockroach/pkg/sql/gcjob.updateStatusForGCElements.func1
*   | 	github.com/cockroachdb/cockroach/pkg/sql/gcjob/refresh_statuses.go:105
*   | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*CollectionFactory).Txn.func2
*   | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/txn.go:94
*   | github.com/cockroachdb/cockroach/pkg/kv.runTxn.func1
*   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:949
*   | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
*   | 	github.com/cockroachdb/cockroach/pkg/kv/txn.go:993
*   | github.com/cockroachdb/cockroach/pkg/kv.runTxn
*   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:948
*   | github.com/cockroachdb/cockroach/pkg/kv.(*DB).TxnWithAdmissionControl
*   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:911
*   | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
*   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:890
*   | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*CollectionFactory).Txn
*   | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/txn.go:80
*   | github.com/cockroachdb/cockroach/pkg/sql.DescsTxn
*   | 	github.com/cockroachdb/cockroach/pkg/sql/exec_util.go:3332
*   | github.com/cockroachdb/cockroach/pkg/sql/gcjob.updateStatusForGCElements
*   | 	github.com/cockroachdb/cockroach/pkg/sql/gcjob/refresh_statuses.go:99
*   | github.com/cockroachdb/cockroach/pkg/sql/gcjob.refreshTables
*   | 	github.com/cockroachdb/cockroach/pkg/sql/gcjob/refresh_statuses.go:55
*   | github.com/cockroachdb/cockroach/pkg/sql/gcjob.schemaChangeGCResumer.Resume
*   | 	github.com/cockroachdb/cockroach/pkg/sql/gcjob/gc_job.go:223
*   | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine.func2
*   | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1168
*   | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
*   | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1169
*   | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
*   | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:408
*   | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resumeJob.func1
*   | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:329
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
*   | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494
*   | runtime.goexit
*   | 	GOROOT/src/runtime/asm_amd64.s:1581
* Wraps: (2) runtime error: invalid memory address or nil pointer dereference
* Error types: (1) *withstack.withStack (2) runtime.errorString

Tagging CDC as the owner for jobs but it could also be ZoneConfig.

@ajwerner
Copy link
Contributor

ajwerner commented Jul 6, 2022

I thought I fixed this in #83474 / #83280

@tbg
Copy link
Member

tbg commented Jul 6, 2022

Oh maybe you did, I may not have been completely up to date, this happened on top of

commit c897707
Merge: 933b684 4488445
Author: craig[bot] [email protected]
Date: Tue Jul 5 12:52:22 2022 +0000

@ajwerner
Copy link
Contributor

ajwerner commented Jul 6, 2022

Huh, that should have the patch, I'll do something.

@ajwerner
Copy link
Contributor

ajwerner commented Jul 6, 2022

Oh! this must be on the old instance. We haven't released a 22.1 with the fix yet

@cockroach-teamcity
Copy link
Member Author

roachtest.acceptance/version-upgrade failed with artifacts on master @ 8fd5b3500796fae41c07fffd4246648b349b6460:

test artifacts and logs in: /artifacts/acceptance/version-upgrade/run_1
	assertions.go:262,assertions.go:1332,require.go:1231,versionupgrade.go:100,versionupgrade.go:188,versionupgrade.go:176,acceptance.go:71,acceptance.go:114,test_runner.go:896: 
			Error Trace:	versionupgrade.go:100
			            				versionupgrade.go:188
			            				versionupgrade.go:176
			            				acceptance.go:71
			            				acceptance.go:114
			            				test_runner.go:896
			            				asm_amd64.s:1581
			Error:      	Received unexpected error:
			            	pq: failed to run backup: exporting 114 ranges: unable to dial n4: breaker open
			Test:       	acceptance/version-upgrade

	test_impl.go:265,require.go:1234,versionupgrade.go:100,versionupgrade.go:188,versionupgrade.go:176,acceptance.go:71,acceptance.go:114,test_runner.go:896: 

Parameters: ROACHTEST_cloud=local , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@amruss
Copy link
Contributor

amruss commented Jul 27, 2022

@adityamaru removing from our backlog - let me know if there is some reason the CDC team should be involved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-cdc T-disaster-recovery T-kv KV Team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

9 participants