Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [timed out while starting tenant] #119340

Closed
cockroach-teamcity opened this issue Feb 17, 2024 · 22 comments
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 17, 2024

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ a58e89a2e54e3a5ad73edcce48a669e516cceedd:

Fatal error:

panic: test timed out after 4m57s
running tests:
	TestRangeFeedIntentResolutionRace (4m33s)

Stack:

goroutine 73770 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2259 +0x320
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x38
Log preceding fatal error

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace448879069
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/replication

This test on roachdash | Improve this report!

Jira issue: CRDB-36159

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv-replication labels Feb 17, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Feb 17, 2024
@pav-kv
Copy link
Collaborator

pav-kv commented Feb 19, 2024

The test cluster start-up hangs?

goroutine 34902 [select, 4 minutes]:
github.com/cockroachdb/cockroach/pkg/server.(*serverController).startAndWaitForRunningServer(0x4007f2e1c0, {0x5f8b298, 0x92782e0}, {0x4b7297d?, 0x1bf08eb000?})
	github.com/cockroachdb/cockroach/pkg/server/server_controller_orchestration.go:232 +0xcc
github.com/cockroachdb/cockroach/pkg/server.(*testServer).StartSharedProcessTenant(_, {_, _}, {{0x4b7297d, 0xb}, {0xa}, {{0x5f48680, 0x4006419c00}, {0x0, 0x0}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/server/testserver.go:1381 +0x474
github.com/cockroachdb/cockroach/pkg/server.(*testServer).startSharedProcessDefaultTestTenant(0x4003930000?, {0x5f8b298, 0x92782e0})
	github.com/cockroachdb/cockroach/pkg/server/testserver.go:660 +0x88
github.com/cockroachdb/cockroach/pkg/server.(*testServer).maybeStartDefaultTestTenant(0x4007be6e00, {0x5f8b298?, 0x92782e0?})
	github.com/cockroachdb/cockroach/pkg/server/testserver.go:710 +0x24c
github.com/cockroachdb/cockroach/pkg/server.(*testServer).Activate(0x4007be6e00, {0x5f8b298, 0x92782e0})
	github.com/cockroachdb/cockroach/pkg/server/testserver.go:836 +0xbc
github.com/cockroachdb/cockroach/pkg/testutils/serverutils.(*wrap).Activate(0x5f68d08?, {0x5f8b298, 0x92782e0})
	github.com/cockroachdb/cockroach/bazel-out/aarch64-fastbuild/bin/pkg/testutils/serverutils/ts_control_forwarder_generated.go:25 +0x54
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start(0x4007747800, {0x5fc1d80, 0x4008ceb380?})
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:484 +0x634
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster({_, _}, _, {{{{0x5f48680, 0x4006419880}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:238 +0x60
github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test.TestRangeFeedIntentResolutionRace(0x4008ceb380)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1321 +0x338
testing.tRunner(0x4008ceb380, 0x4ee2c10)
	GOROOT/src/testing/testing.go:1595 +0xe8
created by testing.(*T).Run in goroutine 1
	GOROOT/src/testing/testing.go:1648 +0x33c

@pav-kv pav-kv added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels Feb 19, 2024
@pav-kv pav-kv added the P-3 Issues/test failures with no fix SLA label Mar 4, 2024
@nvanbenschoten
Copy link
Member

We've seen the same failure in #115345, #115080, and #114210.

Since these stuck startups are in StartSharedProcessTenant, I'm going to move this over to T-multitenant and allow them to make the call on whether this should remain a release-blocker and what to do with it.

@nvanbenschoten nvanbenschoten added T-multitenant Issues owned by the multi-tenant virtual team and removed T-kv-replication labels Mar 6, 2024
@nvanbenschoten nvanbenschoten changed the title kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [timed out while starting tenant] Mar 6, 2024
@stevendanna stevendanna self-assigned this Mar 15, 2024
@stevendanna
Copy link
Collaborator

I am not sure if I've reproduced the exact same thing, but I'm able to reliably get this test to time out under stress with a 1 minute timeout.

What is happening here is interesting. I don't have the whole story yet.

This is a 3 node test. As the 3 tenant SQL servers start up, they all see that the permanent migrations need to run:

I240320 10:55:12.666645 4455 upgrade/upgrademanager/manager.go:237  [T10,Vtest-tenant,n3] 312  the last permanent upgrade (v0.0-upgrading-step-004) does not appear to have completed; attempting to run all upgrades
I240320 10:55:12.691847 4636 upgrade/upgrademanager/manager.go:237  [T10,Vtest-tenant,n1] 328  the last permanent upgrade (v0.0-upgrading-step-004) does not appear to have completed; attempting to run all upgrades
I240320 10:55:13.099549 4485 upgrade/upgrademanager/manager.go:237  [T10,Vtest-tenant,n2] 382  the last permanent upgrade (v0.0-upgrading-step-004) does not appear to have completed; attempting to run all upgrades

It appears n1 won the race to create the migration job, the other two nodes simply wait for n1's job:

I240320 10:55:13.046709 4455 upgrade/upgrademanager/manager.go:774  [T10,Vtest-tenant,n3] 379  found existing migration job 953095660148752385 for version 0.0-upgrading-step-004 in status running, waiting
I240320 10:55:13.147829 4485 upgrade/upgrademanager/manager.go:774  [T10,Vtest-tenant,n2] 386  found existing migration job 953095660148752385 for version 0.0-upgrading-step-004 in status running, waiting

Over on n1 we make it all the way to:

I240320 10:55:13.034107 5259 upgrade/upgrades/permanent_upgrades.go:80  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 378  executing bootstrap step "add default SQL schema telemetry schedule"

It appears that the transaction for that migration is stuck:

W240320 10:55:14.298830 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 397  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 10 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=9 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.315082 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 398  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 20 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=19 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.366451 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 402  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 30 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=29 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.379330 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 403  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 40 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=39 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.473752 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 407  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 50 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=49 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.514074 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 409  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 60 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=59 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.590555 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 411  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 70 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=69 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.609663 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 413  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 80 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=79 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.832972 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 422  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 90 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=89 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.866958 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 425  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 100 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=99 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.868632 5259 kv/txn.go:1123  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 426  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 101 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=100 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Terminating retry loop and returning error due to cluster setting kv.transaction.internal.max_auto_retries (100). Rollback error: <nil>.

This bubbles up to the resumer:

E240320 10:55:14.869355 5259 jobs/adopt.go:461  [T10,Vtest-tenant,n1] 428  job 953095660148752385: adoption completed with error non-cancelable: running migration for 0.0-upgrading-step-004: bootstrap step "add default SQL schema telemetry schedule" failed: have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 101 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=100 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Terminating retry loop and returning error due to cluster setting kv.transaction.internal.max_auto_retries (100). Rollback error: <nil>.

But I believe that because this is a non-cancellable job, this job will still be in a "running" state, waiting for it to be re-adopted. Now, it should get adopted again, but in this 1m test run we may not be running long enough for that to happen.

I've been unable to reproduce this with a longer timeout, becuase I almost always hit:

    rangefeed_external_test.go:1428: 
                Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1416
                                                        github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1428
                Error:          Expected value not to be nil.
                Test:           TestRangeFeedIntentResolutionRace
                Messages:       nil checkpoint

Before that happens.

Focusing for a moment on why we might have contention, I can see clearly why there might be contention in the short term during startup. Namely, before the permanent migrations run that create these new schedules, we also start goroutines on every node that poll the schedules table and tries to create the schedule. So, it makes some sense that all these queries trying to create the same schedule may contend. Before 79219f7 all of the backoff in the loops that try to create these schedules don't matter because they will be in a very tight db.Txn retry loop. After that commit I think the system eventually unsticks itself.

@stevendanna stevendanna removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 27, 2024
@cockroach-teamcity
Copy link
Member Author

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 666fdb445868b6f27862313ee75d032687f1b3db:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace2243622718
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1444: split off range r71:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1467: started rangefeed on [n3,s3,r71/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1521: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1509
        	            				github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1521
        	Error:      	Expected value not to be nil.
        	Test:       	TestRangeFeedIntentResolutionRace
        	Messages:   	nil checkpoint
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace2243622718
--- FAIL: TestRangeFeedIntentResolutionRace (60.38s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed on master @ 3c643b003890560b16c4fee1d1c18bea1871803b:

Fatal error:

panic: test timed out after 4m57s
running tests:
	TestRangeFeedIntentResolutionRace (4m5s)

Stack:

goroutine 89414 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
	GOROOT/src/time/sleep.go:177 +0x2d
Log preceding fatal error

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestRangeFeedIntentResolutionRace717568773
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.

Parameters:

  • attempt=1
  • run=9
  • shard=1
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed on master @ 934684d8134fd8bb34eae1a37f3aa83a4ac066b7:

Fatal error:

panic: test timed out after 4m57s
running tests:
	TestRangeFeedIntentResolutionRace (4m0s)

Stack:

goroutine 89234 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
	GOROOT/src/time/sleep.go:177 +0x2d
Log preceding fatal error

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestRangeFeedIntentResolutionRace1421379124
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.

Parameters:

  • attempt=1
  • run=2
  • shard=1
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed on master @ 714dc4da48a3d2a07b5d097542808982f848f704:

Fatal error:

panic: test timed out after 4m57s
running tests:
	TestRangeFeedIntentResolutionRace (4m13s)

Stack:

goroutine 95968 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
	GOROOT/src/time/sleep.go:177 +0x2d
Log preceding fatal error

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestRangeFeedIntentResolutionRace1687467905
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.

Parameters:

  • attempt=1
  • run=7
  • shard=1
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ f2e7709ee3912568de9e214560292844bf4e9f23:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4171001780
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1445: split off range r71:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1468: started rangefeed on [n3,s3,r71/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1522: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1510
        	            				github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1522
        	Error:      	Expected value not to be nil.
        	Test:       	TestRangeFeedIntentResolutionRace
        	Messages:   	nil checkpoint
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4171001780
--- FAIL: TestRangeFeedIntentResolutionRace (60.23s)
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 1ca84f37a5c230267fe4c9b209983db62becce6a:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace2771884319
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1445: split off range r71:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1467: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1467
        	Error:      	Received unexpected error:
        	            	context deadline exceeded
        	Test:       	TestRangeFeedIntentResolutionRace
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace2771884319
--- FAIL: TestRangeFeedIntentResolutionRace (57.49s)
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@stevendanna stevendanna added T-db-server and removed T-multitenant Issues owned by the multi-tenant virtual team labels Jul 9, 2024
@stevendanna
Copy link
Collaborator

I think there are probably a few things going on here, but we really should try to dig into these slow tenant startup issues.

@stevendanna stevendanna removed their assignment Jul 9, 2024
@cockroach-teamcity
Copy link
Member Author

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 7512f81be1591cde8e3b5d34bb0b9b1101f22d24:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4152603468
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1443: split off range r72:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1466: started rangefeed on [n3,s3,r72/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1520: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1508
        	            				github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1520
        	Error:      	Expected value not to be nil.
        	Test:       	TestRangeFeedIntentResolutionRace
        	Messages:   	nil checkpoint
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4152603468
--- FAIL: TestRangeFeedIntentResolutionRace (61.71s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@shubhamdhama
Copy link
Contributor

TLDR: Mostly just reproducing the issue. No new observation that isn't mentioned here #119340 (comment)

Progress so far

I am able to reproduce all three failures mentioned here under stress.
If I put a put a timeout of 1 min, it panics during startup:

❯ dev test pkg/kv/kvclient/rangefeed:rangefeed_test -f TestRangeFeedIntentResolutionRace --stress --count 20000 --timeout 1m --show-logs --verbose -- --test_output=errors

panic: test timed out after 57s
running tests:
	TestRangeFeedIntentResolutionRace (57s)
...
github.com/cockroachdb/cockroach/pkg/server.(*serverController).startAndWaitForRunningServer(0xc003fc8ee0, {0x6ac9c28, 0xa327c00}, {0x555f1ce?, 0x1bf08eb000?})
	github.com/cockroachdb/cockroach/pkg/server/server_controller.go:374 +0xf0

and this #119340 (comment) seems correct

Focusing for a moment on why we might have contention, I can see clearly why there might be contention in the short term during startup. Namely, before the permanent migrations run that create these new schedules, we also start goroutines on every node that poll the schedules table and tries to create the schedule. So, it makes some sense that all these queries trying to create the same schedule may contend.

I verified by just putting some stack trace in

// periodically. This is done during the cluster startup upgrade.
func CreateSchemaTelemetrySchedule(
ctx context.Context, txn isql.Txn, st *cluster.Settings, clusterID uuid.UUID,
) (*jobs.ScheduledJob, error) {
id, err := GetSchemaTelemetryScheduleID(ctx, txn)
if err != nil {
return nil, err

When the --timeout is set to some larger number (2m) it fails inconsistently between,

        	Error:      	Expected value not to be nil.

and

        	Error:      	Received unexpected error:
        	            	context deadline exceeded

and they fail after successful adoption and completion of the migration,

I240903 14:58:07.550793 15874 1@jobs/structured_log.go:60  [T10,Vtest-tenant,n3] 505 ={"Timestamp":1725375487550789995,"EventType":"status_change","JobID":1000423535935651842,"JobType":"MIGRATION","Description":"status changed to: succeeded","PreviousStatus":"running","NewStatus":"succeeded","RunNum":2}

Just out of curiosity, with --timeout of 2mins, I tried increasing the context timeout value to a significant period,

	ctx, cancel := context.WithTimeout(context.Background(), 300*time.Second)

but it failed unexpectedly even before giving up on transaction retries,

# this line just to show how soon it panicked
I240904 15:17:40.782599 16 1@server/server.go:2005  [T1,Vsystem,n1] 46  starting postgres server at 127.0.0.1:34155 (use: 127.0.0.1:34155)
...
I240904 15:17:43.906064 4795 1@sql/catalog/schematelemetry/schematelemetrycontroller/controller.go:217  [T10,Vtest-tenant,n3] 429 +	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x3fe
panic: error verifying key /Tenant/10/Table/37/1/1000710681101893635/0; lock table num=14
 lock: /Tenant/10/Table/37/1/1000710677910618115/0
  holder: txn: 08c2bf26-0a5a-44a6-8e31-809b8a2bf9f3 epoch: 2, iso: Serializable, ts: 1725463063.883275891,1, info: repl [Intent]
   waiting readers:
    req: 448, txn: e5c74f28-60ce-40ed-b2dd-15b0dab6f563
    req: 447, txn: 0eba4bd7-25f6-46a8-93fd-0483e1300e54
 lock: /Tenant/10/Table/37/1/1000710677910618115/1/1
  holder: txn: 08c2bf26-0a5a-44a6-8e31-809b8a2bf9f3 epoch: 2, iso: Serializable, ts: 1725463063.883275891,1, info: repl [Intent]

Anyway, my first goal here is to understand why startup is slow? So I'm sidelining issues when--timeout is 2 mins.

In my knowledge since only this test case is facing slow startup, so my suspicion is the test setup done here before starting the test server is causing that txns contention.

@tbg
Copy link
Member

tbg commented Sep 19, 2024

I just investigated the same failure over in #130931 (comment). I'll now mark that one as a duplicate of this one and assign it to T-db-server as well. In case it helps, the short-form analysis seems to be:

  • the migrations job sometimes hits contention and fails
  • the contention is ultimately resolved via breaking a deadlock (at least in the linked issue)
  • it's then only re-adopted after 30s or a minute or something like that
  • that's too long for the test, it has a 30s timeout on its context.
  • it then fails with the "nil checkpoint" which is really a result of the checkpoint channel being closed as a result of context cancellation.

We can:

  • understand why there is a contention deadlock while starting up a tenant; we really shouldn't have those in our internal txns. @stevendanna seems to have had some ideas who is involved in the deadlock at the bottom of this comment. Once understood, we should be able to make some change that at least prevents a deadlock. (Perhaps taking out an extra lock, or using read committed, who knows).
  • give the test a larger timeout, i.e. change this to something like 2 minutes.

The former seems like a real solution and would solve a problem that seems likely to also affect production. The second is pure band aid. My suggestion is to do both, in the order in which they appear here.

@tbg tbg removed the A-testing Testing tools and infrastructure label Sep 19, 2024
@stevendanna
Copy link
Collaborator

Uff. Apologies I dropped the ball on this and so many have spent time re-investigating this issue.

The first bit of contention I originally found was around this telemetry schedule creation.

  1. Move this call out to a different method that we only start after the migrations are done. Other parts of server startup have a similar multi-part startup.

s.schemaTelemetryController.Start(ctx, stopper)

  1. Remove the permanent migration to add the schedule completely. If we have this anti-entropy mechanism that is going to check every startup, why not just rely on that?

  2. Understand why the schedule creation contends for so long. Sure, they are both trying to create the same schedule, but it kinda seems to me like one of these writers should win eventually.

@spilchen
Copy link
Contributor

The second approach seems more logical to me. As you pointed out, we already have startup code that ensures the schedule is in place (see below), so it's unclear why a permanent upgrade task would also be needed for this.

updateSchedule(stopCtx, c.db, c.st, c.clusterID())

@cockroach-teamcity
Copy link
Member Author

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 393c4f973b662a31263490b936bbbdda23b8875a:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:165: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4120758641
    test_log_scope.go:76: use -show-logs to present logs inline
    test_server_shim.go:152: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1433: split off range r77:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1457: started rangefeed on [n3,s3,r77/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1511: 
        	Error Trace:	pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1499
        	            				pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1511
        	Error:      	Expected value not to be nil.
        	Test:       	TestRangeFeedIntentResolutionRace
        	Messages:   	nil checkpoint
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4120758641
--- FAIL: TestRangeFeedIntentResolutionRace (36.01s)
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ f94a89b5231ddaffcb4f8705e19b5504af56bc47:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:165: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4214466927
    test_log_scope.go:76: use -show-logs to present logs inline
    test_server_shim.go:154: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1433: split off range r77:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1457: started rangefeed on [n3,s3,r77/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1511: rangefeed checkpoint at 1737144262.766417449,0 >= 1737144262.666877932,0
    rangefeed_external_test.go:1512: initial checkpoint at 1737144262.766417449,0
    rangefeed_external_test.go:1523: wrote /Tenant/10"\xfa/foo"
    rangefeed_external_test.go:1542: replica [n1,s1,r77/1:/Tenant/1{0/Table/…-1}] closed timestamp at 1737144262.967536359,0 >= 1737144262.954275413,0
    rangefeed_external_test.go:1543: replica [n3,s3,r77/3:/Tenant/1{0/Table/…-1}] closed timestamp at 1737144262.967536359,0 >= 1737144262.954275413,0
    rangefeed_external_test.go:1544: closed timestamp 1737144262.967536359,0 is above txn write timestamp 1737144262.954275413,0
    rangefeed_external_test.go:1383: blocked push for txn id=7e11cb77 key=/Tenant/10"\xfa/foo" iso=Serializable pri=0.00000000 epo=0 ts=1737144262.954275413,0 min=1737144262.954275413,0 seq=0
    rangefeed_external_test.go:1551: rangefeed checkpoint at 1737144262.954275412,0 >= 1737144262.954275412,0
    rangefeed_external_test.go:1552: rangefeed caught up to txn write timestamp at 1737144262.954275412,0
    rangefeed_external_test.go:1562: txn committed at 1737144262.954275413,0
    rangefeed_external_test.go:1386: unblocked push for txn id=7e11cb77 key=/Tenant/10"\xfa/foo" iso=Serializable pri=0.00000000 epo=0 ts=1737144262.954275413,0 min=1737144262.954275413,0 seq=0
    rangefeed_external_test.go:1360: blocked r77 on s3
    rangefeed_external_test.go:1588: 
        	Error Trace:	pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1588
        	Error:      	repl3 emitted premature value %s
        	Test:       	TestRangeFeedIntentResolutionRace
        	Messages:   	<nil>
    rangefeed_external_test.go:1360: blocked r77 on s3
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4214466927
--- FAIL: TestRangeFeedIntentResolutionRace (30.07s)
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@stevendanna stevendanna self-assigned this Jan 20, 2025
@stevendanna stevendanna added A-kv Anything in KV that doesn't belong in a more specific category. T-kv KV Team labels Jan 20, 2025
Copy link

blathers-crl bot commented Jan 20, 2025

This issue has multiple T-eam labels. Please make sure it only has one, or else issue synchronization will not work correctly.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@stevendanna
Copy link
Collaborator

The last error on this makes we worried this persistent failure is hiding other problems so I am going to try to solve this this week.

stevendanna added a commit to stevendanna/cockroach that referenced this issue Jan 21, 2025
The logging didn't actually print the value as it seemed to intend.

Informs cockroachdb#119340

Release note: None
craig bot pushed a commit that referenced this issue Jan 21, 2025
139150: kvserver: remove StoreBenignError r=tbg a=tbg

Before commit 3f0b37a, the StoreBenignError is used to handle pebble.ErrSnapshotExcised. As the latter has been removed from pebble, we don't need StoreBenignError anymore.

This commit does the following:

- Remove type "StoreBenignError".
- Remove the related increase action on counter "storeFailures".
- Update related tests "TestBaseQueueRequeue".

Fixes: #129941

Closes: #130308

Release note: None

139280: roachtest: adding backup/restore tests for minio r=sravotto a=sravotto

Introducing a test to verify that we can backup and restore into a Minio Object Store cluster, using S3 API.

Fixes: #139272

Release note: None

139333: roachtest: only run 30 node backfill tests in full ac mode r=andrewbaptist a=andrewbaptist

In the non-full AC modes, a node can OOM during the fill period and the test will fail. This impacts the perturbation/metamorphic/backfill test.

Fixes: #139302
Informs: #139319

Release note: None

139475: rangefeed: fix test logging r=tbg a=stevendanna

The logging didn't actually print the value as it seemed to intend.

Informs #119340

Release note: None

Co-authored-by: XiaochenCui <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Silvano Ravotto <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
shubhamdhama added a commit to shubhamdhama/cockroach that referenced this issue Jan 23, 2025
…nt upgrade

During startup, `CreateSchemaTelemetrySchedule`, which creates the
scheduled job to collect schema telemetry, is redundantly invoked twice
concurrently: once via `schematelemetrycontroller.Controller.Start` and
again through the `ensureSQLSchemaTelemetrySchedule` permanent upgrade.

This was identified in cockroachdb#119340, where the permanent upgrade encounters
contention and is resolved nearly 30 seconds later causing slow startup of
the tenant. Although the exact reason for the prolonged transaction
deadlock is unclear, we can still benefit by removing this redundant
upgrade.

Informs: cockroachdb#119340
Closes: cockroachdb#130931

Release note: None

Epic: none
shubhamdhama added a commit to shubhamdhama/cockroach that referenced this issue Jan 23, 2025
…nt upgrade

During startup, `CreateSchemaTelemetrySchedule`, which creates the
scheduled job to collect schema telemetry, is redundantly invoked twice
concurrently: once via `schematelemetrycontroller.Controller.Start` and
again through the `ensureSQLSchemaTelemetrySchedule` permanent upgrade.

This was identified in cockroachdb#119340, where the permanent upgrade encounters
contention and is resolved nearly 30 seconds later causing slow startup of
the tenant. Although the exact reason for the prolonged transaction
deadlock is unclear, we can still benefit by removing this redundant
upgrade.

Informs: cockroachdb#119340
Closes: cockroachdb#130931

Release note: None

Epic: none
craig bot pushed a commit that referenced this issue Jan 24, 2025
139632: upgrades: remove redundant `ensureSQLSchemaTelemetrySchedule` permanent upgrade r=rafiss a=shubhamdhama

During startup, `CreateSchemaTelemetrySchedule`, which creates the scheduled job to collect schema telemetry, is redundantly invoked twice concurrently: once via `schematelemetrycontroller.Controller.Start` and also through the `ensureSQLSchemaTelemetrySchedule` permanent upgrade.

This was identified in #119340, where the permanent upgrade encounters contention and is resolved nearly 30 seconds later causing slow startup of the tenant. Although the exact reason for the prolonged transaction deadlock is unclear, we can still benefit by removing this redundant upgrade.

Informs: #119340
Closes: #130931

Release note: None

Epic: none

Co-authored-by: Shubham Dhama <[email protected]>
andy-kimball pushed a commit to andy-kimball/cockroach that referenced this issue Jan 27, 2025
…nt upgrade

During startup, `CreateSchemaTelemetrySchedule`, which creates the
scheduled job to collect schema telemetry, is redundantly invoked twice
concurrently: once via `schematelemetrycontroller.Controller.Start` and
again through the `ensureSQLSchemaTelemetrySchedule` permanent upgrade.

This was identified in cockroachdb#119340, where the permanent upgrade encounters
contention and is resolved nearly 30 seconds later causing slow startup of
the tenant. Although the exact reason for the prolonged transaction
deadlock is unclear, we can still benefit by removing this redundant
upgrade.

Informs: cockroachdb#119340
Closes: cockroachdb#130931

Release note: None

Epic: none
blathers-crl bot pushed a commit that referenced this issue Feb 5, 2025
…nt upgrade

During startup, `CreateSchemaTelemetrySchedule`, which creates the
scheduled job to collect schema telemetry, is redundantly invoked twice
concurrently: once via `schematelemetrycontroller.Controller.Start` and
again through the `ensureSQLSchemaTelemetrySchedule` permanent upgrade.

This was identified in #119340, where the permanent upgrade encounters
contention and is resolved nearly 30 seconds later causing slow startup of
the tenant. Although the exact reason for the prolonged transaction
deadlock is unclear, we can still benefit by removing this redundant
upgrade.

Informs: #119340
Closes: #130931

Release note: None

Epic: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team
Projects
None yet
Development

No branches or pull requests

9 participants