kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [timed out while starting tenant] #119340

cockroach-teamcity · 2024-02-17T23:11:17Z

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ a58e89a2e54e3a5ad73edcce48a669e516cceedd:

Fatal error:

panic: test timed out after 4m57s
running tests:
	TestRangeFeedIntentResolutionRace (4m33s)

Stack:

goroutine 73770 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2259 +0x320
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x38

Log preceding fatal error

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace448879069
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/replication _{This test on roachdash | Improve this report!

Jira issue: CRDB-36159}

The text was updated successfully, but these errors were encountered:

pav-kv · 2024-02-19T14:53:47Z

The test cluster start-up hangs?

goroutine 34902 [select, 4 minutes]:
github.com/cockroachdb/cockroach/pkg/server.(*serverController).startAndWaitForRunningServer(0x4007f2e1c0, {0x5f8b298, 0x92782e0}, {0x4b7297d?, 0x1bf08eb000?})
	github.com/cockroachdb/cockroach/pkg/server/server_controller_orchestration.go:232 +0xcc
github.com/cockroachdb/cockroach/pkg/server.(*testServer).StartSharedProcessTenant(_, {_, _}, {{0x4b7297d, 0xb}, {0xa}, {{0x5f48680, 0x4006419c00}, {0x0, 0x0}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/server/testserver.go:1381 +0x474
github.com/cockroachdb/cockroach/pkg/server.(*testServer).startSharedProcessDefaultTestTenant(0x4003930000?, {0x5f8b298, 0x92782e0})
	github.com/cockroachdb/cockroach/pkg/server/testserver.go:660 +0x88
github.com/cockroachdb/cockroach/pkg/server.(*testServer).maybeStartDefaultTestTenant(0x4007be6e00, {0x5f8b298?, 0x92782e0?})
	github.com/cockroachdb/cockroach/pkg/server/testserver.go:710 +0x24c
github.com/cockroachdb/cockroach/pkg/server.(*testServer).Activate(0x4007be6e00, {0x5f8b298, 0x92782e0})
	github.com/cockroachdb/cockroach/pkg/server/testserver.go:836 +0xbc
github.com/cockroachdb/cockroach/pkg/testutils/serverutils.(*wrap).Activate(0x5f68d08?, {0x5f8b298, 0x92782e0})
	github.com/cockroachdb/cockroach/bazel-out/aarch64-fastbuild/bin/pkg/testutils/serverutils/ts_control_forwarder_generated.go:25 +0x54
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start(0x4007747800, {0x5fc1d80, 0x4008ceb380?})
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:484 +0x634
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster({_, _}, _, {{{{0x5f48680, 0x4006419880}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:238 +0x60
github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test.TestRangeFeedIntentResolutionRace(0x4008ceb380)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1321 +0x338
testing.tRunner(0x4008ceb380, 0x4ee2c10)
	GOROOT/src/testing/testing.go:1595 +0xe8
created by testing.(*T).Run in goroutine 1
	GOROOT/src/testing/testing.go:1648 +0x33c

nvanbenschoten · 2024-03-06T17:14:15Z

We've seen the same failure in #115345, #115080, and #114210.

Since these stuck startups are in StartSharedProcessTenant, I'm going to move this over to T-multitenant and allow them to make the call on whether this should remain a release-blocker and what to do with it.

stevendanna · 2024-03-20T11:28:23Z

I am not sure if I've reproduced the exact same thing, but I'm able to reliably get this test to time out under stress with a 1 minute timeout.

What is happening here is interesting. I don't have the whole story yet.

This is a 3 node test. As the 3 tenant SQL servers start up, they all see that the permanent migrations need to run:

I240320 10:55:12.666645 4455 upgrade/upgrademanager/manager.go:237  [T10,Vtest-tenant,n3] 312  the last permanent upgrade (v0.0-upgrading-step-004) does not appear to have completed; attempting to run all upgrades
I240320 10:55:12.691847 4636 upgrade/upgrademanager/manager.go:237  [T10,Vtest-tenant,n1] 328  the last permanent upgrade (v0.0-upgrading-step-004) does not appear to have completed; attempting to run all upgrades
I240320 10:55:13.099549 4485 upgrade/upgrademanager/manager.go:237  [T10,Vtest-tenant,n2] 382  the last permanent upgrade (v0.0-upgrading-step-004) does not appear to have completed; attempting to run all upgrades

It appears n1 won the race to create the migration job, the other two nodes simply wait for n1's job:

I240320 10:55:13.046709 4455 upgrade/upgrademanager/manager.go:774  [T10,Vtest-tenant,n3] 379  found existing migration job 953095660148752385 for version 0.0-upgrading-step-004 in status running, waiting
I240320 10:55:13.147829 4485 upgrade/upgrademanager/manager.go:774  [T10,Vtest-tenant,n2] 386  found existing migration job 953095660148752385 for version 0.0-upgrading-step-004 in status running, waiting

Over on n1 we make it all the way to:

I240320 10:55:13.034107 5259 upgrade/upgrades/permanent_upgrades.go:80  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 378  executing bootstrap step "add default SQL schema telemetry schedule"

It appears that the transaction for that migration is stuck:

W240320 10:55:14.298830 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 397  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 10 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=9 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.315082 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 398  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 20 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=19 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.366451 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 402  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 30 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=29 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.379330 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 403  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 40 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=39 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.473752 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 407  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 50 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=49 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.514074 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 409  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 60 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=59 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.590555 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 411  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 70 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=69 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.609663 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 413  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 80 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=79 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.832972 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 422  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 90 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=89 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.866958 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 425  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 100 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=99 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.868632 5259 kv/txn.go:1123  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 426  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 101 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=100 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Terminating retry loop and returning error due to cluster setting kv.transaction.internal.max_auto_retries (100). Rollback error: <nil>.

This bubbles up to the resumer:

E240320 10:55:14.869355 5259 jobs/adopt.go:461  [T10,Vtest-tenant,n1] 428  job 953095660148752385: adoption completed with error non-cancelable: running migration for 0.0-upgrading-step-004: bootstrap step "add default SQL schema telemetry schedule" failed: have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 101 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=100 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Terminating retry loop and returning error due to cluster setting kv.transaction.internal.max_auto_retries (100). Rollback error: <nil>.

But I believe that because this is a non-cancellable job, this job will still be in a "running" state, waiting for it to be re-adopted. Now, it should get adopted again, but in this 1m test run we may not be running long enough for that to happen.

I've been unable to reproduce this with a longer timeout, becuase I almost always hit:

    rangefeed_external_test.go:1428: 
                Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1416
                                                        github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1428
                Error:          Expected value not to be nil.
                Test:           TestRangeFeedIntentResolutionRace
                Messages:       nil checkpoint

Before that happens.

Focusing for a moment on why we might have contention, I can see clearly why there might be contention in the short term during startup. Namely, before the permanent migrations run that create these new schedules, we also start goroutines on every node that poll the schedules table and tries to create the schedule. So, it makes some sense that all these queries trying to create the same schedule may contend. Before 79219f7 all of the backoff in the loops that try to create these schedules don't matter because they will be in a very tight db.Txn retry loop. After that commit I think the system eventually unsticks itself.

cockroach-teamcity · 2024-04-12T18:44:08Z

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 666fdb445868b6f27862313ee75d032687f1b3db:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace2243622718
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1444: split off range r71:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1467: started rangefeed on [n3,s3,r71/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1521: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1509
        	            				github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1521
        	Error:      	Expected value not to be nil.
        	Test:       	TestRangeFeedIntentResolutionRace
        	Messages:   	nil checkpoint
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace2243622718
--- FAIL: TestRangeFeedIntentResolutionRace (60.38s)

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2024-04-16T07:09:23Z

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed on master @ 3c643b003890560b16c4fee1d1c18bea1871803b:

Fatal error:

panic: test timed out after 4m57s
running tests:
	TestRangeFeedIntentResolutionRace (4m5s)

Stack:

goroutine 89414 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
	GOROOT/src/time/sleep.go:177 +0x2d

Log preceding fatal error

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestRangeFeedIntentResolutionRace717568773
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.

Parameters:

attempt=1
run=9
shard=1

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2024-04-17T06:36:05Z

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed on master @ 934684d8134fd8bb34eae1a37f3aa83a4ac066b7:

Fatal error:

panic: test timed out after 4m57s
running tests:
	TestRangeFeedIntentResolutionRace (4m0s)

Stack:

goroutine 89234 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
	GOROOT/src/time/sleep.go:177 +0x2d

Log preceding fatal error

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestRangeFeedIntentResolutionRace1421379124
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.

Parameters:

attempt=1
run=2
shard=1

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2024-04-18T06:51:59Z

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed on master @ 714dc4da48a3d2a07b5d097542808982f848f704:

Fatal error:

panic: test timed out after 4m57s
running tests:
	TestRangeFeedIntentResolutionRace (4m13s)

Stack:

goroutine 95968 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
	GOROOT/src/time/sleep.go:177 +0x2d

Log preceding fatal error

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestRangeFeedIntentResolutionRace1687467905
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.

Parameters:

attempt=1
run=7
shard=1

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed #122569 kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [C-test-failure O-robot T-kv-replication branch-release-24.1 release-blocker]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2024-05-31T18:13:49Z

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ f2e7709ee3912568de9e214560292844bf4e9f23:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4171001780
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1445: split off range r71:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1468: started rangefeed on [n3,s3,r71/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1522: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1510
        	            				github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1522
        	Error:      	Expected value not to be nil.
        	Test:       	TestRangeFeedIntentResolutionRace
        	Messages:   	nil checkpoint
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4171001780
--- FAIL: TestRangeFeedIntentResolutionRace (60.23s)

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed #122569 kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [A-testing C-bug C-test-failure O-robot P-3 T-multitenant branch-release-24.1]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2024-06-13T17:48:09Z

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 1ca84f37a5c230267fe4c9b209983db62becce6a:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace2771884319
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1445: split off range r71:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1467: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1467
        	Error:      	Received unexpected error:
        	            	context deadline exceeded
        	Test:       	TestRangeFeedIntentResolutionRace
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace2771884319
--- FAIL: TestRangeFeedIntentResolutionRace (57.49s)

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed #122569 kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [A-testing C-bug C-test-failure O-robot P-3 T-multitenant branch-release-24.1]

_{This test on roachdash | Improve this report!}

stevendanna · 2024-07-09T17:16:11Z

I think there are probably a few things going on here, but we really should try to dig into these slow tenant startup issues.

cockroach-teamcity · 2024-08-28T05:02:38Z

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 7512f81be1591cde8e3b5d34bb0b9b1101f22d24:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4152603468
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1443: split off range r72:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1466: started rangefeed on [n3,s3,r72/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1520: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1508
        	            				github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1520
        	Error:      	Expected value not to be nil.
        	Test:       	TestRangeFeedIntentResolutionRace
        	Messages:   	nil checkpoint
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4152603468
--- FAIL: TestRangeFeedIntentResolutionRace (61.71s)

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

shubhamdhama · 2024-09-04T17:05:02Z

TLDR: Mostly just reproducing the issue. No new observation that isn't mentioned here #119340 (comment)

Progress so far

I am able to reproduce all three failures mentioned here under stress.
If I put a put a timeout of 1 min, it panics during startup:

❯ dev test pkg/kv/kvclient/rangefeed:rangefeed_test -f TestRangeFeedIntentResolutionRace --stress --count 20000 --timeout 1m --show-logs --verbose -- --test_output=errors

panic: test timed out after 57s
running tests:
	TestRangeFeedIntentResolutionRace (57s)
...
github.com/cockroachdb/cockroach/pkg/server.(*serverController).startAndWaitForRunningServer(0xc003fc8ee0, {0x6ac9c28, 0xa327c00}, {0x555f1ce?, 0x1bf08eb000?})
	github.com/cockroachdb/cockroach/pkg/server/server_controller.go:374 +0xf0

and this #119340 (comment) seems correct

Focusing for a moment on why we might have contention, I can see clearly why there might be contention in the short term during startup. Namely, before the permanent migrations run that create these new schedules, we also start goroutines on every node that poll the schedules table and tries to create the schedule. So, it makes some sense that all these queries trying to create the same schedule may contend.

I verified by just putting some stack trace in

cockroach/pkg/sql/catalog/schematelemetry/schematelemetrycontroller/controller.go

Lines 212 to 218 in aeec299

    
           // periodically. This is done during the cluster startup upgrade. 
        
           func CreateSchemaTelemetrySchedule( 
        
           	ctx context.Context, txn isql.Txn, st *cluster.Settings, clusterID uuid.UUID, 
        
           ) (*jobs.ScheduledJob, error) { 
        
           	id, err := GetSchemaTelemetryScheduleID(ctx, txn) 
        
           	if err != nil { 
        
           		return nil, err

When the --timeout is set to some larger number (2m) it fails inconsistently between,

        	Error:      	Expected value not to be nil.

and

        	Error:      	Received unexpected error:
        	            	context deadline exceeded

and they fail after successful adoption and completion of the migration,

I240903 14:58:07.550793 15874 1@jobs/structured_log.go:60  [T10,Vtest-tenant,n3] 505 ={"Timestamp":1725375487550789995,"EventType":"status_change","JobID":1000423535935651842,"JobType":"MIGRATION","Description":"status changed to: succeeded","PreviousStatus":"running","NewStatus":"succeeded","RunNum":2}

Just out of curiosity, with --timeout of 2mins, I tried increasing the context timeout value to a significant period,

	ctx, cancel := context.WithTimeout(context.Background(), 300*time.Second)

but it failed unexpectedly even before giving up on transaction retries,

# this line just to show how soon it panicked
I240904 15:17:40.782599 16 1@server/server.go:2005  [T1,Vsystem,n1] 46  starting postgres server at 127.0.0.1:34155 (use: 127.0.0.1:34155)
...
I240904 15:17:43.906064 4795 1@sql/catalog/schematelemetry/schematelemetrycontroller/controller.go:217  [T10,Vtest-tenant,n3] 429 +	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x3fe
panic: error verifying key /Tenant/10/Table/37/1/1000710681101893635/0; lock table num=14
 lock: /Tenant/10/Table/37/1/1000710677910618115/0
  holder: txn: 08c2bf26-0a5a-44a6-8e31-809b8a2bf9f3 epoch: 2, iso: Serializable, ts: 1725463063.883275891,1, info: repl [Intent]
   waiting readers:
    req: 448, txn: e5c74f28-60ce-40ed-b2dd-15b0dab6f563
    req: 447, txn: 0eba4bd7-25f6-46a8-93fd-0483e1300e54
 lock: /Tenant/10/Table/37/1/1000710677910618115/1/1
  holder: txn: 08c2bf26-0a5a-44a6-8e31-809b8a2bf9f3 epoch: 2, iso: Serializable, ts: 1725463063.883275891,1, info: repl [Intent]

Anyway, my first goal here is to understand why startup is slow? So I'm sidelining issues when--timeout is 2 mins.

In my knowledge since only this test case is facing slow startup, so my suspicion is the test setup done here before starting the test server is causing that txns contention.

tbg · 2024-09-19T11:18:00Z

I just investigated the same failure over in #130931 (comment). I'll now mark that one as a duplicate of this one and assign it to T-db-server as well. In case it helps, the short-form analysis seems to be:

the migrations job sometimes hits contention and fails
the contention is ultimately resolved via breaking a deadlock (at least in the linked issue)
it's then only re-adopted after 30s or a minute or something like that
that's too long for the test, it has a 30s timeout on its context.
it then fails with the "nil checkpoint" which is really a result of the checkpoint channel being closed as a result of context cancellation.

We can:

understand why there is a contention deadlock while starting up a tenant; we really shouldn't have those in our internal txns. @stevendanna seems to have had some ideas who is involved in the deadlock at the bottom of this comment. Once understood, we should be able to make some change that at least prevents a deadlock. (Perhaps taking out an extra lock, or using read committed, who knows).
give the test a larger timeout, i.e. change this to something like 2 minutes.

The former seems like a real solution and would solve a problem that seems likely to also affect production. The second is pure band aid. My suggestion is to do both, in the order in which they appear here.

stevendanna · 2024-09-19T11:41:46Z

Uff. Apologies I dropped the ball on this and so many have spent time re-investigating this issue.

The first bit of contention I originally found was around this telemetry schedule creation.

Move this call out to a different method that we only start after the migrations are done. Other parts of server startup have a similar multi-part startup.

cockroach/pkg/sql/conn_executor.go

Line 620 in 285460a

s.schemaTelemetryController.Start(ctx, stopper)

Remove the permanent migration to add the schedule completely. If we have this anti-entropy mechanism that is going to check every startup, why not just rely on that?
Understand why the schedule creation contends for so long. Sure, they are both trying to create the same schedule, but it kinda seems to me like one of these writers should win eventually.

spilchen · 2024-09-20T17:07:24Z

The second approach seems more logical to me. As you pointed out, we already have startup code that ensures the schedule is in place (see below), so it's unclear why a permanent upgrade task would also be needed for this.

cockroach/pkg/sql/catalog/schematelemetry/schematelemetrycontroller/controller.go

Line 113 in aeec299

updateSchedule(stopCtx, c.db, c.st, c.clusterID())

cockroach-teamcity · 2025-01-03T21:45:34Z

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 393c4f973b662a31263490b936bbbdda23b8875a:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:165: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4120758641
    test_log_scope.go:76: use -show-logs to present logs inline
    test_server_shim.go:152: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1433: split off range r77:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1457: started rangefeed on [n3,s3,r77/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1511: 
        	Error Trace:	pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1499
        	            				pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1511
        	Error:      	Expected value not to be nil.
        	Test:       	TestRangeFeedIntentResolutionRace
        	Messages:   	nil checkpoint
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4120758641
--- FAIL: TestRangeFeedIntentResolutionRace (36.01s)

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed #136627 kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [A-kv-transactions C-bug C-test-failure O-robot P-1 T-kv branch-release-24.1]
kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed #133969 kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [C-bug C-test-failure O-robot P-2 T-db-server X-duplicate branch-release-24.3]
kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed #130931 kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [C-bug C-test-failure O-robot P-2 T-kv branch-release-24.2]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2025-01-17T20:53:18Z

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ f94a89b5231ddaffcb4f8705e19b5504af56bc47:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:165: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4214466927
    test_log_scope.go:76: use -show-logs to present logs inline
    test_server_shim.go:154: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1433: split off range r77:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1457: started rangefeed on [n3,s3,r77/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1511: rangefeed checkpoint at 1737144262.766417449,0 >= 1737144262.666877932,0
    rangefeed_external_test.go:1512: initial checkpoint at 1737144262.766417449,0
    rangefeed_external_test.go:1523: wrote /Tenant/10"\xfa/foo"
    rangefeed_external_test.go:1542: replica [n1,s1,r77/1:/Tenant/1{0/Table/…-1}] closed timestamp at 1737144262.967536359,0 >= 1737144262.954275413,0
    rangefeed_external_test.go:1543: replica [n3,s3,r77/3:/Tenant/1{0/Table/…-1}] closed timestamp at 1737144262.967536359,0 >= 1737144262.954275413,0
    rangefeed_external_test.go:1544: closed timestamp 1737144262.967536359,0 is above txn write timestamp 1737144262.954275413,0
    rangefeed_external_test.go:1383: blocked push for txn id=7e11cb77 key=/Tenant/10"\xfa/foo" iso=Serializable pri=0.00000000 epo=0 ts=1737144262.954275413,0 min=1737144262.954275413,0 seq=0
    rangefeed_external_test.go:1551: rangefeed checkpoint at 1737144262.954275412,0 >= 1737144262.954275412,0
    rangefeed_external_test.go:1552: rangefeed caught up to txn write timestamp at 1737144262.954275412,0
    rangefeed_external_test.go:1562: txn committed at 1737144262.954275413,0
    rangefeed_external_test.go:1386: unblocked push for txn id=7e11cb77 key=/Tenant/10"\xfa/foo" iso=Serializable pri=0.00000000 epo=0 ts=1737144262.954275413,0 min=1737144262.954275413,0 seq=0
    rangefeed_external_test.go:1360: blocked r77 on s3
    rangefeed_external_test.go:1588: 
        	Error Trace:	pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1588
        	Error:      	repl3 emitted premature value %s
        	Test:       	TestRangeFeedIntentResolutionRace
        	Messages:   	<nil>
    rangefeed_external_test.go:1360: blocked r77 on s3
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4214466927
--- FAIL: TestRangeFeedIntentResolutionRace (30.07s)

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed #133969 kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [C-bug C-test-failure O-robot P-2 T-db-server X-duplicate branch-release-24.3]
kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed #130931 kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [C-bug C-test-failure O-robot P-2 T-kv branch-release-24.2]

_{This test on roachdash | Improve this report!}

blathers-crl · 2025-01-20T16:35:04Z

This issue has multiple T-eam labels. Please make sure it only has one, or else issue synchronization will not work correctly.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

stevendanna · 2025-01-20T16:35:42Z

The last error on this makes we worried this persistent failure is hiding other problems so I am going to try to solve this this week.

The logging didn't actually print the value as it seemed to intend. Informs cockroachdb#119340 Release note: None

139150: kvserver: remove StoreBenignError r=tbg a=tbg Before commit 3f0b37a, the StoreBenignError is used to handle pebble.ErrSnapshotExcised. As the latter has been removed from pebble, we don't need StoreBenignError anymore. This commit does the following: - Remove type "StoreBenignError". - Remove the related increase action on counter "storeFailures". - Update related tests "TestBaseQueueRequeue". Fixes: #129941 Closes: #130308 Release note: None 139280: roachtest: adding backup/restore tests for minio r=sravotto a=sravotto Introducing a test to verify that we can backup and restore into a Minio Object Store cluster, using S3 API. Fixes: #139272 Release note: None 139333: roachtest: only run 30 node backfill tests in full ac mode r=andrewbaptist a=andrewbaptist In the non-full AC modes, a node can OOM during the fill period and the test will fail. This impacts the perturbation/metamorphic/backfill test. Fixes: #139302 Informs: #139319 Release note: None 139475: rangefeed: fix test logging r=tbg a=stevendanna The logging didn't actually print the value as it seemed to intend. Informs #119340 Release note: None Co-authored-by: XiaochenCui <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Silvano Ravotto <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Steven Danna <[email protected]>

…nt upgrade During startup, `CreateSchemaTelemetrySchedule`, which creates the scheduled job to collect schema telemetry, is redundantly invoked twice concurrently: once via `schematelemetrycontroller.Controller.Start` and again through the `ensureSQLSchemaTelemetrySchedule` permanent upgrade. This was identified in cockroachdb#119340, where the permanent upgrade encounters contention and is resolved nearly 30 seconds later causing slow startup of the tenant. Although the exact reason for the prolonged transaction deadlock is unclear, we can still benefit by removing this redundant upgrade. Informs: cockroachdb#119340 Closes: cockroachdb#130931 Release note: None Epic: none

139632: upgrades: remove redundant `ensureSQLSchemaTelemetrySchedule` permanent upgrade r=rafiss a=shubhamdhama During startup, `CreateSchemaTelemetrySchedule`, which creates the scheduled job to collect schema telemetry, is redundantly invoked twice concurrently: once via `schematelemetrycontroller.Controller.Start` and also through the `ensureSQLSchemaTelemetrySchedule` permanent upgrade. This was identified in #119340, where the permanent upgrade encounters contention and is resolved nearly 30 seconds later causing slow startup of the tenant. Although the exact reason for the prolonged transaction deadlock is unclear, we can still benefit by removing this redundant upgrade. Informs: #119340 Closes: #130931 Release note: None Epic: none Co-authored-by: Shubham Dhama <[email protected]>

…nt upgrade During startup, `CreateSchemaTelemetrySchedule`, which creates the scheduled job to collect schema telemetry, is redundantly invoked twice concurrently: once via `schematelemetrycontroller.Controller.Start` and again through the `ensureSQLSchemaTelemetrySchedule` permanent upgrade. This was identified in cockroachdb#119340, where the permanent upgrade encounters contention and is resolved nearly 30 seconds later causing slow startup of the tenant. Although the exact reason for the prolonged transaction deadlock is unclear, we can still benefit by removing this redundant upgrade. Informs: cockroachdb#119340 Closes: cockroachdb#130931 Release note: None Epic: none

…nt upgrade During startup, `CreateSchemaTelemetrySchedule`, which creates the scheduled job to collect schema telemetry, is redundantly invoked twice concurrently: once via `schematelemetrycontroller.Controller.Start` and again through the `ensureSQLSchemaTelemetrySchedule` permanent upgrade. This was identified in #119340, where the permanent upgrade encounters contention and is resolved nearly 30 seconds later causing slow startup of the tenant. Although the exact reason for the prolonged transaction deadlock is unclear, we can still benefit by removing this redundant upgrade. Informs: #119340 Closes: #130931 Release note: None Epic: none

cockroach-teamcity added this to the 24.1 milestone Feb 17, 2024

pav-kv assigned erikgrinaker Feb 19, 2024

pav-kv added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels Feb 19, 2024

pav-kv added the P-3 Issues/test failures with no fix SLA label Mar 4, 2024

nvanbenschoten unassigned erikgrinaker Mar 6, 2024

nvanbenschoten added T-multitenant Issues owned by the multi-tenant virtual team and removed T-kv-replication labels Mar 6, 2024

nvanbenschoten changed the title ~~kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed~~ kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [timed out while starting tenant] Mar 6, 2024

stevendanna self-assigned this Mar 15, 2024

stevendanna removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 27, 2024

cockroach-teamcity mentioned this issue Apr 18, 2024

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed #122569

Closed

stevendanna added T-db-server and removed T-multitenant Issues owned by the multi-tenant virtual team labels Jul 9, 2024

stevendanna removed their assignment Jul 9, 2024

exalate-issue-sync bot assigned shubhamdhama Sep 4, 2024

exalate-issue-sync bot unassigned shubhamdhama Sep 12, 2024

cockroach-teamcity mentioned this issue Sep 18, 2024

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed #130931

Closed

tbg removed the A-testing Testing tools and infrastructure label Sep 19, 2024

cockroach-teamcity mentioned this issue Oct 31, 2024

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed #133969

Open

cockroach-teamcity mentioned this issue Dec 3, 2024

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed #136627

Closed

stevendanna self-assigned this Jan 20, 2025

stevendanna added A-kv Anything in KV that doesn't belong in a more specific category. T-kv KV Team labels Jan 20, 2025

stevendanna removed the T-db-server label Jan 20, 2025

stevendanna added a commit to stevendanna/cockroach that referenced this issue Jan 21, 2025

rangefeed: fix test logging

9e11323

The logging didn't actually print the value as it seemed to intend. Informs cockroachdb#119340 Release note: None

stevendanna mentioned this issue Jan 21, 2025

rangefeed: fix test logging #139475

Merged

shubhamdhama mentioned this issue Jan 23, 2025

upgrades: remove redundant ensureSQLSchemaTelemetrySchedule permanent upgrade #139632

Merged

stevendanna closed this as completed Jan 29, 2025

blathers-crl bot mentioned this issue Feb 5, 2025

release-25.1: upgrades: remove redundant ensureSQLSchemaTelemetrySchedule permanent upgrade #140515

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [timed out while starting tenant] #119340

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [timed out while starting tenant] #119340

cockroach-teamcity commented Feb 17, 2024 •

edited by jlinder

Loading

pav-kv commented Feb 19, 2024

nvanbenschoten commented Mar 6, 2024

stevendanna commented Mar 20, 2024

cockroach-teamcity commented Apr 12, 2024

cockroach-teamcity commented Apr 16, 2024

cockroach-teamcity commented Apr 17, 2024

cockroach-teamcity commented Apr 18, 2024

cockroach-teamcity commented May 31, 2024

cockroach-teamcity commented Jun 13, 2024

stevendanna commented Jul 9, 2024

cockroach-teamcity commented Aug 28, 2024

shubhamdhama commented Sep 4, 2024

tbg commented Sep 19, 2024

stevendanna commented Sep 19, 2024

spilchen commented Sep 20, 2024

cockroach-teamcity commented Jan 3, 2025

cockroach-teamcity commented Jan 17, 2025

blathers-crl bot commented Jan 20, 2025

stevendanna commented Jan 20, 2025

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [timed out while starting tenant] #119340

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [timed out while starting tenant] #119340

Comments

cockroach-teamcity commented Feb 17, 2024 • edited by jlinder Loading

pav-kv commented Feb 19, 2024

nvanbenschoten commented Mar 6, 2024

stevendanna commented Mar 20, 2024

cockroach-teamcity commented Apr 12, 2024

cockroach-teamcity commented Apr 16, 2024

cockroach-teamcity commented Apr 17, 2024

cockroach-teamcity commented Apr 18, 2024

cockroach-teamcity commented May 31, 2024

cockroach-teamcity commented Jun 13, 2024

stevendanna commented Jul 9, 2024

cockroach-teamcity commented Aug 28, 2024

shubhamdhama commented Sep 4, 2024

tbg commented Sep 19, 2024

stevendanna commented Sep 19, 2024

spilchen commented Sep 20, 2024

cockroach-teamcity commented Jan 3, 2025

cockroach-teamcity commented Jan 17, 2025

blathers-crl bot commented Jan 20, 2025

stevendanna commented Jan 20, 2025

cockroach-teamcity commented Feb 17, 2024 •

edited by jlinder

Loading