-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [timed out while starting tenant] #119340
Comments
The test cluster start-up hangs?
|
I am not sure if I've reproduced the exact same thing, but I'm able to reliably get this test to time out under stress with a 1 minute timeout. What is happening here is interesting. I don't have the whole story yet. This is a 3 node test. As the 3 tenant SQL servers start up, they all see that the permanent migrations need to run:
It appears n1 won the race to create the migration job, the other two nodes simply wait for n1's job:
Over on
It appears that the transaction for that migration is stuck:
This bubbles up to the resumer:
But I believe that because this is a non-cancellable job, this job will still be in a "running" state, waiting for it to be re-adopted. Now, it should get adopted again, but in this 1m test run we may not be running long enough for that to happen. I've been unable to reproduce this with a longer timeout, becuase I almost always hit:
Before that happens. Focusing for a moment on why we might have contention, I can see clearly why there might be contention in the short term during startup. Namely, before the permanent migrations run that create these new schedules, we also start goroutines on every node that poll the schedules table and tries to create the schedule. So, it makes some sense that all these queries trying to create the same schedule may contend. Before 79219f7 all of the backoff in the loops that try to create these schedules don't matter because they will be in a very tight db.Txn retry loop. After that commit I think the system eventually unsticks itself. |
kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 666fdb445868b6f27862313ee75d032687f1b3db:
|
kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed on master @ 3c643b003890560b16c4fee1d1c18bea1871803b: Fatal error:
Stack:
Log preceding fatal error
Parameters:
|
kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed on master @ 934684d8134fd8bb34eae1a37f3aa83a4ac066b7: Fatal error:
Stack:
Log preceding fatal error
Parameters:
|
kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed on master @ 714dc4da48a3d2a07b5d097542808982f848f704: Fatal error:
Stack:
Log preceding fatal error
Parameters:
Same failure on other branches
|
kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ f2e7709ee3912568de9e214560292844bf4e9f23:
Same failure on other branches
|
kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 1ca84f37a5c230267fe4c9b209983db62becce6a:
Same failure on other branches
|
I think there are probably a few things going on here, but we really should try to dig into these slow tenant startup issues. |
kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 7512f81be1591cde8e3b5d34bb0b9b1101f22d24:
|
TLDR: Mostly just reproducing the issue. No new observation that isn't mentioned here #119340 (comment) Progress so far I am able to reproduce all three failures mentioned here under stress.
and this #119340 (comment) seems correct
I verified by just putting some stack trace in cockroach/pkg/sql/catalog/schematelemetry/schematelemetrycontroller/controller.go Lines 212 to 218 in aeec299
When the
and
and they fail after successful adoption and completion of the migration,
Just out of curiosity, with
but it failed unexpectedly even before giving up on transaction retries,
Anyway, my first goal here is to understand why startup is slow? So I'm sidelining issues when In my knowledge since only this test case is facing slow startup, so my suspicion is the test setup done here before starting the test server is causing that txns contention. |
I just investigated the same failure over in #130931 (comment). I'll now mark that one as a duplicate of this one and assign it to
We can:
The former seems like a real solution and would solve a problem that seems likely to also affect production. The second is pure band aid. My suggestion is to do both, in the order in which they appear here. |
Uff. Apologies I dropped the ball on this and so many have spent time re-investigating this issue. The first bit of contention I originally found was around this telemetry schedule creation.
cockroach/pkg/sql/conn_executor.go Line 620 in 285460a
|
The second approach seems more logical to me. As you pointed out, we already have startup code that ensures the schedule is in place (see below), so it's unclear why a permanent upgrade task would also be needed for this. cockroach/pkg/sql/catalog/schematelemetry/schematelemetrycontroller/controller.go Line 113 in aeec299
|
kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 393c4f973b662a31263490b936bbbdda23b8875a:
Same failure on other branches
|
kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ f94a89b5231ddaffcb4f8705e19b5504af56bc47:
Same failure on other branches
|
This issue has multiple T-eam labels. Please make sure it only has one, or else issue synchronization will not work correctly. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
The last error on this makes we worried this persistent failure is hiding other problems so I am going to try to solve this this week. |
The logging didn't actually print the value as it seemed to intend. Informs cockroachdb#119340 Release note: None
139150: kvserver: remove StoreBenignError r=tbg a=tbg Before commit 3f0b37a, the StoreBenignError is used to handle pebble.ErrSnapshotExcised. As the latter has been removed from pebble, we don't need StoreBenignError anymore. This commit does the following: - Remove type "StoreBenignError". - Remove the related increase action on counter "storeFailures". - Update related tests "TestBaseQueueRequeue". Fixes: #129941 Closes: #130308 Release note: None 139280: roachtest: adding backup/restore tests for minio r=sravotto a=sravotto Introducing a test to verify that we can backup and restore into a Minio Object Store cluster, using S3 API. Fixes: #139272 Release note: None 139333: roachtest: only run 30 node backfill tests in full ac mode r=andrewbaptist a=andrewbaptist In the non-full AC modes, a node can OOM during the fill period and the test will fail. This impacts the perturbation/metamorphic/backfill test. Fixes: #139302 Informs: #139319 Release note: None 139475: rangefeed: fix test logging r=tbg a=stevendanna The logging didn't actually print the value as it seemed to intend. Informs #119340 Release note: None Co-authored-by: XiaochenCui <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Silvano Ravotto <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Steven Danna <[email protected]>
…nt upgrade During startup, `CreateSchemaTelemetrySchedule`, which creates the scheduled job to collect schema telemetry, is redundantly invoked twice concurrently: once via `schematelemetrycontroller.Controller.Start` and again through the `ensureSQLSchemaTelemetrySchedule` permanent upgrade. This was identified in cockroachdb#119340, where the permanent upgrade encounters contention and is resolved nearly 30 seconds later causing slow startup of the tenant. Although the exact reason for the prolonged transaction deadlock is unclear, we can still benefit by removing this redundant upgrade. Informs: cockroachdb#119340 Closes: cockroachdb#130931 Release note: None Epic: none
…nt upgrade During startup, `CreateSchemaTelemetrySchedule`, which creates the scheduled job to collect schema telemetry, is redundantly invoked twice concurrently: once via `schematelemetrycontroller.Controller.Start` and again through the `ensureSQLSchemaTelemetrySchedule` permanent upgrade. This was identified in cockroachdb#119340, where the permanent upgrade encounters contention and is resolved nearly 30 seconds later causing slow startup of the tenant. Although the exact reason for the prolonged transaction deadlock is unclear, we can still benefit by removing this redundant upgrade. Informs: cockroachdb#119340 Closes: cockroachdb#130931 Release note: None Epic: none
139632: upgrades: remove redundant `ensureSQLSchemaTelemetrySchedule` permanent upgrade r=rafiss a=shubhamdhama During startup, `CreateSchemaTelemetrySchedule`, which creates the scheduled job to collect schema telemetry, is redundantly invoked twice concurrently: once via `schematelemetrycontroller.Controller.Start` and also through the `ensureSQLSchemaTelemetrySchedule` permanent upgrade. This was identified in #119340, where the permanent upgrade encounters contention and is resolved nearly 30 seconds later causing slow startup of the tenant. Although the exact reason for the prolonged transaction deadlock is unclear, we can still benefit by removing this redundant upgrade. Informs: #119340 Closes: #130931 Release note: None Epic: none Co-authored-by: Shubham Dhama <[email protected]>
…nt upgrade During startup, `CreateSchemaTelemetrySchedule`, which creates the scheduled job to collect schema telemetry, is redundantly invoked twice concurrently: once via `schematelemetrycontroller.Controller.Start` and again through the `ensureSQLSchemaTelemetrySchedule` permanent upgrade. This was identified in cockroachdb#119340, where the permanent upgrade encounters contention and is resolved nearly 30 seconds later causing slow startup of the tenant. Although the exact reason for the prolonged transaction deadlock is unclear, we can still benefit by removing this redundant upgrade. Informs: cockroachdb#119340 Closes: cockroachdb#130931 Release note: None Epic: none
…nt upgrade During startup, `CreateSchemaTelemetrySchedule`, which creates the scheduled job to collect schema telemetry, is redundantly invoked twice concurrently: once via `schematelemetrycontroller.Controller.Start` and again through the `ensureSQLSchemaTelemetrySchedule` permanent upgrade. This was identified in #119340, where the permanent upgrade encounters contention and is resolved nearly 30 seconds later causing slow startup of the tenant. Although the exact reason for the prolonged transaction deadlock is unclear, we can still benefit by removing this redundant upgrade. Informs: #119340 Closes: #130931 Release note: None Epic: none
kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ a58e89a2e54e3a5ad73edcce48a669e516cceedd:
Fatal error:
Stack:
Log preceding fatal error
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-36159
The text was updated successfully, but these errors were encountered: