roachtest: cdc/mixed-versions failed #106878

cockroach-teamcity · 2023-07-15T09:02:37Z

roachtest.cdc/mixed-versions failed with artifacts on release-22.2 @ 100a4aa3f590ac3989b3cc5a172996afaa9de862:

test artifacts and logs in: /artifacts/cdc/mixed-versions/run_1
(test_runner.go:985).runTest: test timed out (30m0s)

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/test-eng _{This test on roachdash | Improve this report!

Jira issue: CRDB-29749
Epic CRDB-11732}

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-07-17T13:35:18Z

cc @cockroachdb/cdc

renatolabs · 2023-07-17T13:35:48Z

@cockroachdb/cdc Could you take a look at this failure, please?

jayshrivastava · 2023-07-17T15:00:19Z

At ~8:40 we get stuck waiting for resolved timestamps here after rolling back the nodes.

cockroach/pkg/cmd/roachtest/tests/mixed_version_cdc.go

Line 402 in 1fa4fac

tester.waitForResolvedTimestamps(),

08:40:47 versionupgrade.go:200: test status: versionUpgradeTest: starting step 12
08:40:47 mixed_version_cdc.go:163: test status: waiting for 5 resolved timestamps
08:41:04 mixed_version_cdc.go:274: 11 resolved timestamps validated, latest is 1m52.957808509s behind realtime
08:41:04 mixed_version_cdc.go:177: 1 of 5 timestamps resolved

The teardown is initiated at ~9:02

teardown: 09:02:16 test_runner.go:1035: [w12] dumped stacks to __stacks

Looks like the changefeed job failed at around ~8:37.

882572078533705731	CHANGEFEED	CREATE CHANGEFEED FOR TABLE bank.bank INTO 'kafka://10.150.0.142:9092' WITH resolved = '10s', updated		root	{171}	failed	NULL	2023-07-15 08:33:43.575977	2023-07-15 08:33:43.597976	2023-07-15 08:40:44.57881	2023-07-15 08:40:44.546961	NULL	1689410351823428620.0000000000	unable to dial n1: breaker open	NULL	2675257580918454187	2023-07-15 08:40:44.560528	2023-07-15 08:41:14.560528	1	"{""running execution from '2023-07-15 08:33:43.597976' to '2023-07-15 08:37:14.997527' on 3 failed: could not register flowID {[38 79 173 244 38 254 74 58 184 112 115 40 60 254 156 37]} because the registry is draining""}"	"[{""executionEndMicros"": ""1689410234997527"", ""executionStartMicros"": ""1689410023597976"", ""instanceId"": 3, ""status"": ""running"", ""truncatedError"": ""could not register flowID {[38 79 173 244 38 254 74 58 184 112 115 40 60 254 156 37]} because the registry is draining""}]"

We log a permanent changefeed shutdown at ~8:40 as well

./logs/2.unredacted/cockroach.teamcity-10921274-1689399618-34-n5cpu4-0002.ubuntu.2023-07-15T08_36_35Z.013474.log:I230715 08:40:04.181666 3482 ccl/changefeedccl/changefeed_stmt.go:1030 ⋮ [n2,job=‹CHANGEFEED id=882572078533705731›] 308  CHANGEFEED 882572078533705731 shutting down (cause: cannot acquire lease when draining)

This code path may be relevant here. Maybe we tried to mark this error as retryable on the job level.

cockroach/pkg/ccl/changefeedccl/changefeedbase/errors.go

Lines 119 to 125 in 9f8281c

    
           if lm.IsDraining() { 
        
           	// This node is being drained. It's safe to propagate this error (to the 
        
           	// job registry) since job registry should not be able to commit this error 
        
           	// to the jobs table; but to be safe, make sure this error is marked as jobs 
        
           	// retryable error to ensure that some other node retries this changefeed. 
        
           	return jobs.MarkAsRetryJobError(cause) 
        
           }

Draining errors should be marked as retryable in one way or another, but it seems that retries did not happen in this case.

miretskiy · 2023-07-17T16:03:13Z

The shutting down error is correct -- and the log messages seem to indicate that nothing can be persisted to the jobs table (i.e. mark changefeed failed, etc)... So, the changefeed ought to become unclaimed, and eventually, get claimed by another node.

To be clear: i don't think this error is permanent

miretskiy · 2023-07-17T16:13:16Z

It is strange though...

jayshrivastava · 2023-07-17T16:52:24Z

the log messages seem to indicate that nothing can be persisted to the jobs table (i.e. mark changefeed failed, etc)

Which log messages are you referring to? I can look into it. It looks like the the job status was set to failed in the middle of the test. I'm looking for the reason why. Nothing should put it in that state during the test, and nothing can bring it out of that state.

jayshrivastava · 2023-07-20T17:17:48Z

Adding some notes from the discussion yesterday:

The node draining errors were common in 22.1 and are considered to be fixed in 22.2 onwards by treating all errors as retryable (I believe by this PR: #90810).

This test may flake due to the upgrade from 22.1->22.2. The test asserts a changefeed remains running by checking for resolved timestamps being emitted on a regular basis. The problem with this is that, during the rolling upgrade, the changefeed may fail with a "draining" error. This issue is fixed in 22.2 onwards by treating all errors as retryable. Rather than skipping this test because 22.1 is EOLed, it is preferable to still run this test regularly because it tests 22.2 functionality. This change adds a fix where the test will poll the changefeed every 1s and recreate it if it fails. Closes: cockroachdb#106878 Release note: None Epic: None

cockroach-teamcity added this to the 22.2 milestone Jul 15, 2023

blathers-crl bot added the T-testeng TestEng Team label Jul 15, 2023

renatolabs added T-cdc and removed T-testeng TestEng Team labels Jul 17, 2023

blathers-crl bot added the A-cdc Change Data Capture label Jul 17, 2023

cockroach-teamcity mentioned this issue Jul 18, 2023

roachtest: cdc/mixed-versions failed #107007

Closed

miretskiy assigned jayshrivastava Jul 18, 2023

jayshrivastava mentioned this issue Jul 20, 2023

release-22.2: roachtest/cdc/mixed-versions: allow for changefeed failures #107293

Merged

jayshrivastava closed this as completed Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: cdc/mixed-versions failed #106878

roachtest: cdc/mixed-versions failed #106878

cockroach-teamcity commented Jul 15, 2023 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Jul 17, 2023

renatolabs commented Jul 17, 2023

jayshrivastava commented Jul 17, 2023

miretskiy commented Jul 17, 2023 •

edited

Loading

miretskiy commented Jul 17, 2023

jayshrivastava commented Jul 17, 2023

jayshrivastava commented Jul 20, 2023

roachtest: cdc/mixed-versions failed #106878

roachtest: cdc/mixed-versions failed #106878

Comments

cockroach-teamcity commented Jul 15, 2023 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented Jul 17, 2023

renatolabs commented Jul 17, 2023

jayshrivastava commented Jul 17, 2023

miretskiy commented Jul 17, 2023 • edited Loading

miretskiy commented Jul 17, 2023

jayshrivastava commented Jul 17, 2023

jayshrivastava commented Jul 20, 2023

cockroach-teamcity commented Jul 15, 2023 •

edited by exalate-issue-sync bot

Loading

miretskiy commented Jul 17, 2023 •

edited

Loading