-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: cdc/mixed-versions failed #106878
Comments
cc @cockroachdb/cdc |
@cockroachdb/cdc Could you take a look at this failure, please? |
At ~8:40 we get stuck waiting for resolved timestamps here after rolling back the nodes.
The teardown is initiated at ~9:02
Looks like the changefeed job failed at around ~8:37.
We log a permanent changefeed shutdown at ~8:40 as well
This code path may be relevant here. Maybe we tried to mark this error as retryable on the job level. cockroach/pkg/ccl/changefeedccl/changefeedbase/errors.go Lines 119 to 125 in 9f8281c
Draining errors should be marked as retryable in one way or another, but it seems that retries did not happen in this case. |
The shutting down error is correct -- and the log messages seem to indicate that nothing can be persisted to the jobs table (i.e. mark changefeed failed, etc)... So, the changefeed ought to become unclaimed, and eventually, get claimed by another node. To be clear: i don't think this error is permanent |
It is strange though... |
Which log messages are you referring to? I can look into it. It looks like the the job status was set to |
Adding some notes from the discussion yesterday: The node draining errors were common in 22.1 and are considered to be fixed in 22.2 onwards by treating all errors as retryable (I believe by this PR: #90810). |
This test may flake due to the upgrade from 22.1->22.2. The test asserts a changefeed remains running by checking for resolved timestamps being emitted on a regular basis. The problem with this is that, during the rolling upgrade, the changefeed may fail with a "draining" error. This issue is fixed in 22.2 onwards by treating all errors as retryable. Rather than skipping this test because 22.1 is EOLed, it is preferable to still run this test regularly because it tests 22.2 functionality. This change adds a fix where the test will poll the changefeed every 1s and recreate it if it fails. Closes: cockroachdb#106878 Release note: None Epic: None
This test may flake due to the upgrade from 22.1->22.2. The test asserts a changefeed remains running by checking for resolved timestamps being emitted on a regular basis. The problem with this is that, during the rolling upgrade, the changefeed may fail with a "draining" error. This issue is fixed in 22.2 onwards by treating all errors as retryable. Rather than skipping this test because 22.1 is EOLed, it is preferable to still run this test regularly because it tests 22.2 functionality. This change adds a fix where the test will poll the changefeed every 1s and recreate it if it fails. Closes: cockroachdb#106878 Release note: None Epic: None
This test may flake due to the upgrade from 22.1->22.2. The test asserts a changefeed remains running by checking for resolved timestamps being emitted on a regular basis. The problem with this is that, during the rolling upgrade, the changefeed may fail with a "draining" error. This issue is fixed in 22.2 onwards by treating all errors as retryable. Rather than skipping this test because 22.1 is EOLed, it is preferable to still run this test regularly because it tests 22.2 functionality. This change adds a fix where the test will poll the changefeed every 1s and recreate it if it fails. Closes: cockroachdb#106878 Release note: None Epic: None
roachtest.cdc/mixed-versions failed with artifacts on release-22.2 @ 100a4aa3f590ac3989b3cc5a172996afaa9de862:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-29749
Epic CRDB-11732
The text was updated successfully, but these errors were encountered: