release-22.2: roachtest/cdc/mixed-versions: allow for changefeed failures #107293

jayshrivastava · 2023-07-20T18:53:35Z

This test may flake due to the upgrade from 22.1->22.2. The
test asserts a changefeed remains running by checking for
resolved timestamps being emitted on a regular basis. The
problem with this is that, during the rolling upgrade,
the changefeed may fail with a "draining" error.
This issue is fixed in 22.2 onwards by treating all errors
as retryable.

Rather than skipping this test because 22.1 is EOLed, it is
preferable to still run this test regularly because it tests
22.2 functionality. This change adds a fix where the test
will poll the changefeed every 1s and recreate it if it fails.

Closes: #106878
Release note: None
Epic: None

cockroach-teamcity · 2023-07-20T18:53:45Z

This change is

renatolabs · 2023-07-20T19:46:24Z

but the changefeed may fail with a "draining" error

In a real deployment, what is an operator expected to do when this happens? Restart the changefeed? If we do that (or whatever the operator is expected to do), would the test pass?

I'm wondering if we can detect this class of error and "recover" automatically. Or, if that's not possible, t.Skip the run.

While 22.1 is EOLed, 22.2 is still supported and receives the occasional backport, so I'm thinking if there's a possibility of not disabling the test altogether.

jayshrivastava · 2023-07-21T18:59:14Z

@renatolabs I agree. I added some logic to re-create the changefeed. This should be fine since the test only checks for resolved timestamps and doesn't care about the individual changefeed job.

pkg/cmd/roachtest/tests/mixed_version_cdc.go

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs)

pkg/cmd/roachtest/tests/mixed_version_cdc.go line 359 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Let's replace this with cmvt.monitor.Go() instead. go is discouraged in roachtests as a panic in the goroutine would lead to the entire roachtest process crashing (unless you add a recover to all of them, which is easy to forget).

Done.

I also added this terminateWatcherCh that is closed by the tester near the end of the test. The watcher only terminates when it checks for the context is being cancelled. After the context is cancelled but before the watcher sees the context being cancelled, it may encounter an error trying to query the DB. If this happens, it would return an error and fail the test. Using the channel should avoid this problem.

pkg/cmd/roachtest/tests/mixed_version_cdc.go line 365 at r2 (raw file):

			cmvt.terminateWatcherCh = make(chan struct{})
		}
		cmvt.monitor.Go(func(ctx2 context.Context) error {

One goroutine per changefeed is not ideal, but this test only creates one, so its ok. This is only in 22.2 and I doubt we will update this test to create a large number of changefeeds anyways.

pkg/cmd/roachtest/tests/mixed_version_cdc.go

renatolabs

Approving as the changes LGTM. Thanks for working on this!

If the test does flake as I suspect, I'll let you decide how you want to deal with it (manually as it happens, in this PR, as a followup).

This test may flake due to the upgrade from 22.1->22.2. The test asserts a changefeed remains running by checking for resolved timestamps being emitted on a regular basis. The problem with this is that, during the rolling upgrade, the changefeed may fail with a "draining" error. This issue is fixed in 22.2 onwards by treating all errors as retryable. Rather than skipping this test because 22.1 is EOLed, it is preferable to still run this test regularly because it tests 22.2 functionality. This change adds a fix where the test will poll the changefeed every 1s and recreate it if it fails. Closes: cockroachdb#106878 Release note: None Epic: None

This change adds a new roachtest "cdc/mixed-versions-multiple-upgrades" which starts a changefeed and ensures that it keeps running through multiple cluster upgrades and rollbacks. This test is an extension of a previous test "cdc/mixed-versions" which only tests upgrading from one previous version to the present version. "cdc/mixed-versions" is now renamed to "cdc/mixed-versions-single-upgrade". "cdc/mixed-versions-multiple-upgrades" performs 4 upgrades starting from 4 versions in the past. As seen in cockroachdb#107293, changefeeds running on versions prior to 22.2 can be very brittle, mainly due to them not retrying transient errors. To address this problem, this new roachtest will ensure that 22.2 is the minimum binary version. So if this test is running on 23.1, it will only test the upgrade path from 22.2-23.1. Once 23.2 releases, it will test the upgrade path from 22.2 -> 23.1 -> 23.2 and so on, keeping a running window of size 4. Closes: cockroachdb#107451 Epic: None Release note: None

This change adds a new roachtest "cdc/mixed-versions-multiple-upgrades" which starts a changefeed and ensures that it keeps running through multiple cluster upgrades and rollbacks. This test is an extension of a previous test "cdc/mixed-versions" which only tests upgrading from one previous version to the present version. "cdc/mixed-versions" is now renamed to "cdc/mixed-versions-single-upgrade". "cdc/mixed-versions-multiple-upgrades" performs 4 upgrades starting from 4 versions in the past. As seen in cockroachdb#107293, changefeeds running on versions prior to 22.2 can be very brittle, mainly due to them not retrying transient errors. To address this problem, this new roachtest will ensure that 22.2 is the minimum binary version. So if this test is running on 23.1, it will only test the upgrade path from 22.2-23.1. Once 23.2 releases, it will test the upgrade path from 22.2 -> 23.1 -> 23.2 and so on, keeping a running window of size 4. More notes: Can we remove the restriction of starting at version 22.2? In [cockroachdb#107293](cockroachdb#107293), we added some code to restart the changefeed in case it fails to reduce flakes when running 22.1. It does defeat the purpose of the test to do so, so it would not make sense to do something similar in this change. It also does not make sense to change 22.1 because it is EOL'd. Why perform the upgrade through multiple versions? As seen in cockroachdb#107451, there are cases where, for example, a 22.1 changefeed survives an upgrade to 22.2 and a 22.2 changefeed survives and upgrade to 23.1, but a 22.1 does NOT survive an upgrade from 22.1 -> 22.2 -> 23.1. This non-transitive relationship is the reason that I added this new test. We (and our users) expect that arbitrarily old jobs can be resumed in new versions without error (unless we explicitly add code to fail these jobs in certain cases). Because of this expectation, we should have a test which asserts a changefeed stays running through multiple version upgrades. Why limit the test at 4 versions? We do not want the test to run for an unbounded number of upgrades as we have many more releases in the future. Closes: cockroachdb#107451 Epic: None Release note: None

jayshrivastava requested a review from renatolabs July 20, 2023 18:53

jayshrivastava marked this pull request as ready for review July 20, 2023 18:53

jayshrivastava force-pushed the release-22.2-skip-mixed-versions branch from 8289f47 to 646a592 Compare July 20, 2023 18:55

jayshrivastava force-pushed the release-22.2-skip-mixed-versions branch from d164eb2 to 3ab305d Compare July 21, 2023 18:35

jayshrivastava changed the title ~~roachtest/cdc/mixed-versions: skip test~~ roachtest/cdc/mixed-versions: allow for changefeed failures Jul 24, 2023

renatolabs reviewed Jul 24, 2023

View reviewed changes

pkg/cmd/roachtest/tests/mixed_version_cdc.go Outdated Show resolved Hide resolved

jayshrivastava force-pushed the release-22.2-skip-mixed-versions branch from 3ab305d to 0ab05d1 Compare July 24, 2023 19:26

jayshrivastava commented Jul 24, 2023

View reviewed changes

jayshrivastava requested a review from renatolabs July 24, 2023 19:32

renatolabs reviewed Jul 24, 2023

View reviewed changes

pkg/cmd/roachtest/tests/mixed_version_cdc.go Outdated Show resolved Hide resolved

renatolabs reviewed Jul 24, 2023

View reviewed changes

pkg/cmd/roachtest/tests/mixed_version_cdc.go Outdated Show resolved Hide resolved

renatolabs approved these changes Jul 25, 2023

View reviewed changes

jayshrivastava force-pushed the release-22.2-skip-mixed-versions branch from 0ab05d1 to 2bf93a4 Compare July 25, 2023 15:48

jayshrivastava force-pushed the release-22.2-skip-mixed-versions branch from 2bf93a4 to 09de229 Compare July 25, 2023 17:07

jayshrivastava merged commit bd2fcf2 into cockroachdb:release-22.2 Jul 25, 2023

jayshrivastava deleted the release-22.2-skip-mixed-versions branch July 25, 2023 18:07

jayshrivastava changed the title ~~roachtest/cdc/mixed-versions: allow for changefeed failures~~ release-22.2: roachtest/cdc/mixed-versions: allow for changefeed failures Aug 1, 2023

jayshrivastava mentioned this pull request Aug 2, 2023

roachtest/cdc-mixed-versions: add multi-step upgrade test #107948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-22.2: roachtest/cdc/mixed-versions: allow for changefeed failures #107293

release-22.2: roachtest/cdc/mixed-versions: allow for changefeed failures #107293

jayshrivastava commented Jul 20, 2023 •

edited

Loading

cockroach-teamcity commented Jul 20, 2023

renatolabs commented Jul 20, 2023

jayshrivastava commented Jul 21, 2023

jayshrivastava left a comment

renatolabs left a comment

release-22.2: roachtest/cdc/mixed-versions: allow for changefeed failures #107293

release-22.2: roachtest/cdc/mixed-versions: allow for changefeed failures #107293

Conversation

jayshrivastava commented Jul 20, 2023 • edited Loading

cockroach-teamcity commented Jul 20, 2023

renatolabs commented Jul 20, 2023

jayshrivastava commented Jul 21, 2023

jayshrivastava left a comment

Choose a reason for hiding this comment

renatolabs left a comment

Choose a reason for hiding this comment

jayshrivastava commented Jul 20, 2023 •

edited

Loading