roachtest: improve testing of mixed-version clusters #15851

andreimatei · 2017-05-10T17:48:07Z

As it became apparent in #15819, we don't have tests for the upgrade of a cluster from one version to another. I believe we have a "mixed version" acceptance test, but I think all it does is it writes data using an old version and reads it with a new one.

We should have a test checking the interoperability of versions in a running cluster. We should test that as many operations as possible work when the operation involves nodes on different versions. Examples of things to test: DistSQL reads, replication, lease transfers.

In the specific case of #15819, what would have helped is one with a (lease-related?) Raft command being proposed by the new version and applied by the old one.

Jira issue: CRDB-6080

tbg · 2017-05-10T18:01:48Z

In the specific case of #15819, what would have helped is one with a (lease-related?) Raft command being proposed by the new version and applied by the old one.

Actually I think a non-lease-but-still-state-updating one. For example, any replica change.

tbg · 2017-05-10T18:03:31Z

In particular, it seems that this should've fired by having an old node join a new (under-replicated) cluster.

cuongdo · 2017-08-22T20:20:44Z

@tschottdorf I believe you're already working on some form of this for migrations?

tbg · 2017-08-22T20:46:35Z

Yeah, this seems like it'll naturally fall out of the cluster version migration tests.

tbg · 2017-08-31T00:29:22Z

Moving to 1.2.

tbg · 2017-11-16T18:39:39Z

Yeah, this seems like it'll naturally fall out of the cluster version migration tests.

And so it did, though we could do more.

I'll unassign myself and assign @cuongdo to hold on to this issue with the aim that some real manpower is allotted to it in the future. Running comprehensive stability tests requires more effort than I can provide ad-hoc.

tbg · 2018-01-11T16:54:55Z

See also #18811.

tbg · 2018-01-16T16:28:57Z

augment the TestVersionUpgrade (acceptance) test so that it (while running in mixed version mode at the beginning) tests various migrations (such as ClearRange, AdjustStats, ...) to make sure that they don't have unintended side effects
add commentary in cockroach_versions.go to stipulate this for future migrations
update the test so that it starts the "old" node with 1.1.0 instead of 1.0.x
update the test so that it always bumps to the latest available cluster version possible (available via SELECT crdb_internal.node_executable_version() against a node running the latest code).

bdarnell · 2018-02-13T17:20:26Z

We broke version upgrades with #22487, but TestVersionUpgrade only became flaky instead of failing consistently. We need to figure out how to make this test detect broken upgrades more reliably. (I think the issue in this case was whether the processes live long enough to send multiple liveness heartbeats).

benesch · 2018-02-13T17:34:02Z

Yeah, I think the real solution is to take one of these roachtest nightly tests that's serving kv traffic and bump versions every, say, 30m instead of as quickly as possible.

petermattis · 2018-02-13T18:06:09Z

@benesh We'd need to add a facility to roachtest for downloading additional binaries, probably via util/binfetcher. Shouldn't be difficult.

bdarnell · 2018-02-13T18:16:16Z

I'd like to get something in CI for every PR, not just a nightly. CI should have blocked the merge of the checksum PR instead of leaving us with a debugging puzzle with a day's worth of possible PRs to blame.

benesch · 2018-02-13T18:23:45Z

For the per-PR acceptance tests, a time.Sleep(10*time.Second) in between version bumps would go a long way, but we still need something more serious to catch incompatibilities in e.g. the DistSQL layer.

…-enable See cockroachdb#15851. This change re-enables `TestVersionUpgrade`, which has been broken for at-least the past few weeks. It also verifies that the test would have caught the regression in cockroachdb#22636. In doing so, it improves `TestVersionUpgrade` by splitting the version upgrades into a series of incremental steps. Release note: None

tbg · 2018-03-08T19:18:51Z

We have rudimentary coverage here. Both roachtest and CI have tests that migrate a cluster through multiple versions, with some foreground activity thrown in. What we need to achieve over time is to exercise more interesting code paths in these tests. That is, we want to be decommissioning, running DistSQL queries, SCRUB, compaction queue activity, all workload generators, ... and fail the test if anything goes wrong or becomes unexpectedly slow. This is a bit of a herculean task, so I think it makes sense to split this into SQL and core components. For the core components, we can audit the introduced cluster versions and retrofit test coverage where appropriate, and make sure that any new cluster version comes with an addition to the roachtest and, if possible, CI.

tbg · 2018-10-11T12:04:19Z

@petermattis this is now a meta-issue and more of a request for process/culture/infrastructure. Could you decide what to do with it? At the very least, we should be running a weekly mixed-version roachtest (#31223) but the problem is the missing diversity of workloads.

petermattis · 2018-10-11T12:26:13Z

Ack. Let me think on this.

github-actions · 2021-06-09T02:06:22Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

andreimatei · 2023-02-28T19:46:58Z

Testing have improved over the years. Closing.

andreimatei added this to the 1.1 milestone May 10, 2017

andreimatei mentioned this issue May 10, 2017

storage: (*Replica).handleEvalResult called on Replica with zero in-memory, nonzero on-disk Lease #15819

Closed

cuongdo assigned tbg Aug 22, 2017

tbg modified the milestones: 1.2, 1.1 Aug 31, 2017

tbg mentioned this issue Aug 31, 2017

tests: verify on-the-wire compatibility #5678

Closed

tbg assigned cuongdo and unassigned tbg Nov 16, 2017

tbg mentioned this issue Jan 11, 2018

stability: longer and more extensive testing of mixed version clusters #18811

Closed

tbg assigned nvanbenschoten and unassigned cuongdo Jan 11, 2018

nvanbenschoten mentioned this issue Feb 13, 2018

acceptance: TestVersionUpgrade times out #22581

Closed

nvanbenschoten mentioned this issue Feb 14, 2018

acceptance: break TestVersionUpgrade into series of upgrade steps; re-enable #22714

Merged

petermattis changed the title ~~stability: improve testing of mixed-version clusters~~ roachtest: improve testing of mixed-version clusters Mar 30, 2018

petermattis modified the milestones: 2.0, 2.1 Mar 30, 2018

tbg added the A-kv-client Relating to the KV client and the KV interface. label May 15, 2018

tbg mentioned this issue Jul 17, 2018

roachtest: improved mixed-version testing #24290

Closed

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2018

petermattis removed this from the 2.1 milestone Oct 5, 2018

tbg assigned petermattis and unassigned tbg and nvanbenschoten Oct 11, 2018

tbg added A-testing Testing tools and infrastructure and removed A-kv-client Relating to the KV client and the KV interface. labels Oct 11, 2018

github-actions bot added the no-issue-activity label Jun 9, 2021

tbg removed the no-issue-activity label Jun 9, 2021

tbg assigned tbg and unassigned petermattis Jun 9, 2021

tbg removed their assignment May 18, 2022

andreimatei closed this as completed Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: improve testing of mixed-version clusters #15851

roachtest: improve testing of mixed-version clusters #15851

andreimatei commented May 10, 2017 •

edited by cockroach-jira-scripts

Loading

tbg commented May 10, 2017

tbg commented May 10, 2017 •

edited

Loading

cuongdo commented Aug 22, 2017

tbg commented Aug 22, 2017

tbg commented Aug 31, 2017

tbg commented Nov 16, 2017

tbg commented Jan 11, 2018

tbg commented Jan 16, 2018

bdarnell commented Feb 13, 2018

benesch commented Feb 13, 2018

petermattis commented Feb 13, 2018

bdarnell commented Feb 13, 2018

benesch commented Feb 13, 2018

tbg commented Mar 8, 2018

tbg commented Oct 11, 2018

petermattis commented Oct 11, 2018

github-actions bot commented Jun 9, 2021

andreimatei commented Feb 28, 2023

roachtest: improve testing of mixed-version clusters #15851

roachtest: improve testing of mixed-version clusters #15851

Comments

andreimatei commented May 10, 2017 • edited by cockroach-jira-scripts Loading

tbg commented May 10, 2017

tbg commented May 10, 2017 • edited Loading

cuongdo commented Aug 22, 2017

tbg commented Aug 22, 2017

tbg commented Aug 31, 2017

tbg commented Nov 16, 2017

tbg commented Jan 11, 2018

tbg commented Jan 16, 2018

bdarnell commented Feb 13, 2018

benesch commented Feb 13, 2018

petermattis commented Feb 13, 2018

bdarnell commented Feb 13, 2018

benesch commented Feb 13, 2018

tbg commented Mar 8, 2018

tbg commented Oct 11, 2018

petermattis commented Oct 11, 2018

github-actions bot commented Jun 9, 2021

andreimatei commented Feb 28, 2023

andreimatei commented May 10, 2017 •

edited by cockroach-jira-scripts

Loading

tbg commented May 10, 2017 •

edited

Loading