Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: improve testing of mixed-version clusters #15851

Closed
andreimatei opened this issue May 10, 2017 · 18 comments
Closed

roachtest: improve testing of mixed-version clusters #15851

andreimatei opened this issue May 10, 2017 · 18 comments
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@andreimatei
Copy link
Contributor

andreimatei commented May 10, 2017

As it became apparent in #15819, we don't have tests for the upgrade of a cluster from one version to another. I believe we have a "mixed version" acceptance test, but I think all it does is it writes data using an old version and reads it with a new one.

We should have a test checking the interoperability of versions in a running cluster. We should test that as many operations as possible work when the operation involves nodes on different versions. Examples of things to test: DistSQL reads, replication, lease transfers.

In the specific case of #15819, what would have helped is one with a (lease-related?) Raft command being proposed by the new version and applied by the old one.

Jira issue: CRDB-6080

@tbg
Copy link
Member

tbg commented May 10, 2017

In the specific case of #15819, what would have helped is one with a (lease-related?) Raft command being proposed by the new version and applied by the old one.

Actually I think a non-lease-but-still-state-updating one. For example, any replica change.

@tbg
Copy link
Member

tbg commented May 10, 2017

In particular, it seems that this should've fired by having an old node join a new (under-replicated) cluster.

@cuongdo
Copy link
Contributor

cuongdo commented Aug 22, 2017

@tschottdorf I believe you're already working on some form of this for migrations?

@tbg
Copy link
Member

tbg commented Aug 22, 2017

Yeah, this seems like it'll naturally fall out of the cluster version migration tests.

@tbg
Copy link
Member

tbg commented Aug 31, 2017

Moving to 1.2.

@tbg tbg modified the milestones: 1.2, 1.1 Aug 31, 2017
@tbg
Copy link
Member

tbg commented Nov 16, 2017

Yeah, this seems like it'll naturally fall out of the cluster version migration tests.

And so it did, though we could do more.

I'll unassign myself and assign @cuongdo to hold on to this issue with the aim that some real manpower is allotted to it in the future. Running comprehensive stability tests requires more effort than I can provide ad-hoc.

@tbg tbg assigned cuongdo and unassigned tbg Nov 16, 2017
@tbg
Copy link
Member

tbg commented Jan 11, 2018

See also #18811.

@tbg
Copy link
Member

tbg commented Jan 16, 2018

  • augment the TestVersionUpgrade (acceptance) test so that it (while running in mixed version mode at the beginning) tests various migrations (such as ClearRange, AdjustStats, ...) to make sure that they don't have unintended side effects
  • add commentary in cockroach_versions.go to stipulate this for future migrations
  • update the test so that it starts the "old" node with 1.1.0 instead of 1.0.x
  • update the test so that it always bumps to the latest available cluster version possible (available via SELECT crdb_internal.node_executable_version() against a node running the latest code).

@bdarnell
Copy link
Contributor

We broke version upgrades with #22487, but TestVersionUpgrade only became flaky instead of failing consistently. We need to figure out how to make this test detect broken upgrades more reliably. (I think the issue in this case was whether the processes live long enough to send multiple liveness heartbeats).

@benesch
Copy link
Contributor

benesch commented Feb 13, 2018

Yeah, I think the real solution is to take one of these roachtest nightly tests that's serving kv traffic and bump versions every, say, 30m instead of as quickly as possible.

@petermattis
Copy link
Collaborator

@benesh We'd need to add a facility to roachtest for downloading additional binaries, probably via util/binfetcher. Shouldn't be difficult.

@bdarnell
Copy link
Contributor

I'd like to get something in CI for every PR, not just a nightly. CI should have blocked the merge of the checksum PR instead of leaving us with a debugging puzzle with a day's worth of possible PRs to blame.

@benesch
Copy link
Contributor

benesch commented Feb 13, 2018

For the per-PR acceptance tests, a time.Sleep(10*time.Second) in between version bumps would go a long way, but we still need something more serious to catch incompatibilities in e.g. the DistSQL layer.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 14, 2018
…-enable

See cockroachdb#15851.

This change re-enables `TestVersionUpgrade`, which has been broken for at-least
the past few weeks. It also verifies that the test would have caught the regression
in cockroachdb#22636.

In doing so, it improves `TestVersionUpgrade` by splitting the version upgrades
into a series of incremental steps.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 15, 2018
…-enable

See cockroachdb#15851.

This change re-enables `TestVersionUpgrade`, which has been broken for at-least
the past few weeks. It also verifies that the test would have caught the regression
in cockroachdb#22636.

In doing so, it improves `TestVersionUpgrade` by splitting the version upgrades
into a series of incremental steps.

Release note: None
@tbg
Copy link
Member

tbg commented Mar 8, 2018

We have rudimentary coverage here. Both roachtest and CI have tests that migrate a cluster through multiple versions, with some foreground activity thrown in. What we need to achieve over time is to exercise more interesting code paths in these tests. That is, we want to be decommissioning, running DistSQL queries, SCRUB, compaction queue activity, all workload generators, ... and fail the test if anything goes wrong or becomes unexpectedly slow. This is a bit of a herculean task, so I think it makes sense to split this into SQL and core components. For the core components, we can audit the introduced cluster versions and retrofit test coverage where appropriate, and make sure that any new cluster version comes with an addition to the roachtest and, if possible, CI.

@petermattis petermattis changed the title stability: improve testing of mixed-version clusters roachtest: improve testing of mixed-version clusters Mar 30, 2018
@petermattis petermattis modified the milestones: 2.0, 2.1 Mar 30, 2018
@tbg tbg added the A-kv-client Relating to the KV client and the KV interface. label May 15, 2018
@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2018
@petermattis petermattis removed this from the 2.1 milestone Oct 5, 2018
@tbg tbg assigned petermattis and unassigned tbg and nvanbenschoten Oct 11, 2018
@tbg tbg added A-testing Testing tools and infrastructure and removed A-kv-client Relating to the KV client and the KV interface. labels Oct 11, 2018
@tbg
Copy link
Member

tbg commented Oct 11, 2018

@petermattis this is now a meta-issue and more of a request for process/culture/infrastructure. Could you decide what to do with it? At the very least, we should be running a weekly mixed-version roachtest (#31223) but the problem is the missing diversity of workloads.

@petermattis
Copy link
Collaborator

Ack. Let me think on this.

@github-actions
Copy link

github-actions bot commented Jun 9, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@tbg tbg assigned tbg and unassigned petermattis Jun 9, 2021
@tbg tbg removed their assignment May 18, 2022
@andreimatei
Copy link
Contributor Author

Testing have improved over the years. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

7 participants