-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: improve testing of mixed-version clusters #15851
Comments
Actually I think a non-lease-but-still-state-updating one. For example, any replica change. |
In particular, it seems that this should've fired by having an old node join a new (under-replicated) cluster. |
@tschottdorf I believe you're already working on some form of this for migrations? |
Yeah, this seems like it'll naturally fall out of the cluster version migration tests. |
Moving to 1.2. |
And so it did, though we could do more. I'll unassign myself and assign @cuongdo to hold on to this issue with the aim that some real manpower is allotted to it in the future. Running comprehensive stability tests requires more effort than I can provide ad-hoc. |
See also #18811. |
|
We broke version upgrades with #22487, but TestVersionUpgrade only became flaky instead of failing consistently. We need to figure out how to make this test detect broken upgrades more reliably. (I think the issue in this case was whether the processes live long enough to send multiple liveness heartbeats). |
Yeah, I think the real solution is to take one of these |
@benesh We'd need to add a facility to |
I'd like to get something in CI for every PR, not just a nightly. CI should have blocked the merge of the checksum PR instead of leaving us with a debugging puzzle with a day's worth of possible PRs to blame. |
For the per-PR acceptance tests, a |
…-enable See cockroachdb#15851. This change re-enables `TestVersionUpgrade`, which has been broken for at-least the past few weeks. It also verifies that the test would have caught the regression in cockroachdb#22636. In doing so, it improves `TestVersionUpgrade` by splitting the version upgrades into a series of incremental steps. Release note: None
…-enable See cockroachdb#15851. This change re-enables `TestVersionUpgrade`, which has been broken for at-least the past few weeks. It also verifies that the test would have caught the regression in cockroachdb#22636. In doing so, it improves `TestVersionUpgrade` by splitting the version upgrades into a series of incremental steps. Release note: None
We have rudimentary coverage here. Both roachtest and CI have tests that migrate a cluster through multiple versions, with some foreground activity thrown in. What we need to achieve over time is to exercise more interesting code paths in these tests. That is, we want to be decommissioning, running DistSQL queries, SCRUB, compaction queue activity, all workload generators, ... and fail the test if anything goes wrong or becomes unexpectedly slow. This is a bit of a herculean task, so I think it makes sense to split this into SQL and core components. For the core components, we can audit the introduced cluster versions and retrofit test coverage where appropriate, and make sure that any new cluster version comes with an addition to the roachtest and, if possible, CI. |
@petermattis this is now a meta-issue and more of a request for process/culture/infrastructure. Could you decide what to do with it? At the very least, we should be running a weekly mixed-version roachtest (#31223) but the problem is the missing diversity of workloads. |
Ack. Let me think on this. |
We have marked this issue as stale because it has been inactive for |
Testing have improved over the years. Closing. |
As it became apparent in #15819, we don't have tests for the upgrade of a cluster from one version to another. I believe we have a "mixed version" acceptance test, but I think all it does is it writes data using an old version and reads it with a new one.
We should have a test checking the interoperability of versions in a running cluster. We should test that as many operations as possible work when the operation involves nodes on different versions. Examples of things to test: DistSQL reads, replication, lease transfers.
In the specific case of #15819, what would have helped is one with a (lease-related?) Raft command being proposed by the new version and applied by the old one.
Jira issue: CRDB-6080
The text was updated successfully, but these errors were encountered: