-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for bwc for testclusters and convert full cluster restart #45374
Conversation
Pinging @elastic/es-core-infra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. Also, we need to address input snapshotting here for cacheability. Since a single node can use multiple distributions, we need to include all of those distributions when snapshotting inputs. Right now ElasticsearchNode#getDistributionClasspath
only takes into account the first distribution version. We need to tweak this so it includes all distribution jar files. Same goes for getDistributionFiles()
.
This is going to be complicated by the fact that we'll have duplicate files in the same input property. We might have to do something like use the runtime API (i.e. inputs.files()) here so we can be more dynamic. That would allow us to register separate input properties per-distribution. Perhaps register these inputs when we freeze()
the node?
@@ -38,6 +38,8 @@ | |||
|
|||
void setVersion(String version); | |||
|
|||
void setVersion(List<String> version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call this setVersions()
?
public ElasticsearchCluster(String path, String clusterName, Project project, ReaperService reaper, | ||
Function<Integer, ElasticsearchDistribution> distributionFactory, File workingDirBase) { | ||
public ElasticsearchCluster(String path, String clusterName, Project project, | ||
ReaperService reaper, File workingDirBase) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try and keep the distribution factory pattern here rather than have the ElasticsearchNode
directly call into the DistributionContainer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand on the advantages you see here ? To me it seemed like indirection that makes things harder to understand. We have proffered explicit coupling elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid this kind of coupling when at all possible. That's the reason for this factory pattern here initially. I think it's actually easier to understand with that factory logic living in the plugin. It removes some noise from the ElasticsearchNode
class which is already getting quite complex and should realistically be broken up at some point.
); | ||
Path jvmOptions = configFile.getParent().resolve("jvm.options"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused as to why this is necessary? Why are we cherry picking files here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could recursively copy the contents of the directory, but these are the relevant files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't understand why we need to explicity copy these files from the distribution dir to the working directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are setting up a new config dir, outside of the distro dir.
These only get copied if they don't exist, so for bwc tests they will use the old versions even after upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm replacing this with walking the config dir so we don't hard-code files
@@ -86,6 +88,8 @@ | |||
|
|||
void restart(); | |||
|
|||
void goToNextVersion(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe upgradeToNextVersion()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could technically be a downgrade too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
} | ||
|
||
@Override | ||
public void setVersion(List<String> versions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really don't have a need to go through multiple versions. The upgrade tests start with one version, and go to another. This seems far more complicated to me than we had before, where each step of the upgrade tests was a different cluster. IIRC the tricky part there was hooking up finalizers to shut down certain nodes between phases of the tests, but given that testclusters has far better control over when nodes start and stop, why couldn't we stick with that pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm completely open to suggestion here, that's why I stopped at converting one project. Here's my reasoning for going down this path.
The reason we can't just call setVersion
from a doFirst
is that all versions need to be known at configuration time.
I wanted to have a way to support the upgrade in the framework to have less repetition in the projects that implement bwc testing.
The alternative would be to have multiple cluster definitions and set the data paths ( which we would have to allow, we don't currently ).
For rolling restart we would have to come up with a way to tell testclusters not to start the cluster for the task, as the task would want to have fine grained control and start the upgraded cluster node-by-node, I really dislike having to add that.
I would also prefer we don't give node level control to build scripts as it would make implementing the tests more imperative, but descriptive tests are easier to understand.
Another use-case that was brought up and plays into this is testing with evolving configurations in a mixed cluster. e.x. starting with some nodes that have ML disabled, run some tests, enable them, run some more.
In addition we already have some tests that alter the configuration of the cluster ( e.x. enabling security ) that currently just call the same configuration DSL and restart the cluster, but I'm not happy with them.
The initial thinking with the freeze()
call was that all configuration would happen at Gradle's configuration time, thus the versions and all the config should be set at Gradle's configuration time and frozen during execution time.
The setVersion(List)
is probably not the best way to express this, but I really think we should keep the paradigm of configuring both the initial and all future states of the cluster and then have a DSL that would move between the states.
I see setVersion(List)
as a shortcut to get there and start running bwc with --parallel
and get rid of all the timeout related test failures in CI before we go and refactor to separate out configuration from the implementation of the actions, so we can have something like named states we could go to, so we wouldn't need setVersions
but instead it would look more like configuring multiple clusters with configuration inheritance, except it won't be multiple clusters but different state of the same cluster.
e.x.
testClusters {
"bwcTestv${bwcVersion}" {
version = bwcVersion
alternativeConfigurations {
upgraded {
version = project.version
}
}
}
}
upgradeTest.doFirst {
testClusters."bwcTestv${bwcVersion}".alternativeConfiguration("upgraded")
}
We can then also add a way to optionally configure a subset of nodes differently withing this DSL. Something like:
testClusters {
numberOfNodes = 3
setting "foo", "false"
nodes(1/3) {
setting "foo", "true"
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed that we will merge this PR as is and discuss DSL in a follow up.
@mark-vieira I'm don't fully understand your comment with regards to duplicate files. Is it about file names ? How do those hurt? |
@atorok Yeah, now I think the duplicates file stuff will be ok because their relative paths will be different, so they aren't actually the "same file" since the distribution directories differ per-version. FYI, testing that cacheability still works is difficult here because your branch has diverged from |
@elasticmachine run elasticsearch-ci/1 |
1 similar comment
@elasticmachine run elasticsearch-ci/1 |
Looks like the failure from the PR check fails reproducible
It's not immediately obvious how this is influenced by these results. |
* Add support for bwc for testclusters and convert full cluster restart (#45374) * Testclusters fix bwc (#46740) Additions to make testclsuters work with lather versions of ES * Do common node config on bwc tests Before this PR we always ever ran `ElasticsearchCluster.start` once, and the common node config was never done. This becomes apparent in upgrading from `6.x` to `7.x` as the new config is missing preventing the cluster from starting. * Do common node config on bwc tests Before this PR we always ever ran `ElasticsearchCluster.start` once, and the common node config was never done. This becomes apparent in upgrading from `6.x` to `7.x` as the new config is missing preventing the cluster from starting. * Fix logic to pick up snapshot from 6.x * Make sure ports are cleared * Fix test * Don't clear all the config as we rely on it * Fix removal of keys
This PR adds the DSL that allows implementing bwc tests and does a sample one.
The way it works is that all versions are specified up-front in a list and one can call
goToNextVersion
ornextNodeToNextVersion
to upgrade the cluster or nodes within it.The upgrades happen in-place. For this reason we now sync the distro into a dedicated sub-folder.