Add support for bwc for testclusters and convert full cluster restart #45374

alpar-t · 2019-08-09T11:33:35Z

This PR adds the DSL that allows implementing bwc tests and does a sample one.

The way it works is that all versions are specified up-front in a list and one can call goToNextVersion or nextNodeToNextVersion to upgrade the cluster or nodes within it.
The upgrades happen in-place. For this reason we now sync the distro into a dedicated sub-folder.

elasticmachine · 2019-08-09T11:33:58Z

Pinging @elastic/es-core-infra

mark-vieira

A few comments. Also, we need to address input snapshotting here for cacheability. Since a single node can use multiple distributions, we need to include all of those distributions when snapshotting inputs. Right now ElasticsearchNode#getDistributionClasspath only takes into account the first distribution version. We need to tweak this so it includes all distribution jar files. Same goes for getDistributionFiles().

This is going to be complicated by the fact that we'll have duplicate files in the same input property. We might have to do something like use the runtime API (i.e. inputs.files()) here so we can be more dynamic. That would allow us to register separate input properties per-distribution. Perhaps register these inputs when we freeze() the node?

mark-vieira · 2019-08-09T17:35:03Z

buildSrc/src/main/java/org/elasticsearch/gradle/testclusters/TestClusterConfiguration.java

@@ -38,6 +38,8 @@

    void setVersion(String version);

+    void setVersion(List<String> version);


Should we call this setVersions()?

mark-vieira · 2019-08-09T17:40:56Z

buildSrc/src/main/java/org/elasticsearch/gradle/testclusters/ElasticsearchCluster.java

-    public ElasticsearchCluster(String path, String clusterName, Project project, ReaperService reaper,
-                                Function<Integer, ElasticsearchDistribution> distributionFactory, File workingDirBase) {
+    public ElasticsearchCluster(String path, String clusterName, Project project,
+                                ReaperService reaper, File workingDirBase) {


I think we should try and keep the distribution factory pattern here rather than have the ElasticsearchNode directly call into the DistributionContainer.

Could you expand on the advantages you see here ? To me it seemed like indirection that makes things harder to understand. We have proffered explicit coupling elsewhere.

We should avoid this kind of coupling when at all possible. That's the reason for this factory pattern here initially. I think it's actually easier to understand with that factory logic living in the plugin. It removes some noise from the ElasticsearchNode class which is already getting quite complex and should realistically be broken up at some point.

mark-vieira · 2019-08-09T17:42:39Z

buildSrc/src/main/java/org/elasticsearch/gradle/testclusters/ElasticsearchNode.java

            );
+            Path jvmOptions = configFile.getParent().resolve("jvm.options");


I'm confused as to why this is necessary? Why are we cherry picking files here?

We could recursively copy the contents of the directory, but these are the relevant files

I guess I don't understand why we need to explicity copy these files from the distribution dir to the working directory?

We are setting up a new config dir, outside of the distro dir.
These only get copied if they don't exist, so for bwc tests they will use the old versions even after upgrade.

I'm replacing this with walking the config dir so we don't hard-code files

mark-vieira · 2019-08-09T17:43:09Z

buildSrc/src/main/java/org/elasticsearch/gradle/testclusters/TestClusterConfiguration.java

@@ -86,6 +88,8 @@

    void restart();

+    void goToNextVersion();


Maybe upgradeToNextVersion()?

It could technically be a downgrade too.

rjernst · 2019-08-10T04:16:48Z

buildSrc/src/main/java/org/elasticsearch/gradle/testclusters/ElasticsearchNode.java

+    }
+
+    @Override
+    public void setVersion(List<String> versions) {


We really don't have a need to go through multiple versions. The upgrade tests start with one version, and go to another. This seems far more complicated to me than we had before, where each step of the upgrade tests was a different cluster. IIRC the tricky part there was hooking up finalizers to shut down certain nodes between phases of the tests, but given that testclusters has far better control over when nodes start and stop, why couldn't we stick with that pattern?

I'm completely open to suggestion here, that's why I stopped at converting one project. Here's my reasoning for going down this path.

The reason we can't just call setVersion from a doFirst is that all versions need to be known at configuration time.

I wanted to have a way to support the upgrade in the framework to have less repetition in the projects that implement bwc testing.
The alternative would be to have multiple cluster definitions and set the data paths ( which we would have to allow, we don't currently ).
For rolling restart we would have to come up with a way to tell testclusters not to start the cluster for the task, as the task would want to have fine grained control and start the upgraded cluster node-by-node, I really dislike having to add that.

I would also prefer we don't give node level control to build scripts as it would make implementing the tests more imperative, but descriptive tests are easier to understand.

Another use-case that was brought up and plays into this is testing with evolving configurations in a mixed cluster. e.x. starting with some nodes that have ML disabled, run some tests, enable them, run some more.
In addition we already have some tests that alter the configuration of the cluster ( e.x. enabling security ) that currently just call the same configuration DSL and restart the cluster, but I'm not happy with them.

The initial thinking with the freeze() call was that all configuration would happen at Gradle's configuration time, thus the versions and all the config should be set at Gradle's configuration time and frozen during execution time.

The setVersion(List) is probably not the best way to express this, but I really think we should keep the paradigm of configuring both the initial and all future states of the cluster and then have a DSL that would move between the states.
I see setVersion(List) as a shortcut to get there and start running bwc with --parallel and get rid of all the timeout related test failures in CI before we go and refactor to separate out configuration from the implementation of the actions, so we can have something like named states we could go to, so we wouldn't need setVersions but instead it would look more like configuring multiple clusters with configuration inheritance, except it won't be multiple clusters but different state of the same cluster.
e.x.

testClusters { "bwcTestv${bwcVersion}" { version = bwcVersion alternativeConfigurations { upgraded { version = project.version } } } } upgradeTest.doFirst { testClusters."bwcTestv${bwcVersion}".alternativeConfiguration("upgraded") }

We can then also add a way to optionally configure a subset of nodes differently withing this DSL. Something like:

testClusters { numberOfNodes = 3 setting "foo", "false" nodes(1/3) { setting "foo", "true" } }

We discussed that we will merge this PR as is and discuss DSL in a follow up.

alpar-t · 2019-08-14T11:28:02Z

@mark-vieira I'm don't fully understand your comment with regards to duplicate files. Is it about file names ? How do those hurt?
I took a stab at implementing it.

mark-vieira · 2019-08-14T18:57:24Z

@atorok Yeah, now I think the duplicates file stuff will be ok because their relative paths will be different, so they aren't actually the "same file" since the distribution directories differ per-version.

FYI, testing that cacheability still works is difficult here because your branch has diverged from master which contains some fixes in that area.

alpar-t · 2019-08-15T11:49:11Z

@elasticmachine run elasticsearch-ci/1

alpar-t · 2019-08-15T13:03:32Z

@elasticmachine run elasticsearch-ci/1

alpar-t · 2019-08-15T14:21:07Z

Looks like the failure from the PR check fails reproducible

0, xpack.installed=true} with JoinRequest{sourceNode={external_0}{LTe-kf__T_-NHQkdAAAAAA}{uTOeARHbTh-FWsWgz9_iNw}{127.0.0.1}{127.0.0.1:12800}, optionalJoin=Optional[Join{term=1, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={external_0}{LTe-kf__T_-NHQkdAAAAAA}{uTOeARHbTh-FWsWgz9_iNw}{127.0.0.1}{127.0.0.1:12800}, targetNode={integTest-0}{_2jVJVXGRWmRlzgjO1vePw}{9q6f_V5USS6rVjT3esuvZw}{127.0.0.1}{127.0.0.1:33949}{dilm}{testattr=test, ml.machine_memory=33642762240, ml.max_open_jobs=20, xpack.installed=true}}]}
  1> org.elasticsearch.transport.RemoteTransportException: [integTest-0][127.0.0.1:33949][internal:cluster/coordination/join]
  1> Caused by: java.lang.IllegalStateException: failure when sending a validation request to node
  1>    at org.elasticsearch.cluster.coordination.Coordinator$2.onFailure(Coordinator.java:489) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1110) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1110) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:244) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1>    at java.lang.Thread.run(Thread.java:834) [?:?]
  1> Caused by: org.elasticsearch.transport.RemoteTransportException: [external_0][127.0.0.1:12800][internal:cluster/coordination/join/validate]
  1> Caused by: java.lang.IllegalArgumentException: Unknown NamedWriteable [org.elasticsearch.cluster.metadata.MetaData$Custom][index_lifecycle]
  1>    at org.elasticsearch.common.io.stream.NamedWriteableRegistry.getReader(NamedWriteableRegistry.java:112) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.common.io.stream.NamedWriteableAwareStreamInput.readNamedWriteable(NamedWriteableAwareStreamInput.java:45) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.common.io.stream.NamedWriteableAwareStreamInput.readNamedWriteable(NamedWriteableAwareStreamInput.java:39) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.cluster.metadata.MetaData.readFrom(MetaData.java:890) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.cluster.ClusterState.readFrom(ClusterState.java:719) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.cluster.coordination.ValidateJoinRequest.<init>(ValidateJoinRequest.java:33) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RequestHandlerRegistry.newRequest(RequestHandlerRegistry.java:56) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:177) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:120) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:104) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:661) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport.consumeNetworkReads(TcpTransport.java:685) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.nio.MockNioTransport$MockTcpReadWriteHandler.consumeReads(MockNioTransport.java:279) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.nio.SocketChannelContext.handleReadBytes(SocketChannelContext.java:228) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.nio.BytesChannelContext.read(BytesChannelContext.java:40) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.nio.EventHandler.handleRead(EventHandler.java:139) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:151) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:420) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:246) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:174) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:131) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at java.lang.Thread.run(Thread.java:834) ~[?:?]

It's not immediately obvious how this is influenced by these results.

…elastic#45374)

* Add support for bwc for testclusters and convert full cluster restart (#45374) * Testclusters fix bwc (#46740) Additions to make testclsuters work with lather versions of ES * Do common node config on bwc tests Before this PR we always ever ran `ElasticsearchCluster.start` once, and the common node config was never done. This becomes apparent in upgrading from `6.x` to `7.x` as the new config is missing preventing the cluster from starting. * Do common node config on bwc tests Before this PR we always ever ran `ElasticsearchCluster.start` once, and the common node config was never done. This becomes apparent in upgrading from `6.x` to `7.x` as the new config is missing preventing the cluster from starting. * Fix logic to pick up snapshot from 6.x * Make sure ports are cleared * Fix test * Don't clear all the config as we rely on it * Fix removal of keys

alpar-t added 3 commits August 9, 2019 13:38

WIP BWC supprot

b8eded1

Convert full cluster restart to testclusters

de8fa15

checkstyle

8166dd2

alpar-t requested review from rjernst and mark-vieira August 9, 2019 11:33

alpar-t added :Delivery/Build Build or test infrastructure >non-issue v7.4.0 v8.0.0 labels Aug 9, 2019

mark-vieira requested changes Aug 9, 2019

View reviewed changes

rjernst reviewed Aug 10, 2019

View reviewed changes

PR review

f561cb5

Merge remote-tracking branch 'origin/master' into bwc-testclusters

d5325d0

alpar-t added 2 commits August 15, 2019 16:20

Merge remote-tracking branch 'origin/master' into bwc-testclusters

7633ac4

Merge remote-tracking branch 'origin/master' into bwc-testclusters

06077e2

alpar-t added 5 commits August 16, 2019 08:38

Merge remote-tracking branch 'origin/master' into bwc-testclusters

0ee46ad

Fix failing tests

720292b

Fix module installation

bd009c7

Merge remote-tracking branch 'origin/master' into bwc-testclusters

4d54a49

Fix for snapshot modules

872105d

alpar-t merged commit 5b1b521 into elastic:master Aug 16, 2019

alpar-t added the backport pending label Aug 16, 2019

alpar-t added a commit to alpar-t/elasticsearch that referenced this pull request Aug 16, 2019

Add support for bwc for testclusters and convert full cluster restart (…

98ffebc

…elastic#45374)

alpar-t mentioned this pull request Aug 16, 2019

Add support for bwc for testclusters and convert full cluster restart… #45654

Closed

colings86 added v7.4.1 v7.4.0 and removed v7.4.0 v7.4.1 labels Sep 11, 2019

colings86 added v7.4.0 v7.4.1 and removed v7.4.1 v7.4.0 labels Sep 20, 2019

colings86 added v7.4.0 v7.4.1 and removed v7.4.1 v7.4.0 labels Sep 27, 2019

alpar-t added a commit to alpar-t/elasticsearch that referenced this pull request Oct 1, 2019

Add support for bwc for testclusters and convert full cluster restart (…

3ab9a5c

…elastic#45374)

alpar-t mentioned this pull request Oct 1, 2019

Backport testclusters fix bwc #47363

Merged

alpar-t removed the backport pending label Oct 5, 2019

alpar-t deleted the bwc-testclusters branch November 11, 2019 09:41

mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for bwc for testclusters and convert full cluster restart #45374

Add support for bwc for testclusters and convert full cluster restart #45374

alpar-t commented Aug 9, 2019

elasticmachine commented Aug 9, 2019

mark-vieira left a comment •

edited

Loading

mark-vieira Aug 9, 2019

mark-vieira Aug 9, 2019

alpar-t Aug 12, 2019

mark-vieira Aug 12, 2019

mark-vieira Aug 9, 2019

alpar-t Aug 12, 2019

mark-vieira Aug 12, 2019

alpar-t Aug 14, 2019

alpar-t Aug 14, 2019

mark-vieira Aug 9, 2019

alpar-t Aug 12, 2019

mark-vieira Aug 12, 2019

rjernst Aug 10, 2019

alpar-t Aug 12, 2019 •

edited

Loading

alpar-t Aug 16, 2019

alpar-t commented Aug 14, 2019

mark-vieira commented Aug 14, 2019

alpar-t commented Aug 15, 2019

alpar-t commented Aug 15, 2019

alpar-t commented Aug 15, 2019

		@@ -38,6 +38,8 @@

		void setVersion(String version);

		void setVersion(List<String> version);

		);
		Path jvmOptions = configFile.getParent().resolve("jvm.options");

Add support for bwc for testclusters and convert full cluster restart #45374

Add support for bwc for testclusters and convert full cluster restart #45374

Conversation

alpar-t commented Aug 9, 2019

elasticmachine commented Aug 9, 2019

mark-vieira left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alpar-t Aug 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alpar-t commented Aug 14, 2019

mark-vieira commented Aug 14, 2019

alpar-t commented Aug 15, 2019

alpar-t commented Aug 15, 2019

alpar-t commented Aug 15, 2019

mark-vieira left a comment •

edited

Loading

alpar-t Aug 12, 2019 •

edited

Loading