Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for bwc for testclusters and convert full cluster restart #45374

Merged
merged 12 commits into from
Aug 16, 2019

Conversation

alpar-t
Copy link
Contributor

@alpar-t alpar-t commented Aug 9, 2019

This PR adds the DSL that allows implementing bwc tests and does a sample one.

The way it works is that all versions are specified up-front in a list and one can call goToNextVersion or nextNodeToNextVersion to upgrade the cluster or nodes within it.
The upgrades happen in-place. For this reason we now sync the distro into a dedicated sub-folder.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Copy link
Contributor

@mark-vieira mark-vieira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. Also, we need to address input snapshotting here for cacheability. Since a single node can use multiple distributions, we need to include all of those distributions when snapshotting inputs. Right now ElasticsearchNode#getDistributionClasspath only takes into account the first distribution version. We need to tweak this so it includes all distribution jar files. Same goes for getDistributionFiles().

This is going to be complicated by the fact that we'll have duplicate files in the same input property. We might have to do something like use the runtime API (i.e. inputs.files()) here so we can be more dynamic. That would allow us to register separate input properties per-distribution. Perhaps register these inputs when we freeze() the node?

@@ -38,6 +38,8 @@

void setVersion(String version);

void setVersion(List<String> version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this setVersions()?

public ElasticsearchCluster(String path, String clusterName, Project project, ReaperService reaper,
Function<Integer, ElasticsearchDistribution> distributionFactory, File workingDirBase) {
public ElasticsearchCluster(String path, String clusterName, Project project,
ReaperService reaper, File workingDirBase) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try and keep the distribution factory pattern here rather than have the ElasticsearchNode directly call into the DistributionContainer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on the advantages you see here ? To me it seemed like indirection that makes things harder to understand. We have proffered explicit coupling elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid this kind of coupling when at all possible. That's the reason for this factory pattern here initially. I think it's actually easier to understand with that factory logic living in the plugin. It removes some noise from the ElasticsearchNode class which is already getting quite complex and should realistically be broken up at some point.

);
Path jvmOptions = configFile.getParent().resolve("jvm.options");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused as to why this is necessary? Why are we cherry picking files here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could recursively copy the contents of the directory, but these are the relevant files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't understand why we need to explicity copy these files from the distribution dir to the working directory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are setting up a new config dir, outside of the distro dir.
These only get copied if they don't exist, so for bwc tests they will use the old versions even after upgrade.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm replacing this with walking the config dir so we don't hard-code files

@@ -86,6 +88,8 @@

void restart();

void goToNextVersion();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe upgradeToNextVersion()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could technically be a downgrade too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

}

@Override
public void setVersion(List<String> versions) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really don't have a need to go through multiple versions. The upgrade tests start with one version, and go to another. This seems far more complicated to me than we had before, where each step of the upgrade tests was a different cluster. IIRC the tricky part there was hooking up finalizers to shut down certain nodes between phases of the tests, but given that testclusters has far better control over when nodes start and stop, why couldn't we stick with that pattern?

Copy link
Contributor Author

@alpar-t alpar-t Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm completely open to suggestion here, that's why I stopped at converting one project. Here's my reasoning for going down this path.

The reason we can't just call setVersion from a doFirst is that all versions need to be known at configuration time.

I wanted to have a way to support the upgrade in the framework to have less repetition in the projects that implement bwc testing.
The alternative would be to have multiple cluster definitions and set the data paths ( which we would have to allow, we don't currently ).
For rolling restart we would have to come up with a way to tell testclusters not to start the cluster for the task, as the task would want to have fine grained control and start the upgraded cluster node-by-node, I really dislike having to add that.

I would also prefer we don't give node level control to build scripts as it would make implementing the tests more imperative, but descriptive tests are easier to understand.

Another use-case that was brought up and plays into this is testing with evolving configurations in a mixed cluster. e.x. starting with some nodes that have ML disabled, run some tests, enable them, run some more.
In addition we already have some tests that alter the configuration of the cluster ( e.x. enabling security ) that currently just call the same configuration DSL and restart the cluster, but I'm not happy with them.

The initial thinking with the freeze() call was that all configuration would happen at Gradle's configuration time, thus the versions and all the config should be set at Gradle's configuration time and frozen during execution time.

The setVersion(List) is probably not the best way to express this, but I really think we should keep the paradigm of configuring both the initial and all future states of the cluster and then have a DSL that would move between the states.
I see setVersion(List) as a shortcut to get there and start running bwc with --parallel and get rid of all the timeout related test failures in CI before we go and refactor to separate out configuration from the implementation of the actions, so we can have something like named states we could go to, so we wouldn't need setVersions but instead it would look more like configuring multiple clusters with configuration inheritance, except it won't be multiple clusters but different state of the same cluster.
e.x.

testClusters {
    "bwcTestv${bwcVersion}" {
           version  = bwcVersion
           alternativeConfigurations {
               upgraded {
                   version = project.version
              }
           }
    }
}

upgradeTest.doFirst {
 testClusters."bwcTestv${bwcVersion}".alternativeConfiguration("upgraded")
}

We can then also add a way to optionally configure a subset of nodes differently withing this DSL. Something like:

 testClusters {
       numberOfNodes = 3
       setting "foo", "false"
       nodes(1/3) {
             setting "foo", "true"
       }
 }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed that we will merge this PR as is and discuss DSL in a follow up.

@alpar-t
Copy link
Contributor Author

alpar-t commented Aug 14, 2019

@mark-vieira I'm don't fully understand your comment with regards to duplicate files. Is it about file names ? How do those hurt?
I took a stab at implementing it.

@mark-vieira
Copy link
Contributor

@atorok Yeah, now I think the duplicates file stuff will be ok because their relative paths will be different, so they aren't actually the "same file" since the distribution directories differ per-version.

FYI, testing that cacheability still works is difficult here because your branch has diverged from master which contains some fixes in that area.

@alpar-t
Copy link
Contributor Author

alpar-t commented Aug 15, 2019

@elasticmachine run elasticsearch-ci/1

1 similar comment
@alpar-t
Copy link
Contributor Author

alpar-t commented Aug 15, 2019

@elasticmachine run elasticsearch-ci/1

@alpar-t
Copy link
Contributor Author

alpar-t commented Aug 15, 2019

Looks like the failure from the PR check fails reproducible

0, xpack.installed=true} with JoinRequest{sourceNode={external_0}{LTe-kf__T_-NHQkdAAAAAA}{uTOeARHbTh-FWsWgz9_iNw}{127.0.0.1}{127.0.0.1:12800}, optionalJoin=Optional[Join{term=1, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={external_0}{LTe-kf__T_-NHQkdAAAAAA}{uTOeARHbTh-FWsWgz9_iNw}{127.0.0.1}{127.0.0.1:12800}, targetNode={integTest-0}{_2jVJVXGRWmRlzgjO1vePw}{9q6f_V5USS6rVjT3esuvZw}{127.0.0.1}{127.0.0.1:33949}{dilm}{testattr=test, ml.machine_memory=33642762240, ml.max_open_jobs=20, xpack.installed=true}}]}
  1> org.elasticsearch.transport.RemoteTransportException: [integTest-0][127.0.0.1:33949][internal:cluster/coordination/join]
  1> Caused by: java.lang.IllegalStateException: failure when sending a validation request to node
  1>    at org.elasticsearch.cluster.coordination.Coordinator$2.onFailure(Coordinator.java:489) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1110) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1110) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:244) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1>    at java.lang.Thread.run(Thread.java:834) [?:?]
  1> Caused by: org.elasticsearch.transport.RemoteTransportException: [external_0][127.0.0.1:12800][internal:cluster/coordination/join/validate]
  1> Caused by: java.lang.IllegalArgumentException: Unknown NamedWriteable [org.elasticsearch.cluster.metadata.MetaData$Custom][index_lifecycle]
  1>    at org.elasticsearch.common.io.stream.NamedWriteableRegistry.getReader(NamedWriteableRegistry.java:112) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.common.io.stream.NamedWriteableAwareStreamInput.readNamedWriteable(NamedWriteableAwareStreamInput.java:45) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.common.io.stream.NamedWriteableAwareStreamInput.readNamedWriteable(NamedWriteableAwareStreamInput.java:39) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.cluster.metadata.MetaData.readFrom(MetaData.java:890) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.cluster.ClusterState.readFrom(ClusterState.java:719) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.cluster.coordination.ValidateJoinRequest.<init>(ValidateJoinRequest.java:33) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RequestHandlerRegistry.newRequest(RequestHandlerRegistry.java:56) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:177) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:120) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:104) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:661) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport.consumeNetworkReads(TcpTransport.java:685) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.nio.MockNioTransport$MockTcpReadWriteHandler.consumeReads(MockNioTransport.java:279) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.nio.SocketChannelContext.handleReadBytes(SocketChannelContext.java:228) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.nio.BytesChannelContext.read(BytesChannelContext.java:40) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.nio.EventHandler.handleRead(EventHandler.java:139) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:151) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:420) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:246) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:174) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:131) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at java.lang.Thread.run(Thread.java:834) ~[?:?]

It's not immediately obvious how this is influenced by these results.

alpar-t added a commit to alpar-t/elasticsearch that referenced this pull request Oct 1, 2019
alpar-t added a commit that referenced this pull request Oct 2, 2019
* Add support for bwc for testclusters and convert full cluster restart (#45374)

* Testclusters fix bwc (#46740)

Additions to make testclsuters work with lather versions of ES

* Do common node config on bwc tests

Before this PR we always ever ran `ElasticsearchCluster.start` once, and
the common node config was never done.
This becomes apparent in upgrading from `6.x` to `7.x` as the new config
is missing preventing the cluster from starting.

* Do common node config on bwc tests

Before this PR we always ever ran `ElasticsearchCluster.start` once, and
the common node config was never done.
This becomes apparent in upgrading from `6.x` to `7.x` as the new config
is missing preventing the cluster from starting.

* Fix logic to pick up snapshot from 6.x

* Make sure ports are cleared

* Fix test

* Don't clear all the config as we rely on it

* Fix removal of keys
@alpar-t alpar-t deleted the bwc-testclusters branch November 11, 2019 09:41
@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure >non-issue Team:Delivery Meta label for Delivery team v7.4.1 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants