Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] SmokeTestMultiNodeClientYamlTestSuiteIT classMethod failing #77025

Closed
benwtrent opened this issue Aug 30, 2021 · 17 comments · Fixed by #79946
Closed

[CI] SmokeTestMultiNodeClientYamlTestSuiteIT classMethod failing #77025

benwtrent opened this issue Aug 30, 2021 · 17 comments · Fixed by #79946
Assignees
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates :Delivery/Build Build or test infrastructure Team:Data Management Meta label for data/management team Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@benwtrent
Copy link
Member

This is a weird timeout against cat/aliases, nothing in the logs seems to indicate why it timed out.

This is amd64 + corretto11 testing pair. Which, probably means nothing, but opening a tracking issue so that we can see if this testing matrix becomes problematic.

Build scan:
https://gradle-enterprise.elastic.co/s/pblxdzxp54tzy/tests/:qa:smoke-test-multinode:integTest/org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT/classMethod

Reproduction line:
null

Applicable branches:
7.15

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT&tests.test=classMethod

Failure excerpt:

java.lang.Exception: Suite timeout exceeded (>= 2400000 msec).

  at __randomizedtesting.SeedInfo.seed([327D9E0656CFF6AF]:0)

@benwtrent benwtrent added :Delivery/Build Build or test infrastructure :Data Management/Indices APIs APIs to create and manage indices and templates >test-failure Triaged test failures from CI labels Aug 30, 2021
@elasticmachine elasticmachine added Team:Delivery Meta label for Delivery team Team:Data Management Meta label for data/management team labels Aug 30, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@martijnvg
Copy link
Member

This suite has failed 29 times the last 7 days with these timeouts. This suite seems to fail with a different test each time. Also I can't pinpoint to a specific OS or java version. This failure seems only to occur with the 7.x branch.

Note sure what to do here. Have the number of yaml tests slowly grown over time, so that the default suite timeout is insufficient? Or did the time it takes to run the tests increased over time?

@mark-vieira
Copy link
Contributor

Note sure what to do here. Have the number of yaml tests slowly grown over time, so that the default suite timeout is insufficient? Or did the time it takes to run the tests increased over time?

I believe the latter. Here's the average execution time for this test suite over the past month on Windows:

image

Comparing number of tests before and after the big increase it's 1503 tests vs 1511 so it's not total tests that attributing the increase. There's a similar, but less extreme increase on Linux as well so I'm also apt to say that's it's unlikely to be something environmental and likely something changed with our tests here. Very often when we've seen this in the past it's been some regression were we add some costly operation to every single test case. It might be worth bisecting to see if we can find a problem commit.

@mark-vieira
Copy link
Contributor

FWIW, the :rest-api:spec project runs essentially the same suite of tests, albeit against a single node and we've not seen a similar increase. It seems to be somewhat specific to these multinode tests.

image

@mark-vieira
Copy link
Contributor

I take that back, looking at CoreWithSecurityClientYamlTestSuiteIT I see a bump around the same timeframe as well:

image

It's possible that our YAML test suites have regressed overall in some way, and it's just the mult-node scenario that seems to suffer particularly badly.

@martijnvg
Copy link
Member

@mark-vieira thanks for sharing these graphs. Yes, I think it makes sense to check what changes got in that can contribute to the increase of executing this multi node qa module roughly between Sep 21 and Sep 25.

@mark-vieira
Copy link
Contributor

@martijnvg ownership is always unclear wtih these types of tests suites since we're just running the entire set of core tests. Can you ensure someone follows up on this?

@martijnvg
Copy link
Member

I will take a look at the commits in mentioned time range to see if any of these commits contributed to the increased time to run this qa module.

@masseyke
Copy link
Member

masseyke commented Oct 25, 2021

It's odd that in the stack dump of the timed out test that there are 400 threads like this:

  2>    at [email protected]/sun.nio.ch.EPoll.wait(Native Method)
  2>    at [email protected]/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120)
  2>    at [email protected]/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
  2>    - locked sun.nio.ch.Util$2@286add67
  2>    - locked sun.nio.ch.EPollSelectorImpl@f55a422
  2>    at [email protected]/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
  2>    at app//org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:343)
  2>    at app//org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
  2>    at app//org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
  2>    at [email protected]/java.lang.Thread.run(Thread.java:834)```

I'm not sure what org.apache.http.impl.nio is used for, or why there would be so many threads. When I run that locally and do kill -3 several times I don't see that behavior.

@mark-vieira
Copy link
Contributor

I'm not sure what org.apache.http.impl.nio is used for, or why there would be so many threads. When I run that locally and do kill -3 several times I don't see that behavior.

This is simply the rest client used to communicate to the elasticsearch cluster by the tests. Not sure how the thread pools are tuned necessarily but the CI workers are 32 CPU machines so I suspect it creates a lot of threads based on that.

@masseyke masseyke self-assigned this Oct 26, 2021
@bpintea
Copy link
Contributor

bpintea commented Oct 27, 2021

Seem to be a 7.16 reoccurrence: https://gradle-enterprise.elastic.co/s/ysgdchiv6764q

@masseyke
Copy link
Member

I was able to reproduce this locally. I've (surprisingly) tracked this down to this commit -- #76791. @dakrone suspected that it is because ESRestTestCase deletes all ILM policies one-by-one at https://github.com/elastic/elasticsearch/blob/master/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java#L781. Each delete of each ILM policy triggers a cluster state update, which has to be pushed out to the other node in the multi-node cluster. This lines up with what I had noticed in the profiler -- a lot of time spent updating cluster state. It actually takes up about 30% of the runtime. At Lee's recommendation, I've added the new ILM policies to preserveILMPolicyIds (https://github.com/elastic/elasticsearch/blob/master/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java#L567) and the test seems to have sped up again. I'll have a PR up for that in a few minutes.

masseyke added a commit that referenced this issue Oct 28, 2021
…n SmokeTestMultiNodeClientYamlTestSuiteIT (#79946)

In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does
not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated
to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a
significant amount of time -- about 30% of the runtime of the test. This was causing
SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to
the list of known policies not to delete.
Closes #77025
Relates #76791
masseyke added a commit to masseyke/elasticsearch that referenced this issue Oct 28, 2021
…n SmokeTestMultiNodeClientYamlTestSuiteIT (elastic#79946)

In elastic#76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does
not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated
to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a
significant amount of time -- about 30% of the runtime of the test. This was causing
SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to
the list of known policies not to delete.
Closes elastic#77025
Relates elastic#76791
masseyke added a commit to masseyke/elasticsearch that referenced this issue Oct 28, 2021
…n SmokeTestMultiNodeClientYamlTestSuiteIT (elastic#79946)

In elastic#76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does
not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated
to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a
significant amount of time -- about 30% of the runtime of the test. This was causing
SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to
the list of known policies not to delete.
Closes elastic#77025
Relates elastic#76791
elasticsearchmachine pushed a commit that referenced this issue Oct 28, 2021
…n SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) (#80052)

In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does
not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated
to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a
significant amount of time -- about 30% of the runtime of the test. This was causing
SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to
the list of known policies not to delete.
Closes #77025
Relates #76791
elasticsearchmachine pushed a commit that referenced this issue Oct 29, 2021
…low down SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) (#80053)

* Preventing unnecessary ILM policy deletions that drastically slow down SmokeTestMultiNodeClientYamlTestSuiteIT (#79946)

In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does
not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated
to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a
significant amount of time -- about 30% of the runtime of the test. This was causing
SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to
the list of known policies not to delete.
Closes #77025
Relates #76791

* fixing backported code for 7.16

* allowing type removal warnings
@mark-vieira
Copy link
Contributor

Thanks @masseyke! It looks like that definitely helped. Here's average execution times on Windows for the 7.16 branch:
image

It looks like we might even be slightly better than before the regression in execution time as well. Here's the CoreWithSecurityClientYamlTestSuiteIT on master over the last three months:
image

@masseyke
Copy link
Member

masseyke commented Nov 3, 2021

Great! I was actually hoping it would be a little bit faster than before because I found a couple of additional ILM polices that we didn't need to be deleting that had been added before September 23, and added them to the exclude list as well.

@mark-vieira
Copy link
Contributor

Yep, the mixed cluster tests saw a similar improvement:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates :Delivery/Build Build or test infrastructure Team:Data Management Meta label for data/management team Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
7 participants