[CI] SmokeTestMultiNodeClientYamlTestSuiteIT classMethod failing #77025

benwtrent · 2021-08-30T14:02:53Z

This is a weird timeout against cat/aliases, nothing in the logs seems to indicate why it timed out.

This is amd64 + corretto11 testing pair. Which, probably means nothing, but opening a tracking issue so that we can see if this testing matrix becomes problematic.

Build scan:
https://gradle-enterprise.elastic.co/s/pblxdzxp54tzy/tests/:qa:smoke-test-multinode:integTest/org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT/classMethod

Reproduction line:
null

Applicable branches:
7.15

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT&tests.test=classMethod

Failure excerpt:

java.lang.Exception: Suite timeout exceeded (>= 2400000 msec).

  at __randomizedtesting.SeedInfo.seed([327D9E0656CFF6AF]:0)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-08-30T14:02:56Z

Pinging @elastic/es-delivery (Team:Delivery)

elasticmachine · 2021-08-30T14:02:56Z

Pinging @elastic/es-data-management (Team:Data Management)

martijnvg · 2021-09-30T09:30:22Z

This suite has failed 29 times the last 7 days with these timeouts. This suite seems to fail with a different test each time. Also I can't pinpoint to a specific OS or java version. This failure seems only to occur with the 7.x branch.

Note sure what to do here. Have the number of yaml tests slowly grown over time, so that the default suite timeout is insufficient? Or did the time it takes to run the tests increased over time?

mark-vieira · 2021-09-30T19:08:54Z

Note sure what to do here. Have the number of yaml tests slowly grown over time, so that the default suite timeout is insufficient? Or did the time it takes to run the tests increased over time?

I believe the latter. Here's the average execution time for this test suite over the past month on Windows:

Comparing number of tests before and after the big increase it's 1503 tests vs 1511 so it's not total tests that attributing the increase. There's a similar, but less extreme increase on Linux as well so I'm also apt to say that's it's unlikely to be something environmental and likely something changed with our tests here. Very often when we've seen this in the past it's been some regression were we add some costly operation to every single test case. It might be worth bisecting to see if we can find a problem commit.

mark-vieira · 2021-09-30T19:11:57Z

FWIW, the :rest-api:spec project runs essentially the same suite of tests, albeit against a single node and we've not seen a similar increase. It seems to be somewhat specific to these multinode tests.

mark-vieira · 2021-09-30T19:14:01Z

I take that back, looking at CoreWithSecurityClientYamlTestSuiteIT I see a bump around the same timeframe as well:

It's possible that our YAML test suites have regressed overall in some way, and it's just the mult-node scenario that seems to suffer particularly badly.

martijnvg · 2021-10-01T07:00:46Z

@mark-vieira thanks for sharing these graphs. Yes, I think it makes sense to check what changes got in that can contribute to the increase of executing this multi node qa module roughly between Sep 21 and Sep 25.

mark-vieira · 2021-10-01T16:25:06Z

@martijnvg ownership is always unclear wtih these types of tests suites since we're just running the entire set of core tests. Can you ensure someone follows up on this?

martijnvg · 2021-10-05T08:34:17Z

I will take a look at the commits in mentioned time range to see if any of these commits contributed to the increased time to run this qa module.

ywangd · 2021-10-12T00:18:59Z

Another one but on the master branch: https://gradle-enterprise.elastic.co/s/uwqv6pg7rytm6/tests/:qa:smoke-test-multinode:integTest/org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT/classMethod?top-execution=1

masseyke · 2021-10-25T22:23:35Z

It's odd that in the stack dump of the timed out test that there are 400 threads like this:

  2>    at [email protected]/sun.nio.ch.EPoll.wait(Native Method)
  2>    at [email protected]/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120)
  2>    at [email protected]/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
  2>    - locked sun.nio.ch.Util$2@286add67
  2>    - locked sun.nio.ch.EPollSelectorImpl@f55a422
  2>    at [email protected]/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
  2>    at app//org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:343)
  2>    at app//org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
  2>    at app//org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
  2>    at [email protected]/java.lang.Thread.run(Thread.java:834)```

I'm not sure what org.apache.http.impl.nio is used for, or why there would be so many threads. When I run that locally and do kill -3 several times I don't see that behavior.

mark-vieira · 2021-10-25T22:26:49Z

I'm not sure what org.apache.http.impl.nio is used for, or why there would be so many threads. When I run that locally and do kill -3 several times I don't see that behavior.

This is simply the rest client used to communicate to the elasticsearch cluster by the tests. Not sure how the thread pools are tuned necessarily but the CI workers are 32 CPU machines so I suspect it creates a lot of threads based on that.

bpintea · 2021-10-27T10:20:45Z

Seem to be a 7.16 reoccurrence: https://gradle-enterprise.elastic.co/s/ysgdchiv6764q

masseyke · 2021-10-27T18:12:04Z

I was able to reproduce this locally. I've (surprisingly) tracked this down to this commit -- #76791. @dakrone suspected that it is because ESRestTestCase deletes all ILM policies one-by-one at https://github.com/elastic/elasticsearch/blob/master/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java#L781. Each delete of each ILM policy triggers a cluster state update, which has to be pushed out to the other node in the multi-node cluster. This lines up with what I had noticed in the profiler -- a lot of time spent updating cluster state. It actually takes up about 30% of the runtime. At Lee's recommendation, I've added the new ILM policies to preserveILMPolicyIds (https://github.com/elastic/elasticsearch/blob/master/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java#L567) and the test seems to have sped up again. I'll have a PR up for that in a few minutes.

…n SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes #77025 Relates #76791

…n SmokeTestMultiNodeClientYamlTestSuiteIT (elastic#79946) In elastic#76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes elastic#77025 Relates elastic#76791

…n SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) (#80052) In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes #77025 Relates #76791

…low down SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) (#80053) * Preventing unnecessary ILM policy deletions that drastically slow down SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes #77025 Relates #76791 * fixing backported code for 7.16 * allowing type removal warnings

mark-vieira · 2021-11-03T19:26:41Z

Thanks @masseyke! It looks like that definitely helped. Here's average execution times on Windows for the 7.16 branch:

It looks like we might even be slightly better than before the regression in execution time as well. Here's the CoreWithSecurityClientYamlTestSuiteIT on master over the last three months:

masseyke · 2021-11-03T19:30:05Z

Great! I was actually hoping it would be a little bit faster than before because I found a couple of additional ILM polices that we didn't need to be deleting that had been added before September 23, and added them to the exclude list as well.

mark-vieira · 2021-11-03T19:31:18Z

Yep, the mixed cluster tests saw a similar improvement:

benwtrent added :Delivery/Build Build or test infrastructure :Data Management/Indices APIs APIs to create and manage indices and templates >test-failure Triaged test failures from CI labels Aug 30, 2021

elasticmachine added Team:Delivery Meta label for Delivery team Team:Data Management Meta label for data/management team labels Aug 30, 2021

mark-vieira mentioned this issue Oct 1, 2021

[CI] SmokeTestMultiNodeClientYamlTestSuiteIT test {yaml=cat.nodeattrs/10_basic/Test cat nodes attrs output} failing #78561

Closed

masseyke self-assigned this Oct 26, 2021

masseyke mentioned this issue Oct 27, 2021

Preventing unnecessary ILM policy deletions that drastically slow down SmokeTestMultiNodeClientYamlTestSuiteIT #79946

Merged

masseyke closed this as completed in #79946 Oct 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] SmokeTestMultiNodeClientYamlTestSuiteIT classMethod failing #77025

[CI] SmokeTestMultiNodeClientYamlTestSuiteIT classMethod failing #77025

benwtrent commented Aug 30, 2021

elasticmachine commented Aug 30, 2021

elasticmachine commented Aug 30, 2021

martijnvg commented Sep 30, 2021

mark-vieira commented Sep 30, 2021

mark-vieira commented Sep 30, 2021

mark-vieira commented Sep 30, 2021

martijnvg commented Oct 1, 2021

mark-vieira commented Oct 1, 2021

martijnvg commented Oct 5, 2021

ywangd commented Oct 12, 2021

masseyke commented Oct 25, 2021 •

edited

Loading

mark-vieira commented Oct 25, 2021

bpintea commented Oct 27, 2021

masseyke commented Oct 27, 2021

mark-vieira commented Nov 3, 2021

masseyke commented Nov 3, 2021

mark-vieira commented Nov 3, 2021

[CI] SmokeTestMultiNodeClientYamlTestSuiteIT classMethod failing #77025

[CI] SmokeTestMultiNodeClientYamlTestSuiteIT classMethod failing #77025

Comments

benwtrent commented Aug 30, 2021

elasticmachine commented Aug 30, 2021

elasticmachine commented Aug 30, 2021

martijnvg commented Sep 30, 2021

mark-vieira commented Sep 30, 2021

mark-vieira commented Sep 30, 2021

mark-vieira commented Sep 30, 2021

martijnvg commented Oct 1, 2021

mark-vieira commented Oct 1, 2021

martijnvg commented Oct 5, 2021

ywangd commented Oct 12, 2021

masseyke commented Oct 25, 2021 • edited Loading

mark-vieira commented Oct 25, 2021

bpintea commented Oct 27, 2021

masseyke commented Oct 27, 2021

mark-vieira commented Nov 3, 2021

masseyke commented Nov 3, 2021

mark-vieira commented Nov 3, 2021

masseyke commented Oct 25, 2021 •

edited

Loading