-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] SmokeTestMultiNodeClientYamlTestSuiteIT classMethod failing #77025
Comments
Pinging @elastic/es-delivery (Team:Delivery) |
Pinging @elastic/es-data-management (Team:Data Management) |
This suite has failed 29 times the last 7 days with these timeouts. This suite seems to fail with a different test each time. Also I can't pinpoint to a specific OS or java version. This failure seems only to occur with the 7.x branch. Note sure what to do here. Have the number of yaml tests slowly grown over time, so that the default suite timeout is insufficient? Or did the time it takes to run the tests increased over time? |
I believe the latter. Here's the average execution time for this test suite over the past month on Windows: Comparing number of tests before and after the big increase it's 1503 tests vs 1511 so it's not total tests that attributing the increase. There's a similar, but less extreme increase on Linux as well so I'm also apt to say that's it's unlikely to be something environmental and likely something changed with our tests here. Very often when we've seen this in the past it's been some regression were we add some costly operation to every single test case. It might be worth bisecting to see if we can find a problem commit. |
@mark-vieira thanks for sharing these graphs. Yes, I think it makes sense to check what changes got in that can contribute to the increase of executing this multi node qa module roughly between Sep 21 and Sep 25. |
@martijnvg ownership is always unclear wtih these types of tests suites since we're just running the entire set of core tests. Can you ensure someone follows up on this? |
I will take a look at the commits in mentioned time range to see if any of these commits contributed to the increased time to run this qa module. |
It's odd that in the stack dump of the timed out test that there are 400 threads like this:
|
This is simply the rest client used to communicate to the elasticsearch cluster by the tests. Not sure how the thread pools are tuned necessarily but the CI workers are 32 CPU machines so I suspect it creates a lot of threads based on that. |
Seem to be a 7.16 reoccurrence: https://gradle-enterprise.elastic.co/s/ysgdchiv6764q |
I was able to reproduce this locally. I've (surprisingly) tracked this down to this commit -- #76791. @dakrone suspected that it is because ESRestTestCase deletes all ILM policies one-by-one at https://github.com/elastic/elasticsearch/blob/master/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java#L781. Each delete of each ILM policy triggers a cluster state update, which has to be pushed out to the other node in the multi-node cluster. This lines up with what I had noticed in the profiler -- a lot of time spent updating cluster state. It actually takes up about 30% of the runtime. At Lee's recommendation, I've added the new ILM policies to preserveILMPolicyIds (https://github.com/elastic/elasticsearch/blob/master/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java#L567) and the test seems to have sped up again. I'll have a PR up for that in a few minutes. |
…n SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes #77025 Relates #76791
…n SmokeTestMultiNodeClientYamlTestSuiteIT (elastic#79946) In elastic#76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes elastic#77025 Relates elastic#76791
…n SmokeTestMultiNodeClientYamlTestSuiteIT (elastic#79946) In elastic#76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes elastic#77025 Relates elastic#76791
…n SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) (#80052) In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes #77025 Relates #76791
…low down SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) (#80053) * Preventing unnecessary ILM policy deletions that drastically slow down SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes #77025 Relates #76791 * fixing backported code for 7.16 * allowing type removal warnings
Thanks @masseyke! It looks like that definitely helped. Here's average execution times on Windows for the It looks like we might even be slightly better than before the regression in execution time as well. Here's the |
Great! I was actually hoping it would be a little bit faster than before because I found a couple of additional ILM polices that we didn't need to be deleting that had been added before September 23, and added them to the exclude list as well. |
This is a weird timeout against cat/aliases, nothing in the logs seems to indicate why it timed out.
This is amd64 + corretto11 testing pair. Which, probably means nothing, but opening a tracking issue so that we can see if this testing matrix becomes problematic.
Build scan:
https://gradle-enterprise.elastic.co/s/pblxdzxp54tzy/tests/:qa:smoke-test-multinode:integTest/org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT/classMethod
Reproduction line:
null
Applicable branches:
7.15
Reproduces locally?:
Didn't try
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT&tests.test=classMethod
Failure excerpt:
The text was updated successfully, but these errors were encountered: