-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Darwin build times out #58286
Comments
Pinging @elastic/es-core-infra (:Core/Infra/Build) |
Some more examples are:
The same thing happened last year - see #48148. That was found to be because the macOS worker couldn't cope with running multiple integration test suites in parallel. The suite timeouts were stopped by reducing the parallelism by setting That change is still in effect (in a file in the private repo that contains the Jenkins config), but something caused the parallelism to increase in macOS CI around 11th June. You can see the effect in the "Build Time Trend" in Jenkins. For example, in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-darwin-compatibility/buildTimeTrend the time taken drops significantly between build 30 and 31, but at the cost of a large proportion of the subsequent builds failing due to suite timeouts. You can see the parallelism increase in the build scan timelines:
Those two builds both succeeded, but based on the investigation in #48148 I'm sure it's that increase in parallelism that's led to the macOS builds being flakey ever since. Do the macOS CI workers have spinning disks rather than SSDs? They seem to be staggeringly slow under load. In one of the Gradle scans the log shows it took 6 seconds to install 1 index template:
Or maybe it's because the macOS worker has less RAM than the Linux workers and is having to use swap to run four test suites at the same time. |
Can someone from @elastic/es-core-infra pick this up please? |
Some more recent failures:
Why did the number of suites run in parallel increase from 1 to 4 on 11th June? Is there an easy way to cut it to 3 and see if that helps? |
@mark-vieira I know we've been talking about the Darwin builds - do we have any concrete plans yet? |
I think realistically our best option is just to bump the test suite timeouts until we have better macos workers. |
The Darwin CI hosts continue to struggle with timeouts. This commit increases the timouts for docs and client rest tests. relates elastic#58286
The Darwin CI hosts continue to struggle with timeouts. This commit increases the timouts for docs and client rest tests. relates #58286
The Darwin CI hosts continue to struggle with timeouts. This commit increases the timouts for docs and client rest tests. relates #58286
Another timeout... https://gradle-enterprise.elastic.co/s/jr4i67y5woiz6 |
The current Mac Minis used for ES CI have 32GB RAM. So this explains why they struggle to run 16 test suites in parallel compared to the Linux CI workers that have 128GB RAM. It looks like it is possible to hire Mac Minis with 64GB RAM from Mac Stadium. These are the biggest Apple has ever made. So they'd still be half the size of the Linux workers and would still probably suffer some spurious timeouts if the CI setup is tuned for the Linux workers. However, doubling the size of the macOS workers would probably mean we suffer a lot less timeouts than we have on macOS today. Might be worth a conversation with Infra (if this isn't happening already)? I don't know what our constraints are around switching machines hired from Mac Stadium. |
Failed again at oldEs1Fixture (7.12) https://gradle-enterprise.elastic.co/s/axym2bhwyrljg |
It looks like the last few darwin builds have timed out cloning: Only one of the last fourteen builds succeeded. |
I wonder if the reference repo is empty or something. We do log:
When fetching, but if that doesn't have useful commits in it then it can still take a long time to clone. |
It may be worth it to create separate issues for all these Mac timeouts, but I am not sure. Another timeout occurred today: https://gradle-enterprise.elastic.co/s/c4wfdnj2l4oxs/ This one timed out on three separate tests. All REST tests. This is NOT timing out due to cloning the repo (there is another issue for that) |
BWC timeout in 7.x against 6.8.17: https://gradle-enterprise.elastic.co/s/7d4ihrpf4dkcg |
Here's another one from today that looks related. |
Here's another from today: https://gradle-enterprise.elastic.co/s/ws2mxuhxdotee/console-log/raw?task=:server:integTest |
This one seems to be related as well: https://gradle-enterprise.elastic.co/s/em6byt5yt2dlq |
I've reached out to infra to get our Mac build agents rebuilt. |
Another timeout: https://gradle-enterprise.elastic.co/s/adgded3jzzxvs/ This one was for snapshot restore snapshot builds against azure.
|
Another Darwin build failure on master: https://gradle-enterprise.elastic.co/s/hdgccgk5fx6di |
https://gradle-enterprise.elastic.co/s/darif25jbecym and https://gradle-enterprise.elastic.co/s/vlpeh4oo6avcs today. ClientYamlTestSuiteIT and DocsClientYamlTestSuiteIT |
Well well well if it isn't ClientYamlTestSuiteIT and DocsClientYamlTestSuiteIT again |
Yet another darwin timeout: https://gradle-enterprise.elastic.co/s/7rhvm6ivqz3bm
|
And just like any test triage day, the failures continue here... None of the darwin builds seem to ever pass (see https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+8.1+multijob+platform-support-darwin/ or https://elasticsearch-ci.elastic.co/view/All/job/elastic+elasticsearch+main+multijob+platform-support-darwin/), so what's the point of running them if we're not working on a timely fix? |
We've removed these jobs. |
The darwin build seems to fail due to times a lot. I'm not sure if anything happened recently, but I figure the build folk will have more of the right tools to track it down so I'm tagging them.
Build scan:
scan
Repro line:
Reproduces locally?:
No.
Applicable branches: mostly master
Failure history:
link
Failure excerpt:
The text was updated successfully, but these errors were encountered: