-
-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subset of openjdk tests can not complete on certain machines within 10hrs #2893
Comments
@smlambert Are we getting test case failures on those systems (timeouts etc. on some tests) which would explain this as opposed to the machines being slow? |
Yes, that most certainly could be part of it for the s390x nodes, from a quick look (so not necessarily conclusive) where there are 300+ failures in a run that aborts at 10hrs and only 2 failures in a run that completes in under 4 hrs. In the 300+ failures, many are jdk_net and jdk_management failures ( I will also add a comment to this issue about the performance degradation of the solaris runs, which is different from the s390x examples. |
See adoptium/aqa-tests#4258 (comment) for sparcv9 solaris example. Also noting the performance degradation between run from September 10th to present day, what changed (new machine config? tests starting to fail, so hitting timeouts? needs investigation): |
I'd treat the solaris/sparcv9 systems separately from everything else since they're a bit less under our control as to how they might be virtualised at the provider and that is consistent with what we've seen elsewhere where those systems are used. We're a little limited in options there unfortunately. |
The Marist machines were all replaced a few months ago as part of a migration project and the two SLES ones were brought online after our older SLES ones were not being regularly used. |
Probably related: #2923 |
Seen by Andrew this week too (slack thread) |
Hadn't realised that @andrew-m-leonard had disconnected the machine when he created the above issue. I've brought it back online. The real problem here is that the machines are not running Also @Haroon-Khel bearing in mind what we're seeing with the extended.openjdk suite and some differences in runtimes on different machines, can you also check/verify whether there is a different in CPU limits etc. across the Linux/aarch64 test machines that might account for some of the runtime differences we've seen even on different containers? There also still seems to be a bit of variation on Linux/s390x which it would be good to check into to see if there are some machine-specifics happening, since on some of those runs machines it's taking a LOT longer than on others (extended20_testlist1, extended20_testlist2) |
Ref linux aarch64 containers, theyre spread over 2 dockerhost machines
So theres a slight difference in max clock speed (but not significant). In addition, if you look at https://github.com/adoptium/infrastructure/blob/master/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/DockerStatic/tasks/main.yml, we limit the cpu and memory of each of the containers; 2 cpus and 6g memory for most containers with the exception of some of the ubuntu ones which have a 4cpu limit (not sure why theres a difference) |
Can you detemine whether any of the slow ones mentioned in here are locked to 2 CPU? Also verify whether the test logs from the runs on those machines are correctly detecting 2 CPUs, or trying to run on more (I think searching the logs for |
Another platform for which the extended openjdk job aborts due to timeout is windows x64. These jobs appeared during the march release https://ci.adoptium.net/job/Test_openjdk20_hs_extended.openjdk_x86-64_windows_testList_2/29/ on test-azure-win2019-x64-1 |
Skimming through https://ci.adoptium.net/job/Test_openjdk20_hs_extended.openjdk_x86-64_windows_testList_0/28/consoleFull to see how long it takes for test suites to pass I notice that the jvm_compiler tests take the most time:
3h38m
under 4hrs jvm_compiler_2 is skipped It would be interesting to see the run time of these tests on other platforms |
A similar observation in https://ci.adoptium.net/job/Test_openjdk20_hs_extended.openjdk_x86-64_windows_testList_0/27/
under 4hrs
over 4hrs |
1hr
over 2hrs
over 4hrs |
https://ci.adoptium.net/job/Test_openjdk17_hs_extended.openjdk_s390x_linux_testList_1/80/consoleFull on test-marist-sles12-s390x-2 is interesting. There isnt a test suite that hangs |
Sparcv9 extended openjdk https://ci.adoptium.net/job/Test_openjdk8_hs_extended.openjdk_sparcv9_solaris_testList_1/72/consoleFull on build-siteox-solaris10u11-sparcv9-1
4hrs
over 2hrs
over 1hr https://ci.adoptium.net/job/Test_openjdk8_hs_extended.openjdk_sparcv9_solaris_testList_1/71/consoleFull on test-siteox-solaris10u11-sparcv9-1
4hrs again
over 2hrs again jdk_security4_0, jdk_nio_0 took 1hr https://ci.adoptium.net/job/Test_openjdk8_hs_extended.openjdk_sparcv9_solaris_testList_1/68/consoleFull on test-siteox-solaris10u11-sparcv9-1
close to 4hrs
over 2hrs jdk_security4_0, jdk_nio_0 about 1hr A clear pattern, hotspot_jre_0 and jdk_other_0 seem to be hanging |
https://ci.adoptium.net/job/Grinder/6986/console was a grinder from the march release. I increased the timeout from its default 10 to 20 because the original extended openjdk job aborted due to timeout. The grinder runtime is 13:22:06 to 00:35:47, so around 11hrs. I propose that until we can figure out why tests are hanging, the default timeout should be set to 15 (if 15 does in fact means 15hrs), so atleast these test jobs can complete @smlambert What do you think? |
It is measured in hours and we can certainly adjust the TIME_LIMIT (across the board or for specific platforms). |
Changing it in those places only matters if the setting in the jobs is left blank, which we typically never do. So, I suggest we regenerate the extended.openjdk test jobs with the desired value across all platforms and versions. regen JDK8 extended.openjdk - Test_Job_Auto_Gen/986 |
I'm going to copy the comment from #2923 (comment) into here as another data point, and close the other issue as a duplicate: ubuntu2110 took 9h30 |
JDK8 extended.openjdk testing on Solaris SparcV9 appears to be timing out regularly, especially the target jdk_security3_0. At one point it was suggested that AsyncSSLSocketClose.java may be responsible, but I ran it on its own here and it seems to run and pass without a problem. So to identify the slow runner/s in security3, I've run another job here with an extended timeout. If it hasn't finished by this time tomorrow, I'll explore other options. Notes for future testing:
|
Ok, update on SparkV9: jdk_security3 passes if given enough time to run. This job (mentioned above) took almost exactly 10 hours to run, by itself. Looking at the unit test runtimes, here are the biggest offenders:
So removing SignatureTest would knock this target down to an 8 hour runtime, and removing all 9 of these tests would reduce the duration by 5 hrs 33 mins. I'll exclude the test in a branch and run the full extended.openjdk suite to see if this fixes things. I'll also extend the timeout on these jobs as well, so we can identify any other long-running targets. Update: Exclude file appears to be ready. Tests running here. Update 2023/09/06 AdamFarley: Turns out exclusions work better when you include the test name file extensions. However, we were able to identify some long runners in the other test targets:
Rerunning without these tests here. Note: I noticed some of these were already excluded for zLinux, which doesn't have a JIT and is known for being slow as a result. It's more correlation than causation at this stage. I see the sparc Temurin build uses Server VM instead of zLinux's Zero VM, which implies that it isn't interpreter-only on sparc. However, further investigation (if warranted) could begin by analysing JIT activity during the longer-running tests (like ZipFSOutputStreamTest) and comparing it to the (JIT-on) throughput on another OS/architecture that has similar throughput when run with the JIT off. Or you could just run the test with -Xint and see if it gets any faster on sparc. If not, then this could be an "inactive JIT" situation. |
I reran with a new exclusion file here, and the two testList subjobs ran for 10 hours (ish) each. @Haroon-Khel - I propose we merge this PR and increase the SparcV9 time limit to prevent the "ish" from pushing us a hair over the default 10 hours. What do you think? |
Im for increasing the timeout to 15hrs as it worked here #2893 (comment). But which pr are you referring to? |
This changeset as a PR, I meant to say. The changeset excludes a bunch of the longer running tests, which I tested in the links above. I've created the PR now. Here's a link. |
Without that PR, we'll need to increase the timeout to more than 15 hours (per each of the two testlists), or we'll need to run these tests in more than 2 testlists, which would require:
Seemed a simpler idea to just exclude the biggest offenders for the overrun, and lengthening the timeout. |
Noting that queue time is not part of the overall TIME_LIMIT value, so you would not hit the 10 hr default time limit if you waited for 10 hrs for a machine to become free. The time limit starts the countdown at the setup stage. |
I didn't know that. Thanks Shelley. |
Please set the title to indicate the test name and machine name where known.
3/7 of the last runs of a subset of extended.openjdk tests timeout and are aborted when run on certain machines (test-marist-sles12-s390x-2, test-marist-rhel7-s390x-2, test-marist-sles15-s390x-2)
To make it easy for the infrastructure team to repeat and diagnose, please
answer the following questions:
Test_
job on https://ci.adoptopenjdk.net which showed the failureThe text was updated successfully, but these errors were encountered: