Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subset of openjdk tests can not complete on certain machines within 10hrs #2893

Closed
smlambert opened this issue Jan 23, 2023 · 30 comments
Closed
Assignees
Labels

Comments

@smlambert
Copy link
Contributor

smlambert commented Jan 23, 2023

Please set the title to indicate the test name and machine name where known.
Screen Shot 2023-01-23 at 9 52 39 AM

3/7 of the last runs of a subset of extended.openjdk tests timeout and are aborted when run on certain machines (test-marist-sles12-s390x-2, test-marist-rhel7-s390x-2, test-marist-sles15-s390x-2)

To make it easy for the infrastructure team to repeat and diagnose, please
answer the following questions:

@smlambert smlambert changed the title Subset of extended.openjdk tests can not complete on certain machines within 10hrs Subset of openjdk tests can not complete on certain machines within 10hrs Jan 23, 2023
@sxa
Copy link
Member

sxa commented Jan 24, 2023

@smlambert Are we getting test case failures on those systems (timeouts etc. on some tests) which would explain this as opposed to the machines being slow?

@smlambert
Copy link
Contributor Author

Yes, that most certainly could be part of it for the s390x nodes, from a quick look (so not necessarily conclusive) where there are 300+ failures in a run that aborts at 10hrs and only 2 failures in a run that completes in under 4 hrs.

In the 300+ failures, many are jdk_net and jdk_management failures (NoRouteToHostException in amongst them, so possibly a side-effect of #2807.

I will also add a comment to this issue about the performance degradation of the solaris runs, which is different from the s390x examples.

@smlambert
Copy link
Contributor Author

See adoptium/aqa-tests#4258 (comment) for sparcv9 solaris example.

Also noting the performance degradation between run from September 10th to present day, what changed (new machine config? tests starting to fail, so hitting timeouts? needs investigation):

Screen Shot 2023-01-23 at 8 58 11 PM

@sxa
Copy link
Member

sxa commented Jan 24, 2023

I'd treat the solaris/sparcv9 systems separately from everything else since they're a bit less under our control as to how they might be virtualised at the provider and that is consistent with what we've seen elsewhere where those systems are used. We're a little limited in options there unfortunately.

@sxa
Copy link
Member

sxa commented Jan 30, 2023

The Marist machines were all replaced a few months ago as part of a migration project and the two SLES ones were brought online after our older SLES ones were not being regularly used.

FYI @steelhead31 @Haroon-Khel

@sxa sxa added this to the 2023-02 (February) milestone Jan 30, 2023
@sxa
Copy link
Member

sxa commented Mar 8, 2023

Probably related: #2923

@sxa
Copy link
Member

sxa commented Mar 17, 2023

Seen by Andrew this week too (slack thread)

@sxa
Copy link
Member

sxa commented Mar 20, 2023

Hadn't realised that @andrew-m-leonard had disconnected the machine when he created the above issue. I've brought it back online.

The real problem here is that the machines are not running extended.openjdk with the Dynamic parallel option so are taking longer than they do on other platforms. @sophia-guo will do the work to enable that for Linux/aarch64 which should resolve it on most of the machines.

Also @Haroon-Khel bearing in mind what we're seeing with the extended.openjdk suite and some differences in runtimes on different machines, can you also check/verify whether there is a different in CPU limits etc. across the Linux/aarch64 test machines that might account for some of the runtime differences we've seen even on different containers? There also still seems to be a bit of variation on Linux/s390x which it would be good to check into to see if there are some machine-specifics happening, since on some of those runs machines it's taking a LOT longer than on others (extended20_testlist1, extended20_testlist2)

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Mar 21, 2023

Ref linux aarch64 containers, theyre spread over 2 dockerhost machines

dockerhost-equinix-ubuntu2004-armv8-1, max cpu clock speed 3000.0000 MHz, MemTotal: 525926796 kB (500G)
dockerhost-equinix-ubuntu2204-armv8-1, max cpu clock speed 2800.0000 MHz, MemTotal: 525918844 kB (500G)

So theres a slight difference in max clock speed (but not significant).

In addition, if you look at https://github.com/adoptium/infrastructure/blob/master/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/DockerStatic/tasks/main.yml, we limit the cpu and memory of each of the containers; 2 cpus and 6g memory for most containers with the exception of some of the ubuntu ones which have a 4cpu limit (not sure why theres a difference)

@sxa
Copy link
Member

sxa commented Mar 21, 2023

Can you detemine whether any of the slow ones mentioned in here are locked to 2 CPU? Also verify whether the test logs from the runs on those machines are correctly detecting 2 CPUs, or trying to run on more (I think searching the logs for cores should let you determine that.

@Haroon-Khel
Copy link
Contributor

Skimming through https://ci.adoptium.net/job/Test_openjdk20_hs_extended.openjdk_x86-64_windows_testList_0/28/consoleFull to see how long it takes for test suites to pass I notice that the jvm_compiler tests take the most time:

18:59:26  Running test jvm_compiler_0 ...
22:38:54  jvm_compiler_0_PASSED

3h38m

22:38:54  Running test jvm_compiler_1 ...
02:49:19  jvm_compiler_1_PASSED

under 4hrs

jvm_compiler_2 is skipped

It would be interesting to see the run time of these tests on other platforms

@Haroon-Khel
Copy link
Contributor

A similar observation in https://ci.adoptium.net/job/Test_openjdk20_hs_extended.openjdk_x86-64_windows_testList_0/27/

12:48:18  Running test jvm_compiler_0 ...
16:35:45  jvm_compiler_0_PASSED

under 4hrs

16:35:45  Running test jvm_compiler_1 ...
20:47:03  jvm_compiler_1_PASSED

over 4hrs

@Haroon-Khel
Copy link
Contributor

On https://ci.adoptium.net/job/Test_openjdk20_hs_extended.openjdk_x86-64_windows_testList_2/29/consoleFull

06:19:29  Running test jdk_beans_1 ...
07:19:13  jdk_beans_1_PASSED

1hr

07:30:19  Running test jdk_net_0 ...
09:47:47  jdk_net_0_PASSED

over 2hrs

11:02:46  Running test jdk_security3_0 ...
15:12:19  jdk_security3_0_PASSED

over 4hrs

@Haroon-Khel
Copy link
Contributor

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Mar 30, 2023

Sparcv9 extended openjdk

https://ci.adoptium.net/job/Test_openjdk8_hs_extended.openjdk_sparcv9_solaris_testList_1/72/consoleFull on build-siteox-solaris10u11-sparcv9-1

01:40:16  Running test hotspot_jre_0 ...
05:47:15  hotspot_jre_0_FAILED

4hrs

05:47:15  Running test jdk_other_0 ...
08:06:12  jdk_other_0_PASSED

over 2hrs

09:43:39  Running test jdk_security4_0 ...
11:06:14  Cancelling nested steps due to timeout

over 1hr

https://ci.adoptium.net/job/Test_openjdk8_hs_extended.openjdk_sparcv9_solaris_testList_1/71/consoleFull on test-siteox-solaris10u11-sparcv9-1

01:16:05  Running test hotspot_jre_0 ...
05:18:45  hotspot_jre_0_FAILED

4hrs again

05:18:46  Running test jdk_other_0 ...
07:55:45  jdk_other_0_PASSED

over 2hrs again

jdk_security4_0, jdk_nio_0 took 1hr

https://ci.adoptium.net/job/Test_openjdk8_hs_extended.openjdk_sparcv9_solaris_testList_1/68/consoleFull on test-siteox-solaris10u11-sparcv9-1

05:03:35  Running test hotspot_jre_0 ..
08:58:16  hotspot_jre_0_FAILED

close to 4hrs

08:58:16  Running test jdk_other_0 ...
11:15:58  jdk_other_0_PASSED

over 2hrs

jdk_security4_0, jdk_nio_0 about 1hr

A clear pattern, hotspot_jre_0 and jdk_other_0 seem to be hanging

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Mar 30, 2023

https://ci.adoptium.net/job/Grinder/6986/console was a grinder from the march release. I increased the timeout from its default 10 to 20 because the original extended openjdk job aborted due to timeout. The grinder runtime is 13:22:06 to 00:35:47, so around 11hrs.

I propose that until we can figure out why tests are hanging, the default timeout should be set to 15 (if 15 does in fact means 15hrs), so atleast these test jobs can complete

@smlambert What do you think?

@smlambert
Copy link
Contributor Author

It is measured in hours and we can certainly adjust the TIME_LIMIT (across the board or for specific platforms).

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Mar 31, 2023

I would prefer across the board. Aside from changing the job config of every extended openjdk, is there a quicker way to adjust the default TIME_LIMIT parameter?

If I change it here, here and here, do you know if it updates it for the existing jenkins jobs?

@smlambert
Copy link
Contributor Author

Changing it in those places only matters if the setting in the jobs is left blank, which we typically never do. So, I suggest we regenerate the extended.openjdk test jobs with the desired value across all platforms and versions.

regen JDK8 extended.openjdk - Test_Job_Auto_Gen/986
regen JDK11,17 extended.openjdk - Test_Job_Auto_Gen/987
regen JDK20 extended.openjdk - Test_Job_Auto_Gen/988

@sxa
Copy link
Member

sxa commented Apr 3, 2023

I'm going to copy the comment from #2923 (comment) into here as another data point, and close the other issue as a duplicate:

ubuntu2110 took 9h30
ubuntu2204 took 6h30
Both passed all13,931 tests
Both were started at pretty much the same time (to within a minute) but one of them just seemed to get though everything a bit more slowly:
times.2583.gz
times.6583.gz

@adamfarley
Copy link
Contributor

adamfarley commented Jun 7, 2023

JDK8 extended.openjdk testing on Solaris SparcV9 appears to be timing out regularly, especially the target jdk_security3_0.

At one point it was suggested that AsyncSSLSocketClose.java may be responsible, but I ran it on its own here and it seems to run and pass without a problem.

So to identify the slow runner/s in security3, I've run another job here with an extended timeout. If it hasn't finished by this time tomorrow, I'll explore other options.

Notes for future testing:

  • Regular javacore generation could be useful (once every few minutes)
  • perhaps something that announces each time a unit test starts and finishes, complete with test name and timestamp.

@adamfarley
Copy link
Contributor

adamfarley commented Jun 8, 2023

Ok, update on SparkV9:

jdk_security3 passes if given enough time to run. This job (mentioned above) took almost exactly 10 hours to run, by itself.

Looking at the unit test runtimes, here are the biggest offenders:

Test Duration
sun/security/rsa/SignatureTest 2 hr 4 min
sun/security/tools/keytool/WeakAlg 48 min
sun/security/tools/keytool/standard 38 min
sun/security/tools/keytool/PSS 37 min
sun/security/util/math/TestIntegerModuloP 26 min
javax/net/ssl/TLSCommon/TLSTest 21 min
sun/security/rsa/pss/SignatureTestPSS 14min
sun/security/provider/DSA/SupportedDSAParamGen 14 min
sun/security/provider/DSA/TestAlgParameterGenerator 11 min

So removing SignatureTest would knock this target down to an 8 hour runtime, and removing all 9 of these tests would reduce the duration by 5 hrs 33 mins.

I'll exclude the test in a branch and run the full extended.openjdk suite to see if this fixes things. I'll also extend the timeout on these jobs as well, so we can identify any other long-running targets.

Update: Exclude file appears to be ready. Tests running here.

Update 2023/09/06 AdamFarley: Turns out exclusions work better when you include the test name file extensions. However, we were able to identify some long runners in the other test targets:

Test Duration
demo/zipfs/ZipFSOutputStreamTest 2 hr 17 min
compiler/7184394/TestAESMain 1 hr 23 min
com/sun/crypto/provider/KeyAgreement/SupportedDHParamGens 29 min
tools/pack200/CommandLineTests 18 min

Rerunning without these tests here.

Note: I noticed some of these were already excluded for zLinux, which doesn't have a JIT and is known for being slow as a result. It's more correlation than causation at this stage. I see the sparc Temurin build uses Server VM instead of zLinux's Zero VM, which implies that it isn't interpreter-only on sparc. However, further investigation (if warranted) could begin by analysing JIT activity during the longer-running tests (like ZipFSOutputStreamTest) and comparing it to the (JIT-on) throughput on another OS/architecture that has similar throughput when run with the JIT off.

Or you could just run the test with -Xint and see if it gets any faster on sparc. If not, then this could be an "inactive JIT" situation.

@adamfarley
Copy link
Contributor

I reran with a new exclusion file here, and the two testList subjobs ran for 10 hours (ish) each.

@Haroon-Khel - I propose we merge this PR and increase the SparcV9 time limit to prevent the "ish" from pushing us a hair over the default 10 hours. What do you think?

@Haroon-Khel
Copy link
Contributor

Im for increasing the timeout to 15hrs as it worked here #2893 (comment). But which pr are you referring to?

@adamfarley
Copy link
Contributor

adamfarley commented Jun 19, 2023

This changeset as a PR, I meant to say. The changeset excludes a bunch of the longer running tests, which I tested in the links above. I've created the PR now. Here's a link.

@adamfarley
Copy link
Contributor

Without that PR, we'll need to increase the timeout to more than 15 hours (per each of the two testlists), or we'll need to run these tests in more than 2 testlists, which would require:

  • Making a testlist job able to wait for over 10 hours for a machine to free up.
  • Making sure we can run with concurrency able to be set to more than the number of machines.
  • Coming to terms with the idea that individual unit tests can run for over 2 hours.

Seemed a simpler idea to just exclude the biggest offenders for the overrun, and lengthening the timeout.

@smlambert
Copy link
Contributor Author

Noting that queue time is not part of the overall TIME_LIMIT value, so you would not hit the 10 hr default time limit if you waited for 10 hrs for a machine to become free. The time limit starts the countdown at the setup stage.

@adamfarley
Copy link
Contributor

I didn't know that. Thanks Shelley.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: Done
Development

No branches or pull requests

5 participants