Subset of openjdk tests can not complete on certain machines within 10hrs #2893

smlambert · 2023-01-23T14:57:35Z

Please set the title to indicate the test name and machine name where known.

3/7 of the last runs of a subset of extended.openjdk tests timeout and are aborted when run on certain machines (test-marist-sles12-s390x-2, test-marist-rhel7-s390x-2, test-marist-sles15-s390x-2)

To make it easy for the infrastructure team to repeat and diagnose, please
answer the following questions:

test suite/name (e.g, BUILD_LIST, TARGET, CUSTOM_TARGET)?
a link into recent Test_ job on https://ci.adoptopenjdk.net which showed the failure
Hyperlink to re-run in Grinder:
Is there an existing issue elsewhere covering this?
Which machine(s) does it work on? test-docker-sles15-s390x-1, test-marist-ubuntu2204-s390x-1
Which machine(s) does it fail on? (test-marist-sles12-s390x-2, test-marist-rhel7-s390x-2, test-marist-sles15-s390x-2), test-docker-debian11-armv8-1

The text was updated successfully, but these errors were encountered:

sxa · 2023-01-24T16:50:20Z

@smlambert Are we getting test case failures on those systems (timeouts etc. on some tests) which would explain this as opposed to the machines being slow?

smlambert · 2023-01-24T18:19:56Z

Yes, that most certainly could be part of it for the s390x nodes, from a quick look (so not necessarily conclusive) where there are 300+ failures in a run that aborts at 10hrs and only 2 failures in a run that completes in under 4 hrs.

In the 300+ failures, many are jdk_net and jdk_management failures (NoRouteToHostException in amongst them, so possibly a side-effect of #2807.

I will also add a comment to this issue about the performance degradation of the solaris runs, which is different from the s390x examples.

smlambert · 2023-01-24T18:22:06Z

See adoptium/aqa-tests#4258 (comment) for sparcv9 solaris example.

Also noting the performance degradation between run from September 10th to present day, what changed (new machine config? tests starting to fail, so hitting timeouts? needs investigation):

sxa · 2023-01-24T20:42:36Z

I'd treat the solaris/sparcv9 systems separately from everything else since they're a bit less under our control as to how they might be virtualised at the provider and that is consistent with what we've seen elsewhere where those systems are used. We're a little limited in options there unfortunately.

sxa · 2023-01-30T11:46:21Z

The Marist machines were all replaced a few months ago as part of a migration project and the two SLES ones were brought online after our older SLES ones were not being regularly used.

FYI @steelhead31 @Haroon-Khel

smlambert · 2023-02-15T01:09:55Z

On test-marist-sles12-s390x-2:
https://ci.adoptopenjdk.net/job/Test_openjdk17_hs_extended.openjdk_s390x_linux_testList_0/70/
https://ci.adoptopenjdk.net/job/Test_openjdk17_hs_extended.openjdk_s390x_linux_testList_0/77/

sxa · 2023-03-08T15:45:03Z

Probably related: #2923

sxa · 2023-03-17T12:12:00Z

Seen by Andrew this week too (slack thread)

sxa · 2023-03-20T13:50:19Z

Hadn't realised that @andrew-m-leonard had disconnected the machine when he created the above issue. I've brought it back online.

The real problem here is that the machines are not running extended.openjdk with the Dynamic parallel option so are taking longer than they do on other platforms. @sophia-guo will do the work to enable that for Linux/aarch64 which should resolve it on most of the machines.

Also @Haroon-Khel bearing in mind what we're seeing with the extended.openjdk suite and some differences in runtimes on different machines, can you also check/verify whether there is a different in CPU limits etc. across the Linux/aarch64 test machines that might account for some of the runtime differences we've seen even on different containers? There also still seems to be a bit of variation on Linux/s390x which it would be good to check into to see if there are some machine-specifics happening, since on some of those runs machines it's taking a LOT longer than on others (extended20_testlist1, extended20_testlist2)

Haroon-Khel · 2023-03-21T12:33:10Z

Ref linux aarch64 containers, theyre spread over 2 dockerhost machines

dockerhost-equinix-ubuntu2004-armv8-1, max cpu clock speed 3000.0000 MHz, MemTotal: 525926796 kB (500G)
dockerhost-equinix-ubuntu2204-armv8-1, max cpu clock speed 2800.0000 MHz, MemTotal: 525918844 kB (500G)

So theres a slight difference in max clock speed (but not significant).

In addition, if you look at https://github.com/adoptium/infrastructure/blob/master/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/DockerStatic/tasks/main.yml, we limit the cpu and memory of each of the containers; 2 cpus and 6g memory for most containers with the exception of some of the ubuntu ones which have a 4cpu limit (not sure why theres a difference)

sxa · 2023-03-21T13:38:53Z

Can you detemine whether any of the slow ones mentioned in here are locked to 2 CPU? Also verify whether the test logs from the runs on those machines are correctly detecting 2 CPUs, or trying to run on more (I think searching the logs for cores should let you determine that.

Haroon-Khel · 2023-03-30T11:54:38Z

Another platform for which the extended openjdk job aborts due to timeout is windows x64. These jobs appeared during the march release

https://ci.adoptium.net/job/Test_openjdk20_hs_extended.openjdk_x86-64_windows_testList_2/29/ on test-azure-win2019-x64-1
https://ci.adoptium.net/job/Test_openjdk20_hs_extended.openjdk_x86-64_windows_testList_0/27/ on test-azure-win2012r2-x64-3
https://ci.adoptium.net/job/Test_openjdk20_hs_extended.openjdk_x86-64_windows_testList_0/28/ again on test-azure-win2012r2-x64-3

Haroon-Khel · 2023-03-30T12:03:39Z

Skimming through https://ci.adoptium.net/job/Test_openjdk20_hs_extended.openjdk_x86-64_windows_testList_0/28/consoleFull to see how long it takes for test suites to pass I notice that the jvm_compiler tests take the most time:

18:59:26  Running test jvm_compiler_0 ...
22:38:54  jvm_compiler_0_PASSED

3h38m

22:38:54  Running test jvm_compiler_1 ...
02:49:19  jvm_compiler_1_PASSED

under 4hrs

jvm_compiler_2 is skipped

It would be interesting to see the run time of these tests on other platforms

Haroon-Khel · 2023-03-30T12:07:07Z

A similar observation in https://ci.adoptium.net/job/Test_openjdk20_hs_extended.openjdk_x86-64_windows_testList_0/27/

12:48:18  Running test jvm_compiler_0 ...
16:35:45  jvm_compiler_0_PASSED

under 4hrs

16:35:45  Running test jvm_compiler_1 ...
20:47:03  jvm_compiler_1_PASSED

over 4hrs

Haroon-Khel · 2023-03-30T12:11:12Z

On https://ci.adoptium.net/job/Test_openjdk20_hs_extended.openjdk_x86-64_windows_testList_2/29/consoleFull

06:19:29  Running test jdk_beans_1 ...
07:19:13  jdk_beans_1_PASSED

1hr

07:30:19  Running test jdk_net_0 ...
09:47:47  jdk_net_0_PASSED

over 2hrs

11:02:46  Running test jdk_security3_0 ...
15:12:19  jdk_security3_0_PASSED

over 4hrs

Haroon-Khel · 2023-03-30T16:23:58Z

https://ci.adoptium.net/job/Test_openjdk17_hs_extended.openjdk_s390x_linux_testList_1/80/consoleFull on test-marist-sles12-s390x-2 is interesting. There isnt a test suite that hangs

Haroon-Khel · 2023-03-30T16:38:24Z

Sparcv9 extended openjdk

https://ci.adoptium.net/job/Test_openjdk8_hs_extended.openjdk_sparcv9_solaris_testList_1/72/consoleFull on build-siteox-solaris10u11-sparcv9-1

01:40:16  Running test hotspot_jre_0 ...
05:47:15  hotspot_jre_0_FAILED

4hrs

05:47:15  Running test jdk_other_0 ...
08:06:12  jdk_other_0_PASSED

over 2hrs

09:43:39  Running test jdk_security4_0 ...
11:06:14  Cancelling nested steps due to timeout

over 1hr

https://ci.adoptium.net/job/Test_openjdk8_hs_extended.openjdk_sparcv9_solaris_testList_1/71/consoleFull on test-siteox-solaris10u11-sparcv9-1

01:16:05  Running test hotspot_jre_0 ...
05:18:45  hotspot_jre_0_FAILED

4hrs again

05:18:46  Running test jdk_other_0 ...
07:55:45  jdk_other_0_PASSED

over 2hrs again

jdk_security4_0, jdk_nio_0 took 1hr

https://ci.adoptium.net/job/Test_openjdk8_hs_extended.openjdk_sparcv9_solaris_testList_1/68/consoleFull on test-siteox-solaris10u11-sparcv9-1

05:03:35  Running test hotspot_jre_0 ..
08:58:16  hotspot_jre_0_FAILED

close to 4hrs

08:58:16  Running test jdk_other_0 ...
11:15:58  jdk_other_0_PASSED

over 2hrs

jdk_security4_0, jdk_nio_0 about 1hr

A clear pattern, hotspot_jre_0 and jdk_other_0 seem to be hanging

Haroon-Khel · 2023-03-30T16:47:54Z

https://ci.adoptium.net/job/Grinder/6986/console was a grinder from the march release. I increased the timeout from its default 10 to 20 because the original extended openjdk job aborted due to timeout. The grinder runtime is 13:22:06 to 00:35:47, so around 11hrs.

I propose that until we can figure out why tests are hanging, the default timeout should be set to 15 (if 15 does in fact means 15hrs), so atleast these test jobs can complete

@smlambert What do you think?

smlambert · 2023-03-30T21:43:41Z

It is measured in hours and we can certainly adjust the TIME_LIMIT (across the board or for specific platforms).

Haroon-Khel · 2023-03-31T13:57:41Z

I would prefer across the board. Aside from changing the job config of every extended openjdk, is there a quicker way to adjust the default TIME_LIMIT parameter?

If I change it here, here and here, do you know if it updates it for the existing jenkins jobs?

smlambert · 2023-03-31T16:51:45Z

Changing it in those places only matters if the setting in the jobs is left blank, which we typically never do. So, I suggest we regenerate the extended.openjdk test jobs with the desired value across all platforms and versions.

regen JDK8 extended.openjdk - Test_Job_Auto_Gen/986
regen JDK11,17 extended.openjdk - Test_Job_Auto_Gen/987
regen JDK20 extended.openjdk - Test_Job_Auto_Gen/988

sxa · 2023-04-03T12:22:45Z

I'm going to copy the comment from #2923 (comment) into here as another data point, and close the other issue as a duplicate:

ubuntu2110 took 9h30
ubuntu2204 took 6h30
Both passed all13,931 tests
Both were started at pretty much the same time (to within a minute) but one of them just seemed to get though everything a bit more slowly:
times.2583.gz
times.6583.gz

adamfarley · 2023-06-07T13:33:52Z

JDK8 extended.openjdk testing on Solaris SparcV9 appears to be timing out regularly, especially the target jdk_security3_0.

At one point it was suggested that AsyncSSLSocketClose.java may be responsible, but I ran it on its own here and it seems to run and pass without a problem.

So to identify the slow runner/s in security3, I've run another job here with an extended timeout. If it hasn't finished by this time tomorrow, I'll explore other options.

Notes for future testing:

Regular javacore generation could be useful (once every few minutes)
perhaps something that announces each time a unit test starts and finishes, complete with test name and timestamp.

adamfarley · 2023-06-08T11:02:58Z

Ok, update on SparkV9:

jdk_security3 passes if given enough time to run. This job (mentioned above) took almost exactly 10 hours to run, by itself.

Looking at the unit test runtimes, here are the biggest offenders:

Test	Duration
sun/security/rsa/SignatureTest	2 hr 4 min
sun/security/tools/keytool/WeakAlg	48 min
sun/security/tools/keytool/standard	38 min
sun/security/tools/keytool/PSS	37 min
sun/security/util/math/TestIntegerModuloP	26 min
javax/net/ssl/TLSCommon/TLSTest	21 min
sun/security/rsa/pss/SignatureTestPSS	14min
sun/security/provider/DSA/SupportedDSAParamGen	14 min
sun/security/provider/DSA/TestAlgParameterGenerator	11 min

So removing SignatureTest would knock this target down to an 8 hour runtime, and removing all 9 of these tests would reduce the duration by 5 hrs 33 mins.

I'll exclude the test in a branch and run the full extended.openjdk suite to see if this fixes things. I'll also extend the timeout on these jobs as well, so we can identify any other long-running targets.

Update: Exclude file appears to be ready. Tests running here.

Update 2023/09/06 AdamFarley: Turns out exclusions work better when you include the test name file extensions. However, we were able to identify some long runners in the other test targets:

Test	Duration
demo/zipfs/ZipFSOutputStreamTest	2 hr 17 min
compiler/7184394/TestAESMain	1 hr 23 min
com/sun/crypto/provider/KeyAgreement/SupportedDHParamGens	29 min
tools/pack200/CommandLineTests	18 min

Rerunning without these tests here.

Note: I noticed some of these were already excluded for zLinux, which doesn't have a JIT and is known for being slow as a result. It's more correlation than causation at this stage. I see the sparc Temurin build uses Server VM instead of zLinux's Zero VM, which implies that it isn't interpreter-only on sparc. However, further investigation (if warranted) could begin by analysing JIT activity during the longer-running tests (like ZipFSOutputStreamTest) and comparing it to the (JIT-on) throughput on another OS/architecture that has similar throughput when run with the JIT off.

Or you could just run the test with -Xint and see if it gets any faster on sparc. If not, then this could be an "inactive JIT" situation.

adamfarley · 2023-06-16T11:18:51Z

I reran with a new exclusion file here, and the two testList subjobs ran for 10 hours (ish) each.

@Haroon-Khel - I propose we merge this PR and increase the SparcV9 time limit to prevent the "ish" from pushing us a hair over the default 10 hours. What do you think?

Haroon-Khel · 2023-06-19T13:59:59Z

Im for increasing the timeout to 15hrs as it worked here #2893 (comment). But which pr are you referring to?

adamfarley · 2023-06-19T15:01:34Z

This changeset as a PR, I meant to say. The changeset excludes a bunch of the longer running tests, which I tested in the links above. I've created the PR now. Here's a link.

adamfarley · 2023-06-19T15:07:16Z

Without that PR, we'll need to increase the timeout to more than 15 hours (per each of the two testlists), or we'll need to run these tests in more than 2 testlists, which would require:

Making a testlist job able to wait for over 10 hours for a machine to free up.
Making sure we can run with concurrency able to be set to more than the number of machines.
Coming to terms with the idea that individual unit tests can run for over 2 hours.

Seemed a simpler idea to just exclude the biggest offenders for the overrun, and lengthening the timeout.

smlambert · 2023-06-19T15:20:02Z

Noting that queue time is not part of the overall TIME_LIMIT value, so you would not hit the 10 hr default time limit if you waited for 10 hrs for a machine to become free. The time limit starts the countdown at the setup stage.

adamfarley · 2023-06-19T15:42:36Z

I didn't know that. Thanks Shelley.

smlambert added the testFail label Jan 23, 2023

smlambert mentioned this issue Jan 23, 2023

AQAvit Triage for January 2023: jdk-17.0.6+10 adoptium/aqa-tests#4254

Closed

smlambert changed the title ~~Subset of extended.openjdk tests can not complete on certain machines within 10hrs~~ Subset of openjdk tests can not complete on certain machines within 10hrs Jan 23, 2023

smlambert mentioned this issue Jan 24, 2023

General Retrospective for January 2023 Releases adoptium/adoptium#200

Closed

11 tasks

sxa added this to the 2023-02 (February) milestone Jan 30, 2023

sxa modified the milestones: 2023-02 (February), 2023-03 (March) Mar 8, 2023

smlambert mentioned this issue Mar 17, 2023

test-marist-sles15-s390x-2 slow extended.openjdk performance #2996

Closed

sxa mentioned this issue Mar 20, 2023

test-docker-fedora35-armv8l-1 unable to complete extended.openjdk in under 10hrs #3001

Closed

sxa assigned sophia-guo and Haroon-Khel Mar 20, 2023

sophia-guo mentioned this issue Mar 20, 2023

Enable aarch64_linux test parallel adoptium/ci-jenkins-pipelines#661

Merged

sophia-guo added this to Adoptium 1Q 2023 Plan Mar 20, 2023

Haroon-Khel removed this from Adoptium 1Q 2023 Plan Mar 28, 2023

Haroon-Khel added this to Adoptium 2Q 2023 Plan Mar 28, 2023

sxa mentioned this issue Apr 3, 2023

"Skipped parameter" warnings in jenkins server log #2774

Open

adamfarley mentioned this issue Apr 4, 2023

Weekly build triage for the week starting 2023/04/01 adoptium/temurin-build#3307

Closed

jiekang moved this to Done in Adoptium 2Q 2023 Plan Apr 11, 2023

Haroon-Khel closed this as completed Apr 11, 2023

adamfarley mentioned this issue Jun 12, 2023

Weekly build triage for the week starting Saturday 2023/06/10 adoptium/temurin-build#3389

Closed

adamfarley mentioned this issue Jun 19, 2023

Weekly build triage for the week starting Saturday 2023/06/17 adoptium/temurin-build#3394

Closed

adamfarley mentioned this issue Jun 27, 2023

Weekly build triage for the week starting Saturday 2023/06/24 adoptium/temurin-build#3406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subset of openjdk tests can not complete on certain machines within 10hrs #2893

Subset of openjdk tests can not complete on certain machines within 10hrs #2893

smlambert commented Jan 23, 2023 •

edited

Loading

sxa commented Jan 24, 2023

smlambert commented Jan 24, 2023

smlambert commented Jan 24, 2023

sxa commented Jan 24, 2023

sxa commented Jan 30, 2023

smlambert commented Feb 15, 2023

sxa commented Mar 8, 2023

sxa commented Mar 17, 2023

sxa commented Mar 20, 2023 •

edited

Loading

Haroon-Khel commented Mar 21, 2023 •

edited

Loading

sxa commented Mar 21, 2023

Haroon-Khel commented Mar 30, 2023

Haroon-Khel commented Mar 30, 2023

Haroon-Khel commented Mar 30, 2023

Haroon-Khel commented Mar 30, 2023

Haroon-Khel commented Mar 30, 2023

Haroon-Khel commented Mar 30, 2023 •

edited

Loading

Haroon-Khel commented Mar 30, 2023 •

edited

Loading

smlambert commented Mar 30, 2023

Haroon-Khel commented Mar 31, 2023 •

edited

Loading

smlambert commented Mar 31, 2023

sxa commented Apr 3, 2023

adamfarley commented Jun 7, 2023 •

edited

Loading

adamfarley commented Jun 8, 2023 •

edited

Loading

adamfarley commented Jun 16, 2023

Haroon-Khel commented Jun 19, 2023

adamfarley commented Jun 19, 2023 •

edited

Loading

adamfarley commented Jun 19, 2023

smlambert commented Jun 19, 2023

adamfarley commented Jun 19, 2023

Subset of openjdk tests can not complete on certain machines within 10hrs #2893

Subset of openjdk tests can not complete on certain machines within 10hrs #2893

Comments

smlambert commented Jan 23, 2023 • edited Loading

sxa commented Jan 24, 2023

smlambert commented Jan 24, 2023

smlambert commented Jan 24, 2023

sxa commented Jan 24, 2023

sxa commented Jan 30, 2023

smlambert commented Feb 15, 2023

sxa commented Mar 8, 2023

sxa commented Mar 17, 2023

sxa commented Mar 20, 2023 • edited Loading

Haroon-Khel commented Mar 21, 2023 • edited Loading

sxa commented Mar 21, 2023

Haroon-Khel commented Mar 30, 2023

Haroon-Khel commented Mar 30, 2023

Haroon-Khel commented Mar 30, 2023

Haroon-Khel commented Mar 30, 2023

Haroon-Khel commented Mar 30, 2023

Haroon-Khel commented Mar 30, 2023 • edited Loading

Haroon-Khel commented Mar 30, 2023 • edited Loading

smlambert commented Mar 30, 2023

Haroon-Khel commented Mar 31, 2023 • edited Loading

smlambert commented Mar 31, 2023

sxa commented Apr 3, 2023

adamfarley commented Jun 7, 2023 • edited Loading

adamfarley commented Jun 8, 2023 • edited Loading

adamfarley commented Jun 16, 2023

Haroon-Khel commented Jun 19, 2023

adamfarley commented Jun 19, 2023 • edited Loading

adamfarley commented Jun 19, 2023

smlambert commented Jun 19, 2023

adamfarley commented Jun 19, 2023

smlambert commented Jan 23, 2023 •

edited

Loading

sxa commented Mar 20, 2023 •

edited

Loading

Haroon-Khel commented Mar 21, 2023 •

edited

Loading

Haroon-Khel commented Mar 30, 2023 •

edited

Loading

Haroon-Khel commented Mar 30, 2023 •

edited

Loading

Haroon-Khel commented Mar 31, 2023 •

edited

Loading

adamfarley commented Jun 7, 2023 •

edited

Loading

adamfarley commented Jun 8, 2023 •

edited

Loading

adamfarley commented Jun 19, 2023 •

edited

Loading