Previously passing configuration times out on GH actions #304

zakkak · 2021-10-20T07:11:53Z

Description

mandrel 21.3-dev appears to be failing the last few days.

The last passing run was https://github.com/graalvm/mandrel/runs/3901801751?check_suite_focus=true with the following configuration:

component	version
mandrel	`2d827b9`
mandrel-packaging	9bbf83a0a813c778762279faa0460661810ec554
quarkus	a2c81efd6607d270dc7eaa9b7c38acb849c85030
jdk	11.0.13-ea+7
mx	5.309.2

Re-running the exact same configuration on GH runners results in timeouts or failures (see https://github.com/graalvm/mandrel/runs/3947878999?check_suite_focus=true)

How To Reproduce

Steps to reproduce the behavior:

gh workflow run base.yml \
  -f quarkus-version=a2c81efd6607d270dc7eaa9b7c38acb849c85030 \
  -f name="rerun last working config for 21.3" \
  -f jdk=ea \
  -f version=2d827b9970ac5e45dc1bb3e997aea4e3333d6939 \
  -f mandrel-packaging-version=9bbf83a0a813c778762279faa0460661810ec554

The text was updated successfully, but these errors were encountered:

zakkak · 2021-10-20T08:32:43Z

GraalVM CE builds with the same configuration seem to be working fine (see https://github.com/graalvm/mandrel/runs/3948735963?check_suite_focus=true) so this issue shouldn't have to do with changes in the GH runner.

Compiling Quarkus' integration test main (with -Dquarkus.native.native-image-xmx=5g) also results in the following:

	GraalVM CE 21.3	Mandrel 21.3-dev
Comp time	~16.5m	~24.5m
# of GCs	275	301
# of Full GCs	125	153
Time spend in Full GCs	~11.5m	~19.5m

Relates to: #288

zakkak · 2021-10-20T13:48:25Z

As the above numbers indicate main results in both Mandrel and GraalVM CE allocating enough data to make the heap usage reach close to the heap capacity and thus in excessive number of GCs

Removing -Dquarkus.native.native-image-xmx=5g the performance is much better and the number of Full GCs falls to only 11 (for both Mandrel and GraalVM CE).

That said, it appears that Mandrel is performing slightly more allocations than GraalVM CE resulting in higher heap usage and thus stressing the GC even more.

What's even more interesting is that building GraalVM CE from source without libgraal [1] appears to be performing better than both the GraalVM CE and Mandrel.

	GraalVM CE 21.3	Mandrel 21.3-dev	GraalVM CE nolibgraal
Comp time	~16.5m	~24.5m	~15m
# of GCs	275	301	245
# of Full GCs	125	153	82
Time spend in Full GCs	~11.5m	~19.5m	~9.8m

[1]

mx --primary-suite-path substratevm --components='Native Image' --native-images=native-image build

zakkak · 2021-10-20T14:56:18Z

FWIW: Building mandrel with labsjdk (to eliminate the chance of this being an artifact of using different base JDKs) and compiling with -Dquarkus.native.native-image-xmx=5g results in:

Fatal error:java.lang.OutOfMemoryError: GC overhead limit exceeded

jerboaa · 2021-10-21T08:25:58Z

FWIW: Building mandrel with labsjdk (to eliminate the chance of this being an artifact of using different base JDKs) and compiling with -Dquarkus.native.native-image-xmx=5g results in:
Fatal error:java.lang.OutOfMemoryError: GC overhead limit exceeded

It could be a symptom of metaspace issues.

@zakkak Could you do a run with this patch added to show RSS? This gives us some clue about non heap memory in addition to heap.

zakkak · 2021-10-21T11:27:30Z

(to eliminate the chance of this being an artifact of using different base JDKs)

It turns out there is indeed at least one difference causing more heap allocations to happen depending on the underlying JDK.

One of them is due to the implementation of HotSpotResolvedJavaFieldImpl.getAnnotation which in the upstream OpenJDK 11 results in the field being copied while in labsjdk it looks like no heap allocations are happening (probably due to caching).

HotSpotResolvedJavaFieldImpl.getAnnotation is invoked from AnalysisConstantReflectionProvider.filterInjectedAccessor

Experimenting further I tried building GraalVM CE without libgraal and without jlinking [1], the results are the following:

	GraalVM CE 21.3	Mandrel 21.3-dev	GraalVM CE nolibgraal	GraalVM CE nolibgraal nojlinking
Comp time	~16.5m	~24.5m	~15m	~18.5m
# of GCs	275	301	245	285
# of Full GCs	125	153	82	113
Time spend in Full GCs	~11.5m	~19.5m	~9.8m	~13.2m

Judging by the results it looks like not-jlinking has a big impact in the number of allocations. Looking into why...~~From my so far understanding this is due to AbstractNativeImageClassLoaderSupport.handleClassFileName being invoked more times and resulting in more allocations~~.

[1]

mx --primary-suite-path substratevm --no-jlinking --components='Native Image' build

jerboaa · 2021-10-21T12:25:21Z

Experimenting further I tried building GraalVM CE without libgraal and without jlinking [1], the results are the following:
[...]

Judging by the results it looks like not-jlinking has a big impact in the number of allocations.

This doesn't explain why we are seeing this now, though. Does it?

zakkak · 2021-10-22T05:53:39Z

This doesn't explain why we are seeing this now, though. Does it?

No, that's true.

Since you got me back on track I performed some more tests and eventually re-ran the exact same run that was failing (which led me to opening this issue) and now it passed (see https://github.com/graalvm/mandrel/runs/3968009738?check_suite_focus=true)

Additionally, I found out that the same test fails (on GH actions) with GraalVM CE 21.3 as well, so it's not a Mandrel thing after all... (see https://github.com/quarkusio/quarkus/runs/3966293297?check_suite_focus=true)

graal/master (i.e. latest graal) seems to work fine for both Mandrel and GraalVM CE builds (see https://github.com/graalvm/mandrel/runs/3962138022?check_suite_focus=true and https://github.com/graalvm/mandrel/runs/3959205365?check_suite_focus=true)

Unfortunately further testing indicates that this might be a transient error, I have seen the same mandrel version (a3ad15c) build the native image within ~11m, ~17m, and ~25m.

I'll try to run more experiments on an idling laptop to see if they are more stable.

zakkak · 2021-10-26T15:45:33Z

I'll try to run more experiments on an idling laptop to see if they are more stable.

The lack of stability appears to be the same on my idling laptop making this really hard to pin down.

logs/mandrel-21.2.0.2-Final-main-1:[INFO] Total time:  18:14 min
logs/mandrel-21.2.0.2-Final-main-2:[INFO] Total time:  17:48 min
logs/mandrel-21.2.0.2-Final-main-3:[INFO] Total time:  18:31 min
logs/mandrel-21.2.0.2-Final-main-4:[INFO] Total time:  17:58 min
logs/mandrel-21.2.0.2-Final-main-5:[INFO] Total time:  18:10 min

logs/mandrel-21.3.0.0-Final-main-1.txt:[INFO] Total time:  18:22 min
logs/mandrel-21.3.0.0-Final-main-2.txt:[INFO] Total time:  17:39 min
logs/mandrel-21.3.0.0-Final-main-3.txt:[INFO] Total time:  29:56 min
logs/mandrel-21.3.0.0-Final-main-4.txt:[INFO] Total time:  19:35 min
logs/mandrel-21.3.0.0-Final-main-5.txt:[INFO] Total time:  33:06 min

Passing -H:-ParseOnce seems to:

Improve compilation time by ~3x (In contrast to what I wrongfully observed when testing without constraining the memory usage on a9a2f40)
Increase stability
Decrease the maximum live data set (as observed by the gc logs) from 3.9G to 2.7G. The live data increase in the "universe" phase.

logs/mandrel-21.2.0.2-Final-main-noParseOnce-1.txt:[INFO] Total time:  09:59 min
logs/mandrel-21.2.0.2-Final-main-noParseOnce-2.txt:[INFO] Total time:  04:45 min
logs/mandrel-21.2.0.2-Final-main-noParseOnce-3.txt:[INFO] Total time:  04:45 min
logs/mandrel-21.2.0.2-Final-main-noParseOnce-4.txt:[INFO] Total time:  04:48 min
logs/mandrel-21.2.0.2-Final-main-noParseOnce-5.txt:[INFO] Total time:  04:45 min

logs/mandrel-21.3.0.0-Final-main-noParseOnce-1.txt:[INFO] Total time:  05:54 min
logs/mandrel-21.3.0.0-Final-main-noParseOnce-2.txt:[INFO] Total time:  05:42 min
logs/mandrel-21.3.0.0-Final-main-noParseOnce-3.txt:[INFO] Total time:  05:36 min
logs/mandrel-21.3.0.0-Final-main-noParseOnce-4.txt:[INFO] Total time:  05:38 min
logs/mandrel-21.3.0.0-Final-main-noParseOnce-5.txt:[INFO] Total time:  05:28 min

Some open questions:

How did this not pop up earlier in our CI? Quarkus main has not been using -H:-ParseOnce since 0be0406
What causes the large variance in compilation times?
Why does this only affect integration-tests/main to that degree?

Karm · 2021-10-26T17:49:10Z

@zakkak I might be bringing owls to Athens with this "hacker news" style pro tip....but have you disabled turbo boost to get more consistent perf? I guess fiddling around /sys/devices/system/cpu/intel_pstate/no_turbo on my workstation would do. Not sure about your system...

zakkak · 2021-10-26T17:58:29Z

@zakkak I might be bringing owls to Athens with this "hacker news" style pro tip....but have you disabled turbo boost to get more consistent perf? I guess fiddling around /sys/devices/system/cpu/intel_pstate/no_turbo on my workstation would do. Not sure about your system...

That's probably worth a try. Although the fact that the big variance is consistent on both my laptop and GH actions reduces my expectations (and is the main reason I didn't bother trying it yet). Note also that even in 21.2 we see excessive GC (which would result in throttling as well), but the variance only appears in 21.3...

PS: That's definitely not a "hacker news" style pro tip. I assure you there are plenty of academic papers (not to mention blog posts) out there that report X% of improvement without factoring in DVFS. Thanks for bringing it up :)

Disable single parsing of compiler graphs till the impact of it on heap usage decreases see oracle/graal#3435 and graalvm/mandrel#304 (comment)

Disable single parsing of compiler graphs till the impact of it on heap usage decreases see oracle/graal#3435 and graalvm/mandrel#304 (comment) quarkusio#19511 didn't take into account runs with constraint memory. Bringing back `-H:-ParseOnce` reduces the heap usage significantly, from 3.9G down to 2.7G in integration-tests/main.

zakkak · 2021-10-27T08:13:26Z

Looking at the GC logs, I observed that we spend a lot of time in Full GCs, which means that the JVM fails to find free space in the young generation and has to perform expensive Full GCs to reclaim memory, e.g.:

--
[949.815s][info][gc,heap        ] GC(167) PSYoungGen: 840192K->797743K(1293824K)
[949.815s][info][gc,heap        ] GC(167) ParOldGen: 3495027K->3495275K(3495424K)
[949.815s][info][gc,metaspace   ] GC(167) Metaspace: 495297K->495297K(1468416K)
[949.815s][info][gc             ] GC(167) Pause Full (Ergonomics) 4233M->4192M(4677M) 11731.041ms
--
[955.809s][info][gc,heap        ] GC(168) PSYoungGen: 840192K->736113K(1293824K)
[955.809s][info][gc,heap        ] GC(168) ParOldGen: 3495275K->3495150K(3495424K)
[955.809s][info][gc,metaspace   ] GC(168) Metaspace: 495297K->495297K(1468416K)
[955.809s][info][gc             ] GC(168) Pause Full (Ergonomics) 4233M->4132M(4677M) 5785.816ms
--
[968.554s][info][gc,heap        ] GC(169) PSYoungGen: 840192K->740403K(1293824K)
[968.554s][info][gc,heap        ] GC(169) ParOldGen: 3495150K->3495217K(3495424K)
[968.554s][info][gc,metaspace   ] GC(169) Metaspace: 495300K->495300K(1468416K)
[968.554s][info][gc             ] GC(169) Pause Full (Ergonomics) 4233M->4136M(4677M) 12270.639ms
[968.554s][info][gc,cpu         ] GC(169) User=92.73s Sys=0.07s Real=12.27s
[quarkus-integration-test-main-999-SNAPSHOT-runner:660320]     universe: 304,192.26 ms,  4.57 GB

Notice how we spend ~30 seconds to reclaim ~250M... As we can see in all 3 GCs we are unable to reclaim memory from the young generation (it has a lot of survivors) so a Full GC is triggered (due to GC ergonomics) in an effort to reclaim memory from the old generation and promote the survivors. At this stage it's clear that we are operating at the limits of the heap capacity and that we are spending unnecessary CPU cycles trying to optimize heap usage due to GC ergonomics. Initially I tried finding a way to disable the GC ergonomics (or at least tune the heuristics) but I was not able to do so. Thanks to @shipilev, who pointed me to UseAdaptiveGCBoundary as an alternative to try out, I was able to get the following results:

logs/mandrel-21.3.0.0-Final-main-UseAdaptiveGCBoundary-1.txt:[INFO] Total time:  07:05 min
logs/mandrel-21.3.0.0-Final-main-UseAdaptiveGCBoundary-2.txt:[INFO] Total time:  07:59 min
logs/mandrel-21.3.0.0-Final-main-UseAdaptiveGCBoundary-3.txt:[INFO] Total time:  07:38 min
logs/mandrel-21.3.0.0-Final-main-UseAdaptiveGCBoundary-4.txt:[INFO] Total time:  08:07 min

Notice how they are still worse than:

logs/mandrel-21.3.0.0-Final-main-noParseOnce-5.txt:[INFO] Total time:  05:28 min

but the overhead is far less significant.

Using UseAdaptiveGCBoundary we see the following:

[182.670s][info][gc,heap        ] GC(71) PSYoungGen: 44523K->0K(1014272K)
[182.670s][info][gc,heap        ] GC(71) ParOldGen: 3760404K->3774730K(4096000K)
[182.670s][info][gc,metaspace   ] GC(71) Metaspace: 490450K->490450K(1464320K)
[182.670s][info][gc             ] GC(71) Pause Full (Ergonomics) 3715M->3686M(4990M) 5603.479ms
--
[193.125s][info][gc,heap        ] GC(75) PSYoungGen: 44513K->0K(891392K)
[193.125s][info][gc,heap        ] GC(75) ParOldGen: 3914472K->3936429K(4218880K)
[193.125s][info][gc,metaspace   ] GC(75) Metaspace: 492746K->492746K(1466368K)
[193.125s][info][gc             ] GC(75) Pause Full (Ergonomics) 3866M->3844M(4990M) 4521.975ms
--
[212.629s][info][gc,heap        ] GC(81) PSYoungGen: 44539K->0K(807936K)
[212.629s][info][gc,heap        ] GC(81) ParOldGen: 4054251K->4057183K(4302336K)
[212.629s][info][gc,metaspace   ] GC(81) Metaspace: 493543K->493543K(1466368K)
[212.629s][info][gc             ] GC(81) Pause Full (Ergonomics) 4002M->3962M(4990M) 5725.366ms
--
[229.383s][info][gc,metaspace   ] GC(88) Metaspace: 493956K->493956K(1466368K)
[229.383s][info][gc             ] GC(88) Pause Young (Allocation Failure) 4877M->4045M(4990M) 90.973ms
[229.383s][info][gc,cpu         ] GC(88) User=0.59s Sys=0.00s Real=0.09s
[quarkus-integration-test-main-999-SNAPSHOT-runner:316334]     (clinit):   1,779.54 ms,  4.87 GB

Notice how the live data set constantly increases up to ~4.9G and how we still spend a lot of time due to GC ergonomics trying to reclaim memory, but this time in a far lower frequency than before.

FTR the different compilation phases have different memory footprints as follows:

with `+ParseOnce`

classlist phase has a live data set of <300MB
clinit phase has a live data set of 3.9G - 4G which is kept alive until the universe phase is complete
parse phase has a live data set of 3.3G
inline phase has a live data set of 2.5G
compile phase has a live data set of 2.9G
image phase has a live data set of 1.5G

with `-ParseOnce`

classlist phase has a live data set of <300MB
clinit phase has a live data set of ~2.1G which is kept alive until the universe phase is complete
parse phase has a live data set of 2.2G
inline phase has a live data set of 3G
compile phase has a live data set of 3.2G
image phase has a live data set of 1.7G

Disable single parsing of compiler graphs till the impact of it on heap usage decreases see oracle/graal#3435 and graalvm/mandrel#304 (comment) quarkusio#19511 didn't take into account runs with constraint memory. Bringing back `-H:-ParseOnce` reduces the heap usage significantly, from 3.9G down to 2.7G in integration-tests/main.

zakkak · 2021-11-02T19:41:27Z

FTR the commit that increased the memory usage and triggered the CI failures appears to be ce05101

It's also worth noting that trying -H:-IncludeMethodData (option added in ce05101) locally didn't have a big impact.

zakkak · 2021-11-04T07:24:49Z

I am closing this issue as my current understanding is that we were already stressing the JVM and ce05101 resulted in a small increase in memory usage that ended up consistently causing the CI runs to fail.

For the time being we have reverted Quarkus to using -H:-ParseOnce (see quarkusio/quarkus#21018) to avoid the issue.

I plan to further investigate why -H:+ParseOnce has such a big impact in the case of Quarkus' integration-tests/main, see #316.

Disable single parsing of compiler graphs till the impact of it on heap usage decreases see oracle/graal#3435 and graalvm/mandrel#304 (comment) quarkusio#19511 didn't take into account runs with constraint memory. Bringing back `-H:-ParseOnce` reduces the heap usage significantly, from 3.9G down to 2.7G in integration-tests/main. (cherry picked from commit 201b9a6)

zakkak added the bug Something isn't working label Oct 20, 2021

zakkak self-assigned this Oct 20, 2021

zakkak added the testing CI and testing related issues and PRs label Oct 20, 2021

zakkak added the affects/21.3 label Oct 20, 2021

zakkak added a commit to zakkak/quarkus that referenced this issue Oct 26, 2021

Disable single parsing of compiler graphs

e2c6824

Disable single parsing of compiler graphs till the impact of it on heap usage decreases see oracle/graal#3435 and graalvm/mandrel#304 (comment)

This was referenced Oct 26, 2021

Disable single parsing of compiler graphs by default quarkusio/quarkus#21018

Merged

CI: Build GraalVM with libgraal for Quarkus tests oracle/graal#3952

Closed

This was referenced Nov 3, 2021

[GR-33602] [RFC] More User-friendly Output for Native Image oracle/graal#3955

Closed

-H:+ParseOnce singificantly increases the memory footprint of native-image in Quarkus' integration-tests/main #316

Closed

zakkak closed this as completed Nov 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Previously passing configuration times out on GH actions #304

Previously passing configuration times out on GH actions #304

zakkak commented Oct 20, 2021 •

edited

Loading

zakkak commented Oct 20, 2021

zakkak commented Oct 20, 2021 •

edited

Loading

zakkak commented Oct 20, 2021

jerboaa commented Oct 21, 2021

zakkak commented Oct 21, 2021 •

edited

Loading

jerboaa commented Oct 21, 2021 •

edited

Loading

zakkak commented Oct 22, 2021 •

edited

Loading

zakkak commented Oct 26, 2021 •

edited

Loading

Karm commented Oct 26, 2021

zakkak commented Oct 26, 2021

zakkak commented Oct 27, 2021

zakkak commented Nov 2, 2021

zakkak commented Nov 4, 2021

Previously passing configuration times out on GH actions #304

Previously passing configuration times out on GH actions #304

Comments

zakkak commented Oct 20, 2021 • edited Loading

Description

How To Reproduce

zakkak commented Oct 20, 2021

zakkak commented Oct 20, 2021 • edited Loading

zakkak commented Oct 20, 2021

jerboaa commented Oct 21, 2021

zakkak commented Oct 21, 2021 • edited Loading

jerboaa commented Oct 21, 2021 • edited Loading

zakkak commented Oct 22, 2021 • edited Loading

zakkak commented Oct 26, 2021 • edited Loading

Karm commented Oct 26, 2021

zakkak commented Oct 26, 2021

zakkak commented Oct 27, 2021

with +ParseOnce

with -ParseOnce

zakkak commented Nov 2, 2021

zakkak commented Nov 4, 2021

zakkak commented Oct 20, 2021 •

edited

Loading

zakkak commented Oct 20, 2021 •

edited

Loading

zakkak commented Oct 21, 2021 •

edited

Loading

jerboaa commented Oct 21, 2021 •

edited

Loading

zakkak commented Oct 22, 2021 •

edited

Loading

zakkak commented Oct 26, 2021 •

edited

Loading

with `+ParseOnce`

with `-ParseOnce`