Would making `diff_` thresholds a percentage instead of absolute values make more sense? #206

zakkak · 2023-09-21T15:57:08Z

Right now, diff_* thresholds for performance regression testing are defined as absolute numbers, e.g.:

mandrel-integration-tests/apps/jfr-native-image-performance/threshold.properties

linux.diff_native.time.to.first.ok.request.threshold.ms=80

This sometimes results in test failures when running on different machines than the one used to tune the thresholds.

However, I am thinking that checking if the increase is within an acceptable range, e.g. 5%, would probably make more sense. After all a 50ms increase on a 10ms run is huge, while on a 5s run it's negligible.

I wonder if switching to percentages instead would also allow us to perform the regression testing (only for diffs between runs) on various machines (including github runners) while not losing accuracy.

cc @Karm @jerboaa

The text was updated successfully, but these errors were encountered:

roberttoyonaga · 2023-09-26T20:17:10Z

Hi @zakkak just chiming in here - The Jfr perf test thresholds are specified as a relative change ( |new - old| / old ). Maybe something similar could make sense elsewhere too.

Karm · 2023-10-03T20:52:19Z

Definitely makes sense, requires recording JVM run as a baseline, but that already happens, see the notion of diff_jvm and diff_native suffixes in threshold.properties.

zakkak · 2023-10-04T10:55:55Z

@Karm what percentage would you consider acceptable?

Karm · 2023-10-04T12:58:01Z

@Karm what percentage would you consider acceptable?

There are 2 things:

% difference between JVM (time-to-first-ok-request, time-to-complete, RSS) and Native, i.e. is it acceptable, that Native's , time-to-complete is 10% worse etc.
And then there is a deviation from some hardcoded value.

I'd focus on 2) and I'd hardcode values from Q 2.13.8.Final, M 22.3.3.1-Final run on a reference system.
I'd run again with Q 2.16.9.Final, M 22.3.3.1-Final on a reference system and record the percentage difference.
That is what I'd use as acceptable percentage to judge the success of failure of Quarkus 3.x and M 23.x.
By reference system I mean one of the stock 8 cores 16 g ram RHEL 8 contemporary Xeon backed VMs I use as they have pretty stable profile.

zakkak mentioned this issue Nov 24, 2023

[CI] Mandrel integration tests fail with Java 21 mandrel build of mandrel/23.1 on Linux #198

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would making `diff_` thresholds a percentage instead of absolute values make more sense? #206

Would making `diff_` thresholds a percentage instead of absolute values make more sense? #206

zakkak commented Sep 21, 2023

roberttoyonaga commented Sep 26, 2023

Karm commented Oct 3, 2023

zakkak commented Oct 4, 2023

Karm commented Oct 4, 2023 •

edited

Loading

Would making diff_ thresholds a percentage instead of absolute values make more sense? #206

Would making diff_ thresholds a percentage instead of absolute values make more sense? #206

Comments

zakkak commented Sep 21, 2023

roberttoyonaga commented Sep 26, 2023

Karm commented Oct 3, 2023

zakkak commented Oct 4, 2023

Karm commented Oct 4, 2023 • edited Loading

Would making `diff_` thresholds a percentage instead of absolute values make more sense? #206

Would making `diff_` thresholds a percentage instead of absolute values make more sense? #206

Karm commented Oct 4, 2023 •

edited

Loading