[Perf] Regressions in math based array benchmarks #52316

DrewScoggins · 2021-05-05T18:15:09Z

Run Information

Architecture	x64
OS	ubuntu 18.04
Baseline	ac82799250fe42c8ba2941aef487aca94ecf04cc
Compare	1c9e2009d9942a6df31daad8c912c68446856fa2
Diff	Diff

Regressions in ByteMark

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
BenchLUDecomp - Duration of single invocation	967.28 ms	1.34 secs	1.38	0.00

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'ByteMark*'

Payloads

Baseline
Compare

Histogram

ByteMark.BenchLUDecomp

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture	x64
OS	ubuntu 18.04
Baseline	ac82799250fe42c8ba2941aef487aca94ecf04cc
Compare	1c9e2009d9942a6df31daad8c912c68446856fa2
Diff	Diff

Regressions in Benchstone.BenchI.Array2

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
Test - Duration of single invocation	357.67 ms	566.68 ms	1.58	0.01

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'Benchstone.BenchI.Array2*'

Payloads

Baseline
Compare

Histogram

Benchstone.BenchI.Array2.Test

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture	x64
OS	ubuntu 18.04
Baseline	ac82799250fe42c8ba2941aef487aca94ecf04cc
Compare	1c9e2009d9942a6df31daad8c912c68446856fa2
Diff	Diff

Regressions in BenchmarksGame.FannkuchRedux_2

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
RunBench - Duration of single invocation	126.48 ms	143.40 ms	1.13	0.00

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'BenchmarksGame.FannkuchRedux_2*'

Payloads

Baseline
Compare

Histogram

BenchmarksGame.FannkuchRedux_2.RunBench(n: 10, expectedSum: 73196)

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture	x64
OS	ubuntu 18.04
Baseline	ac82799250fe42c8ba2941aef487aca94ecf04cc
Compare	1c9e2009d9942a6df31daad8c912c68446856fa2
Diff	Diff

Regressions in SciMark2.kernel

Benchmark	Baseline	Test	Test/Base	Test Quality	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
benchSparseMult - Duration of single invocation	512.50 ms	610.62 ms	1.19	0.00

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'SciMark2.kernel*'

Payloads

Baseline
Compare

Histogram

SciMark2.kernel.benchSparseMult

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2021-05-05T18:15:12Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

DrewScoggins · 2021-05-05T18:15:49Z

This change seems to be the cause, #51901

AndyAyersMS · 2021-05-10T16:37:58Z

@BruceForstall PTAL
cc @dotnet/jit-contrib

BruceForstall · 2021-06-10T23:48:30Z

The ludcmp issue is an LSRA spill cost issue, detailed in #53703.

The Array2 regression is due to unfortunate loop alignment of the innermost nested loop (of a 4 deep loop nest): in the new, slow case, this innermost loop spans 3 32-byte chunks, whereas before, it only spanned 2. The JIT aligns the loop up to 16 bytes, but that's not enough here. Oddly, I can easily repro this regression on my Linux x64 box, but not my Windows x64 box (where the test didn't show any regression in the perf lab, and in fact has been incredibly consistent, compared to the Linux one, which has moved around a bit). If I set COMPlus_JitAlignLoopAdaptive=0, the performance swaps: the previously "slow" version is across 2 chunks and the previously fast version is across 3 chunks... and, their times swap.

kunalspathak · 2021-06-11T01:23:05Z

So, I just generated disassembly of before (your loop weight change) vs. after, and I see things other way round.

In below screenshot, left is before and right is after. Earier, we were aligning the loop with 4 bytes and making the loop span in 2 32-byte chunks, however after your change, we no longer align the loop because of ;; Skip alignment: 'Loop at G_M30565_IG06 PaddingNeeded= 11, MaxPadding= 8, LoopSize= 56, AlignmentBoundary= 16B.'. With that, the loop spans 3 32-byte chunks and that could be the explaination of why we see regression.

If I set COMPlus_JitAlignLoopAdaptive=0, we align the loop by adding 7 bytes Before your change and non-adaptive alignment (on left below), we would align the loop by 4 bytes. In both cases, the loop fits in 2 32-byte chunks.

Finally, without loop alignment, COMPlus_JitAlignLoops=0, before and after changes, the loop takes 3 32-byte chunks to fit.

To summarize, before your change, due to the loop alignment, we got better performance but after your change, we got the similar performance, had we didn't had loop alignment (modulo the performance impact of align instructions that is inside the 3rd nested loop).

BruceForstall · 2021-06-11T18:03:51Z

I don't understand your "To summarize" sentence.

I think we might be saying the same thing. In the "before my change" case, the inner loop ended up in 2 32-byte chunks. After, it ended up in 3 32-byte chunks. In the "after" case, setting COMPlus_JitAlignLoopAdaptive=0 causes the inner loop to be in 2 32-byte chunks, giving the same performance as the "before" case. (Also, setting COMPlus_JitAlignLoopAdaptive=0 in the "before" case gives worse performance, equivalent to my "after" case, due to the inner loop being in 3 32-byte chunks.)

BruceForstall · 2021-06-11T21:51:08Z

The BenchmarksGame.FannkuchRedux_2 case is mysterious: the slow case includes 2 additional instructions in an un-executed path (loop cloning slow path), but all hot paths have the same instructions (some slightly different register allocation), and the hot inner loops are all properly aligned.

BruceForstall · 2021-06-11T23:25:30Z

The SciMark2 regression (both Windows and Linux) appears due to an additional spill / reload in the hot loop of matmult:

mov      r15d, dword ptr [rsp+24H] // here
cmp      r12d, r15d
mov      dword ptr [rsp+24H], r15d // here

This was fixed by #53853.

BruceForstall · 2021-06-11T23:26:06Z

Since these have all been investigated, and follow-up issues opened, I'm closing this.

DrewScoggins added os-linux Linux OS (any supported distro) tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark arch-x64 labels May 5, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label May 5, 2021

jeffschwMSFT added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 6, 2021

JulieLeeMSFT assigned BruceForstall May 10, 2021

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label May 10, 2021

JulieLeeMSFT added this to the 6.0.0 milestone May 10, 2021

BruceForstall mentioned this issue Jun 3, 2021

LSRA spill cost issue #53703

Closed

kunalspathak mentioned this issue Jun 9, 2021

Consider reloadWeight while evaluating spill cost #53853

Merged

BruceForstall mentioned this issue Jun 11, 2021

Loop alignment issue in Array2 benchmark #54072

Closed

BruceForstall closed this as completed Jun 11, 2021

ghost locked as resolved and limited conversation to collaborators Jul 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Regressions in math based array benchmarks #52316

[Perf] Regressions in math based array benchmarks #52316

DrewScoggins commented May 5, 2021

Payloads

Histogram

ByteMark.BenchLUDecomp

Docs

Payloads

Histogram

Benchstone.BenchI.Array2.Test

Docs

Payloads

Histogram

BenchmarksGame.FannkuchRedux_2.RunBench(n: 10, expectedSum: 73196)

Docs

Payloads

Histogram

SciMark2.kernel.benchSparseMult

Docs

dotnet-issue-labeler bot commented May 5, 2021

DrewScoggins commented May 5, 2021

AndyAyersMS commented May 10, 2021

BruceForstall commented Jun 10, 2021

kunalspathak commented Jun 11, 2021

BruceForstall commented Jun 11, 2021

BruceForstall commented Jun 11, 2021

BruceForstall commented Jun 11, 2021

BruceForstall commented Jun 11, 2021

[Perf] Regressions in math based array benchmarks #52316

[Perf] Regressions in math based array benchmarks #52316

Comments

DrewScoggins commented May 5, 2021

Run Information

Regressions in ByteMark

Repro

Payloads

Histogram

ByteMark.BenchLUDecomp

Docs

Run Information

Regressions in Benchstone.BenchI.Array2

Repro

Payloads

Histogram

Benchstone.BenchI.Array2.Test

Docs

Run Information

Regressions in BenchmarksGame.FannkuchRedux_2

Repro

Payloads

Histogram

BenchmarksGame.FannkuchRedux_2.RunBench(n: 10, expectedSum: 73196)

Docs

Run Information

Regressions in SciMark2.kernel

Repro

Payloads

Histogram

SciMark2.kernel.benchSparseMult

Docs

dotnet-issue-labeler bot commented May 5, 2021

DrewScoggins commented May 5, 2021

AndyAyersMS commented May 10, 2021

BruceForstall commented Jun 10, 2021

kunalspathak commented Jun 11, 2021

BruceForstall commented Jun 11, 2021

BruceForstall commented Jun 11, 2021

BruceForstall commented Jun 11, 2021

BruceForstall commented Jun 11, 2021