Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Regressions in math based array benchmarks #52316

Closed
DrewScoggins opened this issue May 5, 2021 · 9 comments
Closed

[Perf] Regressions in math based array benchmarks #52316

DrewScoggins opened this issue May 5, 2021 · 9 comments
Assignees
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@DrewScoggins
Copy link
Member

Run Information

Architecture x64
OS ubuntu 18.04
Baseline ac82799250fe42c8ba2941aef487aca94ecf04cc
Compare 1c9e2009d9942a6df31daad8c912c68446856fa2
Diff Diff

Regressions in ByteMark

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
BenchLUDecomp - Duration of single invocation 967.28 ms 1.34 secs 1.38 0.00

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'ByteMark*'

Payloads

Baseline
Compare

Histogram

ByteMark.BenchLUDecomp


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture x64
OS ubuntu 18.04
Baseline ac82799250fe42c8ba2941aef487aca94ecf04cc
Compare 1c9e2009d9942a6df31daad8c912c68446856fa2
Diff Diff

Regressions in Benchstone.BenchI.Array2

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
Test - Duration of single invocation 357.67 ms 566.68 ms 1.58 0.01

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'Benchstone.BenchI.Array2*'

Payloads

Baseline
Compare

Histogram

Benchstone.BenchI.Array2.Test


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture x64
OS ubuntu 18.04
Baseline ac82799250fe42c8ba2941aef487aca94ecf04cc
Compare 1c9e2009d9942a6df31daad8c912c68446856fa2
Diff Diff

Regressions in BenchmarksGame.FannkuchRedux_2

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
RunBench - Duration of single invocation 126.48 ms 143.40 ms 1.13 0.00

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'BenchmarksGame.FannkuchRedux_2*'

Payloads

Baseline
Compare

Histogram

BenchmarksGame.FannkuchRedux_2.RunBench(n: 10, expectedSum: 73196)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Run Information

Architecture x64
OS ubuntu 18.04
Baseline ac82799250fe42c8ba2941aef487aca94ecf04cc
Compare 1c9e2009d9942a6df31daad8c912c68446856fa2
Diff Diff

Regressions in SciMark2.kernel

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
benchSparseMult - Duration of single invocation 512.50 ms 610.62 ms 1.19 0.00

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'SciMark2.kernel*'

Payloads

Baseline
Compare

Histogram

SciMark2.kernel.benchSparseMult


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@DrewScoggins DrewScoggins added os-linux Linux OS (any supported distro) tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark arch-x64 labels May 5, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label May 5, 2021
@DrewScoggins
Copy link
Member Author

This change seems to be the cause, #51901

@jeffschwMSFT jeffschwMSFT added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 6, 2021
@AndyAyersMS
Copy link
Member

@BruceForstall PTAL
cc @dotnet/jit-contrib

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label May 10, 2021
@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone May 10, 2021
@BruceForstall
Copy link
Member

The ludcmp issue is an LSRA spill cost issue, detailed in #53703.

The Array2 regression is due to unfortunate loop alignment of the innermost nested loop (of a 4 deep loop nest): in the new, slow case, this innermost loop spans 3 32-byte chunks, whereas before, it only spanned 2. The JIT aligns the loop up to 16 bytes, but that's not enough here. Oddly, I can easily repro this regression on my Linux x64 box, but not my Windows x64 box (where the test didn't show any regression in the perf lab, and in fact has been incredibly consistent, compared to the Linux one, which has moved around a bit). If I set COMPlus_JitAlignLoopAdaptive=0, the performance swaps: the previously "slow" version is across 2 chunks and the previously fast version is across 3 chunks... and, their times swap.

@kunalspathak
Copy link
Member

So, I just generated disassembly of before (your loop weight change) vs. after, and I see things other way round.

In below screenshot, left is before and right is after. Earier, we were aligning the loop with 4 bytes and making the loop span in 2 32-byte chunks, however after your change, we no longer align the loop because of ;; Skip alignment: 'Loop at G_M30565_IG06 PaddingNeeded= 11, MaxPadding= 8, LoopSize= 56, AlignmentBoundary= 16B.'. With that, the loop spans 3 32-byte chunks and that could be the explaination of why we see regression.

image

If I set COMPlus_JitAlignLoopAdaptive=0, we align the loop by adding 7 bytes Before your change and non-adaptive alignment (on left below), we would align the loop by 4 bytes. In both cases, the loop fits in 2 32-byte chunks.

image

Finally, without loop alignment, COMPlus_JitAlignLoops=0, before and after changes, the loop takes 3 32-byte chunks to fit.

image

To summarize, before your change, due to the loop alignment, we got better performance but after your change, we got the similar performance, had we didn't had loop alignment (modulo the performance impact of align instructions that is inside the 3rd nested loop).

@BruceForstall
Copy link
Member

I don't understand your "To summarize" sentence.

I think we might be saying the same thing. In the "before my change" case, the inner loop ended up in 2 32-byte chunks. After, it ended up in 3 32-byte chunks. In the "after" case, setting COMPlus_JitAlignLoopAdaptive=0 causes the inner loop to be in 2 32-byte chunks, giving the same performance as the "before" case. (Also, setting COMPlus_JitAlignLoopAdaptive=0 in the "before" case gives worse performance, equivalent to my "after" case, due to the inner loop being in 3 32-byte chunks.)

@BruceForstall
Copy link
Member

The BenchmarksGame.FannkuchRedux_2 case is mysterious: the slow case includes 2 additional instructions in an un-executed path (loop cloning slow path), but all hot paths have the same instructions (some slightly different register allocation), and the hot inner loops are all properly aligned.

@BruceForstall
Copy link
Member

The SciMark2 regression (both Windows and Linux) appears due to an additional spill / reload in the hot loop of matmult:

mov      r15d, dword ptr [rsp+24H] // here
cmp      r12d, r15d
mov      dword ptr [rsp+24H], r15d // here

This was fixed by #53853.

@BruceForstall
Copy link
Member

Since these have all been investigated, and follow-up issues opened, I'm closing this.

@ghost ghost locked as resolved and limited conversation to collaborators Jul 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Projects
None yet
Development

No branches or pull requests

6 participants