Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loop alignment issue in Array2 benchmark #54072

Closed
BruceForstall opened this issue Jun 11, 2021 · 2 comments
Closed

Loop alignment issue in Array2 benchmark #54072

BruceForstall opened this issue Jun 11, 2021 · 2 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner

Comments

@BruceForstall
Copy link
Member

A recent change #51901 leading to a regression in the Benchstone.BenchI.Array2 benchmark on Ubuntu (but not Windows): #52316.

The core of the benchmark is the Bench function inner loop:

for (; loop != 0; loop--) {
    for (int i = 0; i < 10; i++) {
        for (int j = 0; j < 10; j++) {
            for (int k = 0; k < 10; k++) {
                d[i][j][k] = s[i][j][k];
            }
        }
    }
}

The code of this loop is almost equivalent, modulo register allocation, before and after #51901. The difference is loop alignment: before #51901, the loop fits in 2 32-byte chunks; after, it is in 3 32-byte chunks. On Ubuntu, this leads to about a 50% performance regression. Simply setting COMPlus_JitAlignLoopAdaptive=0 changes the alignment such that the inner loop fits in 2 32-byte chunks, recovering the performance.

This is a high weight basic block; perhaps the alignment heuristics should "try harder" and be willing to insert more alignment padding in case it might be profitable?

@BruceForstall BruceForstall added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 11, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jun 11, 2021
@kunalspathak
Copy link
Member

I have shared my investigation in #52316 (comment). In before case, we align the loop (with 6 bytes), but after your change we don't align it because of other heuristics (loop size vs. no. of padding needed).

This is a high weight basic block; perhaps the alignment heuristics should "try harder"

We already have COMPlus_JitAlignLoopMinBlockWeight that tells if the alignment should be done, and in this case, we do try to align, but we want to restrict ourselves in adding too much padding because of code size reason, hence we have conservative heuristics. I don't think we can do much for this case; to put in other way - Before your change, the benchmark was benefitting from loop alignment, but after your change, it is not.

@kunalspathak
Copy link
Member

No action needed here.

@ghost ghost locked as resolved and limited conversation to collaborators Jul 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

2 participants