Arm64: In mod operation happening inside the loop, if divisor is an invariant, hoist the divisor checks #64795

kunalspathak · 2022-02-04T06:49:30Z

Arm64 doesn't have modulo operator, so we convert the operation to div/mul/sub format. However, we don't do much to the tree node until codegen where we generate code for it. During codegen, it is too late to know that divisor was invariant otherwise we can hoist some extra checks like divisor == 0 and divisor == -1.

In below example, when both dividend (x) and divisor (y) are invariant, we hoist the checks properly, but when dividend is not invariant (result) but divisor is invariant (y), we can hoist the checks out of the loop.

public static int issue2(int x, int y, int z)
{
    int result = 0;
    for (int i = 0; i < z; i++)
    {
        //result = x % y; <-- this hoist things properly because both dividend and divisor are invariant.
        result = result % y;
    }
    return result;
}

...
G_M61875_IG03:
            cmp     w0, #0   ; Check# 1 divisor == 0, can be hoisted 
            beq     G_M61875_IG08
            cmn     w0, #1  ; Check# 2 divisor == -1, can be hoisted 
            bne     G_M61875_IG04
            adds    wzr, w1, w1
            bne     G_M61875_IG04
            bvs     G_M61875_IG07
                                                ;; bbWeight=4    PerfScore 22.00
G_M61875_IG04:
            sdiv    w4, w1, w0
            mul     w4, w4, w0
            sub     w1, w1, w4
            add     w3, w3, #1
            cmp     w3, w2
            blt     G_M61875_IG03
                                                ;; bbWeight=4    PerfScore 62.00
G_M61875_IG05:
            mov     w0, w1
                                                ;; bbWeight=1    PerfScore 0.50
G_M61875_IG06:
            ldp     fp, lr, [sp],#16
            ret     lr
                                                ;; bbWeight=1    PerfScore 2.00
G_M61875_IG07:
            bl      CORINFO_HELP_OVERFLOW
                                                ;; bbWeight=0    PerfScore 0.00
G_M61875_IG08:
            bl      CORINFO_HELP_THROWDIVZERO
            brk_windows #0

category:cq
theme:div-mod-rem
skill-level:expert
cost:medium
impact:medium

The text was updated successfully, but these errors were encountered:

ghost · 2022-02-04T06:49:35Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Arm64 doesn't have modulo operator, so we convert the operation to div/mul/sub format. However, we don't do much to the tree node until codegen where we generate code for it. During codegen, it is too late to know that divisor was invariant otherwise we can hoist some extra checks like divisor == 0 and divisor == -1.

In below example, when both dividend (x) and divisor (y) are invariant, we hoist the checks properly, but when dividend is not invariant (result) but divisor is invariant (y), we can hoist the checks out of the loop.

public static int issue2(int x, int y, int z)
{
    int result = 0;
    for (int i = 0; i < z; i++)
    {
        //result = x % y; <-- this hoist things properly because both dividend and divisor are invariant.
        result = result % y;
    }
    return result;
}

...
G_M61875_IG03:
            cmp     w0, #0   ; Check# 1 divisor == 0, can be hoisted 
            beq     G_M61875_IG08
            cmn     w0, #1  ; Check# 2 divisor == -1, can be hoisted 
            bne     G_M61875_IG04
            adds    wzr, w1, w1
            bne     G_M61875_IG04
            bvs     G_M61875_IG07
                                                ;; bbWeight=4    PerfScore 22.00
G_M61875_IG04:
            sdiv    w4, w1, w0
            mul     w4, w4, w0
            sub     w1, w1, w4
            add     w3, w3, #1
            cmp     w3, w2
            blt     G_M61875_IG03
                                                ;; bbWeight=4    PerfScore 62.00
G_M61875_IG05:
            mov     w0, w1
                                                ;; bbWeight=1    PerfScore 0.50
G_M61875_IG06:
            ldp     fp, lr, [sp],#16
            ret     lr
                                                ;; bbWeight=1    PerfScore 2.00
G_M61875_IG07:
            bl      CORINFO_HELP_OVERFLOW
                                                ;; bbWeight=0    PerfScore 0.00
G_M61875_IG08:
            bl      CORINFO_HELP_THROWDIVZERO
            brk_windows #0

Author:	kunalspathak
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `untriaged`
Milestone:	-

kunalspathak · 2022-02-04T06:49:51Z

@dotnet/jit-contrib

EgorBo · 2022-02-04T09:54:42Z

Not sure I understand this, so early in morph we transform it to

loop
{
-     loopVariant = loopVariant % loopInvariant;
+     loopVariant = loopVariant  - (loopVariant  / loopInvariant) * loopInvariant;
}

I don't see what can be hoisted here.

In theory, for cases like this for loops (on both x64 and Arm64) we can use a super smart trick by @lemire that we already use for Dictionaries, e.g.:

public static uint issue2(uint x, uint y, uint z)
{
    uint result = 0;

    ulong cachedMul = GetFastModMultiplier(y);
    for (int i = 0; i < z; i++)
    {
        result = FastMod(x, y, cachedMul); // no 'div' in the loop body!
    }
    return result;
}

public static ulong GetFastModMultiplier(uint divisor) =>
    ulong.MaxValue / divisor + 1;

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static uint FastMod(uint value, uint divisor, ulong multiplier) => 
    (uint)(((((multiplier * value) >> 32) + 1) * divisor) >> 32);

EgorBo · 2022-02-04T09:59:13Z

so e.g. you can try to morph

loopVariant % loopInvariant

to

(uint)((((((ulong.MaxValue / loopInvariant + 1) * loopVariant) >> 32) + 1) * loopInvariant) >> 32)

and hope LICM will manage to hoist the multiplier part. This can only be done for loops and only when you're sure the invariant part will be hoisted. Also, only unsigned types and on 64bit

NOTE: if loopInvariant is not a leaf node we'll have to spill it to a temp and GT_ASG nodes are unlikely to be hoisted

SingleAccretion · 2022-02-04T11:41:05Z

The way to fix this is to introduce a new node like CKFINITE that represents "the check" and wraps the actual divisor. Then add support for it in VN, and it should work itself out in hoisting/CSE (well -- it'll need some special handling not unlike COMMA has).

(One will also need to add support for "non-faulting" DIVs as part of this)

Obviously, a new node for this narrow optimization is not a small price to pay in complexity, so one will have to make sure it is worth it.

kunalspathak · 2022-02-04T15:46:11Z

I don't see what can be hoisted here.

The loopInvariant == 0 and loopInvariant == -1 checks.

The way to fix this is to introduce a new node like CKFINITE that represents "the check"

I agree. There will be an entire new pipeline to be added to support this. Quick search on internet reveals the pattern being used in popular code.

runtime/src/libraries/System.Private.Xml/src/System/Xml/Xsl/XPathConvert.cs

Line 905 in 9055c8b

uHH %= uScale;

https://github.com/mono/mono/blob/e3c2771b7004661afb04d568b6e8ae1d2f397a3c/mcs/class/Mono.Security/Mono.Math/BigInteger.cs#L1881

kunalspathak · 2022-02-23T17:42:14Z

Related: #46010

TIHan · 2022-05-10T02:34:00Z

I spent some time looking at this.

A few comments:

If we start using this GT_CHK_DIV_BY_ZERO node, we do not want to use it for all cases of div-by-zero otherwise it would mess up the tree. So realistically, this node should only be used when the div-by-zero check is to be done elsewhere rather than at the point of division. Would probably mean that we have to add a flag(maybe GTF_NO_CHK_DIV_BY_ZERO?) on GT_DIV to tell it not emit it's own div-by-zero check if a GT_CHK_DIV_BY_ZERO was already used to check its divisor.
About the hoisting part, we need to be careful because we need to know with certainty that a division will occur and that there are no side-effects between hoisted check and the division itself.

I'm trying to get familiar with this area, so forgive me if it turns out the JIT can handle a lot of this already.

kunalspathak · 2022-05-10T06:23:37Z

About the hoisting part, we need to be careful because we need to know with certainty that a division will occur and that there are no side-effects between hoisted check and the division itself.

We already take care of that. Refer:

runtime/src/coreclr/jit/optimizer.cpp

Lines 6646 to 6649 in 21f1a2b

    
           // Only unconditionally executed blocks in the loop are visited (see optHoistThisLoop) 
        
           // so after we're done visiting the first block we need to assume the worst, that the 
        
           // blocks that are not visisted have side effects. 
        
           m_beforeSideEffect = false;

and

runtime/src/coreclr/jit/optimizer.cpp

Lines 6826 to 6837 in 21f1a2b

    
           if (!m_beforeSideEffect) 
        
           { 
        
               // For now, we give up on an expression that might raise an exception if it is after the 
        
               // first possible global side effect (and we assume we're after that if we're not in the first 
        
               // block). 
        
               // TODO-CQ: this is when we might do loop cloning. 
        
               // 
        
               if ((tree->gtFlags & GTF_EXCEPT) != 0) 
        
               { 
        
                   INDEBUG(failReason = "side effect ordering constraint";) 
        
                   treeIsHoistable = false; 
        
               }

TIHan · 2022-05-10T19:13:09Z

Great! Thank you for linking to that, that's really helpful.

TIHan · 2023-04-18T23:56:28Z

If we start using this GT_CHK_DIV_BY_ZERO node, we do not want to use it for all cases of div-by-zero otherwise it would mess up the tree. So realistically, this node should only be used when the div-by-zero check is to be done elsewhere rather than at the point of division. Would probably mean that we have to add a flag(maybe GTF_NO_CHK_DIV_BY_ZERO?) on GT_DIV to tell it not emit it's own div-by-zero check if a GT_CHK_DIV_BY_ZERO was already used to check its divisor.

Looking back at my old comment, I did introduce similar flags from #82924:

    GTF_DIV_MOD_NO_BY_ZERO      = 0x20000000, // GT_DIV, GT_MOD -- Div or mod definitely does not divide-by-zero.

    GTF_DIV_MOD_NO_OVERFLOW     = 0x40000000, // GT_DIV, GT_MOD -- Div or mod definitely does not overflow.

I agree that introducing new nodes and hoisting them out would be the proper way to resolve this.

dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Feb 4, 2022

kunalspathak assigned TIHan and kunalspathak Feb 4, 2022

kunalspathak mentioned this issue Feb 4, 2022

Improving ARM64 Performance in .NET 7.0 #64820

Closed

32 tasks

BruceForstall added arch-arm64 and removed untriaged New issue has not been triaged by the area owner labels Feb 8, 2022

BruceForstall added this to the 7.0.0 milestone Feb 8, 2022

kunalspathak removed their assignment May 31, 2022

JulieLeeMSFT added Priority:1 Work that is critical for the release, but we could probably ship without Pri2 and removed Priority:1 Work that is critical for the release, but we could probably ship without labels Jun 6, 2022

TIHan modified the milestones: 7.0.0, 8.0.0 Jun 16, 2022

TIHan modified the milestones: 8.0.0, Future Jul 5, 2022

marek-safar removed the Pri2 label Sep 9, 2022

a74nh mentioned this issue Sep 23, 2022

[RyuJIT][ARM] Redundant overflow check in "C / X" #46010

Closed

kunalspathak mentioned this issue Oct 13, 2022

Improving Arm64 Performance in .NET 8.0 #77010

Closed

28 tasks

TIHan mentioned this issue Apr 19, 2023

[JIT] (experiment) ARM64 - Hoist DivByZero and Overflow check out of loops #85027

Closed

kunalspathak assigned a74nh and unassigned TIHan Nov 8, 2024

kunalspathak added the Priority:3 Work that is nice to have label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm64: In mod operation happening inside the loop, if divisor is an invariant, hoist the divisor checks #64795

Arm64: In mod operation happening inside the loop, if divisor is an invariant, hoist the divisor checks #64795

kunalspathak commented Feb 4, 2022 •

edited by BruceForstall

Loading

ghost commented Feb 4, 2022

kunalspathak commented Feb 4, 2022

EgorBo commented Feb 4, 2022

EgorBo commented Feb 4, 2022 •

edited

Loading

SingleAccretion commented Feb 4, 2022 •

edited

Loading

kunalspathak commented Feb 4, 2022

kunalspathak commented Feb 23, 2022

TIHan commented May 10, 2022 •

edited

Loading

kunalspathak commented May 10, 2022

TIHan commented May 10, 2022

TIHan commented Apr 18, 2023

Arm64: In mod operation happening inside the loop, if divisor is an invariant, hoist the divisor checks #64795

Arm64: In mod operation happening inside the loop, if divisor is an invariant, hoist the divisor checks #64795

Comments

kunalspathak commented Feb 4, 2022 • edited by BruceForstall Loading

ghost commented Feb 4, 2022

kunalspathak commented Feb 4, 2022

EgorBo commented Feb 4, 2022

EgorBo commented Feb 4, 2022 • edited Loading

SingleAccretion commented Feb 4, 2022 • edited Loading

kunalspathak commented Feb 4, 2022

kunalspathak commented Feb 23, 2022

TIHan commented May 10, 2022 • edited Loading

kunalspathak commented May 10, 2022

TIHan commented May 10, 2022

TIHan commented Apr 18, 2023

kunalspathak commented Feb 4, 2022 •

edited by BruceForstall

Loading

EgorBo commented Feb 4, 2022 •

edited

Loading

SingleAccretion commented Feb 4, 2022 •

edited

Loading

TIHan commented May 10, 2022 •

edited

Loading