Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64: In mod operation happening inside the loop, if divisor is an invariant, hoist the divisor checks #64795

Open
Tracked by #77010
kunalspathak opened this issue Feb 4, 2022 · 11 comments
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:3 Work that is nice to have
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Feb 4, 2022

Arm64 doesn't have modulo operator, so we convert the operation to div/mul/sub format. However, we don't do much to the tree node until codegen where we generate code for it. During codegen, it is too late to know that divisor was invariant otherwise we can hoist some extra checks like divisor == 0 and divisor == -1.

In below example, when both dividend (x) and divisor (y) are invariant, we hoist the checks properly, but when dividend is not invariant (result) but divisor is invariant (y), we can hoist the checks out of the loop.

public static int issue2(int x, int y, int z)
{
    int result = 0;
    for (int i = 0; i < z; i++)
    {
        //result = x % y; <-- this hoist things properly because both dividend and divisor are invariant.
        result = result % y;
    }
    return result;
}
...
G_M61875_IG03:
            cmp     w0, #0   ; Check# 1 divisor == 0, can be hoisted 
            beq     G_M61875_IG08
            cmn     w0, #1  ; Check# 2 divisor == -1, can be hoisted 
            bne     G_M61875_IG04
            adds    wzr, w1, w1
            bne     G_M61875_IG04
            bvs     G_M61875_IG07
                                                ;; bbWeight=4    PerfScore 22.00
G_M61875_IG04:
            sdiv    w4, w1, w0
            mul     w4, w4, w0
            sub     w1, w1, w4
            add     w3, w3, #1
            cmp     w3, w2
            blt     G_M61875_IG03
                                                ;; bbWeight=4    PerfScore 62.00
G_M61875_IG05:
            mov     w0, w1
                                                ;; bbWeight=1    PerfScore 0.50
G_M61875_IG06:
            ldp     fp, lr, [sp],#16
            ret     lr
                                                ;; bbWeight=1    PerfScore 2.00
G_M61875_IG07:
            bl      CORINFO_HELP_OVERFLOW
                                                ;; bbWeight=0    PerfScore 0.00
G_M61875_IG08:
            bl      CORINFO_HELP_THROWDIVZERO
            brk_windows #0

category:cq
theme:div-mod-rem
skill-level:expert
cost:medium
impact:medium

@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Feb 4, 2022
@ghost
Copy link

ghost commented Feb 4, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Arm64 doesn't have modulo operator, so we convert the operation to div/mul/sub format. However, we don't do much to the tree node until codegen where we generate code for it. During codegen, it is too late to know that divisor was invariant otherwise we can hoist some extra checks like divisor == 0 and divisor == -1.

In below example, when both dividend (x) and divisor (y) are invariant, we hoist the checks properly, but when dividend is not invariant (result) but divisor is invariant (y), we can hoist the checks out of the loop.

public static int issue2(int x, int y, int z)
{
    int result = 0;
    for (int i = 0; i < z; i++)
    {
        //result = x % y; <-- this hoist things properly because both dividend and divisor are invariant.
        result = result % y;
    }
    return result;
}
...
G_M61875_IG03:
            cmp     w0, #0   ; Check# 1 divisor == 0, can be hoisted 
            beq     G_M61875_IG08
            cmn     w0, #1  ; Check# 2 divisor == -1, can be hoisted 
            bne     G_M61875_IG04
            adds    wzr, w1, w1
            bne     G_M61875_IG04
            bvs     G_M61875_IG07
                                                ;; bbWeight=4    PerfScore 22.00
G_M61875_IG04:
            sdiv    w4, w1, w0
            mul     w4, w4, w0
            sub     w1, w1, w4
            add     w3, w3, #1
            cmp     w3, w2
            blt     G_M61875_IG03
                                                ;; bbWeight=4    PerfScore 62.00
G_M61875_IG05:
            mov     w0, w1
                                                ;; bbWeight=1    PerfScore 0.50
G_M61875_IG06:
            ldp     fp, lr, [sp],#16
            ret     lr
                                                ;; bbWeight=1    PerfScore 2.00
G_M61875_IG07:
            bl      CORINFO_HELP_OVERFLOW
                                                ;; bbWeight=0    PerfScore 0.00
G_M61875_IG08:
            bl      CORINFO_HELP_THROWDIVZERO
            brk_windows #0
Author: kunalspathak
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib

@EgorBo
Copy link
Member

EgorBo commented Feb 4, 2022

Not sure I understand this, so early in morph we transform it to

loop
{
-     loopVariant = loopVariant % loopInvariant;
+     loopVariant = loopVariant  - (loopVariant  / loopInvariant) * loopInvariant;
}

I don't see what can be hoisted here.

In theory, for cases like this for loops (on both x64 and Arm64) we can use a super smart trick by @lemire that we already use for Dictionaries, e.g.:

public static uint issue2(uint x, uint y, uint z)
{
    uint result = 0;

    ulong cachedMul = GetFastModMultiplier(y);
    for (int i = 0; i < z; i++)
    {
        result = FastMod(x, y, cachedMul); // no 'div' in the loop body!
    }
    return result;
}

public static ulong GetFastModMultiplier(uint divisor) =>
    ulong.MaxValue / divisor + 1;

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static uint FastMod(uint value, uint divisor, ulong multiplier) => 
    (uint)(((((multiplier * value) >> 32) + 1) * divisor) >> 32);

@EgorBo
Copy link
Member

EgorBo commented Feb 4, 2022

so e.g. you can try to morph

loopVariant % loopInvariant

to

(uint)((((((ulong.MaxValue / loopInvariant + 1) * loopVariant) >> 32) + 1) * loopInvariant) >> 32)

and hope LICM will manage to hoist the multiplier part. This can only be done for loops and only when you're sure the invariant part will be hoisted. Also, only unsigned types and on 64bit

NOTE: if loopInvariant is not a leaf node we'll have to spill it to a temp and GT_ASG nodes are unlikely to be hoisted

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Feb 4, 2022

The way to fix this is to introduce a new node like CKFINITE that represents "the check" and wraps the actual divisor. Then add support for it in VN, and it should work itself out in hoisting/CSE (well -- it'll need some special handling not unlike COMMA has).

(One will also need to add support for "non-faulting" DIVs as part of this)

Obviously, a new node for this narrow optimization is not a small price to pay in complexity, so one will have to make sure it is worth it.

@kunalspathak
Copy link
Member Author

I don't see what can be hoisted here.

The loopInvariant == 0 and loopInvariant == -1 checks.

The way to fix this is to introduce a new node like CKFINITE that represents "the check"

I agree. There will be an entire new pipeline to be added to support this. Quick search on internet reveals the pattern being used in popular code.

https://github.com/mono/mono/blob/e3c2771b7004661afb04d568b6e8ae1d2f397a3c/mcs/class/Mono.Security/Mono.Math/BigInteger.cs#L1881

@BruceForstall BruceForstall added arch-arm64 and removed untriaged New issue has not been triaged by the area owner labels Feb 8, 2022
@BruceForstall BruceForstall added this to the 7.0.0 milestone Feb 8, 2022
@kunalspathak
Copy link
Member Author

Related: #46010

@TIHan
Copy link
Contributor

TIHan commented May 10, 2022

I spent some time looking at this.

A few comments:

  • If we start using this GT_CHK_DIV_BY_ZERO node, we do not want to use it for all cases of div-by-zero otherwise it would mess up the tree. So realistically, this node should only be used when the div-by-zero check is to be done elsewhere rather than at the point of division. Would probably mean that we have to add a flag(maybe GTF_NO_CHK_DIV_BY_ZERO?) on GT_DIV to tell it not emit it's own div-by-zero check if a GT_CHK_DIV_BY_ZERO was already used to check its divisor.
  • About the hoisting part, we need to be careful because we need to know with certainty that a division will occur and that there are no side-effects between hoisted check and the division itself.

I'm trying to get familiar with this area, so forgive me if it turns out the JIT can handle a lot of this already.

@kunalspathak
Copy link
Member Author

About the hoisting part, we need to be careful because we need to know with certainty that a division will occur and that there are no side-effects between hoisted check and the division itself.

We already take care of that. Refer:

// Only unconditionally executed blocks in the loop are visited (see optHoistThisLoop)
// so after we're done visiting the first block we need to assume the worst, that the
// blocks that are not visisted have side effects.
m_beforeSideEffect = false;

and

if (!m_beforeSideEffect)
{
// For now, we give up on an expression that might raise an exception if it is after the
// first possible global side effect (and we assume we're after that if we're not in the first
// block).
// TODO-CQ: this is when we might do loop cloning.
//
if ((tree->gtFlags & GTF_EXCEPT) != 0)
{
INDEBUG(failReason = "side effect ordering constraint";)
treeIsHoistable = false;
}

@TIHan
Copy link
Contributor

TIHan commented May 10, 2022

Great! Thank you for linking to that, that's really helpful.

@kunalspathak kunalspathak removed their assignment May 31, 2022
@JulieLeeMSFT JulieLeeMSFT added Priority:1 Work that is critical for the release, but we could probably ship without Pri2 and removed Priority:1 Work that is critical for the release, but we could probably ship without labels Jun 6, 2022
@TIHan TIHan modified the milestones: 7.0.0, 8.0.0 Jun 16, 2022
@TIHan TIHan modified the milestones: 8.0.0, Future Jul 5, 2022
@marek-safar marek-safar removed the Pri2 label Sep 9, 2022
@TIHan
Copy link
Contributor

TIHan commented Apr 18, 2023

If we start using this GT_CHK_DIV_BY_ZERO node, we do not want to use it for all cases of div-by-zero otherwise it would mess up the tree. So realistically, this node should only be used when the div-by-zero check is to be done elsewhere rather than at the point of division. Would probably mean that we have to add a flag(maybe GTF_NO_CHK_DIV_BY_ZERO?) on GT_DIV to tell it not emit it's own div-by-zero check if a GT_CHK_DIV_BY_ZERO was already used to check its divisor.

Looking back at my old comment, I did introduce similar flags from #82924:

    GTF_DIV_MOD_NO_BY_ZERO      = 0x20000000, // GT_DIV, GT_MOD -- Div or mod definitely does not divide-by-zero.

    GTF_DIV_MOD_NO_OVERFLOW     = 0x40000000, // GT_DIV, GT_MOD -- Div or mod definitely does not overflow.

I agree that introducing new nodes and hoisting them out would be the proper way to resolve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:3 Work that is nice to have
Projects
None yet
Development

No branches or pull requests

8 participants