Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGO: Static profile affects loop alignment where it's not supposed to do #71649

Closed
EgorBo opened this issue Jul 5, 2022 · 3 comments · Fixed by #71659
Closed

PGO: Static profile affects loop alignment where it's not supposed to do #71649

EgorBo opened this issue Jul 5, 2022 · 3 comments · Fixed by #71659
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Jul 5, 2022

Extracted from dotnet/performance:

    private static uint[] input_uint = Enumerable.Range(0, 1000).Select(i => (uint)i).ToArray();

    [MethodImpl(MethodImplOptions.NoInlining)]
    public int TrailingZeroCount_uint()
    {
        int sum = 0;
        uint[] input = input_uint;
        for (int i = 0; i < input.Length; i++)
            sum += BitOperations.TrailingZeroCount(input[i]);
        return sum;
    }

Codegen diff: https://www.diffchecker.com/dIVV1Iwy
Left: ReadyToRun=1, Right: ReadyToRun=0

My theory that when we collect the profile for TrailingZeroCount we don't have BMI on that machine so it messes weights up which then inmported into this loop and mark it as non-profitable to align.

mibc dump:

"Method": "[S.P.CoreLib]System.Numerics.BitOperations.TrailingZeroCount(uint32)",
"InstrumentationData": [
  {
    "ILOffset": 0,
    "InstrumentationKind": "NumRuns",
    "Other": 7
  },
  {
    "ILOffset": 7,
    "InstrumentationKind": "EdgeIntCount",
    "Other": 0,
    "Data": 162837
  },
  {
    "ILOffset": 20,
    "InstrumentationKind": "EdgeIntCount",
    "Other": 0,
    "Data": 0
  },
  {
    "ILOffset": 30,
    "InstrumentationKind": "EdgeIntCount",
    "Other": 0,
    "Data": 0
  },
  {
    "ILOffset": 37,
    "InstrumentationKind": "EdgeIntCount",
    "Other": 0,
    "Data": 0
  }
]

Hm.. however, looks like MIBC is correct for BitOperations.TrailingZeroCount, "ILOffset": 7 stands for

call         unsigned int32 System.Runtime.Intrinsics.X86.Bmi1::TrailingZeroCount(unsigned int32)
@EgorBo EgorBo added the tenet-performance Performance related issue label Jul 5, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 5, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 5, 2022
@ghost
Copy link

ghost commented Jul 5, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Extracted from dotnet/performance:

    private static uint[] input_uint = Enumerable.Range(0, 1000).Select(i => (uint)i).ToArray();

    [MethodImpl(MethodImplOptions.NoInlining)]
    public int TrailingZeroCount_uint()
    {
        int sum = 0;
        uint[] input = input_uint;
        for (int i = 0; i < input.Length; i++)
            sum += BitOperations.TrailingZeroCount(input[i]);
        return sum;
    }

Codegen diff: https://www.diffchecker.com/dIVV1Iwy
Left: ReadyToRun=1, Right: ReadyToRun=0

My theory that when we collect the profile for TrailingZeroCount we don't have BMI on that machine so it messes weights up which then inmported into this loop and mark it as non-profitable to align.

Author: EgorBo
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr

Milestone: -

@EgorBo EgorBo added this to the 8.0.0 milestone Jul 5, 2022
@EgorBo EgorBo removed the untriaged New issue has not been triaged by the area owner label Jul 5, 2022
@EgorBo EgorBo self-assigned this Jul 5, 2022
@EgorBo
Copy link
Member Author

EgorBo commented Jul 5, 2022

cc @AndyAyersMS I think you might find this issue curious too, somehow static pgo marks loop body "cold" or rather not hot enough.

image

Weights with ReadyToRun=1 (using StandardOptimization.mibc):

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0014]  1                             1           [???..???)-> BB09 ( cond )                     internal idxlen 
BB02 [0015]  1       BB01                  1           [???..???)-> BB07 ( cond )                     internal 
BB10 [0021]  1       BB02                  1           [00C..???)                                     internal idxlen LoopPH 
BB05 [0001]  2       BB05,BB10             0.49  50  0 [00C..021)-> BB05 ( cond )                     i Loop Loop0 idxlen bwd bwd-target IBC 
BB06 [0016]  1       BB05                  1           [???..???)-> BB09 (always)                     internal 
BB07 [0018]  2       BB02,BB07             0.01   1    [00C..021)-> BB07 ( cond )                     i idxlen bwd IBC 
BB09 [0003]  3       BB01,BB06,BB07        1           [021..023)        (return)                     i 
-----------------------------------------------------------------------------------------------------------------------------------------

Weights for ReadyToRun=0:

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0014]  1                             1       [???..???)-> BB09 ( cond )                     internal idxlen 
BB02 [0015]  1       BB01                  1       [???..???)-> BB07 ( cond )                     internal 
BB10 [0021]  1       BB02                  1       [00C..???)                                     internal idxlen LoopPH 
BB05 [0001]  2       BB05,BB10             3.96  0 [00C..021)-> BB05 ( cond )                     i Loop Loop0 idxlen bwd bwd-target align 
BB06 [0016]  1       BB05                  1       [???..???)-> BB09 (always)                     internal 
BB07 [0018]  2       BB02,BB07             0.04    [00C..021)-> BB07 ( cond )                     i idxlen bwd 
BB09 [0003]  3       BB01,BB06,BB07        1       [021..023)        (return)                     i 
-----------------------------------------------------------------------------------------------------------------------------------------

@EgorBo
Copy link
Member Author

EgorBo commented Jul 5, 2022

I assume it's that general problem of propagating weights from profiled inlinees to non-profiled callee.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 5, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 6, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Aug 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant