Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace successive "ldr" and "str" instructions with "ldp" and "stp" #77540

Merged
merged 30 commits into from
Jan 27, 2023

Conversation

AndyJGraham
Copy link
Contributor

@AndyJGraham AndyJGraham commented Oct 27, 2022

This change serves to address the following four Github tickets:

  1. ARM64: Optimize pair of "ldr reg, [fp]" to ldp ARM64: Optimize pair of "ldr reg, [fp]" to ldp #35130
  2. ARM64: Optimize pair of "ldr reg, [reg]" to ldp ARM64: Optimize pair of "ldr reg, [reg]" to ldp #35132
  3. ARM64: Optimize pair of "str reg, [reg]" to stp ARM64: Optimize pair of "str reg, [reg]" to stp #35133
  4. ARM64: Optimize pair of "str reg, [fp]" to stp  ARM64: Optimize pair of "str reg, [fp]" to stp  #35134

A technique was employed that involved detecting an optimisation opportunity as instruction sequences were being generated. The optimised instruction was then generated on top of the previous instruction, with no second instruction generated. Thus, there were no changes to instruction group size at “emission time” and no changes to jump instructions.

This change serves to address the following four Github tickets:

    1. ARM64: Optimize pair of "ldr reg, [fp]" to ldp dotnet#35130
    2. ARM64: Optimize pair of "ldr reg, [reg]" to ldp dotnet#35132
    3. ARM64: Optimize pair of "str reg, [reg]" to stp dotnet#35133
    4. ARM64: Optimize pair of "str reg, [fp]" to stp  dotnet#35134

A technique was employed that involved detecting an optimisation
opportunity as instruction sequences were being generated.
The optimised instruction was then generated on top of the previous
instruction, with no second instruction generated. Thus, there were no
changes to instruction group size at “emission time” and no changes to
jump instructions.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 27, 2022
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Oct 27, 2022
@dnfadmin
Copy link

dnfadmin commented Oct 27, 2022

CLA assistant check
All CLA requirements met.

@ghost
Copy link

ghost commented Oct 27, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This change serves to address the following four Github tickets:

1. ARM64: Optimize pair of "ldr reg, [fp]" to ldp #35130
2. ARM64: Optimize pair of "ldr reg, [reg]" to ldp #35132
3. ARM64: Optimize pair of "str reg, [reg]" to stp #35133
4. ARM64: Optimize pair of "str reg, [fp]" to stp  #35134

A technique was employed that involved detecting an optimisation opportunity as instruction sequences were being generated. The optimised instruction was then generated on top of the previous instruction, with no second instruction generated. Thus, there were no changes to instruction group size at “emission time” and no changes to jump instructions.

Author: AndyJGraham
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@a74nh
Copy link
Contributor

a74nh commented Oct 27, 2022

@kunalspathak

@kunalspathak
Copy link
Member

@dotnet/jit-contrib @BruceForstall

@kunalspathak
Copy link
Member

While this is definitely a good change, it feels to me that we need to have some common method to do the replacement (resuse != nullptr) in general fashion instead of doing it in one off methods (emit_R_R_R_I()) .

@AndyJGraham
Copy link
Contributor Author

While this is definitely a good change, it feels to me that we need to have some common method to do the replacement (resuse != nullptr) in general fashion instead of doing it in one off methods (emit_R_R_R_I()) .

Hi, Kunal. I am not sure what you mean here. It seems to me that any instruction can either be emitted and added to the instruction group or used to overwrite the last emitted instruction.

I cannot see any way that this can be achieved without altering each emitting function. Can you please advise?

Thanks, Andy

@kunalspathak
Copy link
Member

kunalspathak commented Oct 28, 2022

I think there is some missing GC tracking missing for the 2nd register. In below diff, we need to report that both x0 and x2 holds GC values (as seen on left), but we only report that x0 has GC value.

image

(windows-arm64 benchmark diff 3861.dasm)

Same here:

image

(windows-arm64 benchmark diff 26954.dasm)

We see that towards the end of IG87, we mark x4 as not holding GC value anymore, but we fail to add it. Also I am little confused with V112 and V107 being replaced with V00 in the comments.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closely at the PR, I think this should be fine to have the logic at emit_R_R_R_I(). Added some comments to point the missing gc tracking information.

src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 28, 2022
Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should consider a different model where we support "back up" in the emitter.

src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
@BruceForstall
Copy link
Member

It would be useful to include here in the comments a few examples of the improvement asm diffs.

Also useful would be insight into what could be done to improve codegen related to this in the future (i.e., what cases are we missing after this change?).

@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 31, 2022
@BruceForstall
Copy link
Member

@AndyJGraham Given my comment #77540 (comment), I was curious if it would work. I implemented it with https://github.com/BruceForstall/runtime/tree/LdpStp_RemoveLastInstruction on top of this. I think it's the right way to go. It isn't completely bug free yet, though.

However, while doing this, I noticed a couple things about the implementation:

  1. I think there's a GC hole when merging two str to a single stp if one or more of the stores are to tracked stack local GC variables. I haven't seen or been able to construct this yet as a test. Note that emitIns_R_S and emitIns_S_R save the stack local variable in the id and use it for GC info. In the case of emitIns_R_S I don't think it matters because we're reading from a stack local to a register and we are setting the register GC bits properly. However, with emitIns_S_R we're writing to the stack local. When the optimization kicks in, it calls emitIns_R_R_R_I and loses the fact the target is a stack local. This means that emitInsWritesToLclVarStackLocPair will never return true. There is a function, emitIns_S_S_R_R that is designed to handle two register writes to the same stack variable (used for 16-byte SIMD types and outgoing stack arguments). We have no support for a single stp instruction writing to two coincidentally adjacent (tracked, GC-ref) stack locals. It seems like we shouldn't do this optimization if the stack location is a GC/byref pointer.

  2. The optimization handles cases like:

            ldr     w1, [x20, #0x10]
            ldr     w2, [x20, #0x14]

=>

            ldp     w1, w2, [x20, #0x10]

but doesn't handle:

            ldr     w1, [x20, #0x14]
            ldr     w2, [x20, #0x10]

=>

            ldp     w2, w1, [x20, #0x10]

@kunalspathak
Copy link
Member

I think there's a GC hole

Saw them here too: #77540 (comment)

@BruceForstall
Copy link
Member

Saw them here too: #77540 (comment)

I think that one has been addressed, but I think there is a potential str/str=>stp GC hole.

@kunalspathak
Copy link
Member

You might also want to take into account emitForceNewIG ....See #78074

@BruceForstall
Copy link
Member

@AndyJGraham Can you please fix the conflict so tests can run?

@BruceForstall
Copy link
Member

@a74nh @AndyJGraham @kunalspathak Assuming the perf regressions in #81551 are directly attributable to this work: are there cases where a single ldp/stp is slower on some platforms than two consecutive ldr/str?

@a74nh
Copy link
Contributor

a74nh commented Feb 7, 2023

@a74nh @AndyJGraham @kunalspathak Assuming the perf regressions in #81551 are directly attributable to this work: are there cases where a single ldp/stp is slower on some platforms than two consecutive ldr/str?

ldp/stp should never be slower than two consecutive ldr/stp.

I ran this myself on an altra, ubuntu:

10 times with current head:

|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 5.775 ns | 0.0008 ns | 0.0007 ns | 5.774 ns | 5.773 ns | 5.776 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 5.774 ns | 0.0009 ns | 0.0008 ns | 5.774 ns | 5.773 ns | 5.776 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 7.209 ns | 0.0010 ns | 0.0008 ns | 7.209 ns | 7.208 ns | 7.211 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 7.172 ns | 0.0014 ns | 0.0013 ns | 7.172 ns | 7.170 ns | 7.174 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 5.802 ns | 0.0011 ns | 0.0009 ns | 5.802 ns | 5.800 ns | 5.803 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 5.554 ns | 0.0010 ns | 0.0009 ns | 5.555 ns | 5.553 ns | 5.556 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 6.352 ns | 0.0007 ns | 0.0007 ns | 6.352 ns | 6.351 ns | 6.353 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 5.555 ns | 0.0009 ns | 0.0009 ns | 5.554 ns | 5.553 ns | 5.557 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 7.204 ns | 0.0013 ns | 0.0011 ns | 7.204 ns | 7.203 ns | 7.207 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 6.220 ns | 0.0007 ns | 0.0007 ns | 6.220 ns | 6.219 ns | 6.221 ns |         - |

10 times with this ldr/str patch reverted:

|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 7.014 ns | 0.0034 ns | 0.0029 ns | 7.014 ns | 7.009 ns | 7.020 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 5.815 ns | 0.0021 ns | 0.0019 ns | 5.815 ns | 5.811 ns | 5.818 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 7.026 ns | 0.0013 ns | 0.0012 ns | 7.026 ns | 7.023 ns | 7.027 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 7.007 ns | 0.0020 ns | 0.0019 ns | 7.007 ns | 7.005 ns | 7.010 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 5.606 ns | 0.0009 ns | 0.0008 ns | 5.606 ns | 5.604 ns | 5.607 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 5.598 ns | 0.0315 ns | 0.0294 ns | 5.614 ns | 5.537 ns | 5.616 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 6.440 ns | 0.0019 ns | 0.0017 ns | 6.440 ns | 6.437 ns | 6.443 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 7.017 ns | 0.0021 ns | 0.0019 ns | 7.017 ns | 7.014 ns | 7.019 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 7.013 ns | 0.0023 ns | 0.0019 ns | 7.013 ns | 7.010 ns | 7.016 ns |         - |
|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated |
| GetStringHashCode |         10 | 6.352 ns | 0.0022 ns | 0.0020 ns | 6.351 ns | 6.349 ns | 6.356 ns |         - |

The regression is reporting going from 4ns to 6ns.
My testing is showing a range of anywhere from 5ns to 7ns.

Looking at the 100,1000,10000 variations:

current HEAD:

|            Method | BytesCount |         Mean |     Error |    StdDev |       Median |          Min |          Max | Allocated |
| GetStringHashCode |         10 |     5.588 ns | 0.0010 ns | 0.0010 ns |     5.588 ns |     5.586 ns |     5.590 ns |         - |
| GetStringHashCode |        100 |    47.271 ns | 0.0049 ns | 0.0043 ns |    47.270 ns |    47.265 ns |    47.280 ns |         - |
| GetStringHashCode |       1000 |   452.993 ns | 0.0301 ns | 0.0267 ns |   452.992 ns |   452.953 ns |   453.038 ns |         - |
| GetStringHashCode |      10000 | 4,666.836 ns | 3.2538 ns | 3.0436 ns | 4,666.813 ns | 4,660.594 ns | 4,671.444 ns |         - |
|            Method | BytesCount |         Mean |     Error |    StdDev |       Median |          Min |          Max | Allocated |
| GetStringHashCode |         10 |     5.774 ns | 0.0009 ns | 0.0008 ns |     5.774 ns |     5.773 ns |     5.776 ns |         - |
| GetStringHashCode |        100 |    45.450 ns | 0.0039 ns | 0.0037 ns |    45.450 ns |    45.446 ns |    45.458 ns |         - |
| GetStringHashCode |       1000 |   453.829 ns | 0.0321 ns | 0.0301 ns |   453.814 ns |   453.793 ns |   453.885 ns |         - |
| GetStringHashCode |      10000 | 4,675.019 ns | 0.3641 ns | 0.3406 ns | 4,675.028 ns | 4,674.435 ns | 4,675.676 ns |         - |
|            Method | BytesCount |         Mean |      Error |     StdDev |       Median |          Min |          Max | Allocated |
| GetStringHashCode |         10 |     7.171 ns |  0.0009 ns |  0.0008 ns |     7.171 ns |     7.170 ns |     7.173 ns |         - |
| GetStringHashCode |        100 |    45.751 ns |  0.0086 ns |  0.0077 ns |    45.748 ns |    45.744 ns |    45.766 ns |         - |
| GetStringHashCode |       1000 |   453.007 ns |  0.0540 ns |  0.0478 ns |   452.995 ns |   452.948 ns |   453.104 ns |         - |
| GetStringHashCode |      10000 | 4,558.079 ns | 11.4109 ns | 10.6738 ns | 4,561.511 ns | 4,520.597 ns | 4,561.850 ns |         - |

With ld/stp patch reverted:

|            Method | BytesCount |         Mean |     Error |    StdDev |       Median |          Min |          Max | Allocated |
| GetStringHashCode |         10 |     6.464 ns | 0.0008 ns | 0.0008 ns |     6.464 ns |     6.463 ns |     6.466 ns |         - |
| GetStringHashCode |        100 |    45.622 ns | 0.0059 ns | 0.0052 ns |    45.621 ns |    45.613 ns |    45.633 ns |         - |
| GetStringHashCode |       1000 |   454.581 ns | 0.0586 ns | 0.0548 ns |   454.589 ns |   454.505 ns |   454.674 ns |         - |
| GetStringHashCode |      10000 | 4,548.124 ns | 8.2210 ns | 7.6900 ns | 4,550.542 ns | 4,521.367 ns | 4,550.865 ns |         - |
|            Method | BytesCount |         Mean |     Error |    StdDev |       Median |          Min |          Max | Allocated |
| GetStringHashCode |         10 |     6.463 ns | 0.0018 ns | 0.0017 ns |     6.463 ns |     6.460 ns |     6.466 ns |         - |
| GetStringHashCode |        100 |    45.624 ns | 0.0054 ns | 0.0048 ns |    45.623 ns |    45.618 ns |    45.633 ns |         - |
| GetStringHashCode |       1000 |   453.109 ns | 0.0251 ns | 0.0235 ns |   453.101 ns |   453.083 ns |   453.153 ns |         - |
| GetStringHashCode |      10000 | 4,552.636 ns | 0.5026 ns | 0.4455 ns | 4,552.749 ns | 4,551.203 ns | 4,553.072 ns |         - |
|            Method | BytesCount |         Mean |     Error |    StdDev |       Median |          Min |          Max | Allocated |
| GetStringHashCode |         10 |     6.410 ns | 0.0013 ns | 0.0012 ns |     6.410 ns |     6.408 ns |     6.412 ns |         - |
| GetStringHashCode |        100 |    45.629 ns | 0.0360 ns | 0.0301 ns |    45.620 ns |    45.609 ns |    45.703 ns |         - |
| GetStringHashCode |       1000 |   454.546 ns | 0.0397 ns | 0.0371 ns |   454.560 ns |   454.465 ns |   454.591 ns |         - |
| GetStringHashCode |      10000 | 4,551.589 ns | 4.3182 ns | 3.8280 ns | 4,552.570 ns | 4,538.298 ns | 4,552.890 ns |         - |

Again, we're only seeing a few nanoseconds difference on a much larger range.

My gut is to say we are within variance and this difference should just vanish on the next run of the test suite. How easy is it to rerun the CI for those tests?

I'll give IterateForEach a run see if I get similar results.

@tannergooding
Copy link
Member

It's possible that loop alignment or some other peephole optimization is regressed from the differing instruction.

Might be worth getting the disassembly to validate the before/after.

@a74nh
Copy link
Contributor

a74nh commented Feb 7, 2023

Might be worth getting the disassembly to validate the before/after

Assembly for the main routine under test hasn't changed at all: (The LDP and STP here are prologue/epilogue entries outside the scope of this patch)

; Assembly listing for method System.String:GetHashCode():int:this
; Emitting BLENDED_CODE for generic ARM64 CPU - Unix
; Tier-1 compilation
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T00] (  4,  4   )     ref  ->   x1         this class-hnd single-def
;* V01 loc0         [V01    ] (  0,  0   )    long  ->  zero-ref   
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M27075_IG01:
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
            mov     x1, x0
						;; size=12 bbWeight=1 PerfScore 2.00
G_M27075_IG02:
            add     x0, x1, #12
            ldr     w1, [x1, #0x08]
            lsl     w1, w1, #1
            movz    w2, #0xD1FFAB1E
            movk    w2, #0xD1FFAB1E LSL #16
            movz    w3, #0xD1FFAB1E
            movk    w3, #0xD1FFAB1E LSL #16
            movz    x4, #0xD1FFAB1E      // code for System.Marvin:ComputeHash32(byref,uint,uint,uint):int
            movk    x4, #0xD1FFAB1E LSL #16
            movk    x4, #0xD1FFAB1E LSL #32
            ldr     x4, [x4]
						;; size=44 bbWeight=1 PerfScore 11.00
G_M27075_IG03:
            ldp     fp, lr, [sp], #0x10
            br      x4

@a74nh
Copy link
Contributor

a74nh commented Feb 7, 2023

Meanwhile, I'm getting a consistent perf drop of 0.1us for System.Collections.IterateForEach when the optimisation is enabled. Will investigate this a little more.

@kunalspathak
Copy link
Member

You might want to check the disassembly of Marvin:ComputeHash32().

@a74nh
Copy link
Contributor

a74nh commented Feb 8, 2023

Narrowed down the IterateForEach issue a little...

Disabling LDP/STP optimization only on System.Collections.Generic.Dictionary2+Enumerator[int,int]:MoveNext()` regains all the lost performance.

MoveNext() has a single use of LDP.

Full assembly for MoveNext.

; Assembly listing for method System.Collections.Generic.Dictionary`2+Enumerator[int,int]:MoveNext():bool:this
; Emitting BLENDED_CODE for generic ARM64 CPU - Unix
; Tier-1 compilation
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T00] ( 13, 28   )   byref  ->   x0         this single-def
;  V01 loc0         [V01,T03] (  4,  5   )   byref  ->   x1        
;  V02 loc1         [V02,T02] (  4,  8   )     int  ->   x2        
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;  V04 tmp1         [V04,T01] (  3, 12   )     ref  ->   x1         class-hnd "impAppendStmt"
;* V05 tmp2         [V05    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op "NewObj constructor temp"
;  V06 tmp3         [V06,T05] (  2,  2   )     int  ->   x2         "Inlining Arg"
;  V07 tmp4         [V07,T06] (  2,  2   )     int  ->   x1         "Inlining Arg"
;  V08 tmp5         [V08,T07] (  2,  1   )     int  ->   x2         V05.key(offs=0x00) P-INDEP "field V05.key (fldOffset=0x0)"
;  V09 tmp6         [V09,T08] (  2,  1   )     int  ->   x1         V05.value(offs=0x04) P-INDEP "field V05.value (fldOffset=0x4)"
;  V10 tmp7         [V10,T04] (  3,  3   )   byref  ->   x0         single-def "BlockOp address local"
;
; Lcl frame size = 0
G_M39015_IG01:
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
						;; size=8 bbWeight=1 PerfScore 1.50
G_M39015_IG02:
            ldr     w1, [x0, #0x08]
            ldr     x2, [x0]
            ldr     w2, [x2, #0x44]
            cmp     w1, w2
            bne     G_M39015_IG09
						;; size=20 bbWeight=1 PerfScore 10.50
G_M39015_IG03:
            ldr     w1, [x0, #0x0C]
            ldr     x2, [x0]
            ldr     w2, [x2, #0x38]
            cmp     w1, w2
            blo     G_M39015_IG06
						;; size=20 bbWeight=8 PerfScore 84.00
G_M39015_IG04:
            ldr     x1, [x0]
            ldr     w1, [x1, #0x38]
            add     w1, w1, #1
            str     w1, [x0, #0x0C]
            str     xzr, [x0, #0x14]
            mov     w0, wzr
						;; size=24 bbWeight=0.50 PerfScore 4.50
G_M39015_IG05:
            ldp     fp, lr, [sp], #0x10
            ret     lr
						;; size=8 bbWeight=0.50 PerfScore 1.00
G_M39015_IG06:
            ldr     x1, [x0]
            ldr     x1, [x1, #0x10]
            ldr     w2, [x0, #0x0C]
            add     w3, w2, #1
            str     w3, [x0, #0x0C]
            ldr     w3, [x1, #0x08]
            cmp     w2, w3
            bhs     G_M39015_IG10
            ubfiz   x2, x2, #4, #32
            add     x2, x2, #16
            add     x1, x1, x2
            ldr     w2, [x1, #0x04]
            cmn     w2, #1
            blt     G_M39015_IG03
						;; size=56 bbWeight=2 PerfScore 43.00
G_M39015_IG07:
            ldp     w2, w1, [x1, #0x08]
            add     x0, x0, #20
            str     w2, [x0]
            str     w1, [x0, #0x04]
            mov     w0, #1
						;; size=20 bbWeight=0.50 PerfScore 3.00
G_M39015_IG08:
            ldp     fp, lr, [sp], #0x10
            ret     lr
						;; size=8 bbWeight=0.50 PerfScore 1.00
G_M39015_IG09:
            movz    x0, #0xD1FFAB1E      // code for System.ThrowHelper:ThrowInvalidOperationException_InvalidOperation_EnumFailedVersion()
            movk    x0, #0xD1FFAB1E LSL #16
            movk    x0, #0xD1FFAB1E LSL #32
            ldr     x0, [x0]
            blr     x0
            brk_unix #0
						;; size=24 bbWeight=0 PerfScore 0.00
G_M39015_IG10:
            bl      CORINFO_HELP_RNGCHKFAIL
            brk_unix #0
						;; size=8 bbWeight=0 PerfScore 0.00

LDP is in G_M39015_IG07m This is outside of a loop. The only branches to code after the LDP are error cases.

Code for MoveNext()

            public bool MoveNext()
            {
                if (_version != _dictionary._version)
                {
                    ThrowHelper.ThrowInvalidOperationException_InvalidOperation_EnumFailedVersion();
                }

                // Use unsigned comparison since we set index to dictionary.count+1 when the enumeration ends.
                // dictionary.count+1 could be negative if dictionary.count is int.MaxValue
                while ((uint)_index < (uint)_dictionary._count)
                {
                    ref Entry entry = ref _dictionary._entries![_index++];

                    if (entry.next >= -1)
                    {
                        _current = new KeyValuePair<TKey, TValue>(entry.key, entry.value);
                        return true;
                    }
                }

                _index = _dictionary._count + 1;
                _current = default;
                return false;
            }

LDP is used for the load of entry.key and entry.value.

Entry struct

        private struct Entry
        {
            public uint hashCode;
            /// <summary>
            /// 0-based index of next entry in chain: -1 means end of chain
            /// also encodes whether this entry _itself_ is part of the free list by changing sign and subtracting 3,
            /// so -2 means end of free list, -3 means index 0 but on free list, -4 means index 1 but on free list, etc.
            /// </summary>
            public int next;
            public TKey key;     // Key of entry
            public TValue value; // Value of entry
        }

TKey and TValue are both ints. So we shouldn't have any alignment issues within the struct. And I would hope that everything else is the dictionary is generally aligned too.

I'm a little concerned about register dependencies. x1/w1 is being used as source and dest. But that shouldn't cause any issues.

Still digging....

@a74nh
Copy link
Contributor

a74nh commented Feb 8, 2023

It's possible that loop alignment or some other peephole optimization is regressed from the differing instruction.

Looks like @tannergooding was right with the alignment issues.

  1. Current head: 2.37 us
  2. Disable all LDP/STP peepholes: 2.27 us
  3. Disable removing the previous instruction when peepholing (giving us LDR+LDP): 2.27 us
  4. When peepholing, generate LDP+NOP: 2.18 us

2 and 3 are the same because they are both doing two loads.
4 is probably what we should be getting with peepholes working correctly.

Disassembly with addresses:
   0x0000ffffb74f2620:	stp	x29, x30, [sp, #-16]!
   0x0000ffffb74f2624:	mov	x29, sp
   0x0000ffffb74f2628:	ldr	w1, [x0, #8]
   0x0000ffffb74f262c:	ldr	x2, [x0]
   0x0000ffffb74f2630:	ldr	w2, [x2, #68]
   0x0000ffffb74f2634:	cmp	w1, w2
   0x0000ffffb74f2638:	b.ne	0xffffb74f26c4  // b.any
   0x0000ffffb74f263c:	ldr	w1, [x0, #12]
   0x0000ffffb74f2640:	ldr	x2, [x0]
   0x0000ffffb74f2644:	ldr	w2, [x2, #56]
   0x0000ffffb74f2648:	cmp	w1, w2
   0x0000ffffb74f264c:	b.cc	0xffffb74f2670  // b.lo, b.ul, b.last
   0x0000ffffb74f2650:	ldr	x1, [x0]
   0x0000ffffb74f2654:	ldr	w1, [x1, #56]
   0x0000ffffb74f2658:	add	w1, w1, #0x1
   0x0000ffffb74f265c:	str	w1, [x0, #12]
   0x0000ffffb74f2660:	stur	xzr, [x0, #20]
   0x0000ffffb74f2664:	mov	w0, wzr
   0x0000ffffb74f2668:	ldp	x29, x30, [sp], #16
   0x0000ffffb74f266c:	ret
   0x0000ffffb74f2670:	ldr	x1, [x0]
   0x0000ffffb74f2674:	ldr	x1, [x1, #16]
   0x0000ffffb74f2678:	ldr	w2, [x0, #12]
   0x0000ffffb74f267c:	add	w3, w2, #0x1
   0x0000ffffb74f2680:	str	w3, [x0, #12]
   0x0000ffffb74f2684:	ldr	w3, [x1, #8]
   0x0000ffffb74f2688:	cmp	w2, w3
   0x0000ffffb74f268c:	b.cs	0xffffb74f26dc  // b.hs, b.nlast
   0x0000ffffb74f2690:	ubfiz	x2, x2, #4, #32
   0x0000ffffb74f2694:	add	x2, x2, #0x10
   0x0000ffffb74f2698:	add	x1, x1, x2
   0x0000ffffb74f269c:	ldr	w2, [x1, #4]
   0x0000ffffb74f26a0:	cmn	w2, #0x1
   0x0000ffffb74f26a4:	b.lt	0xffffb74f263c  // b.tstop

   0x0000ffffb74f26a8:	ldp	w2, w1, [x1, #8]

   0x0000ffffb74f26ac:	add	x0, x0, #0x14
   0x0000ffffb74f26b0:	str	w2, [x0]
   0x0000ffffb74f26b4:	str	w1, [x0, #4]
   0x0000ffffb74f26b8:	mov	w0, #0x1                   	// #1
   0x0000ffffb74f26bc:	ldp	x29, x30, [sp], #16
   0x0000ffffb74f26c0:	ret

   0x0000ffffb74f26c4:	mov	x0, #0xb078                	// #45176
   0x0000ffffb74f26c8:	movk	x0, #0xb75e, lsl #16
   0x0000ffffb74f26cc:	movk	x0, #0xffff, lsl #32
   0x0000ffffb74f26d0:	ldr	x0, [x0]
   0x0000ffffb74f26d4:	blr	x0
   0x0000ffffb74f26d8:	brk	#0x0

   0x0000ffffb74f26dc:	bl	0xffffb74f00f8
   0x0000ffffb74f26e0:	brk	#0x0

   0x0000ffffb74f26e4:	stllrb	w17, [x1]
   0x0000ffffb74f26e8:	.inst	0x00400012 ; undefined
   0x0000ffffb74f26ec:	.inst	0x00400027 ; undefined
   0x0000ffffb74f26f0:	st1h	{z1.s}, p0, [x15, z4.s, uxtw #1]
   0x0000ffffb74f26f4:	udf	#0
   0x0000ffffb74f26f8:	tbnz	x16, #50, 0xffffb74efa04
   0x0000ffffb74f26fc:	udf	#65535
   0x0000ffffb74f2700:	stp	x29, x30, [sp, #-16]!
   0x0000ffffb74f2704:	mov	x29, sp
   0x0000ffffb74f2708:	ldp	x29, x30, [sp], #16
   0x0000ffffb74f270c:	ret
   0x0000ffffb74f2710:	ldxrb	w4, [x0]
   0x0000ffffb74f2714:	.inst	0x00400002 ; undefined
   0x0000ffffb74f2718:	st1h	{z1.s}, p0, [x15, z4.s, uxtw #1]
   0x0000ffffb74f271c:	udf	#0

Looks like what we've now got is that some of those branch targets are misaligned addresses. When the LDP (at 0x0000ffffb74f26a8) is two LDRs, then the misaligned addresses become aligned.

I think we need to check targets of branches are aligned, and if not insert a NOP to align them - that'll be the start of every basic block that where the predecessor isn't the previous block.

LLVM has something already for this (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64Subtarget.cpp#L94), and is quite specific depending on the exact Arm processor.

Before I start trying anything out - has aligning targets been discussed before? Is there already any code which does similar?

@tannergooding
Copy link
Member

I think we need to check targets of branches are aligned, and if not insert a NOP to align them - that'll be the start of every basic block that where the predecessor isn't the previous block.

We already have this, but it is a heuristic and isn't always done. CC. @kunalspathak

@BruceForstall
Copy link
Member

Are you suggesting that the loop back-end branch b.lt 0xffffb74f263c is to an address that is not 16-byte aligned, and thus perhaps we're getting sub-optimal cache behavior, or similar? The other branches to non-16-byte-aligned addresses are to error cases. What do the addresses look like for the ldr/ldr (pre-optimization) case? I would think the ldp would only affect alignment of addresses that follow it, but perhaps there's some interaction with the existing loop alignment code.

@BruceForstall
Copy link
Member

My gut is to say we are within variance and this difference should just vanish on the next run of the test suite. How easy is it to rerun the CI for those tests?

I believe the perf jobs run every night. @kunalspathak would know for sure.

@BruceForstall
Copy link
Member

Overall, if we can verify that the issue is due to loop alignment (or cache effect or some non-deterministic feature) and there isn't a bad interaction between the peephole optimization and the loop alignment implementation that is breaking the loop alignment implementation, then we might choose to simply ignore a regression. It's still worthwhile doing the due diligence to ensure we understand the regression to the best of our ability.

@kunalspathak
Copy link
Member

I believe the perf jobs run every night. @kunalspathak would know for sure.

They are run after every few hours on batch of commits. However, looking at the overall history of the benchmark at https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/refs/heads/main_arm64_ubuntu%2020.04/System.Hashing.GetStringHashCode(BytesCount%3a%2010).html , the regression looks fairly stable.

image

Before I start trying anything out - has aligning targets been discussed before? Is there already any code which does similar?

We added loop alignment back in .NET 6 and you can read the detail heuristics in https://devblogs.microsoft.com/dotnet/loop-alignment-in-net-6/. Essentially, we try to align the start of the loop (the target of the backedge) as much as possible, given that it fits various criteria like size of the loop body, how much padding is needed, if the loop is call-free or not, etc. We have seen cases in the past where the loops in code/benchmarks were aligned and because of optimizations, they stopped being aligned and end up in regression. Again, this usually happen because the algorithm thinks that the loop size is too large that it won't make sense to align or the amount of padding to be added to further align is more and we cannot afford to waste bytes in that.

Overall, if we can verify that the issue is due to loop alignment (or cache effect or some non-deterministic feature) and there isn't a bad interaction between the peephole optimization and the loop alignment implementation that is breaking the loop alignment implementation, then we might choose to simply ignore a regression.

I agree.

@kunalspathak
Copy link
Member

Just to be sure, try disabling loop alignment using DOTNET_JitAlignLoops=0 (the flag works on Release builds as well) before and after your changes. If that doesn't show any regressions, then we would know for sure that the regression is from the alignment.

@BruceForstall
Copy link
Member

this usually happen because the algorithm thinks that the loop size is too large that it won't make sense to align or the amount of padding to be added to further align is more and we cannot afford to waste bytes in that.

@kunalspathak Since this optimization only reduces code size, these cases shouldn't occur, right?

Has the alignment padding already been committed when this peephole optimization occurs? Or will the alignment padding required be adjusted after the peep?

@kunalspathak
Copy link
Member

@kunalspathak Since this optimization only reduces code size, these cases shouldn't occur, right?

Ideally yes, but to confirm that, we need to really make sure that nop was present to align that target before this change. If it was, then @a74nh , could you please provide JitDump of before and after and I can check what is going on. Other reason that I can think of is the loop size in this case is 108 bytes and it takes almost 4 blocks of 32B size to fit this look. If I recall the heuristics for Arm64, we would only allow max 4 bytes of padding for this so it is highly unlikely that we would have aligned the loop before this change given that the difference in loop body size is just 4 bytes (2 ldrs replaced with 1 ldp).

Has the alignment padding already been committed when this peephole optimization occurs? Or will the alignment padding required be adjusted after the peep?

No, the alignment padding adjustment happens after the peep.

@a74nh
Copy link
Contributor

a74nh commented Feb 9, 2023

DOTNET_JitAlignLoops=0

Setting this didn't cause any difference.

I then hacked coreclr so that it inserted a NOP at the start of the next block after the LDP, giving:

G_M39015_IG07:
IN001f: 000088  29410422          ldp     w2, w1, [x1, #0x08]
IN0020: 00008C  91005000          add     x0, x0, #20
IN0021: 000090  B9000002          str     w2, [x0]
IN0022: 000094  B9000401          str     w1, [x0, #0x04]
IN0023: 000098  52800020          mov     w0, #1
G_M39015_IG08: 
IN0030: 00009C  D503201F          nop     
IN0031: 0000A0  A8C17BFD          ldp     fp, lr, [sp], #0x10
IN0032: 0000A4  D65F03C0          ret     lr

And this regained all the lost performance! Back to 2.18ms.

Note that this is the only function I'm allowing peepholes to occur.

could you please provide JitDump of before and after and I can check what is going on

This is with LDP:
dump_ldp.txt

This is with LDP and a NOP:
dump_ldp_with_nop.txt

Are you suggesting that the loop back-end branch b.lt 0xffffb74f263c

Not quite, it would be the jumps to 0x0000ffffb74f26c4 or 0x0000ffffb74f26dc. My NOP causes both of these to become aligned.

@tannergooding
Copy link
Member

Not quite, it would be the jumps to 0x0000ffffb74f26c4 or 0x0000ffffb74f26dc. My NOP causes both of these to become aligned.

IG09 and IG10 should both be cold blocks (they throw), so its a bit unclear why this is impactful.

I'd also notably expect the nop after the ret so it doesn't impact normal code flow execution

@kunalspathak
Copy link
Member

Setting this didn't cause any difference.

Which means that the loop alignment is definitely not affecting it, although from your experiment, it shows that aligning few places would improve the performance (but not necessarily the one that was lost with ldp change). Basically, if you add NOP in same places before your change, do you not see similar improvements?

My NOP causes both of these to become aligned.

Were they aligned before your ldp change?

@a74nh
Copy link
Contributor

a74nh commented Feb 9, 2023

Were they aligned before your ldp change?

Before the LDP change, they were both aligned. So adding the NOP is making the rest of the instructions be in the same position as they were with two LDRs.

I tried some more moving around, and moving the NOP after the ret (or anywhere else afterwards) drops the performance again back to 2.3ms.

IN001f: 000088                    ldp     w2, w1, [x1, #0x08]
IN0020: 00008C                    add     x0, x0, #20
IN0021: 000090                    str     w2, [x0]
IN0022: 000094                    str     w1, [x0, #0x04]
IN0023: 000098                    mov     w0, #1
G_M39015_IG08:
IN0030: 00009C                    ldp     fp, lr, [sp], #0x10
IN0031: 0000A0                    ret     lr
IN0032: 0000A4                    nop     
G_M39015_IG09: 

Which is odd as G_M39015_IG09 is aligned and there is nothing branching to G_M39015_IG08.

it shows that aligning few places would improve the performance (but not necessarily the one that was lost with ldp change). Basically, if you add NOP in same places before your change, do you not see similar improvements?

I can give this a try too.

The next step would be recreate this as a standalone binary using that block of assembly and get it in a simulator. It might take a bit of time to get it showing the exact same behaviour. If we think it's important enough, then I can give it a go.

@BruceForstall
Copy link
Member

It certainly seems like there is some odd micro-architectural effect here. E.g., and this seems like grasping at straws, maybe there's an instruction prefetcher grabbing instructions at the presumably (!) cold, not-taken branch targets that are newly unaligned, causing conflicts with fetching the fall-through path?

I'm not sure how much more I would invest into this investigation, although understanding more might save time the next time we see an unexplained regression.

@kunalspathak
Copy link
Member

kunalspathak commented Feb 13, 2023

@kunalspathak
Copy link
Member

I went through the issues and I don't see any other regressions.

@a74nh
Copy link
Contributor

a74nh commented Feb 16, 2023

Some updates.....

I extracted the assembly for the entire function into a test program and set some dummy memory values for the dictionary. I ran this on a cycle accurate simulator for the N1, and extracted the traces (including pipeline stages). I did this once for the program with LDP, and once with an LDP plus a NOP. There was nothing to suggest any difference between the two, except for the NOP adding a slight delay. Sadly I'm unable to share any of the traces.

What my test app doesn't replicate is the exact memory setup of the coreclr version (eg - the code has the same alignment, but is in a different location. The contents of the dictionary are different, and lives in a different location). So it's possible this is causing a difference. There's also differences to coreclr (eg GC), to take into account.

As a diversion, now that I have some code to insert NOPs in arbitrary places during clr codegen, I experimented a bit more with moving around the NOP. I've annotated the code below with the benchmark result when a NOP (or 2 NOPs or 3) is placed there.
The benchmark speed without a NOP is 2.37

G_M39015_IG01:        ; func=00, offs=000000H, size=0008H, bbWeight=1, PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
IN002c: 000000                    stp     fp, lr, [sp, #-0x10]!
IN002d: 000004                    mov     fp, sp
G_M39015_IG02:        ; offs=000008H, size=0014H, bbWeight=1, PerfScore 10.50, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, BB01 [0000], byref, isz
IN0001: 000008                    ldr     w1, [x0, #0x08]
IN0002: 00000C                    ldr     x2, [x0]
IN0003: 000010                    ldr     w2, [x2, #0x44]
IN0004: 000014                    cmp     w1, w2
IN0005: 000018                    bne     G_M39015_IG09
G_M39015_IG03:        ; offs=00001CH, size=0014H, bbWeight=8, PerfScore 84.00, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, BB05 [0004], byref, isz
IN0006: 00001C                    ldr     w1, [x0, #0x0C]
IN0007: 000020                    ldr     x2, [x0]
IN0008: 000024                    ldr     w2, [x2, #0x38]
IN0009: 000028                    cmp     w1, w2
IN000a: 00002C                    blo     G_M39015_IG06
G_M39015_IG04:        ; offs=000030H, size=0018H, bbWeight=0.50, PerfScore 4.50, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, BB06 [0005], byref
IN000b: 000030                    ldr     x1, [x0]
IN000c: 000034                    ldr     w1, [x1, #0x38]
IN000d: 000038                    add     w1, w1, #1
IN000e: 00003C                    str     w1, [x0, #0x0C]
IN000f: 000040                    str     xzr, [x0, #0x14]
IN0010: 000044                    mov     w0, wzr
G_M39015_IG05:        ; offs=000048H, size=0008H, bbWeight=0.50, PerfScore 1.00, epilog, nogc, extend
IN002e: 000048                    ldp     fp, lr, [sp], #0x10
IN002f: 00004C                    ret     lr
                      nop//2.37 us
                    2nops//2.35 us
G_M39015_IG06:        ; offs=000050H, size=0038H, bbWeight=2, PerfScore 43.00, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, BB03 [0002], gcvars, byref, isz
IN0011: 000050                    ldr     x1, [x0]
                      nop//2.18 us
                    2nops//2.30 us
                    3nops//2.25 us
IN0012: 000054                    ldr     x1, [x1, #0x10]
IN0013: 000058                    ldr     w2, [x0, #0x0C]
IN0014: 00005C                    add     w3, w2, #1
IN0015: 000060                    str     w3, [x0, #0x0C]
IN0016: 000064                    ldr     w3, [x1, #0x08]
IN0017: 000068                    cmp     w2, w3
IN0018: 00006C                    bhs     G_M39015_IG10
IN0019: 000070                    ubfiz   x2, x2, #4, #32
IN001a: 000074                    add     x2, x2, #16
IN001b: 000078                    add     x1, x1, x2
IN001c: 00007C                    ldr     w2, [x1, #0x04]
                      nop//2.20 us
IN001d: 000080                    cmn     w2, #1
                      nop//2.20 us
IN001e: 000084                    blt     G_M39015_IG03
G_M39015_IG07:        ; offs=000088H, size=0014H, bbWeight=0.50, PerfScore 3.00, gcrefRegs=0000 {}, byrefRegs=0003 {x0 x1}, BB04 [0003], byref
                      nop//2.18 us
                    2nops//2.20 us
IN001f: 000088                     w1, [x1, #0x08]
                      nop//can't place here as it interferes with the peephole.
IN0020: 00008C                    add     x0, x0, #20
                      nop//2.18 us
                    2nops//2.25 us
IN0021: 000090                    str     w2, [x0]
                      nop//2.18 us
IN0022: 000094                    str     w1, [x0, #0x04]
                      nop//2.18 us
IN0023: 000098                    mov     w0, #1

G_M39015_IG08:        ; offs=00009CH, size=0008H, bbWeight=0.50, PerfScore 1.00, epilog, nogc, extend
                      nop//2.18 us
IN0030: 00009C                    ldp     fp, lr, [sp], #0x10
                      nop//2.18 us
                    2nops//2.22 us
                    3nops//2.35 us
IN0031: 0000A0                    ret     lr
                      nop//2.37 us
G_M39015_IG09:        ; offs=0000A4H, size=0018H, bbWeight=0, PerfScore 0.00, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB02 [0001], gcvars, byref
IN0024: 0000A4                    movz    x0, #0xB078      // code for System.ThrowHelper:ThrowInvalidOperationException_InvalidOperation_EnumFailedVersion()
IN0025: 0000A8                    movk    x0, #0x707C LSL #16
IN0026: 0000AC                    movk    x0, #0xFFFF LSL #32
IN0027: 0000B0                    ldr     x0, [x0]
IN0028: 0000B4                    blr     x0
IN0029: 0000B8                    brk_unix #0
G_M39015_IG10:        ; offs=0000BCH, size=0008H, bbWeight=0, PerfScore 0.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB07 [0007], byref
IN002a: 0000BC                    bl      CORINFO_HELP_RNGCHKFAIL
IN002b: 0000C0                    brk_unix #0

It's hard to make firm statements here, but:

  • Adding 2 NOPs is usually only slightly slower than a single NOP. This suggests to me the slowdown isn't related to instruction alignment (as 2 NOPs should gives us the same 8byte alignment as no NOPs).
  • Moving the NOP to after IN0030 still gives the improvement. This tells me it's not a register dependency between the instructions (and the simulator would have told me that)
  • It's possible some of these speed ups are happening do to different effects.

@kunalspathak
Copy link
Member

Both the regressions seemed to come back after that despite I don't see any PR that would have improved them.

image

image

The diff range is: 6ad1205...dce07a8

At this point, I won't spend much time on this given that your experiments proved it was around general alignment (and not necessarily loop alignment). Thank you @a74nh for spending time in investigating.

@BruceForstall
Copy link
Member

@a74nh That's some amazing in-depth analysis. I agree with @kunalspathak that it doesn't seem worth spending any more time on it at this point.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants