-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT should propagate assertions and elide custom (bounds) checks #44040
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
@briansull PTAL at bound check issues. |
I will investigate |
In your first test case the register edx holds the loop counter and changes from 0 to 99 |
Here is my slightly modified variant of your first test case:
|
And here is the assembly code for Repro, with Test(int) inlined three times:
|
Looks optimal to me: Only one compare and jump to Throw() |
For your second test case we have many known issue with optimizations of multi-dimensional arrays.
|
@briansull Thank you for taking a look! As for the second case, that's not actually an ND array, but rather a custom type that lets users map arbitrary 2D memory regions (which might or might not be contiguous). It's the Span2D type (click to expand):public readonly ref struct Span2D<T>
{
private readonly Span<T> span;
private readonly int width;
internal readonly int Stride;
public int Height
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
get => this.span.Length;
}
public int Width
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
get => this.width;
}
public ref T this[int row, int column]
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
get
{
if ((uint)row >= (uint)span.Length ||
(uint)column >= (uint)this.width)
{
static void Throw() => throw new IndexOutOfRangeException();
Throw();
}
return ref DangerousGetReferenceAt(row, column);
}
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public ref T DangerousGetReferenceAt(int i, int j)
{
ref T r0 = ref MemoryMarshal.GetReference(this.span);
nint index = ((nint)(uint)i * (nint)(uint)this.Stride) + (nint)(uint)j;
return ref Unsafe.Add(ref r0, index);
}
} With this and the sample code in this issue, I do get that unnecessary bounds checks in the resulting codegen. So the issue was not really about ND arrays per se, but rather about the possibility of the JIT to just be able to track these kind of assertions in general, and not just for a specific type. As in, ideally a fix for this would not just improve the codegen for ND arrays in the future, but to all similar cases such as this one, so that types outside of the BCL such as Right now this has quite the performance penalty unless users switch to completely unsafe code by using Do you think such an optimization could be added to the JIT in the future, our would something like this be technically too expensive to implement, at least for now? Thanks again for your time! 😊 |
Do you have a compile ready version of that demonstrates the issue, that would help me investigate this? In your last post you code uses In the test case I would expect to see a double nested loop, where you want the JIT to somehow hoist out one of the bounds checks. |
Also The JIT has to be fairly conservative in this area. We could have different thread that could change the values of |
You're absolutely right, but what I wanted to highlight here was that the code was already not safe against that scenario anyway, since the JIT is just caching the value of that field into a register for the entire time and never checking it again. So if another thread even used reflection to change that field, that method would never be able to see the change anyway, no? 🤔 EDIT: also in this case that's a ref struct passed by copy, so I don't think there would be a way for anyone else to really change any of those fields at all?
Sure thing! Consider the following full repro: C# repro code (click to expand)public static int Sum(Span2D<int> span)
{
int height = span.Height;
int width = span.Width;
int sum = 0;
for (int i = 0; i < height; i++)
{
for (int j = 0; j < width; j++)
{
sum += span[i, j];
}
}
return sum;
} public readonly ref struct Span2D<T>
{
private readonly Span<T> span;
private readonly int width;
internal readonly int Stride;
public int Height
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
get => this.span.Length;
}
public int Width
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
get => this.width;
}
public ref T this[int row, int column]
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
get
{
if ((uint)row >= (uint)span.Length ||
(uint)column >= (uint)this.width)
{
static void Throw() => throw new IndexOutOfRangeException();
Throw();
}
return ref DangerousGetReferenceAt(row, column);
}
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public ref T DangerousGetReferenceAt(int i, int j)
{
ref T r0 = ref MemoryMarshal.GetReference(this.span);
nint index = ((nint)(uint)i * (nint)(uint)this.Stride) + (nint)(uint)j;
return ref Unsafe.Add(ref r0, index);
}
} Resulting x64 codegen (click to expand); Assembly listing for method RefToArrayTest.Program:Sum(RefToArrayTest.Span2D`1[Int32]):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T01] ( 6, 72 ) byref -> rcx ld-addr-op
; V01 loc0 [V01,T10] ( 3, 6 ) int -> r8
; V02 loc1 [V02,T07] ( 3, 21 ) int -> r10
; V03 loc2 [V03,T05] ( 4, 34 ) int -> rax
; V04 loc3 [V04,T04] ( 6, 45 ) int -> r11
; V05 loc4 [V05,T00] ( 6, 76 ) int -> rsi
; V06 OutArgs [V06 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
; V07 tmp1 [V07,T06] ( 2, 32 ) long -> rbx "Inline stloc first use temp"
; V08 tmp2 [V08,T02] ( 2, 64 ) byref -> rdi "impAppendStmt"
; V09 tmp3 [V09,T03] ( 2, 64 ) struct (16) [rsp+0x28] do-not-enreg[SFB] must-init ld-addr-op "Inlining Arg"
; V10 cse0 [V10,T08] ( 3, 18 ) int -> rdx "CSE - aggressive"
; V11 cse1 [V11,T09] ( 3, 10 ) int -> r9 "CSE - aggressive"
;
; Lcl frame size = 56
G_M54696_IG01:
push rdi
push rsi
push rbp
push rbx
sub rsp, 56
vzeroupper
xor rax, rax
mov qword ptr [rsp+28H], rax
mov qword ptr [rsp+30H], rax
;; bbWeight=1 PerfScore 7.50
G_M54696_IG02:
mov edx, dword ptr [rcx+8]
mov r8d, edx
mov r9d, dword ptr [rcx+16]
mov r10d, r9d
xor eax, eax
xor r11d, r11d
test r8d, r8d
jle SHORT G_M54696_IG09
;; bbWeight=1 PerfScore 6.25
G_M54696_IG03:
xor esi, esi
test r10d, r10d
jle SHORT G_M54696_IG08
;; bbWeight=4 PerfScore 6.00
G_M54696_IG04:
cmp r11d, edx
jae SHORT G_M54696_IG10
;; bbWeight=16 PerfScore 20.00
G_M54696_IG05:
cmp esi, r9d
jae SHORT G_M54696_IG10
;; bbWeight=8 PerfScore 10.00
G_M54696_IG06:
vmovdqu xmm0, xmmword ptr [rcx]
vmovdqu xmmword ptr [rsp+28H], xmm0
;; bbWeight=16 PerfScore 48.00
G_M54696_IG07:
mov rdi, bword ptr [rsp+28H]
mov ebx, r11d
mov ebp, dword ptr [rcx+20]
imul rbx, rbp
mov ebp, esi
add rbx, rbp
add eax, dword ptr [rdi+4*rbx]
inc esi
cmp esi, r10d
jl SHORT G_M54696_IG04
;; bbWeight=16 PerfScore 148.00
G_M54696_IG08:
inc r11d
cmp r11d, r8d
jl SHORT G_M54696_IG03
;; bbWeight=4 PerfScore 6.00
G_M54696_IG09:
add rsp, 56
pop rbx
pop rbp
pop rsi
pop rdi
ret
;; bbWeight=1 PerfScore 3.25
G_M54696_IG10:
call RefToArrayTest.Span2D`1[Int32][System.Int32]:<get_Item>g__Throw|8_0()
int3
;; bbWeight=0 PerfScore 0.00
; Total bytes of code 126, prolog size 23, PerfScore 267.80, instruction count 47 (MethodHash=f1042a57) for method RefToArrayTest.Program:Sum(RefToArrayTest.Span2D`1[Int32]):int
; ============================================================ My issue is specifically with the following bit from the middle of the codegen: G_M54696_IG04:
cmp r11d, edx
jae SHORT G_M54696_IG10
G_M54696_IG05:
cmp esi, r9d
jae SHORT G_M54696_IG10
G_M54696_IG06:
vmovdqu xmm0, xmmword ptr [rcx]
vmovdqu xmmword ptr [rsp+28H], xmm0
G_M54696_IG07:
mov rdi, bword ptr [rsp+28H]
mov ebx, r11d
mov ebp, dword ptr [rcx+20]
imul rbx, rbp
mov ebp, esi
add rbx, rbp
add eax, dword ptr [rdi+4*rbx]
inc esi
cmp esi, r10d
jl SHORT G_M54696_IG04 From what I can see (please correct me if I'm mistaking!), the JIT is storing the following:
Now, the critical bit for this issue is that you can see the following:
Specifically, at the end of G_M54696_IG07:
; [...]
inc esi
cmp esi, r10d
jl SHORT G_M54696_IG04 My point is that all the code between This way, Hope this makes sense, thank you again for taking the time to look into this! 😊 |
Would the code that you have added to do tail duplication of conditional blocks possibly kick in here? If this block below were duplicated so that it had only one predecessor for each copy we could use assertion prop to fold away the conditional branch for the inner loop.
|
Not currently, as there's also an assignment in the block:
We could try and generalize what's there to allow a bit more code to be duplicated. We might also need other tweaks. |
OK, I will open an issue about this and link it to this issue. |
This looks fixed, the codegen for ; Method C:Repro(C+Wrapper):int
G_M58570_IG01:
sub rsp, 40
;; size=4 bbWeight=1 PerfScore 0.25
G_M58570_IG02:
xor eax, eax
xor edx, edx
align [0 bytes for IG03]
;; size=4 bbWeight=1 PerfScore 0.50
G_M58570_IG03:
cmp edx, ecx
jae SHORT G_M58570_IG05
inc eax
inc eax
inc eax
inc edx
cmp edx, 100
jl SHORT G_M58570_IG03
;; size=17 bbWeight=4 PerfScore 14.00
G_M58570_IG04:
add rsp, 40
ret
;; size=5 bbWeight=1 PerfScore 1.25
G_M58570_IG05:
call [C+Wrapper:<Test>g__Throw|1_0()]
int3
;; size=7 bbWeight=0 PerfScore 0.00
; Total bytes of code: 37
Probably fixed by various enhancements around jump threading/RBO. |
Description
Hi, I've noticed that the JIT currently doesn't seem to track what assertions are done on the state of registers/locals between different jump labels in the generated code, resulting in completely redundant conditional branches being repeated even though by looking at the assembly one can actually see the variables being tested couldn't possibly have changed. I've discovered this issue while working on the
Span2D<T>
type in theMicrosoft.Toolkit.HighPerformance
package (here), which in some cases it's actually hurt quite a bit by the lack of this optimization (unless users just manually switch to using unsafe code to work around it).As suggested by @tannergooding - first here's a minimal repro that illustrates the issue:
C# repro code (click to expand)
This produces the following assembly code:
asm x86-64 codegen (click to expand)
You can see that each call to
Wrapper.Test
is inline and produces thatinc eax
as expected. But, each call also adds one extra conditional jump that checks the current index inedx
. After the first one (G_M50921_IG03
), the others are just not needed, as the test is only working onedx
andecx
, which are not modified by any of the following instructions until the end of the loop. The JIT doesn't seem to currently be aware of the state of these registers, so it just doesn't see that no instructions are writing to those registers to be able to know that those repeated conditional branches are just not necessary.My original example
C# repro code (click to expand)
This produces the following assembly code:
asm x86-64 codegen (click to expand)
Here the codegen is repeating the bounds check for
i
every single time the indexer is used, even though none of the registers being checked is every updated by the following code. Ideally the check should only be done once, and then thatjl SHORT G_M41662_IG04
should instead jump toG_M41662_IG05
and only repeat the check forj
, which is updated in that block of code. Like, that jump label to directly go back to the check forj
is in fact generated, it's just not being used.Benchmarks
I run a quick benchmark for this method against a version where I manually skipped the bounds check for
i
after the first iteration of the loop, and got about a 12% speed improvement due to this change alone. I think it might be worth it for the JIT to be able to track situations such as this one automatically, though admittedly I'm not sure how difficult it would be to enable this.Configuration
master
, cloned a couple weeks agoThe text was updated successfully, but these errors were encountered: