-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PGO: test failure in VolatileTest_op_mul on linux arm32 #57219
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
This doesn't seem to be PGO specific, just that PGO seems more likely to trigger it somehow. Running this slightly modified version on arm Linux, built release, with // Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
using System;
using System.Collections.Generic;
using System.Runtime.CompilerServices;
using System.Threading;
using System.Text;
internal class Program
{
private static int Main(string[] args)
{
Console.WriteLine("this test is designed to hang if jit cse doesnt honor volatile");
for (int i = 0; i < 10; i++)
{
if (!TestCSE.Test()) return 1;
}
if (TestCSE.Test()) return 100;
return 1;
}
}
public class TestCSE
{
private const int VAL1 = 0x404;
private const int VAL2 = 0x03;
private static volatile bool s_timeUp = false;
private volatile int _a;
private volatile int _b;
private static int[] s_objs;
static TestCSE()
{
s_objs = new int[3];
s_objs[0] = VAL1;
s_objs[1] = VAL1;
s_objs[2] = VAL2;
}
public TestCSE()
{
_a = s_objs[0];
_b = s_objs[1];
}
public static bool Equal(int val1, int val2)
{
if (val1 == val2)
return true;
return false;
}
public static bool TestFailed(int result, int expected1, int expected2, string tname)
{
if (result == expected1)
return false;
if (result == expected2)
return false;
Console.WriteLine("this failure may not repro everytime");
Console.WriteLine("ERROR FAILED:" + tname + ",got val1=" + result + " expected value is, either " + expected1 + " or " + expected2);
throw new Exception("check failed for " + tname);
}
[MethodImplAttribute(MethodImplOptions.NoInlining)]
public bool TestOp()
{
long i;
Thread.Sleep(0);
_a = VAL1;
_b = VAL1;
for (i = 0; ; i++)
{
if (!Equal(_a * _b, _a * _b)) break;
if (!Equal(_a * _b, _a * _b)) break;
i++;
}
Console.WriteLine("Test 1 passed after " + i + " tries");
_a = VAL1;
_b = VAL1;
for (i = 0; ; i++)
{
if (!Equal(_a * _b, VAL1 * VAL2)) break;
if (!Equal(_a * _b, VAL1 * VAL2)) break;
}
Console.WriteLine("Test 2 passed after " + i + " tries");
bool passed = false;
_a = VAL1;
_b = VAL1;
for (i = 0; ; i++)
{
int ans1 = _a * _b;
int ans2 = _a * _b;
if (TestFailed(ans1, VAL1 * VAL1, VAL1 * VAL2, "Test 3") || TestFailed(ans2, VAL1 * VAL1, VAL1 * VAL2, "Test 3"))
{
passed = false;
break;
}
if (ans1 != ans2)
{
passed = true;
break;
}
}
Console.WriteLine("Test 3 " + (passed ? "passed" : "failed") + " after " + i + " tries");
return passed;
}
private void Flip()
{
for (uint i = 0; !s_timeUp; i++)
{
_a = s_objs[i % 2];
_b = s_objs[(i % 2) + 1];
}
}
public static bool Test()
{
s_timeUp = false;
TestCSE o = new TestCSE();
Thread t = new Thread(new ThreadStart(o.Flip));
t.Start();
bool ans = o.TestOp();
s_timeUp = true;
t.Join();
return ans;
}
} In runtime we eventually hit a debug break in thread suspend on the tiered compilation worker thread indicating that thread suspension is taking too long:
There are two threads in managed code, one in ; Assembly listing for method TestCSE:Equal(int,int):bool
; Emitting BLENDED_CODE for generic ARM CPU - Unix
; Tier-1 compilation
; optimized code
; sp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) int -> r0 single-def
; V01 arg1 [V01,T01] ( 3, 3 ) int -> r1 single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace"
;
; Lcl frame size = 4
G_M17750_IG01: ; gcVars=00000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, nogc <-- Prolog IG
000000 B508 push {r3,lr}
;; bbWeight=1 PerfScore 1.00
G_M17750_IG02: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
000002 4288 cmp r0, r1
000004 D101 bne SHORT G_M17750_IG05
;; bbWeight=1 PerfScore 2.00
G_M17750_IG03: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
000006 2001 movs r0, 1
;; bbWeight=0.50 PerfScore 0.50
G_M17750_IG04: ; , epilog, nogc, extend
000008 BD08 pop {r3,pc}
;; bbWeight=0.50 PerfScore 0.50
G_M17750_IG05: ; gcVars=00000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
00000A 2000 movs r0, 0
;; bbWeight=0.50 PerfScore 0.50
G_M17750_IG06: ; , epilog, nogc, extend
00000C BD08 pop {r3,pc}
;; bbWeight=0.50 PerfScore 0.50
; Total bytes of code 14, prolog size 2, PerfScore 6.40, instruction count 7, allocated bytes for code 14 (MethodHash=977bbaa9) for method TestCSE:Equal(int,int):bool The jit reports There's a seemingly relevant discussion over in #35274 -- seems like this fits the bill, a long running loop with short-lived calls. @janvorli any ideas on why we seemingly can't suspend threads via hijacks in this case? |
Doesn't a method have to be fully-interruptible to support hijack-ing? Since Equal is (incorrectly) marked partially interruptible, the hyjacking logic is not going to try to hijack when the PC is in that method. A partially interruptible method should always have a callsite in it. |
I guess that for a method with straight-line code (no loops), we expect that the caller of the method will be partially interruptible, and that eventually the hi-jack retry logic will stop inside the caller method (and it will be partially interruptible) |
Also this statement/comment doesn't give me very much confidence that this test is reliable:
|
Also on (some?) Linix systems there used to be an issue with hi-jacking, such that we need to add GC-Polls to the method. |
Yes, it seems a bit odd to mark a method as partially interruptible without having any places it can actually be interrupted. Still not clear to me if the jit is at fault here. We do something similar for win x64 codegen, and I can't get the test to hang there. |
Looks like from the stress log LF_SYNC data that we never managed to stop in
all the stopping points are in the caller, and (apparenly) none of them at a safepoint in the caller. |
CC VM team @mangod9, @davidwrighton, and @janvorli as an early headsup because @AndyAyersMS suspects this is either a runtime problem or a bad test. |
VM team, please take a look. |
What's supposed to happen is that we are supposed to hijack the thread running
If we did that, then we would hijack the return address. Unfortunately, from my understanding of the way instructions are handled by the various CPU's we support, I suspect the window for that happening is pretty small, as those are exceptionally cheap instructions. and the rest of the loop is heavy with a fair number of more expensive instructions such as CPU memory barriers, branches, and multiplies, etc. |
David looked at the stress logs and doesn't see anything we can address for .NET 6. So moving this to future. |
JIT/jit64/opt/cse/VolatileTest_op_xor also crashes on linux-arm with PGO - see console.log |
Believe this should be fixed by #95565 |
See CI Run -- this test has been failing on and off.
category:correctness
theme:profile-feedback
The text was updated successfully, but these errors were encountered: