AV in WKS::gc_heap::relocate_address #34279

jkotas · 2020-03-30T16:24:36Z

Hit in Common.Tests on Windows x86 Checked in #34261:

ntdll!KiUserExceptionDispatcher+0xf [d:\rs1\minkernel\ntos\rtl\i386\userdisp.asm @ 597] 
coreclr!WKS::tree_search+0x6 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 24978] 
coreclr!WKS::gc_heap::relocate_address+0x5e [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 25068] 
coreclr!WKS::GCHeap::Relocate+0x45 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 36864] 
coreclr!PromoteCarefully+0x83 [F:\workspace\_work\1\s\src\coreclr\src\vm\siginfo.cpp @ 4875] 
coreclr!GcEnumObject+0x4b [F:\workspace\_work\1\s\src\coreclr\src\vm\gcenv.ee.common.cpp @ 167] 
coreclr!EECodeManager::EnumGcRefs+0xc48 [F:\workspace\_work\1\s\src\coreclr\src\vm\eetwain.cpp @ 4702] 
coreclr!GcStackCrawlCallBack+0x152 [F:\workspace\_work\1\s\src\coreclr\src\vm\gcenv.ee.common.cpp @ 290] 
coreclr!Thread::MakeStackwalkerCallback+0x48 [F:\workspace\_work\1\s\src\coreclr\src\vm\stackwalk.cpp @ 841] 
coreclr!Thread::StackWalkFramesEx+0x193 [F:\workspace\_work\1\s\src\coreclr\src\vm\stackwalk.cpp @ 917] 
coreclr!Thread::StackWalkFrames+0x159 [F:\workspace\_work\1\s\src\coreclr\src\vm\stackwalk.cpp @ 1001] 
coreclr!ScanStackRoots+0x1e9 [F:\workspace\_work\1\s\src\coreclr\src\vm\gcenv.ee.cpp @ 146] 
coreclr!GCToEEInterface::GcScanRoots+0xe4 [F:\workspace\_work\1\s\src\coreclr\src\vm\gcenv.ee.cpp @ 175] 
coreclr!WKS::gc_heap::relocate_phase+0x35 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 25922] 
coreclr!WKS::gc_heap::plan_phase+0x1aa8 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 24246] 
coreclr!WKS::gc_heap::gc1+0x173 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 16782] 
coreclr!WKS::gc_heap::garbage_collect+0x3bd [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 18414] 
coreclr!WKS::GCHeap::GarbageCollectGeneration+0x239 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 38169] 
coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x40 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 14039] 
coreclr!WKS::gc_heap::try_allocate_more_space+0x1ba [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 14157] 
coreclr!WKS::gc_heap::allocate_more_space+0x14 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 14587] 
coreclr!WKS::gc_heap::allocate+0x31 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 14618] 
coreclr!WKS::GCHeap::Alloc+0x199 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 37242] 
coreclr!Alloc+0x13e [F:\workspace\_work\1\s\src\coreclr\src\vm\gchelpers.cpp @ 231] 
coreclr!AllocateSzArray+0x2ea [F:\workspace\_work\1\s\src\coreclr\src\vm\gchelpers.cpp @ 486] 
coreclr!JIT_NewArr1+0x185 [F:\workspace\_work\1\s\src\coreclr\src\vm\jithelpers.cpp @ 2718] 
System_Private_CoreLib!System.Span`1[[System.Byte, System.Private.CoreLib]].ToArray()+0x9a6b2fa1 [/_/src/libraries/System.Private.CoreLib/src/System/Span.cs @ 477] 
Common_Tests!System.Net.Test.Common.VirtualNetworkStream.Write(Byte[], Int32, Int32)+0x67
Common_Tests!System.Net.Test.Common.VirtualNetworkStreamTest.VirtualNetworkStream_SingleThreadIntegrityTest_Ok()+0x137

The text was updated successfully, but these errors were encountered:

jkotas · 2020-03-30T16:27:27Z

Crash while stackwalking Interop+Kernel32.GetFullPathNameW(Char ByRef, UInt32, Char ByRef, IntPtr)

jkotas · 2020-03-30T16:30:08Z

@fadimounir This looks like a duplicate of #31809

fadimounir · 2020-03-30T16:34:05Z

Thanks @jkotas, I will take a look. Are there crash dumps in the failures since I wasn't able to repro this locally?

jkotas · 2020-03-30T16:34:40Z

I have saved the dump and binaries at \jkotas9\drops\issue-34279 so that it won't get lost.

fadimounir · 2020-04-01T19:05:19Z

I took a look at the dump, and it looks like we're enumerating the gs cookie of the InlinedCallFrame as a gcref object. Still not sure why

fadimounir · 2020-04-02T00:45:19Z

@jkotas I've been able to trace the root cause of this issue. The problem is here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/vm/i386/PInvokeStubs.asm#L58

We're using the return address from the call to the JIT_PInvokeBegin helper, and not the return address of the actual PInvoke. The effect of this is that we report GC roots using incorrect GC data. I've been able to manually get a similar state under debugger, and saw that we were enumerating the gs cookie of the InlinedCallFrame. When I manually got this state under debugger however, the Promote() function was able to ignore this bad value, since it detected it wasn't a valid address in the GC heaps. In the crash dump though, things were not as lucky (maybe the gs cookie value we use in checked builds (0x12345678) fell between the low/high range of the gc heap?)

Anyways, I did test under debugger what the correct behavior would have been by manually patching the return address in the InlinedCallFrame to the correct one, and the GC refs enumeration seemed to be doing the correct thing (nothing was being enumerated from the pinvoke frame at that point).

I think the fix is to pass the correct return address value as a parameter to the JIT_PInvokeBegin helper here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/lower.cpp#L3635

@dotnet/jit-contrib I need some help figuring out how to do that.

fadimounir · 2020-04-02T00:46:06Z

BTW, the bug exists on all architectures, all platforms, and not just on Windows-x86. It's just extremely rare that it would ever fail

jkotas · 2020-04-02T01:54:38Z

I think the fix is to pass the correct return address value as a parameter to the JIT_PInvokeBegin helper

That is one option. Other option is to fix the GCInfo reporting so that the GCInfo at the JIT_PInvokeBegin IP is same as the GCInfo at the actual PInvoke.

fadimounir · 2020-04-03T20:39:47Z

@AndyAyersMS could you please forward this issue to someone on your team for a fix? Thanks!

AndyAyersMS · 2020-04-03T21:31:06Z

cc @BruceForstall

AndyAyersMS · 2020-04-11T01:19:10Z

@fadimounir do you still have the GC info reported for the method? Looking at the jit code I don't see offhand how the GC state could change between the PINVOKE_BEGIN and the actual pinvoke.

jkotas · 2020-04-11T03:15:11Z

This is specific to R2R codegen. If you would like to debug it under JIT, you can set flags.Set(CORJIT_FLAGS::CORJIT_FLAG_USE_PINVOKE_HELPERS); flag. It will hit the same problem.

The GC Info looks like this:

GC info 142CFEB8
Method info block:
    method      size   = 0065
    prolog      size   = 20 
    epilog      size   = 10 
    epilog     count   =  1 
    epilog      end    = yes  
    callee-saved regs  = EDI ESI EBX EBP 
    ebp frame          = yes  
    fully interruptible= no  
    double align       = no  
    arguments size     =  2 DWORDs
    stack frame size   = 12 DWORDs
    untracked count    =  2 
    var ptr tab count  =  0 
    epilog        at   005B
65 AA DC AC F4 | 
BD BE 3F       | 

Pointer table:
0D             |             [EBP-10H] an untracked pinned byref local
04             |             [EBP-14H] an untracked pinned byref local
F9 26 C0 C0    | 0026        call [ ESI'EBX'] argMask=00
F9 1F 05 05    | 0045        call [ ] argMask=05 (iargs=05)
FF             |

The problem is the F9 1F 05 05 | 0045 call [ ] argMask=05 (iargs=05). It corresponds to PINVOKE_BEGIN call. PINVOKE_BEGIN should have no arg mask, just like the actual call that does the PInvoke.

jkotas · 2020-04-14T14:38:43Z

I am trying to understand why we are doing the GC reporting for CORINFO_HELP_JIT_PINVOKE_BEGIN call, but not the actual PInvoke call instruction. Is the problem that the actual PInvoke arguments are pushed on the stack when the helper is called? If it is the case, would it help to emit the CORINFO_HELP_JIT_PINVOKE_BEGIN helper call before the arguments are pushed to the stack?

Thinking about consequences of adding yet another argument for the CORINFO_HELP_JIT_PINVOKE_BEGIN helper: It will mean that the helper is going to take 3 arguments on x86 (PInvoke Frame address, return address, stack size to pop). One of the arguments is going to be passed via the stack. Is it going to cause more problems with making sure that the helper can accurately capture the bits of context that it needs?

AndyAyersMS · 2020-04-14T16:33:53Z

In the case I'm looking at we currently don't emit anything for the pinvoke call because of this early out in emitRecordGCcall:

runtime/src/coreclr/src/jit/emit.cpp

Lines 6078 to 6087 in 4335c82

    
           #ifdef JIT32_GCENCODER 
        
               // The JIT32 GCInfo encoder allows us to (as the comment previously here said): 
        
               // "Bail if this is a totally boring call", but the GCInfoEncoder/Decoder interface 
        
               // requires a definition for every call site, so we skip these "early outs" when we're 
        
               // using the general encoder. 
        
               if (regs == 0) 
        
               { 
        
           #if EMIT_TRACK_STACK_DEPTH 
        
                   if (emitCurStackLvl == 0) 
        
                       return;

AndyAyersMS · 2020-04-14T17:54:13Z

If I disable the early bail out then we see reports for the pinvoke and the pivoke end calls:

Pointer table:
0D             |             [EBP-10H] an untracked pinned byref local
04             |             [EBP-14H] an untracked pinned byref local
F9 32 03 03    | 0032        call [ ] argMask=03 (iargs=03)
02             | 0034        call [ ] argMask=00
0B             | 003F        call [ ] argMask=00
FF

I don't understand how the early bail out is safe. Seems like the logic should be tracking whether the call we're looking at has different GC reporting than the previous call, or something like that.

briansull · 2020-04-14T18:35:16Z

I thought that the way that partially interruptible GC reporting worked on x86 was that if nothing is recorded for a particular offset then nothing get reported as being alive. It should not be reporting what it saw alive at a earlier callsite. Fully interruptible does keep updating the state and reports what it has incrementally computed as being alive. But that isn't how partially interruptible works.
It just looks up the GC info at the exact address and reports whatever is there. If there isn't anything for the current offset then nothing is reported as being alive.

AndyAyersMS · 2020-04-14T18:48:07Z

Is the problem that the actual PInvoke arguments are pushed on the stack when the helper is called?

This is why the pinvoke begin gc descriptor has a nonzero arg mask, yes.

Maybe the issue is that the pinvoke args are byrefs and not native ints?

IL_0010  07                ldloc.1     
IL_0011  08                ldloc.2     
IL_0012  28 01 00 00 06    call         0x6000001
IL_0017  20 28 00 00 00    ldc.i4       0x28
IL_001c  58                add         
IL_001d  4d                ldind.i     
IL_001e  4d                ldind.i     
IL_001f  29 ff ff ff 11    calli        0x11FFFFFF

;  V03 loc1            byref  pinned
;  V04 loc2            byref  pinned

               [000021] -ACXG-------              *  ASG       int   
               [000020] D------N----              +--*  LCL_VAR   int    V06 loc4         
               [000019] --CXG-------              \--*  CALL ind unman int   
               [000010] ------------ arg0            +--*  LCL_VAR   byref  V04 loc2         
               [000009] ------------ arg1            +--*  LCL_VAR   byref  V03 loc1         
               [000018] ------------ calli tgt       \--*  LCL_VAR   int    V08 tmp1

Seems like the IL or the jit could cast these to non-gc types.

briansull · 2020-04-14T19:05:26Z

What is the signature of the callsite? I believe that is what determines if it reports a byref or not.

jkotas · 2020-04-14T19:56:35Z

Maybe the issue is that the pinvoke args are byrefs and not native ints?

Yes, good point. The ECMA specs says:

When a method is called, an unmanaged pointer (type native unsigned int or *) is
permitted to match a parameter that requires a managed pointer (type &). The reverse,
however, is not permitted since it would allow a managed pointer to be “lost” by the
memory manager.

I guess that we can start with declaring this to be invalid IL, and fix the IL stub generation accordingly.

jkotas · 2020-04-14T20:00:13Z

But it would be also useful to add at least an assert to the JIT that fires when GC tracked value is passed into PInvoke.

AndyAyersMS · 2020-04-14T21:36:39Z

Looks like the pinvoke call site sig has nativeint, but the jit doesn't do anything with this particular mismatch (see impPopCallArgs).

Seems like this mostly happens with pinvokes that take byrefs.

Assert is something like:

@@ -6737,9 +6737,16 @@ void Compiler::impPopArgsForUnmanagedCall(GenTree* call, CORINFO_SIG_INFO* sig)
         assert(thisPtr->TypeGet() == TYP_I_IMPL || thisPtr->TypeGet() == TYP_BYREF);
     }

-    for (GenTreeCall::Use& use : GenTreeCall::UseList(args))
+    for (GenTreeCall::Use& argUse : GenTreeCall::UseList(args))
     {
-        call->gtFlags |= use.GetNode()->gtFlags & GTF_GLOB_EFFECT;
+        // We should not be passing gc typed args to an unmanaged call.
+        GenTree* arg = argUse.GetNode();
+        if (varTypeIsGC(arg->TypeGet()))
+        {
+            assert(!"*** invalid IL: gc type passed to unmanaged call");
+        }
+
+        call->gtFlags |= arg->gtFlags & GTF_GLOB_EFFECT;

jkotas · 2020-04-14T21:41:23Z

@jkoritzinsky @AaronRobinsonMSFT The interop stub generators are generating invalid IL (passing byref values as pointer arguments) that is hitting subtle issues in the JIT.

Can you please take a look at fixing the stub generators and adding the assert that Andy suggested?

AndyAyersMS · 2020-04-14T21:56:28Z

I see 93 cases when crossgenning SPC for x86 (changing the assert to a printf) -- some of these stubs have multiple byref args, so maybe 50 stubs in all?

*** invalid IL: gc type passed to unmanaged call, in ILStubClass:IL_STUB_PInvoke(System.String,byref,int):int
*** invalid IL: gc type passed to unmanaged call, in ILStubClass:IL_STUB_PInvoke(System.String,int,byref,int,int,int):Microsoft.Win32.SafeHandles.SafeFindHandle
*** invalid IL: gc type passed to unmanaged call, in ILStubClass:IL_STUB_PInvoke(byref,byref):int
... etc ...

AaronRobinsonMSFT · 2020-04-14T22:11:21Z

@jkotas and @AndyAyersMS Will take over issue.

Make the jit more robust in cases where the IL producer is passing a byref to an unmanaged caller, by retyping the argument as native int. Allows the jit to produce self-consistent GC info and avoid the issues seen in dotnet#34279, at least for byrefs. Closes dotnet#39040.

Make the jit more robust in cases where the IL producer is passing a byref to an unmanaged caller, by retyping the argument as native int. Allows the jit to produce self-consistent GC info and avoid the issues seen in #34279, at least for byrefs. Closes #39040.

mjsabby · 2020-07-13T04:38:01Z

A couple of questions:

(1) Does this not impact x64 for some reason? I don't understand why not if that is the case.
(2) Should we add this to IL Verification also?

I'm trying to determine if I'm hitting this or not, and I don't have a good way to audit all my generated IL at runtime in a system where this is happening.

AndyAyersMS · 2020-07-15T00:32:52Z

I would guess it is x86 specific. x86 uses a different gc encoding and may rely more heavily on the pinvoke signature to make sense of the stack.

I can get you a version of the jit that will (optionally) fail with "invalid program" errors instead of asserts or silent casts, if you want to look for this sort of IL in your apps by running tests.

mjsabby · 2020-07-15T04:35:01Z

@AndyAyersMS I see you've made a change that ignores this assert. I suppose if I use that as a baseline build for my testing I can assume that it is not the problem. And you've also said it doesn't seem likely on x64.

AndyAyersMS · 2020-07-15T19:56:39Z

I see you've made a change that ignores this assert

Yes, more or less, at least for byrefs, motivated by IL from C++/CLI that passes byrefs to unmanaged callers.

We're hoping that the IL producer has somehow guaranteed that the byref won't change values across a GC, and want to make sure we don't cause the program to fail for other reasons.

We could just reject the IL, but since we've tolerated this for a long time it would likely cause more trouble than it's worth. But you might want to do that if you are able/willing to fix up your IL producers.

you've also said it doesn't seem likely on x64.

I might be wrong.... @jkotas ?

jkotas · 2020-07-15T20:02:09Z

I agree with your statement: "It doesn't seem likely on x64". We have not seen an actual problem caused by it on x64.

Dotnet-GitSync-Bot added area-GC-coreclr untriaged New issue has not been triaged by the area owner labels Mar 30, 2020

jkotas added area-ReadyToRun-coreclr arch-x86 os-windows and removed area-GC-coreclr labels Mar 30, 2020

fadimounir added bug GCStress non-deterministic and removed arch-x86 os-windows labels Apr 2, 2020

fadimounir self-assigned this Apr 2, 2020

fadimounir removed the untriaged New issue has not been triaged by the area owner label Apr 2, 2020

fadimounir assigned AndyAyersMS and unassigned fadimounir Apr 3, 2020

BruceForstall added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-ReadyToRun-coreclr labels Apr 8, 2020

BruceForstall added this to the 5.0 milestone Apr 8, 2020

jkotas added area-Interop-coreclr and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Apr 14, 2020

AaronRobinsonMSFT self-assigned this Apr 14, 2020

AaronRobinsonMSFT mentioned this issue Apr 15, 2020

Fix all runtime IL generation paths from passing GC types to unmanaged function calls. #35026

Merged

This was referenced Apr 17, 2020

System.IO.FileSystem.Tests crashed in CI #35117

Closed

[x86 Windows] GC stackwalking crash with R2R PInvoke frames #31809

Closed

ghost closed this as completed in #35026 Apr 18, 2020

ooooolivia mentioned this issue May 8, 2020

Turn weblarge tests back on dotnet/performance#1319

Merged

jkotas mentioned this issue Jul 10, 2020

Implicit conversion from uint32& to uint32* should be reconsidered #39040

Closed

AndyAyersMS mentioned this issue Jul 10, 2020

JIT: retype byrefs passed to unmanaged callers as native int #39105

Merged

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AV in WKS::gc_heap::relocate_address #34279

AV in WKS::gc_heap::relocate_address #34279

jkotas commented Mar 30, 2020 •

edited

Loading

jkotas commented Mar 30, 2020

jkotas commented Mar 30, 2020

fadimounir commented Mar 30, 2020

jkotas commented Mar 30, 2020

fadimounir commented Apr 1, 2020

fadimounir commented Apr 2, 2020

fadimounir commented Apr 2, 2020 •

edited

Loading

jkotas commented Apr 2, 2020

fadimounir commented Apr 3, 2020

AndyAyersMS commented Apr 3, 2020

AndyAyersMS commented Apr 11, 2020

jkotas commented Apr 11, 2020

jkotas commented Apr 14, 2020

AndyAyersMS commented Apr 14, 2020

AndyAyersMS commented Apr 14, 2020

briansull commented Apr 14, 2020 •

edited

Loading

AndyAyersMS commented Apr 14, 2020

briansull commented Apr 14, 2020

jkotas commented Apr 14, 2020

jkotas commented Apr 14, 2020

AndyAyersMS commented Apr 14, 2020

jkotas commented Apr 14, 2020

AndyAyersMS commented Apr 14, 2020

AaronRobinsonMSFT commented Apr 14, 2020

mjsabby commented Jul 13, 2020 •

edited

Loading

AndyAyersMS commented Jul 15, 2020

mjsabby commented Jul 15, 2020

AndyAyersMS commented Jul 15, 2020

jkotas commented Jul 15, 2020

AV in WKS::gc_heap::relocate_address #34279

AV in WKS::gc_heap::relocate_address #34279

Comments

jkotas commented Mar 30, 2020 • edited Loading

jkotas commented Mar 30, 2020

jkotas commented Mar 30, 2020

fadimounir commented Mar 30, 2020

jkotas commented Mar 30, 2020

fadimounir commented Apr 1, 2020

fadimounir commented Apr 2, 2020

fadimounir commented Apr 2, 2020 • edited Loading

jkotas commented Apr 2, 2020

fadimounir commented Apr 3, 2020

AndyAyersMS commented Apr 3, 2020

AndyAyersMS commented Apr 11, 2020

jkotas commented Apr 11, 2020

jkotas commented Apr 14, 2020

AndyAyersMS commented Apr 14, 2020

AndyAyersMS commented Apr 14, 2020

briansull commented Apr 14, 2020 • edited Loading

AndyAyersMS commented Apr 14, 2020

briansull commented Apr 14, 2020

jkotas commented Apr 14, 2020

jkotas commented Apr 14, 2020

AndyAyersMS commented Apr 14, 2020

jkotas commented Apr 14, 2020

AndyAyersMS commented Apr 14, 2020

AaronRobinsonMSFT commented Apr 14, 2020

mjsabby commented Jul 13, 2020 • edited Loading

AndyAyersMS commented Jul 15, 2020

mjsabby commented Jul 15, 2020

AndyAyersMS commented Jul 15, 2020

jkotas commented Jul 15, 2020

jkotas commented Mar 30, 2020 •

edited

Loading

fadimounir commented Apr 2, 2020 •

edited

Loading

briansull commented Apr 14, 2020 •

edited

Loading

mjsabby commented Jul 13, 2020 •

edited

Loading