Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiler use after release on process exit #11885

Closed
iskiselev opened this issue Jan 24, 2019 · 13 comments
Closed

Profiler use after release on process exit #11885

iskiselev opened this issue Jan 24, 2019 · 13 comments
Assignees
Milestone

Comments

@iskiselev
Copy link

CoreCLR version: 2.1, 2.2 (probably all of them)
OS: Ubuntu/CentOs/MacOs (probably any non-Windows)

On application exit CLR sometimes crashes with SEGFAULT if profiler has subscribed to COR_PRF_MONITOR_JIT_COMPILATION.
Based on coredump analysis crash happened in EEToProfInterfaceImpl::JITCompilationStarted(unsigned long, int) ().
Looks like it is possible to crash on other profiler callback too - but was not able to stable reproduce with other, but colleague reported that it was observed once on ModuleUnloadFinished.
We've started investigation on it, as have seen exit code 1 sometimes from dotnet test run with XUnit test, after test reported that all test passed successfully on Linux CI environment. We've never seen it on Windows CI environment.

After further investigation, looks like it is enough that profiler will do nothing except:

  • SetEventMask(COR_PRF_MONITOR_JIT_COMPILATION | COR_PRF_DISABLE_ALL_NGEN_IMAGES | COR_PRF_DISABLE_OPTIMIZATIONS | COR_PRF_DISABLE_INLINING) inside Initialize
  • return S_OK for all callbacks
  • proper implement AddRef / Release with deleting itself if ref count = 0.

The issue can be stable reproduced with XUnit tests (one test that will do nothing is enough).
After first execution of dotnet test, I've repeated execution of dotnet exec /usr/share/dotnet/sdk/<version>/vstest.console.dll --framework:.NETCoreApp,Version=v2.0 --logger:trx --Diag:TestResults/testDiagnostics.txt <path_to_test>/Release/netcoreapp2.0/<testname>.dll
It fails after ~20 executions with segfault and exit code 139.

I can provide sample application/profiler, but it will require some additional work on my side.

When I've looked on coreclr sources, I've found very suspicious place here: https://github.com/dotnet/coreclr/blob/a28b25aacdcd2adb0fdfa70bd869f53ba6565976/src/vm/profilinghelper.cpp#L1304
IsProfilerEvacuated check is done only when FEATURE_PROFAPI_ATTACH_DETACH enabled, which may be a hint why have we seen problem only on non-Windows OS.

CC: @noahfalk

@iskiselev iskiselev changed the title Profiler use after release Profiler use after release on process exit Jan 24, 2019
@noahfalk
Copy link
Member

Thanks for the report @iskiselev! You mentioned analyzing a coredump, do you have a stack trace where the failure occurred? I think we could set something up given the repro instructions you provided but we'll need to make a little time to look into it.

@noahfalk
Copy link
Member

noahfalk commented Feb 1, 2019

Hi again @iskiselev, I set up a test that seemed similar to what you described but unfortunately I haven't been able to repro the problem yet. If you have any more details to share that would be helpful.

I can believe that enabling the attach/detach logic you saw there would remedy the AV, but doing it that way suggests we might be leaving other symptoms of the problem behind. For example the profiler might now miss events at the end of the process that it should have been receiving.

I don't mind using the attach/detach mechanism as an extra line of defence and leaving the rest unresolved if we have to, but given that you had a repro in hand it would great to understand the root cause if possible. Thanks!

@iskiselev
Copy link
Author

I've prepared test sample here: https://github.com/iskiselev/clr_profiler_segfault
You'll need to run run.sh. It requires to have CoreCLR SDK and Clang installed.
It is based on https://github.com/Microsoft/clr-samples/tree/master/ProfilingAPI/ELTProfiler

The key thing to increase probability of crash was adding console output to profiler destructor: https://github.com/iskiselev/clr_profiler_segfault/blob/38ec09bf84d2554bd6757ae73ed3d7bcd0895c9c/profiler/CorProfiler.cpp#L16

@iskiselev
Copy link
Author

Hi @noahfalk, if you still need it - I can provide crash dump additionally, but hope you'll be able to reproduce it yourself.

@noahfalk
Copy link
Member

noahfalk commented Feb 9, 2019

Thanks @iskiselev, its been a long time since anyone gave me such an easy to use repro! I really appreciate it. After some false starts I did manage to observe the crash, capture a core dump, and get it under a debugger. A few things have to line up timing-wise to get the failure, but when they do we get one thread doing shutdown:

libc.so.6`__sleep // added by me after the delete to enlarge the race window on this thread
CorProfiler.so`CorProfiler::Release()
libcoreclr.so`EEToProfInterfaceImpl::~EEToProfInterfaceImpl
libcoreclr.so`ProfilingAPIUtility::TerminateProfiling
libcoreclr.so`EEShutDownHelper(fIsDllUnloading=NO)
libcoreclr.so`EEShutDown
...

And another managed thread finishing a JIT compilation and issuing a callback:

libcoreclr.so`EEToProfInterfaceImpl::JITCompilationFinished
libcoreclr.so`MethodDesc::JitCompileCodeLockedEventWrapper
libcoreclr.so`MethodDesc::JitCompileCode
libcoreclr.so`MethodDesc::PrepareILBasedCode
libcoreclr.so`MethodDesc::DoPrestub
libcoreclr.so`::PreStubWorker
libcoreclr.so`ThePreStub

SOS DumpMD shows we were compiling token 6000057 in Microsoft.TestPlatform.CommunicationUtilities.dll but failed to resolve it to an exact method name. We can work it out later if it matters, but its probably just luck of the draw which method gets hit in the race window.

I think we can put together a fix based on this and reviewing the code it appears the runtime is shutting down as expected overall, just the profiler was left exposed to the unexpected callbacks. Should I assume you are interested in having this fix ported to servicing branches for already released runtime versions?

Unfortunately I don't see much that could be done to workaround this prior to getting a runtime fix. Even if you leaked your implementation of ICorProfilerCallback, the runtime is going to delete its EEToProfilerInterface wrapper object and NULL out the pointer to it so an AV will still occur if a callback happens within the race window. You could mitigate the issue somewhat by minimizing the time in that race window which starts here:
https://github.com/dotnet/coreclr/blob/master/src/vm/profilinghelper.cpp#L1329
and ends here:
https://github.com/dotnet/coreclr/blob/master/src/vm/profilinghelper.cpp#L1359
Mainly that implies minimizing the time you spend between deleting the ICorProfilerCallback implementation and exiting the call to ICorProfilerCallback::Release(). I suspect deleting the memory is already the last thing you do before returning, so there is unlikely more benefit to be had.

Thanks for reporting this and let me know about the servicing interest and any other questions. I'll look into a fix and let you know when I've got more info.

@iskiselev
Copy link
Author

iskiselev commented Feb 16, 2019

Thank you for information, @noahfalk. Based on your findings I suspect that windows .Net Core and .Net Framework are also affected by the issue. Is it correct? (I've not done tests to confirm it).

We are very interested this issue to be fixed both in .Net Core and .Net Framework future releases.
We will be glad if this fix will be ported to servicing branches for already released runtime versions, but we can not insist on it. Looks like this issue has low probability and I would not expect a lot of serious consequences from this issue. I'll update this issue if we will have real cases that would be greatly benefit of back-porting it to servicing branches.

noahfalk referenced this issue in noahfalk/coreclr Feb 20, 2019
Fixes issue #22176. Use the profiler evacuation counters to ensure that we
don't callback into the profiler when it has already been released.
Previously we only did this as part of the attach/detach feature, but this
is required for correctness during standard shutdown given that managed
threads are still running concurrently.

Thanks to @iskiselev for a very nice repro as well some accurate analysis
of the runtime that made finding and fixing this much easier!
@noahfalk
Copy link
Member

Based on your findings I suspect that windows .Net Core and .Net Framework are also affected by the issue. Is it correct?

The opposite I think. .Net Framework and .Net Core on Windows both appear unaffected by this issue because enabling FEATURE_PROFAPI_ATTACH_DETACH caused the evacuations counters to be used. Doing those checks prevents the profiler from being released if a callback is in progress, as well as preventing future callbacks after the profiler has been released.

We are very interested this issue to be fixed both in .Net Core and .Net Framework future releases.

I've just opened PR dotnet/coreclr#22712 that should resolve the issue for .Net Core. I can't repro the failure any longer but of course please let me know if you see any sign that the issue wasn't fixed.

@iskiselev
Copy link
Author

iskiselev commented Feb 20, 2019

@noahfalk, thank you for confirmation that, as I originally supposed, windows builds should not be affected by issue. I misunderstand your statement that pointed on race condition lines that have not included IsEvacuated check.

@davmason davmason assigned davmason and unassigned noahfalk Apr 25, 2019
davmason referenced this issue in dotnet/coreclr May 22, 2019
Fixes issue #22176. Use the profiler evacuation counters to ensure that we
don't callback into the profiler when it has already been released.
Previously we only did this as part of the attach/detach feature, but this
is required for correctness during standard shutdown given that managed
threads are still running concurrently.
@noahfalk
Copy link
Member

@davmason - The fix for this was merged right? Good to close?

@davmason
Copy link
Member

Yes, thanks

@k15tfu
Copy link
Contributor

k15tfu commented Nov 5, 2019

Hi! I faced with the same problem on .NET Core 2.2.7. We run CI tests and periodically they fail with exit code 139.

SIGSEGV happens in EEToProfInterfaceImpl::JITCompilationFinished:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f5cf229471d in EEToProfInterfaceImpl::JITCompilationFinished(unsigned long, int, int) ()
   from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
[Current thread is 1 (Thread 0x7f5c5cda4700 (LWP 22665))]
(gdb) bt
#0  0x00007f5cf229471d in EEToProfInterfaceImpl::JITCompilationFinished(unsigned long, int, int) ()
   from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#1  0x00007f5cf23d16b3 in MethodDesc::JitCompileCodeLockedEventWrapper(PrepareCodeConfig*, ListLockEntryBase<NativeCodeVersion>*) ()
   from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#2  0x00007f5cf23d0e50 in MethodDesc::JitCompileCode(PrepareCodeConfig*) ()
   from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#3  0x00007f5cf23d0bd7 in MethodDesc::PrepareILBasedCode(PrepareCodeConfig*) ()
   from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#4  0x00007f5cf23d31d5 in MethodDesc::DoPrestub(MethodTable*) () from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#5  0x00007f5cf23d2b04 in PreStubWorker () from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#6  0x00007f5cf2353e64 in ThePreStub () from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#7  0x00007f5cf235317f in CallDescrWorkerInternal () from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#8  0x00007f5cf227167a in MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) ()
   from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#9  0x00007f5cf23f63a2 in AppDomainTimerCallback_Worker(void*) () from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#10 0x00007f5cf22427df in ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) ()
   from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#11 0x00007f5cf2242f90 in ManagedThreadBase::ThreadPool(ADID, void (*)(void*), void*) ()
   from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#12 0x00007f5cf23f64a8 in AppDomainTimerCallback(void*, unsigned char) ()
   from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#13 0x00007f5cf22651e7 in ThreadpoolMgr::AsyncTimerCallbackCompletion(void*) ()
   from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#14 0x00007f5cf23d8ae6 in UnManagedPerAppDomainTPCount::DispatchWorkItem(bool*, bool*) ()
   from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#15 0x00007f5cf2261903 in ThreadpoolMgr::WorkerThreadStart(void*) () from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#16 0x00007f5cf25e0525 in CorUnix::CPalThread::ThreadEntry(void*) () from /tmp/DotNetTestCodeCompiler_Output/Pytepaj/libcoreclr.so
dotnet/coreclr#17 0x00007f5cf3ba76db in start_thread (arg=0x7f5c5cda4700) at pthread_create.c:463
dotnet/coreclr#18 0x00007f5cf2f9188f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

disasm:

(gdb) disas $pc
Dump of assembler code for function EEToProfInterfaceImpl::JITCompilationFinished(unsigned long, int, int):
   0x00007f5cf22946c0 <+0>:     push   %rbp
   0x00007f5cf22946c1 <+1>:     mov    %rsp,%rbp
   0x00007f5cf22946c4 <+4>:     push   %r15
   0x00007f5cf22946c6 <+6>:     push   %r14
   0x00007f5cf22946c8 <+8>:     push   %r13
   0x00007f5cf22946ca <+10>:    push   %r12
   0x00007f5cf22946cc <+12>:    push   %rbx
   0x00007f5cf22946cd <+13>:    push   %rax
   0x00007f5cf22946ce <+14>:    mov    %ecx,%r14d
   0x00007f5cf22946d1 <+17>:    mov    %edx,%r15d
   0x00007f5cf22946d4 <+20>:    mov    %rsi,%r12
   0x00007f5cf22946d7 <+23>:    mov    %rdi,%r13
   0x00007f5cf22946da <+26>:    mov    0x5d970f(%rip),%rax        # 0x7f5cf286ddf0
   0x00007f5cf22946e1 <+33>:    mov    0x10(%rax),%ecx
   0x00007f5cf22946e4 <+36>:    xor    %eax,%eax
   0x00007f5cf22946e6 <+38>:    cmp    $0x4,%ecx
   0x00007f5cf22946e9 <+41>:    jne    0x7f5cf2294740 <EEToProfInterfaceImpl::JITCompilationFinished(unsigned long, int, int)+128>
   0x00007f5cf22946eb <+43>:    data16 lea 0x5d90fd(%rip),%rdi        # 0x7f5cf286d7f0
   0x00007f5cf22946f3 <+51>:    data16 data16 callq 0x7f5cf2190be0 <__tls_get_addr@plt>
   0x00007f5cf22946fb <+59>:    mov    (%rax),%rbx
   0x00007f5cf22946fe <+62>:    xor    %ecx,%ecx
   0x00007f5cf2294700 <+64>:    test   %rbx,%rbx
   0x00007f5cf2294703 <+67>:    je     0x7f5cf2294716 <EEToProfInterfaceImpl::JITCompilationFinished(unsigned long, int, int)+86>
   0x00007f5cf2294705 <+69>:    mov    0x404(%rbx),%ecx
   0x00007f5cf229470b <+75>:    mov    %ecx,%eax
   0x00007f5cf229470d <+77>:    or     $0x3,%eax
   0x00007f5cf2294710 <+80>:    mov    %eax,0x404(%rbx)
   0x00007f5cf2294716 <+86>:    mov    %ecx,-0x2c(%rbp)
   0x00007f5cf2294719 <+89>:    mov    0x0(%r13),%rdi
=> 0x00007f5cf229471d <+93>:    mov    (%rdi),%rax
   0x00007f5cf2294720 <+96>:    mov    0xc0(%rax),%rax
   0x00007f5cf2294727 <+103>:   mov    %r12,%rsi
   0x00007f5cf229472a <+106>:   mov    %r15d,%edx
   0x00007f5cf229472d <+109>:   mov    %r14d,%ecx
   0x00007f5cf2294730 <+112>:   callq  *%rax
   0x00007f5cf2294732 <+114>:   test   %rbx,%rbx
   0x00007f5cf2294735 <+117>:   je     0x7f5cf2294740 <EEToProfInterfaceImpl::JITCompilationFinished(unsigned long, int, int)+128>
   0x00007f5cf2294737 <+119>:   mov    -0x2c(%rbp),%ecx
---Type <return> to continue, or q <return> to quit---
   0x00007f5cf229473a <+122>:   mov    %ecx,0x404(%rbx)
   0x00007f5cf2294740 <+128>:   add    $0x8,%rsp
   0x00007f5cf2294744 <+132>:   pop    %rbx
   0x00007f5cf2294745 <+133>:   pop    %r12
   0x00007f5cf2294747 <+135>:   pop    %r13
   0x00007f5cf2294749 <+137>:   pop    %r14
   0x00007f5cf229474b <+139>:   pop    %r15
   0x00007f5cf229474d <+141>:   pop    %rbp
   0x00007f5cf229474e <+142>:   retq
   0x00007f5cf229474f <+143>:   test   %rbx,%rbx
   0x00007f5cf2294752 <+146>:   je     0x7f5cf229475d <EEToProfInterfaceImpl::JITCompilationFinished(unsigned long, int, int)+157>
   0x00007f5cf2294754 <+148>:   mov    -0x2c(%rbp),%ecx
   0x00007f5cf2294757 <+151>:   mov    %ecx,0x404(%rbx)
   0x00007f5cf229475d <+157>:   mov    %rax,%rdi
   0x00007f5cf2294760 <+160>:   callq  0x7f5cf2190ad0 <_Unwind_Resume@plt>
End of assembler dump.

Should I assume you are interested in having this fix ported to servicing branches for already released runtime versions?

@noahfalk @davmason Is it still possible? Support for .NET Core 2.2 ends in December. I'm very interested in stable tests =))

@k15tfu
Copy link
Contributor

k15tfu commented Jan 29, 2020

@noahfalk @davmason Okay. What about fixing this in 2.1 release?

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@k15tfu
Copy link
Contributor

k15tfu commented Feb 11, 2020

@davmason Friendly ping.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants