Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pProfInterface is used after being freed by TerminateProfiling #13413

Closed
echesakov opened this issue Sep 12, 2019 · 11 comments · Fixed by dotnet/coreclr#26762
Closed

pProfInterface is used after being freed by TerminateProfiling #13413

echesakov opened this issue Sep 12, 2019 · 11 comments · Fixed by dotnet/coreclr#26762

Comments

@echesakov
Copy link
Contributor

It seems that pProfInterface can be accessed by ProfileEnter/ProfileLeave/ProfileTailcall callbacks after TerminateProfiling() was called.

I was running our tests Pri1 with a sample profiler (https://github.com/microsoft/clr-samples/tree/master/ProfilingAPI/ELTProfiler) on linux-arm64 against my PR dotnet/coreclr#26460 and I hit segmentation faults in GC/Scenarios/FinalizeTimeout test.

The issue can also be reproduced on linux-x64.

What happens:

  1. ProfilingAPIUtility::TerminateProfiling() is called during system shutdown. After that point pProfInterface is freed and set to nullptr.
  2. ProfileEnter or ProfileLeave callback is called by https://github.com/dotnet/coreclr/blob/9c8ba7773e506db05016f1c278a7a1ea27816dbf/tests/src/GC/Scenarios/FinalizeTimeout/FinalizeTimeout.cs#L75
    from a finalizer thread.
  3. Asserts when tries to access pProfInterface (e.g. https://github.com/dotnet/coreclr/blob/master/src/vm/proftoeeinterfaceimpl.cpp#L10259)

Steps to repro:

  1. build CorProfiler.so - I suggest to alter the code in a sample profiler by removing the printf-s in EnterStub, LeaveStub and TailcallStub in CorProfiler.cpp to avoid cluttering your output.
  2. create file FinalizeTimeoutUnderProfiles.lldb with the following steps:
b ProfilingAPIUtility::LoadProfiler
r
watchpoint set expression -- &g_profControlBlock.pProfInterface.m_val
cont
cont
  1. set the environment
export CORECLR_ENABLE_PROFILING=1
export CORECLR_PROFILER={cf0d821e-299b-5307-a3d8-b283c03916dd}
export CORECLR_PROFILER_PATH=<Path to CorProfiler.so>
export CORE_ROOT=<Path to coreclr/bin/tests/Linux.x64.Checked/Tests/Core_Root>
  1. run the test under LLDB
lldb-8 -s FinalizeTimeoutUnderProfiles.lldb -- $CORE_ROOT/corerun $CORE_ROOT/../../GC/Scenarios/FinalizeTimeout/FinalizeTimeout/FinalizeTimeout.dll

You will see that the debugger will stop at two watchpoints - the first time is during ProfilingAPIUtility::LoadProfiler, the second time - during ProfilingAPIUtility::TerminateProfiling().

Then it will stop during SIGSEGV.

@davmason @noahfalk Is this a supported scenario for profiling?
cc @dotnet/dotnet-diag

Related issues: #11885 dotnet/coreclr#22712

@davmason
Copy link
Member

Thanks @echesakovMSFT, this is probably a bug in the runtime but i need to do a little investigating to make sure. The profiler is not expected to ever unregister the ELT hooks (in fact I think we explicitly block them from doing so). We would run finalizers on shutdown on desktop, so I am curious to know if anything's change or if this is a longstanding bug.

@echesakov
Copy link
Contributor Author

Similar race is observed in CoreMangLib/system/span/RefStructWithSpan test where the ELT hooks is called from a background thread after pProfInterface is destroyed.

@viewizard
Copy link
Member

@davmason with patch you provide I have same issue on exit, but with a bit different trace:

Thread 18 received signal SIGSEGV, Segmentation fault.
[Switching to LWP 2376]
ProfileEnter (clientData=<error reading variable: Cannot access memory at address 0xa747e35c>, 
    platformSpecificHandle=<error reading variable: Cannot access memory at address 0xa747e358>)
    at /usr/src/debug/coreclr-3.0.0/src/vm/proftoeeinterfaceimpl.cpp:10284
10284	/usr/src/debug/coreclr-3.0.0/src/vm/proftoeeinterfaceimpl.cpp: No such file or directory.
(gdb) bt
#0  ProfileEnter (clientData=<error reading variable: Cannot access memory at address 0xa747e35c>, 
    platformSpecificHandle=<error reading variable: Cannot access memory at address 0xa747e358>)
    at /usr/src/debug/coreclr-3.0.0/src/vm/proftoeeinterfaceimpl.cpp:10284
dotnet/coreclr#1  0xb3285f8c in GetThread () at /usr/src/debug/coreclr-3.0.0/src/vm/threads.inl:34
dotnet/coreclr#2  SetCallbackStateFlagsHolder::SetCallbackStateFlagsHolder (dwFlags=3, this=<optimized out>) at /usr/src/debug/coreclr-3.0.0/src/vm/profilinghelper.inl:23
dotnet/coreclr#3  ProfileEnter (
inline-frame.c:167: internal-error: void inline_frame_this_id(frame_info*, void**, frame_id*): Assertion `frame_id_p (*this_id)' failed.

it still SIGSEGV on pProfInterface access, but now in proftoeeinterfaceimpl.cpp:10284

(g_profControlBlock.pProfInterface->GetEnter3Hook()         == NULL) &&

@davmason
Copy link
Member

Thanks @viewizard, there is a race in my proposed fix and I will have to update it

@davmason
Copy link
Member

@viewizard can you rerun your test with the update fix I have? I ran all of our tests with ELT enabled and didn't see any crashes, so I believe the race conditions are solved.

@echesakov
Copy link
Contributor Author

@davmason I am about to start the final testing for Arm64 ELT hooks. If you want - I can pull down your changes and test them as well.

@davmason
Copy link
Member

@echesakovMSFT That would be great, thanks!

@echesakov
Copy link
Contributor Author

@davmason I have a question. What a profiler is expected to do during its Shutdown() call? I presume it should deallocate the memory/destroy all the objects that were created before? What if the ELT callbacks rely on the data?

@davmason
Copy link
Member

It’s up to the profiler when to release memory. If the runtime is shutting down, then the profiler will be unloaded too when the process dies. You could just let the OS reclaim the memory. If the profiler wanted to free things on shutdown you would have to set them to null and return early from the ELT stubs after they are null.

@echesakov
Copy link
Contributor Author

It’s up to the profiler when to release memory. If the runtime is shutting down, then the profiler will be unloaded too when the process dies. You could just let the OS reclaim the memory. If the profiler wanted to free things on shutdown you would have to set them to null and return early from the ELT stubs after they are null.

@davmason Basically, this means that every ELT hook should have some synchronization mechanism (ideally, lock-free) to be able to bypass the logic if the system is shutting down?

With you changes I don't see the issue with null pProfInterface but I see that ELT hooks are being called after the system was shut down (and SIGSEGV since the profiler data was disposed). If my profiler implementation accounted for this then I think I wouldn't see any issue at all.

@viewizard
Copy link
Member

@davmason I have tested CoreCLR with new patch on arm32, looks like this solve my issue with SIGSEGV.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants