Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax condition to require no exclusive VM access when generating jitdump #9136

Closed
fjeremic opened this issue Apr 6, 2020 · 10 comments · Fixed by #10143
Closed

Relax condition to require no exclusive VM access when generating jitdump #9136

fjeremic opened this issue Apr 6, 2020 · 10 comments · Fixed by #10143
Labels

Comments

@fjeremic
Copy link
Contributor

fjeremic commented Apr 6, 2020

Often times when we crash the jitdump contains the following:

<?xml version="1.0" standalone="no"?>
<jitDump>
Some thread is holding exclusive VM access. No log created.

This is generated due to the code found in [1]. That if statement seems to want exclusive access when generating the jitdump trace. Often times someone else is holding exclusive VM access and thus we are unable to proceed and we don't generate a jitdump.

This poses two questions:

  1. Why does jitdump generation require exclusive VM access?
  2. Assuming jitdump needs exclusive VM access, what is the best way to acquire it?

[1] https://github.com/eclipse/openj9/blob/958d2f9b581efc1ddf774753520d709e64c7d6e7/runtime/compiler/control/HookedByTheJit.cpp#L2067-L2074

@fjeremic
Copy link
Contributor Author

fjeremic commented Apr 6, 2020

@dsouzai @mpirvu for help answering 1. and @gacholio for help answering 2.

For 2. adding the following seems to be working well so far for me:

   if (J9_XACCESS_NONE != jitConfig->javaVM->exclusiveAccessState)
      {
      jitConfig->javaVM->internalVMFunctions->releaseExclusiveVMAccessFromExternalThread(jitConfig->javaVM);
      }

From reading the implementation of this API it seems to clear the wait queue for exclusive access, and then it waits for current thread which has VM access to release it? Is this correct? Would this be the best way to ensure jitdump gets a hold of exclusive VM access if someone else is holding onto it at the point when we wish to generate the jitdump?

@gacholio
Copy link
Contributor

gacholio commented Apr 6, 2020

If jitdump is a RAS dump facitility, then the RAS code should be in charge of exclusive. @keithc-ca

@dsouzai
Copy link
Contributor

dsouzai commented Apr 6, 2020

I believe at least one reason we can't have some other thread holding exclusive vm access is because during the jit dump compilation we'll call acquireVMAccessIfNeeded to get information via the runtime; if some other thread holds exclusive VM Access, the jit dump thread will be hung indefinitely (especially if the thread holding exclusive vm access is the thread that's running dumpJITInfo).

If the thread that's running dumpJITInfo is the thread that's holding exlusive vm access, then it might be fine to release it and re-acquire it, but it depends on why exclusive vm access was taken to begin with.

@mpirvu
Copy link
Contributor

mpirvu commented Apr 6, 2020

Many VM helpers called by the JIT require VM access, so the diagnostic compilation thread will attempt to acquire VM access and will be blocked is someone else is holding exclusive VM access

@fjeremic
Copy link
Contributor Author

Summarizing my findings here:


Goal

The goal of this issue is to alter/remove the following code in question so that we are always able to generate a jitdump, irregardless of exclusive VM access state:
https://github.com/eclipse/openj9/blob/958d2f9b581efc1ddf774753520d709e64c7d6e7/runtime/compiler/control/HookedByTheJit.cpp#L2067-L2074

Summary of cases to handle

There are two cases in which we want to generate a jitdump at the moment:

  1. A crash in a JIT compiled method body
    • Executing Java application thread will hold shared VM access, meaning no other thread will be holding exclusive access, thus we should always be able to generate a jitdump for methods on the thread stack
  2. A crash in the JIT module (libj9jit29.so)
    a. The compilation thread is holding shared VM access
    - In this case no other thread can be holding exclusive access, so we should be able to generate a jitdump
    b. The compilation thread is not holding shared VM access (most common case)
    - Some other thread may be holding exclusive VM access (GC for example)
    - Currently we will fail to generate a jitdump because of the above code

Next steps

According to a recent Slack thread discussion [1], and by examining the JIT codebase, the only place where we request exclusive VM access in the JIT is when we have to allocate a new code cache. Thus the only case we really have to handle from above is 2. b.

To me this implies we should be able to just remove that piece of code above and everything should work, since if someone holds exclusive access, the jitdump compilation thread will just block when attempting to acquire shared access. In theory it should work. Giving this a shot next.

[1] https://openj9.slack.com/archives/C8312LCV9/p1592332963289100

@fjeremic
Copy link
Contributor Author

Just a quick update. I'm debugging two deadlocks that result from removing the above code. They both happen in stress modes where I've hacked the JIT to randomly crash during compilation. I'm working with others to try and understand how we end up in these deadlock situations and see what can be done.

@fjeremic
Copy link
Contributor Author

fjeremic commented Jun 26, 2020

I've been debugging the deadlocks for the last few days and with the help of @gacholio and @dsouzai we have figured out the issue. I introduced a random crash in doInstructionSelection as follows:

   static int32_t crashThreshold = feGetEnv("TR_CrashThreshold") == NULL ? 10 : atoi(feGetEnv("TR_CrashThreshold"));

   if (rand() % 1000 < crashThreshold || self()->comp()->ilGenRequest().details().isJitDumpMethod())
      {
      *((int32_t*)0) = 5;
      }

Then I ran java -version with varying values for TR_CrashThreshold which ensures we crash almost always. I ran the JVM as follows:

export TR_CrashThreshold=100;
for i in {1..50}; do rm -f log* core* Snap* jitdu* javaco*;./bin/java -Xdump:none -Xdump:jit+java -Xjit:count=1,ignoreAssert -Xmx1M -Xms1M -XX:+DebugVMAccess -version; done

This resulted in two types of deadlocks:

  1. A deadlock right before VM is about to terminate

    • The jitdump file contains === EXCEPTION THROWN (JIT Shutdown) ===
    • The crashed thread is generating the jitdump and an interruption exception is thrown forcing the thread to shut down
    • A number of other compilation threads have crashed and are waiting to acquire the dump lock so they can execute their dump agents
    • The deadlock happens because the thread generating the jitdump has somehow not unwound the stack to release the dump lock and so the other crashed threads can never proceed
  2. A deadlock where a thread is trying to acquire exclusive VM access but is unable to

    • GC thread attempting to acquire exclusive, but it is waiting for vm->exclusiveAccessResponseCount to be zero [1]
    • All compilation threads except one are stuck waiting on the dump monitor to trigger a dump agent [2]
    • One thread is executing the jitdump and it has just released VM access and is waiting on the diagnostic thread to finish compiling the queued JIT method on the compilation queue [3]
    • The diagnostic thread is attempting to process the compilation queue but it attempts to acquire VM access, so it blocks because GC wants exclusive [4]

I've only debugged and figured out the problem for case 2 because that is the much more common case. The bullet points above result in a deadlock as the value of vm->exclusiveAccessResponseCount ends up being -1 and we are never able to acquire VM access.

The reason this happens is because the jitdump thread requests the compilation to be performed synchronously [5], irregardless of the fact that we are a crashed compilation thread! What ends up happening is that because the compilation is synchronous the code assumes we are an application thread, and it ends up attempting to release VM access [6] when it actually doesn't have it! This results in decrementing vm->exclusiveAccessResponseCount which causes problems later.

I don't quite follow why we did not assert that we had VM access when we attempt to release it, as I think we should. I'm going to debug this next, however I placed a quick assert right before the call to releaseVMAccess in [6] and I hit the assert instantly:

      if ((vmThread->publicFlags & J9_PUBLIC_FLAGS_VM_ACCESS) == 0)
         {
         TR_ASSERT_FATAL(false, "Filip: At CompilationThread.cpp:6229 attempting to release VM access on thread [%p] while not holding it", vmThread);
         }
      releaseVMAccess(vmThread);

The solution is simple for case 2. Acquire VM access if we are a crashed compilation thread before requesting the synchronous compilation and release it after we have returned.

[1] https://github.com/fjeremic/openj9/blob/2eb56b8edb6c15c8ef2929aa1710cc1aa008e1f8/runtime/vm/VMAccess.cpp#L302-L304
[2] https://github.com/fjeremic/openj9/blob/2eb56b8edb6c15c8ef2929aa1710cc1aa008e1f8/runtime/rasdump/trigger.c#L590-L598
[3] https://github.com/fjeremic/openj9/blob/2eb56b8edb6c15c8ef2929aa1710cc1aa008e1f8/runtime/compiler/control/CompilationThread.cpp#L6229
[4] https://github.com/fjeremic/openj9/blob/2eb56b8edb6c15c8ef2929aa1710cc1aa008e1f8/runtime/compiler/control/CompilationThread.cpp#L4166
[5] https://github.com/fjeremic/openj9/blob/2eb56b8edb6c15c8ef2929aa1710cc1aa008e1f8/runtime/compiler/control/JitDump.cpp#L327-L330
[6] https://github.com/fjeremic/openj9/blob/2eb56b8edb6c15c8ef2929aa1710cc1aa008e1f8/runtime/compiler/control/CompilationThread.cpp#L6222

@fjeremic fjeremic changed the title Acquire exclusive VM access when generating jitdump Relax condition to require exclusive VM access when generating jitdump Jun 29, 2020
@fjeremic
Copy link
Contributor Author

fjeremic commented Jul 8, 2020

I've found out the reason for deadlock 1. as described above. The backtrace looks as follows:

(gdb) bt
#0  0x000003ff8863182e in acquireVMaccessIfNeeded (vmThread=0x6fc00, isCompThread=<optimized out>) at /openjdk/openj9/runtime/compiler/env/VMJ9.cpp:387
#1  0x000003ff8861927a in TR_J9VMBase::acquireVMAccessIfNeeded (this=0x3ff64006f90) at /openjdk/openj9/runtime/compiler/env/VMJ9.h:884
#2  0x000003ff88614566 in J9::VMEnv::acquireVMAccessIfNeeded (this=<optimized out>, fej9=fej9@entry=0x3ff64006f90) at /openjdk/openj9/runtime/compiler/env/J9VMEnv.cpp:255
#3  0x000003ff886300ba in J9::VMAccessCriticalSection::VMAccessCriticalSection (comp=0x0, protocol=OMR::VMAccessCriticalSection::acquireVMAccessIfNeeded, fej9=0x3ff64006f90,
    this=0x3ff72bee708) at /openjdk/openj9/runtime/compiler/env/J9VMAccessCriticalSection.hpp:62
#4  TR::VMAccessCriticalSection::VMAccessCriticalSection (comp=0x0, protocol=OMR::VMAccessCriticalSection::acquireVMAccessIfNeeded, fej9=0x3ff64006f90, this=0x3ff72bee708)
    at /openjdk/openj9/runtime/compiler/env/VMAccessCriticalSection.hpp:42
#5  TR_J9VM::getClassFromSignature (this=this@entry=0x3ff64006f90, sig=sig@entry=0x3ff842b5a30 "java/lang/Integer", sigLength=<optimized out>, constantPool=0x836a0)
    at /openjdk/openj9/runtime/compiler/env/VMJ9.cpp:7013
#6  0x000003ff88630454 in TR_J9VM::getClassFromSignature (this=this@entry=0x3ff64006f90, sig=sig@entry=0x3ff842b5a30 "java/lang/Integer", sigLength=sigLength@entry=17,
    method=<optimized out>, isVettedForAOT=false) at /openjdk/openj9/runtime/compiler/env/VMJ9.cpp:7007
#7  0x000003ff8861d6a0 in TR_J9VM::getClassFromSignature (this=this@entry=0x3ff64006f90, sig=sig@entry=0x3ff842b5a30 "java/lang/Integer", sigLength=sigLength@entry=17,
    method=<optimized out>, isVettedForAOT=isVettedForAOT@entry=false) at /openjdk/openj9/runtime/compiler/env/VMJ9.cpp:7000
#8  0x000003ff88d6bde8 in TR_Debug::containingClass (this=this@entry=0x3ff7126c4f0, symRef=symRef@entry=0x3ff711f05a0) at /openjdk/omr/compiler/ras/Debug.cpp:975
#9  0x000003ff88d71938 in TR_Debug::print (this=this@entry=0x3ff7126c4f0, symRef=0x3ff711f05a0, output=..., hideHelperMethodInfo=hideHelperMethodInfo@entry=false,
    verbose=verbose@entry=false) at /openjdk/omr/compiler/ras/Debug.cpp:1247
#10 0x000003ff88d82602 in TR_Debug::printNodeInfo (this=0x3ff7126c4f0, node=0x3ff71184d90, output=..., prettyPrint=<optimized out>) at /openjdk/omr/compiler/ras/Tree.cpp:1605
#11 0x000003ff88d8034a in TR_Debug::printNodeInfo (this=0x3ff7126c4f0, pOutFile=0x3ff6401b620, node=0x3ff71184d90) at /openjdk/omr/compiler/ras/Tree.cpp:1563
#12 0x000003ff88d869ec in TR_Debug::print (this=0x3ff7126c4f0, pOutFile=0x3ff6401b620, node=0x3ff71184d90, indentation=<optimized out>, printChildren=<optimized out>)
    at /openjdk/omr/compiler/ras/Tree.cpp:1088
#13 0x000003ff88d8772a in TR_Debug::print (this=0x3ff7126c4f0, pOutFile=0x3ff6401b620, node=0x3ff71184de0, indentation=4, printChildren=<optimized out>)
    at /openjdk/omr/compiler/ras/Tree.cpp:1219
#14 0x000003ff88d7ff80 in TR_Debug::print (this=<optimized out>, pOutFile=<optimized out>, tt=<optimized out>) at /openjdk/omr/compiler/ras/Tree.cpp:1032
#15 0x000003ff88d7f760 in TR_Debug::printIRTrees (this=0x3ff7126c4f0, pOutFile=0x3ff6401b620, title=title@entry=0x3ff88f532dc "Trees", methodSymbol=methodSymbol@entry=0x3ff71184f00)
    at /openjdk/omr/compiler/ras/Tree.cpp:920
#16 0x000003ff88981c30 in OMR::Compilation::dumpMethodTrees (this=this@entry=0x3ff7117f000, title=title@entry=0x3ff88f532dc "Trees", methodSymbol=0x3ff71184f00, methodSymbol@entry=0x0)
    at /openjdk/omr/compiler/compile/OMRCompilation.cpp:1960
#17 0x000003ff885e762a in dumpCurrentILProtected (portLib=portLib@entry=0x3ff89e41380 <j9portLibrary>, opaqueParameters=opaqueParameters@entry=0x3ff72bf5ba0)
    at /openjdk/openj9/runtime/compiler/control/JitDump.cpp:126
#18 0x000003ff896b40e8 in omrsig_protect (portLibrary=0x3ff89e41380 <j9portLibrary>, fn=0x3ff885e7318 <dumpCurrentILProtected(J9PortLibrary*, void*)>, fn_arg=0x3ff72bf5ba0,
    handler=0x3ff885e6d88 <blankDumpCurrentILSignalHandler(J9PortLibrary*, U_32, void*, void*)>, handler_arg=0x6fc00, flags=249, result=0x3ff72bf5b98)
    at /openjdk/omr/port/unix/omrsignal.c:425
#19 0x000003ff885e7934 in dumpCurrentIL (comp=comp@entry=0x3ff7117f000, vmThread=vmThread@entry=0x6fc00, jitConfig=jitConfig@entry=0x3ff84132000, logFile=logFile@entry=0x3ff6401b620)
    at /openjdk/openj9/runtime/compiler/control/JitDump.cpp:189
#20 0x000003ff885e818e in runJitdump (label=<optimized out>, context=<optimized out>, agent=<optimized out>) at /openjdk/openj9/runtime/compiler/control/JitDump.cpp:649
#21 0x000003ff89489064 in doJitDump (agent=0x3ff8406a250, label=0x3ff72bf69a0 "/openjdk/jitdump.20200708.180227.12906.0008.dmp", context=0x3ff72bf6eb0)
    at /openjdk/openj9/runtime/rasdump/dmpagent.c:1063
#22 0x000003ff89484a36 in protectedDumpFunction (portLibrary=portLibrary@entry=0x3ff89e41380 <j9portLibrary>, userData=userData@entry=0x3ff72bf68a0)
    at /openjdk/openj9/runtime/rasdump/dmpagent.c:2904
#23 0x000003ff896b40e8 in omrsig_protect (portLibrary=0x3ff89e41380 <j9portLibrary>, fn=fn@entry=0x3ff89484a10 <protectedDumpFunction>, fn_arg=fn_arg@entry=0x3ff72bf68a0,
    handler=handler@entry=0x3ff89484a48 <signalHandler>, handler_arg=handler_arg@entry=0x0, flags=505, result=0x3ff72bf6898) at /openjdk/omr/port/unix/omrsignal.c:425
#24 0x000003ff8948825e in runDumpFunction (agent=agent@entry=0x3ff8406a250, label=label@entry=0x3ff72bf69a0 "/openjdk/jitdump.20200708.180227.12906.0008.dmp",
    context=context@entry=0x3ff72bf6eb0) at /openjdk/openj9/runtime/rasdump/dmpagent.c:2878

This backtrace shows that we try to acquire VM access while dumping the crashed thread IL. In the acquireVMaccessIfNeeded API we have a check for whether the compilation should be interrupted [1]. If this code path is hit because there is no catch handler the exception will flow up and get caught in the crashed compilation thread. This means that the dump locks acquired in [2] will not be released and the other threads will be stuck in a look in [3].

The fix here is to introduce a catch handler for the entire jitdump process and properly handle any exception.

[1] https://github.com/eclipse/openj9/blob/566007db997abe3f47f25565506c0f5a0a2f1b00/runtime/compiler/env/VMJ9.cpp#L379-L388
[2] https://github.com/eclipse/openj9/blob/566007db997abe3f47f25565506c0f5a0a2f1b00/runtime/rasdump/trigger.c#L758-L762
[3] https://github.com/eclipse/openj9/blob/566007db997abe3f47f25565506c0f5a0a2f1b00/runtime/rasdump/trigger.c#L590-L598

@fjeremic
Copy link
Contributor Author

fjeremic commented Jul 9, 2020

I fixed the two jitdump related deadlocks encountered before. In my stress test now I'm encountering a new deadlock which I believe can happen without the jitdump scenario. This is reproducible by the way and verifiable in gdb. Assume that some compilation thread crashed and is waiting for the diagnostic thread to finish the compile.

            Application Thread            Diagnostic Thread            Compilation Thread
-----------------------------------------------------------------------------------------------------
                                         postCompilationTasks
                                         <acquires comp monitor>

        Count of method M goes to 0
        Attempt to queue async compile
        compileOnSeparateThread
        acquireCompMonitor <blocked>

                                         freeKnownObjectTable

                                                                        performReserveCodeCachePhase
                                                                        ...
                                                                        reserveCodeCache
                                                                        ...
                                                                        acquireExclusiveVMAccess <block>

                                         acquireVMAccessNoSuspend

All three threads are executing concurrently and the sequence of events can be read from top to bottom. In this scenario we have a deadlock because the application thread wants to queue an asynchronous compile, but it can't because it wants to acquire the comp monitor which the diagnostic thread owns. The diagnostic thread acquired the comp monitor in postCompilationTasks and is trying to free the known object table, for which it requires VM access. It tries to get it but it can't because some other compilation thread has requested exclusive. The compilation thread which requested exclusive is trying to reserve a codecache but it can't get exclusive because the application thread has VM access since it is still executing.

IMO one can swap the diagnostic thread with some other compilation thread and I would imagine the same deadlock could happen without a crashed thread. It almost seems as if we need to release the comp monitor before we attempt to acquire VM access in acquireVMaccessIfNeeded. If fact in you look at [1] it seems that this API assumes no monitor is held by the current thread.

If I uncomment that piece of code and turn the assert fatal I do end up hitting it.

I just did a quick test fix:

      _compInfo.releaseCompMonitor(vmThread);
      if (_compiler->getKnownObjectTable())
         _compiler->freeKnownObjectTable();
      _compInfo.acquireCompMonitor(vmThread);

I cannot reproduce any deadlocks in my unit test anymore. 0 / 1000 failure rate. This looks to be the last deadlock to fix.

[1] https://github.com/eclipse/openj9/blob/5daf4597ca4162b26105c4061c1fd26404a3e84c/runtime/compiler/env/VMJ9.cpp#L335-L339

@gacholio
Copy link
Contributor

gacholio commented Jul 9, 2020

This sounds like the correct approach to me. VM access/exclusive operate very much like a monitor, so they should be ordered accordingly.

@fjeremic fjeremic changed the title Relax condition to require exclusive VM access when generating jitdump Relax condition to require no exclusive VM access when generating jitdump Jul 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants