Fix deadlock and potential assert in JitDump #12278

fjeremic · 2021-03-23T23:00:55Z

Release OMRVMThreadName lock in JitDump

Holding on to this lock will prevent the VM from being able to shutdown
properly. This problem can easily be observed via:

java -Xdump:jit:events=vmstart -version

Suspend diagnostic thread after JitDump

If the user requested a JitDump via an event then the JVM continues
execution normally, however the diagnostic thread remains active. This
is not desirable because the diagnostic thread is looping waiting for
work and can start picking up non-JitDump compilation requests which
will cause an assert.

To prevent this and adhere to the contract outlined in processEntries
we suspend the diagnostic thread once the JitDump process is complete.

Fixes: #12336

Signed-off-by: Filip Jeremic [email protected]

fjeremic · 2021-03-23T23:01:16Z

Suggesting @dsouzai as the committer.

fjeremic · 2021-03-23T23:02:54Z

runtime/compiler/control/JitDump.cpp

+   recompilationThreadInfo->suspendCompilationThread();
+   while (recompilationThreadInfo->getCompilationThreadState() != COMPTHREAD_SUSPENDED)
+      {
+      //compInfo->getCompilationMonitor()->notifyAll();
+      //compInfo->waitOnCompMonitor(recompilationThreadInfo->getCompilationThread());
+      }
+
   compInfo->getPersistentInfo()->setDisableFurtherCompilation(false);


Converting to draft until this is sorted out. @dsouzai this is not really ideal. See commit message from 3a7af3184ffc2715409da693906ef5eb3e0a6ee0 as to why we want this. I would like to avoid spinning here wasting cycles waiting for the diagnostic thread to suspend. Is there some monitor or something I can wait on instead?

Why do we need to wait? What happens if we let the current thread go and the diagnostic will thread will suspend itself in the near future?

The reason is because 213d769 made it so that a diagnostic threads should be the only ones who can process JitDump compile requests. The diagnostic threads also cannot self-suspend, i.e. this line in the same commit above:

https://github.com/eclipse/openj9/blob/726f491e4c0644d49d224f560bb10095bf4aec05/runtime/compiler/control/CompilationThread.cpp#L5240-L5241

The reason we want to avoid self-suspend for diagnostic threads is because there is a timing hole between when the diagnostic thread is resumed in JitDump.cpp and when the actual JitDump compile request is added to the queue. I've encountered scenarios where within this timing hole, the diagnostic thread self-suspends (prior to 213d769) and we end up with normal compilation threads picking up JitDump method compile requests, which is definitely something we don't want. See #11772 for details of how that occured.

I suppose an alternative is to remove this assert:

https://github.com/eclipse/openj9/blob/726f491e4c0644d49d224f560bb10095bf4aec05/runtime/compiler/control/CompilationThread.cpp#L5248-L5249

And allow diagnostic threads to handle normal compiles. Then we could get away with not waiting in the code being reviewed above. However to me this is not an ideal solution as there are "special" things about the diagnostic thread.

Any thoughts on the above?

Looking at the code again, even with the while loop above we have a timing hole. The diagnostic thread returns GO_TO_SLEEP_EMPTY_QUEUE from getNextMethodToBeCompiled if there are no methods on the queue:

https://github.com/eclipse/openj9/blob/726f491e4c0644d49d224f560bb10095bf4aec05/runtime/compiler/control/CompilationThread.cpp#L5238-L5253

But in processEntries we change the state of the thread to COMPTHREAD_WAITING. This means that there is a race condition between the JitDump thread signaling the diagnostic thread to suspend and the diagnostic thread executing this code:

https://github.com/eclipse/openj9/blob/726f491e4c0644d49d224f560bb10095bf4aec05/runtime/compiler/control/CompilationThread.cpp#L3999-L4018

We can encounter this situation with two threads:

Crashed thread signals diagnostic thread to suspend as per above code being reviewed

Meanwhile diagnostic thread is in processEntries just about to call getNextMethodToBeCompiled

getNextMethodToBeCompiled returns next action is GO_TO_SLEEP_EMPTY_QUEUE because queue is empty

We reach setCompilationThreadState(COMPTHREAD_WAITING); which overwrites the signal to suspend

This scenario would exist today even without diagnostic threads right? For example if we only have one active compile thread then it will reach this line:

https://github.com/eclipse/openj9/blob/726f491e4c0644d49d224f560bb10095bf4aec05/runtime/compiler/control/CompilationThread.cpp#L5407-L5420

which means if for example the VM singaled us to terminate the thread by changing the state to COMPTHREAD_SIGNAL_TERMINATE then this state would get overwritten in the same way as above.

@mpirvu thoughts on the above?

The sequence of actions I see is the following:

crashed thread calls compileMethod() to queue a synchronous compilation for the diagnostic thread

Diagnostic thread compiles the method and notifies waiting threads (i.e. the crashing thread). At this point the diagnostic thread is in processEntry() and holds the compilation queue monitor.

Diagnostic thread returns to processEntries() and because the stats is still ACTIVE, executes getNextMethodToBeCompiled(). Since there is nothing left in the queue, it indicates an action of GO_TO_SLEEP_EMPTY_QUEYE and its state changes to COMPTHREAD_WAITING.

Diagnostic thread performs a times wait on the compilation queue monitor which releases said monitor.

Crashed thread can finally change the state of the diagnostic thread (state can only be changed with compilation queue monitor in hand) to SIGNAL_SUSPEND.

When new entries are added to the queue the diagnostic thread may be notified, but since its state is no longer COMPTHREAD_WAITING, it does nothing, it leaves processEntries() and in run() it will execute doSuspend() which will result in thread suspension.

Another idea: since normal compilation threads cannot executed jitdump compilation requests and diagnostic threads cannot execute normal compilation requests, why not use a separate queue for jitdump compilation requests?

I updated my branch to the latest state with the latest fixes. I cannot any longer reproduce the assert after calling suspendCompilationThread in JitDump.cpp without the while loop. I've also verified that in TR::CompilationInfoPerThread::processEntries() the compilation monitor is owned by the current thread, meaning that the scenario I described previously is not possible and the scenario outlined by @mpirvu in the second-last comment is correct.

I've updated the PR to simply call suspendCompilationThread.

dsouzai · 2021-03-24T14:58:02Z

Maybe @mpirvu might have a different opinion, but I don't see why we should support the jitdump for events like events=vmstart or even -Xdump:jit:events=user. Fundamentally, the jitdump is a mechanism to re-trace compilations that may have contributed to an error condition; it doesn't really make sense to want a jitdump at well defined points like vmstart or user.

fjeremic · 2021-03-24T18:02:56Z

Maybe @mpirvu might have a different opinion, but I don't see why we should support the jitdump for events like events=vmstart or even -Xdump:jit:events=user. Fundamentally, the jitdump is a mechanism to re-trace compilations that may have contributed to an error condition; it doesn't really make sense to want a jitdump at well defined points like vmstart or user.

We'll go over this in tomorrow's Vitality Talk presentation. One example of when you may want to do this is for exceptions. For example failures in lambda methods don't have a defined name that you can trace, and oftentimes tracing such methods via -Xjit results in Heisenbugs, so we need to be reactive. You can use JitDumps to generate trace files after the failure has occurred, so for example:

-Xdump:jit:
    events=throw,
    filter=*AssertionError#some/test/Bucket.testBla()V#2
    range=1..1

This would cause a JitDump when an AssertionError is thrown 2 stack frames below testBla. In this scenario you don't need to know the name of the method that generated the AssertionError. The JitDump mechanism will walk the stack and trace all the methods on the backtrace.

Other common cases are NPEs, AIOOBs, etc.

Holding on to this lock will prevent the VM from being able to shutdown properly. This problem can easily be observed via: ``` java -Xdump:jit:events=vmstart -version ``` Signed-off-by: Filip Jeremic <[email protected]>

If the user requested a JitDump via an event then the JVM continues execution normally, however the diagnostic thread remains active. This is not desirable because the diagnostic thread is looping waiting for work and can start picking up non-JitDump compilation requests which will cause an assert. To prevent this and adhere to the contract outlined in `processEntries` we suspend the diagnostic thread once the JitDump process is complete. Signed-off-by: Filip Jeremic <[email protected]>

…sticThread Signed-off-by: Filip Jeremic <[email protected]>

If we don't do this we may encounter a crash within OMR which will result in us aborting the JVM and not completing the JitDump process. Signed-off-by: Filip Jeremic <[email protected]>

fjeremic · 2021-04-05T20:07:10Z

Jenkins test sanity all jdk11

mpirvu · 2021-04-05T20:18:49Z

runtime/compiler/control/JitDump.cpp

+                        bodyInfo->getStartPCAfterPreviousCompile(),
+                        jitdumpFile
+                     );
+                     }
                  }
               }
            }


What is the purpose of the code that follows below? We end up calling crashedThread->javaVM->walkStackFrames(crashedThread, &walkState); but to what end?

It will walk the stack calling the callback function we registered (jitDumpStackFrameIterator). This function will then recompile all JIT methods encountered on the stack. This can be useful if the problem originates a few frames above the crashed method, and we are able to walk the entire stack frame.

Typical example of this is we call a VM helper from inside JITted code and we crash somewhere in the VM (ex. bad argument passed). We should then be able to walk the entire stack and recompile all JIT methods.

mpirvu

LGTM

mpirvu · 2021-04-06T00:29:33Z

Merging, since all tests passed

fjeremic added the comp:jit label Mar 23, 2021

fjeremic changed the title ~~Fix deadlock and potential assert in JitDump~~ WIP: Fix deadlock and potential assert in JitDump Mar 23, 2021

fjeremic marked this pull request as draft March 23, 2021 23:01

fjeremic commented Mar 23, 2021

View reviewed changes

fjeremic mentioned this pull request Mar 23, 2021

Improve jitdump functionality #9120

Closed

23 tasks

pshipton mentioned this pull request Mar 31, 2021

Hang creating jitdump in cmdLineTest_gpTest Testing: abort #12336

Closed

fjeremic added 4 commits April 5, 2021 16:01

Release OMRVMThreadName lock in JitDump

ab2e34f

Holding on to this lock will prevent the VM from being able to shutdown properly. This problem can easily be observed via: ``` java -Xdump:jit:events=vmstart -version ``` Signed-off-by: Filip Jeremic <[email protected]>

Rename getCompilationInfoForDumpThread to getCompilationInfoForDiagno…

3281df2

…sticThread Signed-off-by: Filip Jeremic <[email protected]>

Null check gpInfo before calling j9sig_info

40ae89f

If we don't do this we may encounter a crash within OMR which will result in us aborting the JVM and not completing the JitDump process. Signed-off-by: Filip Jeremic <[email protected]>

fjeremic force-pushed the fix-jitdump-lock branch from f1d26fc to 40ae89f Compare April 5, 2021 20:03

fjeremic changed the title ~~WIP: Fix deadlock and potential assert in JitDump~~ Fix deadlock and potential assert in JitDump Apr 5, 2021

fjeremic marked this pull request as ready for review April 5, 2021 20:06

fjeremic requested a review from mpirvu April 5, 2021 20:07

mpirvu reviewed Apr 5, 2021

View reviewed changes

mpirvu approved these changes Apr 5, 2021

View reviewed changes

mpirvu self-assigned this Apr 6, 2021

mpirvu merged commit 0d86a4f into eclipse-openj9:master Apr 6, 2021

fjeremic deleted the fix-jitdump-lock branch April 8, 2021 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock and potential assert in JitDump #12278

Fix deadlock and potential assert in JitDump #12278

fjeremic commented Mar 23, 2021 •

edited

Loading

fjeremic commented Mar 23, 2021

fjeremic Mar 23, 2021

mpirvu Mar 24, 2021

fjeremic Mar 25, 2021

fjeremic Mar 25, 2021

fjeremic Mar 29, 2021

fjeremic Mar 31, 2021

fjeremic Apr 4, 2021

mpirvu Apr 5, 2021

mpirvu Apr 5, 2021

fjeremic Apr 5, 2021

dsouzai commented Mar 24, 2021

fjeremic commented Mar 24, 2021 •

edited

Loading

fjeremic commented Apr 5, 2021

mpirvu Apr 5, 2021

fjeremic Apr 5, 2021

mpirvu left a comment

mpirvu commented Apr 6, 2021

Fix deadlock and potential assert in JitDump #12278

Fix deadlock and potential assert in JitDump #12278

Conversation

fjeremic commented Mar 23, 2021 • edited Loading

fjeremic commented Mar 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsouzai commented Mar 24, 2021

fjeremic commented Mar 24, 2021 • edited Loading

fjeremic commented Apr 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpirvu left a comment

Choose a reason for hiding this comment

mpirvu commented Apr 6, 2021

fjeremic commented Mar 23, 2021 •

edited

Loading

fjeremic commented Mar 24, 2021 •

edited

Loading