Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock and potential assert in JitDump #12278

Merged
merged 4 commits into from
Apr 6, 2021

Conversation

fjeremic
Copy link
Contributor

@fjeremic fjeremic commented Mar 23, 2021

Release OMRVMThreadName lock in JitDump

Holding on to this lock will prevent the VM from being able to shutdown
properly. This problem can easily be observed via:

java -Xdump:jit:events=vmstart -version

Suspend diagnostic thread after JitDump

If the user requested a JitDump via an event then the JVM continues
execution normally, however the diagnostic thread remains active. This
is not desirable because the diagnostic thread is looping waiting for
work and can start picking up non-JitDump compilation requests which
will cause an assert.

To prevent this and adhere to the contract outlined in processEntries
we suspend the diagnostic thread once the JitDump process is complete.

Fixes: #12336

Signed-off-by: Filip Jeremic [email protected]

@fjeremic
Copy link
Contributor Author

Suggesting @dsouzai as the committer.

@fjeremic fjeremic changed the title Fix deadlock and potential assert in JitDump WIP: Fix deadlock and potential assert in JitDump Mar 23, 2021
@fjeremic fjeremic marked this pull request as draft March 23, 2021 23:01
Comment on lines 519 to 556
recompilationThreadInfo->suspendCompilationThread();
while (recompilationThreadInfo->getCompilationThreadState() != COMPTHREAD_SUSPENDED)
{
//compInfo->getCompilationMonitor()->notifyAll();
//compInfo->waitOnCompMonitor(recompilationThreadInfo->getCompilationThread());
}

compInfo->getPersistentInfo()->setDisableFurtherCompilation(false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting to draft until this is sorted out. @dsouzai this is not really ideal. See commit message from 3a7af3184ffc2715409da693906ef5eb3e0a6ee0 as to why we want this. I would like to avoid spinning here wasting cycles waiting for the diagnostic thread to suspend. Is there some monitor or something I can wait on instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to wait? What happens if we let the current thread go and the diagnostic will thread will suspend itself in the near future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is because 213d769 made it so that a diagnostic threads should be the only ones who can process JitDump compile requests. The diagnostic threads also cannot self-suspend, i.e. this line in the same commit above:

https://github.com/eclipse/openj9/blob/726f491e4c0644d49d224f560bb10095bf4aec05/runtime/compiler/control/CompilationThread.cpp#L5240-L5241

The reason we want to avoid self-suspend for diagnostic threads is because there is a timing hole between when the diagnostic thread is resumed in JitDump.cpp and when the actual JitDump compile request is added to the queue. I've encountered scenarios where within this timing hole, the diagnostic thread self-suspends (prior to 213d769) and we end up with normal compilation threads picking up JitDump method compile requests, which is definitely something we don't want. See #11772 for details of how that occured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose an alternative is to remove this assert:

https://github.com/eclipse/openj9/blob/726f491e4c0644d49d224f560bb10095bf4aec05/runtime/compiler/control/CompilationThread.cpp#L5248-L5249

And allow diagnostic threads to handle normal compiles. Then we could get away with not waiting in the code being reviewed above. However to me this is not an ideal solution as there are "special" things about the diagnostic thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on the above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code again, even with the while loop above we have a timing hole. The diagnostic thread returns GO_TO_SLEEP_EMPTY_QUEUE from getNextMethodToBeCompiled if there are no methods on the queue:

https://github.com/eclipse/openj9/blob/726f491e4c0644d49d224f560bb10095bf4aec05/runtime/compiler/control/CompilationThread.cpp#L5238-L5253

But in processEntries we change the state of the thread to COMPTHREAD_WAITING. This means that there is a race condition between the JitDump thread signaling the diagnostic thread to suspend and the diagnostic thread executing this code:

https://github.com/eclipse/openj9/blob/726f491e4c0644d49d224f560bb10095bf4aec05/runtime/compiler/control/CompilationThread.cpp#L3999-L4018

We can encounter this situation with two threads:

  1. Crashed thread signals diagnostic thread to suspend as per above code being reviewed
  2. Meanwhile diagnostic thread is in processEntries just about to call getNextMethodToBeCompiled
  3. getNextMethodToBeCompiled returns next action is GO_TO_SLEEP_EMPTY_QUEUE because queue is empty
  4. We reach setCompilationThreadState(COMPTHREAD_WAITING); which overwrites the signal to suspend

This scenario would exist today even without diagnostic threads right? For example if we only have one active compile thread then it will reach this line:

https://github.com/eclipse/openj9/blob/726f491e4c0644d49d224f560bb10095bf4aec05/runtime/compiler/control/CompilationThread.cpp#L5407-L5420

which means if for example the VM singaled us to terminate the thread by changing the state to COMPTHREAD_SIGNAL_TERMINATE then this state would get overwritten in the same way as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpirvu thoughts on the above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sequence of actions I see is the following:

  1. crashed thread calls compileMethod() to queue a synchronous compilation for the diagnostic thread
  2. Diagnostic thread compiles the method and notifies waiting threads (i.e. the crashing thread). At this point the diagnostic thread is in processEntry() and holds the compilation queue monitor.
  3. Diagnostic thread returns to processEntries() and because the stats is still ACTIVE, executes getNextMethodToBeCompiled(). Since there is nothing left in the queue, it indicates an action of GO_TO_SLEEP_EMPTY_QUEYE and its state changes to COMPTHREAD_WAITING.
  4. Diagnostic thread performs a times wait on the compilation queue monitor which releases said monitor.
  5. Crashed thread can finally change the state of the diagnostic thread (state can only be changed with compilation queue monitor in hand) to SIGNAL_SUSPEND.
  6. When new entries are added to the queue the diagnostic thread may be notified, but since its state is no longer COMPTHREAD_WAITING, it does nothing, it leaves processEntries() and in run() it will execute doSuspend() which will result in thread suspension.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea: since normal compilation threads cannot executed jitdump compilation requests and diagnostic threads cannot execute normal compilation requests, why not use a separate queue for jitdump compilation requests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated my branch to the latest state with the latest fixes. I cannot any longer reproduce the assert after calling suspendCompilationThread in JitDump.cpp without the while loop. I've also verified that in TR::CompilationInfoPerThread::processEntries() the compilation monitor is owned by the current thread, meaning that the scenario I described previously is not possible and the scenario outlined by @mpirvu in the second-last comment is correct.

I've updated the PR to simply call suspendCompilationThread.

@fjeremic fjeremic mentioned this pull request Mar 23, 2021
23 tasks
@dsouzai
Copy link
Contributor

dsouzai commented Mar 24, 2021

Maybe @mpirvu might have a different opinion, but I don't see why we should support the jitdump for events like events=vmstart or even -Xdump:jit:events=user. Fundamentally, the jitdump is a mechanism to re-trace compilations that may have contributed to an error condition; it doesn't really make sense to want a jitdump at well defined points like vmstart or user.

@fjeremic
Copy link
Contributor Author

fjeremic commented Mar 24, 2021

Maybe @mpirvu might have a different opinion, but I don't see why we should support the jitdump for events like events=vmstart or even -Xdump:jit:events=user. Fundamentally, the jitdump is a mechanism to re-trace compilations that may have contributed to an error condition; it doesn't really make sense to want a jitdump at well defined points like vmstart or user.

We'll go over this in tomorrow's Vitality Talk presentation. One example of when you may want to do this is for exceptions. For example failures in lambda methods don't have a defined name that you can trace, and oftentimes tracing such methods via -Xjit results in Heisenbugs, so we need to be reactive. You can use JitDumps to generate trace files after the failure has occurred, so for example:

-Xdump:jit:
    events=throw,
    filter=*AssertionError#some/test/Bucket.testBla()V#2
    range=1..1

This would cause a JitDump when an AssertionError is thrown 2 stack frames below testBla. In this scenario you don't need to know the name of the method that generated the AssertionError. The JitDump mechanism will walk the stack and trace all the methods on the backtrace.

Other common cases are NPEs, AIOOBs, etc.

fjeremic added 4 commits April 5, 2021 16:01
Holding on to this lock will prevent the VM from being able to shutdown
properly. This problem can easily be observed via:

```
java -Xdump:jit:events=vmstart -version
```

Signed-off-by: Filip Jeremic <[email protected]>
If the user requested a JitDump via an event then the JVM continues
execution normally, however the diagnostic thread remains active. This
is not desirable because the diagnostic thread is looping waiting for
work and can start picking up non-JitDump compilation requests which
will cause an assert.

To prevent this and adhere to the contract outlined in `processEntries`
we suspend the diagnostic thread once the JitDump process is complete.

Signed-off-by: Filip Jeremic <[email protected]>
If we don't do this we may encounter a crash within OMR which will
result in us aborting the JVM and not completing the JitDump process.

Signed-off-by: Filip Jeremic <[email protected]>
@fjeremic fjeremic changed the title WIP: Fix deadlock and potential assert in JitDump Fix deadlock and potential assert in JitDump Apr 5, 2021
@fjeremic fjeremic marked this pull request as ready for review April 5, 2021 20:06
@fjeremic fjeremic requested a review from mpirvu April 5, 2021 20:07
@fjeremic
Copy link
Contributor Author

fjeremic commented Apr 5, 2021

Jenkins test sanity all jdk11

bodyInfo->getStartPCAfterPreviousCompile(),
jitdumpFile
);
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the code that follows below? We end up calling crashedThread->javaVM->walkStackFrames(crashedThread, &walkState); but to what end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will walk the stack calling the callback function we registered (jitDumpStackFrameIterator). This function will then recompile all JIT methods encountered on the stack. This can be useful if the problem originates a few frames above the crashed method, and we are able to walk the entire stack frame.

Typical example of this is we call a VM helper from inside JITted code and we crash somewhere in the VM (ex. bad argument passed). We should then be able to walk the entire stack and recompile all JIT methods.

Copy link
Contributor

@mpirvu mpirvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mpirvu mpirvu self-assigned this Apr 6, 2021
@mpirvu
Copy link
Contributor

mpirvu commented Apr 6, 2021

Merging, since all tests passed

@mpirvu mpirvu merged commit 0d86a4f into eclipse-openj9:master Apr 6, 2021
@fjeremic fjeremic deleted the fix-jitdump-lock branch April 8, 2021 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hang creating jitdump in cmdLineTest_gpTest Testing: abort
3 participants