-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax condition to require no exclusive VM access when generating jitdump #9136
Comments
@dsouzai @mpirvu for help answering 1. and @gacholio for help answering 2. For 2. adding the following seems to be working well so far for me:
From reading the implementation of this API it seems to clear the wait queue for exclusive access, and then it waits for current thread which has VM access to release it? Is this correct? Would this be the best way to ensure jitdump gets a hold of exclusive VM access if someone else is holding onto it at the point when we wish to generate the jitdump? |
If jitdump is a RAS dump facitility, then the RAS code should be in charge of exclusive. @keithc-ca |
I believe at least one reason we can't have some other thread holding exclusive vm access is because during the jit dump compilation we'll call If the thread that's running |
Many VM helpers called by the JIT require VM access, so the diagnostic compilation thread will attempt to acquire VM access and will be blocked is someone else is holding exclusive VM access |
Summarizing my findings here: GoalThe goal of this issue is to alter/remove the following code in question so that we are always able to generate a jitdump, irregardless of exclusive VM access state: Summary of cases to handleThere are two cases in which we want to generate a jitdump at the moment:
Next stepsAccording to a recent Slack thread discussion [1], and by examining the JIT codebase, the only place where we request exclusive VM access in the JIT is when we have to allocate a new code cache. Thus the only case we really have to handle from above is 2. b. To me this implies we should be able to just remove that piece of code above and everything should work, since if someone holds exclusive access, the jitdump compilation thread will just block when attempting to acquire shared access. In theory it should work. Giving this a shot next. [1] https://openj9.slack.com/archives/C8312LCV9/p1592332963289100 |
Just a quick update. I'm debugging two deadlocks that result from removing the above code. They both happen in stress modes where I've hacked the JIT to randomly crash during compilation. I'm working with others to try and understand how we end up in these deadlock situations and see what can be done. |
I've been debugging the deadlocks for the last few days and with the help of @gacholio and @dsouzai we have figured out the issue. I introduced a random crash in
Then I ran
This resulted in two types of deadlocks:
I've only debugged and figured out the problem for case 2 because that is the much more common case. The bullet points above result in a deadlock as the value of The reason this happens is because the jitdump thread requests the compilation to be performed synchronously [5], irregardless of the fact that we are a crashed compilation thread! What ends up happening is that because the compilation is synchronous the code assumes we are an application thread, and it ends up attempting to release VM access [6] when it actually doesn't have it! This results in decrementing I don't quite follow why we did not assert that we had VM access when we attempt to release it, as I think we should. I'm going to debug this next, however I placed a quick assert right before the call to
The solution is simple for case 2. Acquire VM access if we are a crashed compilation thread before requesting the synchronous compilation and release it after we have returned. [1] https://github.com/fjeremic/openj9/blob/2eb56b8edb6c15c8ef2929aa1710cc1aa008e1f8/runtime/vm/VMAccess.cpp#L302-L304 |
I've found out the reason for deadlock 1. as described above. The backtrace looks as follows:
This backtrace shows that we try to acquire VM access while dumping the crashed thread IL. In the The fix here is to introduce a catch handler for the entire jitdump process and properly handle any exception. [1] https://github.com/eclipse/openj9/blob/566007db997abe3f47f25565506c0f5a0a2f1b00/runtime/compiler/env/VMJ9.cpp#L379-L388 |
I fixed the two jitdump related deadlocks encountered before. In my stress test now I'm encountering a new deadlock which I believe can happen without the jitdump scenario. This is reproducible by the way and verifiable in gdb. Assume that some compilation thread crashed and is waiting for the diagnostic thread to finish the compile.
All three threads are executing concurrently and the sequence of events can be read from top to bottom. In this scenario we have a deadlock because the application thread wants to queue an asynchronous compile, but it can't because it wants to acquire the comp monitor which the diagnostic thread owns. The diagnostic thread acquired the comp monitor in IMO one can swap the diagnostic thread with some other compilation thread and I would imagine the same deadlock could happen without a crashed thread. It almost seems as if we need to release the comp monitor before we attempt to acquire VM access in If I uncomment that piece of code and turn the assert fatal I do end up hitting it. I just did a quick test fix:
I cannot reproduce any deadlocks in my unit test anymore. 0 / 1000 failure rate. This looks to be the last deadlock to fix. |
This sounds like the correct approach to me. VM access/exclusive operate very much like a monitor, so they should be ordered accordingly. |
Often times when we crash the jitdump contains the following:
This is generated due to the code found in [1]. That
if
statement seems to want exclusive access when generating the jitdump trace. Often times someone else is holding exclusive VM access and thus we are unable to proceed and we don't generate a jitdump.This poses two questions:
[1] https://github.com/eclipse/openj9/blob/958d2f9b581efc1ddf774753520d709e64c7d6e7/runtime/compiler/control/HookedByTheJit.cpp#L2067-L2074
The text was updated successfully, but these errors were encountered: