Avoid deadlock between GC and thread inspection/suspension #12257

gacholio · 2021-03-22T14:15:10Z

When reacquiring VM access in the GC, do not block on thread inspection
or suspension - block only on an exclusive access attempt.

[ci skip]

Signed-off-by: Graham Chapman [email protected]

When reacquiring VM access in the GC, do not block on thread inspection or suspension - block only on an exclusive access attempt. [ci skip] Signed-off-by: Graham Chapman <[email protected]>

gacholio · 2021-03-22T14:19:08Z

Here's a timeline for what I believe is happening with the inspection/GC/OSR hang issue.

Numbers indicate threads. The indented values are the publicFlags of the given thread after each operation.

Initial state
   1 VM_ACCESS
   2 VM_ACCESS
   3 VM_ACCESS
1 attempt to halt 2
   2 VM_ACCESS | HALT_INSPECT
1 release VM access
   1 NONE
1 wait for 2 to have HALT_INSPECT and no VM_ACCESS         <- 1 BLOCKED
3 acquire safe point exclusive
3 release VM access
   3 NONE
3 halt threads
   1 HALTED_SAFE
   2 VM_ACCESS | HALT_INSPECT | REQUEST_SAFE
3 wait for response from 2 to have neither VM_ACCESS nor NOT_SAFE <- 3 BLOCKED
2 out of line allocate causing a GC which has to back off exclusive
2 release critical heap access
   2 NOT_SAFE | HALT_INSPECT | REQUEST_SAFE
2 reacquire critical heap access blocks on HALT_INSPECT       <- 2 BLOCKED

The problem occurs when the GC releases/reacquires VM access. Beause the HALT_INSPECT bit is set on the GCing thread, the VM access acquire blocks, resulting in deadlock.

gacholio · 2021-03-22T14:20:33Z

This change allows the GC to acquire VM access while the master thread is being inspected. This should not cause any issues as the GC will not modify the thread stack in any way until it acquires exclusive VM access, which would be blocked by the inspecting thread.

I have run full cross-platform testing on this change with no issues.

gacholio · 2021-03-22T14:20:58Z

FYI @tajila @IBMJimmyk

gacholio · 2021-03-22T15:39:00Z

jenkins test sanity,extended plinux jdk11

dmitripivkine

Looks good to me but in case I am missing something I added Aleks to review as well - just to double check

zl-wang · 2021-03-23T16:00:29Z

how likely is this PR making its way into 0.26? Enabling GLR by default seemed dependent on this one.
Not strictly dependent, but deadlock easier to happen with that.

gacholio · 2021-03-23T16:04:23Z

Given that we're just finding problems in GLR, it seems a bit late in the cycle to be enabling things by default. @pshipton

Changes to do with VM access/exclusive don't seem like the kind of thing to just slip in at the last moment.

amicic · 2021-03-23T16:16:06Z

Looks good.

I think that checking only HALT_EXCLUSIVE prior to reacquiring VM access is reasonable, even it's more intrusive change than just excluding HALT_INSPECT from the check. This is a GC specific path, where we released VM access due to exclusive VM request for GC by another thread, so we just need to restore the original state here - to reacquire VM access when that GC completed and released exclusive VM access and no new one was requested by a third thread (so we just care if HALT_EXCLUSIVE is removed).

This PR is a variant of eclipse-openj9#15201 with handling of saveObjects NULL case Primary issue: It was possible for a thread which is halted for inspection to resume execution after a GC, resulting in the stack walk crashing in the inspecting thread. This was introduced in eclipse-openj9#12257 . Secondary issue: If instrumentable object allocate is hooked, it was possible to HCR at JIT object allocation points. New restriction: The JVMTI extension event for instrumentable object allocate may now only be acquired at startup, not during late attach. As far as I know, there are no active users of this event. Note that NOT_AT_SAFE_POINT has two distinct if related meanings: - Normal exclusive VM access has priority over safe point Setting NOT_AT_SAFE_POINT around all GCing operations accomplishes this. - JIT optimization requires that there be no possibility of HCR at object allocation points Extending the range of NOT_AT_SAFE_POINT to cover the entire allocation path and disabling safe point if the instumentable object allocate event is hooked ensures this. This fix has several parts: - Pause after GC if halt has been requested After any possibly-GCing path, check to see if the thread should halt before resuming mutation. This fixes the reported problem of inspected threads continuing to run. This also requires that NOT_AT_SAFE_POINT be set across the entirety of the allocation path to prevent safe point from being acquired by the new halting code. - Fix object allocation event reporting If the instrumentable object allocation extension event was hooked, there was a timing hole where HCR could occur at an object allocation from the JIT, which is specificially what safe point HCR is meant to avoid. This is fixed by marking the thread NOT_AT_SAFE_POINT for the duration of the allocate. If the instrumentable object allocate event is hooked, disable safe point HCR as there is no way to safely report the event. - Mark all possibly GCing paths NOT_AT_SAFE_POINT This ensures that the GC has priority over safe point HCR. haltThreadForInspection also now marks the inspecting thread NOT_SAFE for the duration of the halt (see timing below). Signed-off-by: Aleksandar Micic <[email protected]>

Avoid deadlock between GC and thread inspection/suspension

c741161

When reacquiring VM access in the GC, do not block on thread inspection or suspension - block only on an exclusive access attempt. [ci skip] Signed-off-by: Graham Chapman <[email protected]>

gacholio added the comp:gc label Mar 22, 2021

gacholio requested a review from dmitripivkine March 22, 2021 14:15

dmitripivkine requested a review from amicic March 23, 2021 14:25

dmitripivkine approved these changes Mar 23, 2021

View reviewed changes

amicic approved these changes Mar 23, 2021

View reviewed changes

amicic merged commit e749042 into eclipse-openj9:master Mar 23, 2021

gacholio deleted the gcinspect branch March 23, 2021 16:18

This was referenced Mar 24, 2021

Enabled Global Lock Reservation by default on Power #12125

Merged

Enable Global Lock Reservation by Default #11424

Closed

amicic mentioned this pull request May 3, 2022

WIP: Handle ANY halt bits in reacquireAccess #14997

Closed

amicic mentioned this pull request Jun 2, 2022

Fix halted thread being allowed to continue execution #15201

Merged

amicic mentioned this pull request Jun 4, 2022

Fix halted thread being allowed to continue execution v2 #15227

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid deadlock between GC and thread inspection/suspension #12257

Avoid deadlock between GC and thread inspection/suspension #12257

gacholio commented Mar 22, 2021

gacholio commented Mar 22, 2021 •

edited

Loading

gacholio commented Mar 22, 2021

gacholio commented Mar 22, 2021

gacholio commented Mar 22, 2021

dmitripivkine left a comment

zl-wang commented Mar 23, 2021 •

edited

Loading

gacholio commented Mar 23, 2021

amicic commented Mar 23, 2021

Avoid deadlock between GC and thread inspection/suspension #12257

Avoid deadlock between GC and thread inspection/suspension #12257

Conversation

gacholio commented Mar 22, 2021

gacholio commented Mar 22, 2021 • edited Loading

gacholio commented Mar 22, 2021

gacholio commented Mar 22, 2021

gacholio commented Mar 22, 2021

dmitripivkine left a comment

Choose a reason for hiding this comment

zl-wang commented Mar 23, 2021 • edited Loading

gacholio commented Mar 23, 2021

amicic commented Mar 23, 2021

gacholio commented Mar 22, 2021 •

edited

Loading

zl-wang commented Mar 23, 2021 •

edited

Loading