Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i#5786: Add precise clean call mangling identification #5791

Merged
merged 18 commits into from
Jan 14, 2023

Conversation

derekbruening
Copy link
Contributor

@derekbruening derekbruening commented Dec 15, 2022

Adds new labels delimiting clean call sequences.
Converts into a translation record flag when storing translations.

Uses the new labels and flag to precisely identify clean call mangling,
replacing the previous scheme which incorrectly thought mangled tool
pc-relative was a clean call, resulting in incorrect translations and crashes.

To enable using labels after emitting, stops using instr_t.note to hold encoding
offsets for pc-releative operands. Adds a new field instr_t.offset which is used for
this purpose. This leaves note values in place across encodings, which is needed
for new clean call marking labels and also simplifies rseq handling code.

This is a compatibility break and as such we increase the version and
OLDEST_COMPATIBLE_VERSION to 990.

Since we only reserved 16 labels for DR, and we're already breaking compatilibity,
we go ahead and reduce DR_NOTE_FIRST_RESERVED to give DR more reserved labels.

Adds a test case to api.detach_state by adding a client (by converting it to use static DR) which inserts a pc-relative load. This reproduces the crash on detach, and is fixed with this fix. The added instrumentation caused periodic detach failures which were solved by setting the translation and adding a restore-state event: i#4232 covers trying to improve the situation.

Changing the test to use static DR resulted in some change in some Linux kernel lazy
AVX state restores on sigreturn causing the AVX values set on detach to be ignored.
We solve this by writing the real xstate_bv into the signal frame when setting the
xstate context.

Issue: #5786, #4232
Fixes #5786

Adds new labels delimiting clean call sequences.
Converts into a translation record flag when storing translations.

Uses the new labels and flag to precisely identify clean call
mangling, replacing the previous scheme which incorrectly thought
mangled tool pc-relative was a clean call, resulting in incorrect
translations and crashes.

Adds a test case to api.detach_state by adding a client (by converting
it to use static DR) which inserts a pc-relative load.  This
reproduces the crash on detach, and is fixed with this fix.
The added instrumentation caused periodic detach failures which were
solved by setting the translation and adding a restore-state event:
i#4232 covers trying to improve the situation.

Issue: #5786, #4232
Fixes #5786
@derekbruening
Copy link
Contributor Author

WIll add a ref to #4219 as well which covers xl8 problems within the clean call sequence, not solved here.

Stops using instr_t.note to hold encoding offsets for pc-releative
operands.  Adds a new field instr_t.offset which is used for this
purpose.  This leaves note values in place across encodings, which is
needed for new clean call marking labels and also simplifies rseq
handling code.

This is a compatibility break and we should bump the version.

This should be committed separately; adding to this PR to ensure it
all works together.
…SION to 990; Replace note instances in core/emit.c; Replace in ir_aarch64 and disable for non-DR_FAST_IR; Update dr_get_note docs
core/lib/globals_api.h Outdated Show resolved Hide resolved
@derekbruening
Copy link
Contributor Author

detach_state is failing on the GA CI machine with an interesting ymmh state not being preserved issue which does not reproduce locally. Going to go ahead and ask for an initial review while I try repro/debug on other platforms or tmate. Probably the instr_t.offset would be separated into its own PR: but first let's agree that using labels for translation is the way to go.

CMakeLists.txt Show resolved Hide resolved
core/arch/mangle_shared.c Outdated Show resolved Hide resolved
core/ir/instr.h Show resolved Hide resolved
core/heap.c Show resolved Hide resolved
core/lib/globals_api.h Outdated Show resolved Hide resolved
suite/tests/api/detach_state.c Show resolved Hide resolved
suite/tests/api/detach_state.c Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
@derekbruening
Copy link
Contributor Author

An update here: still trying to figure out the detach_state ymm high bit failure on GA CI. It does not reproduce in my work workstation; it does reproduce on an Ubuntu22 machine.

I did track down the key details: the issue is that xstate_bv is set to 0x203 on the suspend signals when DR is statically linked into api.detach_state. That means the AVX state (the high ymm bits) are ignored by sigreturn and end up 0. This has nothing to do with any other part of this PR: solely with DR being statically linked in.

It is weird: xstate_bv is 0x203 in the 1st SIGSEGV in the main thread for both dynamic and static DR; but for dynamic all subsequent signals have 0x207. The actual hardware is always 0x207 the entire time. So it seems like a lazy signal state feature in the kernel but only on some kernels and somehow static DR keeps it lazy: not sure how b/c dynamic doesn't really do anything different wrt AVX with the start/stop interface.

@derekbruening
Copy link
Contributor Author

Adding an AVX instr to the app main thread and sideline thread doesn't change
anything (these were already there in the sideline thread anyway to store the sentinels):

    __asm("vmovdqu %%ymm0, %%ymm0" : : : "ymm0");//NOCHECK

I still don't understand how it can be so locally lazy: but I solved by adding
what save_xmm() does, setting the signal's xstate_bv to the real value inside
mcontext_to_sigcontext_simd():

mcontext_to_sigcontext_simd: setting xstate_bv from 0x0000000000000203 to 0x0000000000000207

@derekbruening
Copy link
Contributor Author

x64 failure is api.thread_churn assert: have not seen that before. Does not repro locally. Likely a flake; re-running; will file after re-run.

@derekbruening
Copy link
Contributor Author

It's thread_churn again.
I found a machine where it reproduces -- the 2nd run does have 10 more unreachable heap blocks.
But, this test passed on the instr_t field and heap bucket size change: it is failing after the final changes none of which seem like they would affect heap usage. Adding a 3rd run shows zero growth so I do not think this is a real problem:

peak unreach heap: 248 A => 258 B => 258 C

@derekbruening
Copy link
Contributor Author

If I run -loglevel 1 to get more info it suddenly drops:

peak unreach heap: 248 A => 234 B => 234 C

Hack-printing the heap category breakdowns shows it getting smaller

Total max (not nec. all used simult.):  11018 KB
Total max (not nec. all used simult.):  10980 KB
Total max (not nec. all used simult.):  10980 KB
peak unreach heap: 248 A => 258 B => 258 C

api/docs/release.dox Outdated Show resolved Hide resolved
core/lib/globals_api.h Outdated Show resolved Hide resolved
core/unix/signal_linux_x86.c Show resolved Hide resolved
@derekbruening
Copy link
Contributor Author

I looked closely at the heap usage and it seems that one run happens to push into a new unit and with the double-sizing it ends up with 10 more blocks due to that one larger-capacity unit. Setting -initial_global_heap_unit_size 256K makes that go away: but then we hit this:

peak reach cache: 128 A => 144 B

This is getting rather frustrating, all the weird failures popping up.

@derekbruening
Copy link
Contributor Author

Searching backward on my local machine: it is the instr_t field and heap bucket commit which triggers the thread_churn failure; yet on GA CI that commit passed and it showed up later, which is unexplained.

@derekbruening
Copy link
Contributor Author

The cache jump from 128 to 144 happens on the 5 to 6 thread transition: still 144 Actually it's on 6 threads! So if I start with 6 it passes. Grrr.

@derekbruening
Copy link
Contributor Author

thread_churn passes on the one local machine where the first problem reproduced; but it still fails on GA CI.

Plus the Windows runs here seem to be hanging. This PR has been a nightmare -- and so far all the issues do not seem to be its fault but are other limitations or latent issues or inherent non-determinism that happen to be triggered by these changes.

@derekbruening
Copy link
Contributor Author

The Windows runs get cancelled after 6 hours (I think that's the GA time limit) -- yet the Run Suite step has no logs that can be viewed or downloaded so it is not at all clear what is happening.

@derekbruening
Copy link
Contributor Author

Looking at other PR's, their Windows runs complete as you'd expect after 15-20 mins (though they seem to have some new test failures that appeared a few days ago). Maybe something in this PR is causing hangs and some Github problem is preventing the logs from showing which tests are hanging.

@derekbruening derekbruening merged commit 8c0be09 into master Jan 14, 2023
@derekbruening derekbruening deleted the i5786-clean-call-xl8 branch January 14, 2023 22:07
dolanzhao pushed a commit that referenced this pull request Jan 30, 2023
Adds new labels delimiting clean call sequences.
Converts into a translation record flag when storing translations.

Uses the new labels and flag to precisely identify clean call
mangling, replacing the previous scheme which incorrectly thought
mangled tool pc-relative was a clean call, resulting in incorrect
translations and crashes.

Adds a test case to api.detach_state by adding a client (by converting
it to use static DR) which inserts a pc-relative load.  This
reproduces the crash on detach, and is fixed with this fix.
The added instrumentation caused periodic detach failures which were
solved by setting the translation and adding a restore-state event:
i#4232 covers trying to improve the situation.

Adds a new instr_t.offset field.
Stops using instr_t.note to hold encoding offsets for pc-releative
operands.  Adds a new field instr_t.offset which is used for this
purpose.  This leaves note values in place across encodings, which is
needed for new clean call marking labels and also simplifies rseq
handling code.

This instr_t field is a compatibility break and we bump the version and 
OLDEST_COMPATIBLE_VERSION here to 990.

Updates dr_get_note docs.

Augments logging of xl8 info with new flag info.

Reduces DR_NOTE_FIRST_RESERVED to give DR more reserved labels.
This is another compatibility break, while at it.

Fixes several issues hit in tests that happened to trigger on the
heap bucket size and other changes:
+ Fixes a rank order violation at loglevel 5: xref #1649
+ Writes real xstate_bv into signal frame when setting the xstate context to
   avoid lazy AVX restore problems.
+ Tweaks the thread_churn test to work around non-linearities.

Issue: #5786, #4232
Fixes #5786
abhinav92003 added a commit that referenced this pull request Apr 21, 2023
Updates the documentation for various encode APIs to correctly state that
the offset field of instr_t is used for instr_t target offset computation,
instead of the note field which was used prior to #5791.
abhinav92003 added a commit that referenced this pull request Apr 25, 2023
Updates the documentation for various encode APIs to correctly state that
the offset field of instr_t is used for instr_t target offset computation, instead
of the note field which was used prior to #5791.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CRASH: tool pc-rel instru mangled by DR can be mistaken as clean call instru and mis-translated
2 participants