-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack overflow during signal handling in AArch64 #4425
Comments
I reproduced this issue and I am looking into it. After a69bac3 |
It appears that 32K is indeed not enough to handle this particular two-level signal frame nest. It is surprising that this overflow has not occurred in more cases. As a fix, maybe it is worth permanently allocating 64K for the signal stack. I added a few prints to stderr in core/unix/signal.c for a clearer view of the problem and the final segfault now occurs in a slightly different place without though changing the story. As you can see from the figure, the allocated signal stack is not enough for the used stack space. This failure does not occur in release mode either because the stack space used for the release build is slightly less and it just fits within 32K or because a lot of stack space for the first signal is already freed before the second one is caught (i.e., different timing). |
For completeness here is some info from the gdb run from which the above signal stack was extracted:
|
Summarizing offline discussion with some action items:
|
Prevents stack overflow in signal processing for the burst_aarch64_sys test without increasing the size of signal stack. The implementation of signal() varies across different systems and involves unspecified behavior for various scenarios. Instead, it is universally recommended to use sigaction. This might not have been a problem for other applications because most of them must be using sigaction. Fixes: #4425
The problem was the use of signal(). By using sigaction, there is no seg faulting and no need to increase the size of the signal stack. |
Prevents a seg fault in the burst_aarch64_sys test that was caused by reading an unspecified sigaction restorer in sig_has_restorer() in unix/signal.c. Does so by returning false early in sig_has_restorer() for AArch64 when the SA_RESTORER flag is not set. By preventing the seg fault, it also prevents the nested signal handling and consequently the stack overflow in burst_aarch64_sys. Issue: #4425
PR #4840 prevents the first SIGSEGV in the However, before closing this issue we need to figure out why the frames for signal handling on AArch64 are so big.
There is a 10x increase in frame size which needs to be understood but justifies at least why signal nesting never caused problems for x86. |
AIUI AArch64 reserves space for the Scalable Vector Extension (SVE and SVE2).
|
Neither SIGSEGV nor the suspend signal are blocked in the handler so we have to handle a nested signal. The x86 SIMD state is dynamically sized: surprising the aarch64 is not as well. I'm not sure your listing of function frame sizes is enough: we want to know the signal frame size as well, which includes all the expanded state ("xstate" on x86), and that is certainly bigger than 400-ish bytes on x86: just the 32 32-byte xmm registers is 1K, right? I'm guessing the x86 signal frames are between 1K and 2K? And the a64 are >4K, so it's probably more like a 2x-3x difference there, coming from the dynamic precise sizing for x86? For the function frame sizes: there's some kind of context structure on the stack in these frames, sigcontext_t. Is the size difference again from the x86 xstate being separated out, and if so how does the code get away with that through its translations without a copy of the xstate? Or does it dynamically make a copy? |
Is AVX-512 enabled on the machine you measured on? Is the xmm execution in signal_arch_init() enough to ensure we get the right size for AVX-512 lazy saving (though I thought, other than the old x87, the lazy saving was only for avoiding stores and it still had the full size)? |
Right, AVX-512 not supported on the machine I used, only AVX and AVX2. So, the x86 signal frames should be much larger for a machine with AVX-512. |
Summarizing offline discussion:
|
…4840) Prevents a seg fault in the burst_aarch64_sys test that was caused by reading an unspecified sigaction restorer in sig_has_restorer() in unix/signal.c. Does so by returning false early in sig_has_restorer() for AArch64 when the SA_RESTORER flag is not set. By preventing the seg fault, it also prevents the nested signal handling and consequently the stack overflow in burst_aarch64_sys test when the -signal_stack_size is not specified. Issue: #4425
Avoids stack allocations during signal handling for conditionally used copies of sigcontext_t by hiding them within callsites. Issue: #4425
PR #4888 reduces the copies of sigcontext_t allocated and yields some decent savings (~9K savings).
This essentially addresses point 1 from the last comment. |
Avoids redundant stack allocations during signal handling for copies of sigcontext_t by hiding them within callsites. Issue: #4425
For x86 issues the issue reported in #1615 seems to be the only problem there. #4649 (and maybe some followups) should eventually address it. For dynamic-alloc, it seems to be unnecessary for now. The stack allocs for copies of sig_context are tightly scoped with its usage. After #4888, even 24K signal stack seems to be enough for two nested signals regardless of the used code paths. |
While working on PR #4397, I found that the
burst_flush_aarch64
test crashes with a SIGSEGV whensignal_stack_size = 32K
(which is the value automatically set by DR after adjustment). The crash is due to a stack overflow and is limited to debug builds.Note that
burst_flush_aarch64
intentionally causes a SIGILL too, which is handled as expected by the test and doesn't cause any crash.To Reproduce
-signal_stack_size 64K
inclients/drcachesim/tests/burst_flush_aarch64.cpp
.clients/bin64/tool.drcacheoff.burst_flush_aarch64
Expected behavior
The SIGILL thrown by the test is expected and is handled too. But the crashing SIGSEGV is unexpected.
Screenshots or Pasted Text
Details in GDB:
The second SIGSEGV is caused by the unexpected stack overflow in
d_r_notify
while pushing registers onto the stack. This d_r_notify is invoked atdynamorio/core/unix/signal.c
Line 5126 in 70be2df
Versions
What version of DynamoRIO are you using?
At commit 70be2df
System details
Additional context
Increasing
signal_stack_size
to 64K for theburst_flush_aarch64
test solved the issue. But it is unclear why the stack overflowed in the first place, as the stack doesn't seem to be too deep.#4397 (comment)
The text was updated successfully, but these errors were encountered: