Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow during signal handling in AArch64 #4425

Closed
abhinav92003 opened this issue Aug 27, 2020 · 13 comments
Closed

Stack overflow during signal handling in AArch64 #4425

abhinav92003 opened this issue Aug 27, 2020 · 13 comments

Comments

@abhinav92003
Copy link
Contributor

While working on PR #4397, I found that the burst_flush_aarch64 test crashes with a SIGSEGV when signal_stack_size = 32K (which is the value automatically set by DR after adjustment). The crash is due to a stack overflow and is limited to debug builds.

Note that burst_flush_aarch64 intentionally causes a SIGILL too, which is handled as expected by the test and doesn't cause any crash.

To Reproduce

  1. Remove custom -signal_stack_size 64K in clients/drcachesim/tests/burst_flush_aarch64.cpp.
  2. Build with debug on.
  3. Run clients/bin64/tool.drcacheoff.burst_flush_aarch64

Expected behavior
The SIGILL thrown by the test is expected and is handled too. But the crashing SIGSEGV is unexpected.

Screenshots or Pasted Text

Details in GDB:

(gdb) r
Starting program: /home/abhinavas/dr/build/i4328-2_debug/clients/bin64/tool.drcacheoff.burst_flush_aarch64 
pre-DR init
pre-DR start

Program received signal SIGILL, Illegal instruction.
0x0000aaaa6aaba448 in ?? ()
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
safe_read_asm_pre () at /home/abhinavas/dr/src/i4328-2/core/arch/aarch64/aarch64.asm:416
416	        strb     w3, [ARG1]
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000aaaaaab8c9fc in d_r_notify (priority=SYSLOG_NONE, internal=false, synch=false, substitution_num=0, prefix=0x0, fmt=0x0) at /home/abhinavas/dr/src/i4328-2/core/utils.c:1884
1884	{
(gdb) disassemble
Dump of assembler code for function d_r_notify:
   0x0000aaaaaab8c9f8 <+0>:	sub	sp, sp, #0x900
=> 0x0000aaaaaab8c9fc <+4>:	stp	x29, x30, [sp,#-16]!
   0x0000aaaaaab8ca00 <+8>:	mov	x29, sp

The second SIGSEGV is caused by the unexpected stack overflow in d_r_notify while pushing registers onto the stack. This d_r_notify is invoked at

SYSLOG_INTERNAL_WARNING_ONCE("(1+x) Handling our fault in a TRY at " PFX, pc);
.

Versions

  • What version of DynamoRIO are you using?
    At commit 70be2df

  • System details

abhinavas@tx1:~/dr/build/i4328-2_debug$ uname -a
Linux tx1.fhahn.com 4.9.0-4-arm64 #1 SMP Debian 4.9.51-1 (2017-09-28) aarch64 GNU/Linux

Additional context

Increasing signal_stack_size to 64K for the burst_flush_aarch64 test solved the issue. But it is unclear why the stack overflowed in the first place, as the stack doesn't seem to be too deep.

#4397 (comment)

@sapostolakis
Copy link
Contributor

sapostolakis commented Jan 25, 2021

I reproduced this issue and I am looking into it. After a69bac3 burst_flush_aarch64 test was changed to burst_aarch64_sys, but the process to reproduce the bug is the same.

@sapostolakis
Copy link
Contributor

It appears that 32K is indeed not enough to handle this particular two-level signal frame nest. It is surprising that this overflow has not occurred in more cases. As a fix, maybe it is worth permanently allocating 64K for the signal stack.

I added a few prints to stderr in core/unix/signal.c for a clearer view of the problem and the final segfault now occurs in a slightly different place without though changing the story.
Below I append a detailed view of the signal stack as I re-constructed using gdb:

signal_stack

As you can see from the figure, the allocated signal stack is not enough for the used stack space.
I am not sure about the signal frame parts. I inferred what these missing parts might be by looking at section 5.3.5 (and fig. 5.9) from Derek's thesis and the code in signal.c.

This failure does not occur in release mode either because the stack space used for the release build is slightly less and it just fits within 32K or because a lot of stack space for the first signal is already freed before the second one is caught (i.e., different timing).

@sapostolakis
Copy link
Contributor

For completeness here is some info from the gdb run from which the above signal stack was extracted:

(gdb) r

pre-DR init
signal stack is 0x0000fffdb7c79000 - 0x0000fffdb7c81000
pre-DR start

Program received signal SIGILL, Illegal instruction.
0x0000aaaa6aaca770 in ?? ()
(gdb) c
Continuing.

master_signal_handler: thread=96767, sig=4, xsp=0x0000fffdb7c7fda0, retaddr=0x0000000000000004
copy_frame_to_stack: rt=1, src=0x0000fffdb7c7fda0, sp=0x0000ffffffffdff0

Program received signal SIGSEGV, Segmentation fault.
safe_read_asm_pre () at /home/apostolakis/dynamorio/core/arch/aarch64/aarch64.asm:416
416             strb     w3, [ARG1]

(gdb) info s
#0  safe_read_asm_pre () at /home/apostolakis/dynamorio/core/arch/aarch64/aarch64.asm:416
#1  0x0000aaaaaadf2de0 in safe_read_fast (base=0xb7ffb500, size=8, out_buf=0xfffdb7c7b6c8, bytes_read=0x0) at /home/apostolakis/dynamorio/core/arch/x86_code.c:430
#2  0x0000aaaaaadfc0bc in safe_read_ex (base=0xb7ffb500, size=8, out_buf=0xfffdb7c7b6c8, bytes_read=0x0) at /home/apostolakis/dynamorio/core/unix/os.c:4582
#3  0x0000aaaaaadfc158 in d_r_safe_read (base=0xb7ffb500, size=8, out_buf=0xfffdb7c7b6c8) at /home/apostolakis/dynamorio/core/unix/os.c:4601
#4  0x0000aaaaaae12824 in sig_has_restorer (info=0xfffdb7c60148, sig=4) at /home/apostolakis/dynamorio/core/unix/signal.c:2925
#5  0x0000aaaaaae1af0c in execute_handler_from_cache (dcontext=0xfffdb7c3b080, sig=4, our_frame=0xfffdb7c7fda0, sc_orig=0xfffdb7c7da40, f=0xfffdb7c929f8, access_address=0x0) at /home/apostolakis/dynamorio/core/unix/signal.c:5658
#6  0x0000aaaaaae17814 in record_pending_signal (dcontext=0xfffdb7c3b080, sig=4, ucxt=0xfffdb7c7fe20, frame=0xfffdb7c7fda0, forged=false, access_address=0x0) at /home/apostolakis/dynamorio/core/unix/signal.c:4556
#7  0x0000aaaaaae1a7d4 in master_signal_handler_C (sig=4, siginfo=0xfffdb7c7fda0, ucxt=0xfffdb7c7fe20, xsp=0xfffdb7c7fda0 "\004") at /home/apostolakis/dynamorio/core/unix/signal.c:5501
#8  0x0000aaaaaadf31f4 in dynamorio_clone_parent () at /home/apostolakis/dynamorio/core/arch/aarch64/aarch64.asm:558

(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000aaaaaabaae78 in do_file_write (f=0, fmt=0x0, ap=...) at /home/apostolakis/dynamorio/core/utils.c:1651
1651    {

(gdb)  info s
#0  0x0000aaaaaabaae78 in do_file_write (f=0, fmt=0x0, ap=...) at /home/apostolakis/dynamorio/core/utils.c:1651
#1  0x0000aaaaaacceb84 in dr_vfprintf (f=2, fmt=0xaaaaaaf59c90 "\nmaster_signal_handler: thread=%d, sig=%d, xsp=%p, retaddr=%p\n", ap=...) at /home/apostolakis/dynamorio/core/lib/instrument.c:4842
#2  0x0000aaaaaaccec34 in dr_fprintf (f=2, fmt=0xaaaaaaf59c90 "\nmaster_signal_handler: thread=%d, sig=%d, xsp=%p, retaddr=%p\n") at /home/apostolakis/dynamorio/core/lib/instrument.c:4852
#3  0x0000aaaaaae19914 in master_signal_handler_C (sig=11, siginfo=0xfffdb7c7a390, ucxt=0xfffdb7c7a410, xsp=0xfffdb7c7a390 "\v") at /home/apostolakis/dynamorio/core/unix/signal.c:5224
#4  0x0000aaaaaadf31f4 in dynamorio_clone_parent () at /home/apostolakis/dynamorio/core/arch/aarch64/aarch64.asm:558

(gdb) c
Continuing.

master_signal_handler: thread=96767, sig=11, xsp=0x0000fffdb7c7fda0, retaddr=0x000000000000000b
<Application /home/apostolakis/dynamorio/build/clients/bin64/tool.drcacheoff.burst_aarch64_sys (96767). Cannot correctly handle received signal 11 in thread 96767: default action in native thread.>

@derekbruening
Copy link
Contributor

Summarizing offline discussion with some action items:

  • It seems surprising that sig_has_restorer would crash on this vanilla signal handler case: are we faulting every time and we need to tweak something b/c of the aarch64 restorer setup being different?

  • The x86 signal stack is just 24K (32K here b/c it's aligned to page size I assume and this machine has 16K or 32K pages?), yet even with AVX-512 (checked: GA CI does have AVX-512) we've never seen a stack overflow. Are the aarch64 signal frames somehow larger than the x86 ones?

  • The stack space used by record_pending_signal seems large: is there an easy way to reduce it?

sapostolakis added a commit that referenced this issue Feb 16, 2021
Prevents stack overflow in signal processing for the burst_aarch64_sys test without increasing the size of signal stack.
The implementation of signal() varies across different systems and involves unspecified behavior for various scenarios.
Instead, it is universally recommended to use sigaction. This might not have been a problem for other applications
because most of them must be using sigaction.

Fixes: #4425
@sapostolakis
Copy link
Contributor

The problem was the use of signal(). By using sigaction, there is no seg faulting and no need to increase the size of the signal stack.
See #4735.

sapostolakis added a commit that referenced this issue Apr 5, 2021
Prevents a seg fault in the burst_aarch64_sys test that was caused by reading an unspecified
sigaction restorer in sig_has_restorer() in unix/signal.c. Does so by returning false early
in sig_has_restorer() for AArch64 when the SA_RESTORER flag is not set.

By preventing the seg fault, it also prevents the nested signal handling and consequently
the stack overflow in burst_aarch64_sys.

Issue: #4425
@sapostolakis
Copy link
Contributor

PR #4840 prevents the first SIGSEGV in the burst_aarch64_sys test effectively preventing signal nesting and thus the stack overflow.

However, before closing this issue we need to figure out why the frames for signal handling on AArch64 are so big.
Using an equivalent test in x86_64 (using ud2 as the asm inst that causes SIGILL), I measured the following frame sizes while handling the SIGILL signal (appended the AArch64 frame sizes for comparison):

x86_64 frame sizes:
master_signal_handler_C:      416 bytes
record_pending_signal:        784 bytes
execute_handler_from_cache:   400 bytes


AArch64 frame sizes:
master_signal_handler_C:      4560 bytes
record_pending_signal:        9056 bytes
execute_handler_from_cache:   4528 bytes

There is a 10x increase in frame size which needs to be understood but justifies at least why signal nesting never caused problems for x86.

@AssadHashmi
Copy link
Contributor

There is a 10x increase in frame size which needs to be understood...

AIUI AArch64 reserves space for the Scalable Vector Extension (SVE and SVE2).
From core/unix/include/sigcontext.h:

typedef struct _kernel_sigcontext_t {                                     
    unsigned long long fault_address;                                     
    unsigned long long regs[31];                                          
    unsigned long long sp;                                                
    unsigned long long pc;                                                
    unsigned long long pstate;                                            
    /* 4K reserved for FP/SIMD state and future expansion */              <--
    unsigned char __reserved[4096] __attribute__((__aligned__(16)));      <--
} kernel_sigcontext_t;

@derekbruening
Copy link
Contributor

Neither SIGSEGV nor the suspend signal are blocked in the handler so we have to handle a nested signal.

The x86 SIMD state is dynamically sized: surprising the aarch64 is not as well.

I'm not sure your listing of function frame sizes is enough: we want to know the signal frame size as well, which includes all the expanded state ("xstate" on x86), and that is certainly bigger than 400-ish bytes on x86: just the 32 32-byte xmm registers is 1K, right?

I'm guessing the x86 signal frames are between 1K and 2K? And the a64 are >4K, so it's probably more like a 2x-3x difference there, coming from the dynamic precise sizing for x86?

For the function frame sizes: there's some kind of context structure on the stack in these frames, sigcontext_t. Is the size difference again from the x86 xstate being separated out, and if so how does the code get away with that through its translations without a copy of the xstate? Or does it dynamically make a copy?

@derekbruening
Copy link
Contributor

Is AVX-512 enabled on the machine you measured on? Is the xmm execution in signal_arch_init() enough to ensure we get the right size for AVX-512 lazy saving (though I thought, other than the old x87, the lazy saving was only for avoiding stores and it still had the full size)?

@sapostolakis
Copy link
Contributor

Is AVX-512 enabled on the machine you measured on?

Right, AVX-512 not supported on the machine I used, only AVX and AVX2. So, the x86 signal frames should be much larger for a machine with AVX-512.

@derekbruening
Copy link
Contributor

Summarizing offline discussion:

  1. Probably we can union the two structs (I assume it's two from the size) in record_pending_signal if not needed simultaneously
  2. Confirm the x86 bug mentiond
  3. Dynamic-alloc all sigcontext_t+xstate for all arches? Share pending special instead of new to avoid per-thread overhead from special alloc commit -- make sure we save vs just increasing stack size

sapostolakis added a commit that referenced this issue Apr 7, 2021
…4840)

Prevents a seg fault in the burst_aarch64_sys test that was caused by reading an unspecified
sigaction restorer in sig_has_restorer() in unix/signal.c. Does so by returning false early
in sig_has_restorer() for AArch64 when the SA_RESTORER flag is not set.

By preventing the seg fault, it also prevents the nested signal handling and consequently
the stack overflow in burst_aarch64_sys test when the -signal_stack_size is not specified.

Issue: #4425
sapostolakis added a commit that referenced this issue Apr 29, 2021
Avoids stack allocations during signal handling for conditionally used copies of sigcontext_t by
hiding them within callsites.

Issue: #4425
@sapostolakis
Copy link
Contributor

sapostolakis commented Apr 29, 2021

PR #4888 reduces the copies of sigcontext_t allocated and yields some decent savings (~9K savings).
The function frame sizes for AArch64 become:

master_signal_handler_C:   4560 bytes  -> 176 bytes
record_pending_signal:     9056 bytes  -> 4704 bytes

This essentially addresses point 1 from the last comment.
Yet to do 2 and 3.

sapostolakis added a commit that referenced this issue Apr 29, 2021
Avoids redundant stack allocations during signal handling for copies of sigcontext_t
by hiding them within callsites.

Issue: #4425
@sapostolakis
Copy link
Contributor

For x86 issues the issue reported in #1615 seems to be the only problem there. #4649 (and maybe some followups) should eventually address it.

For dynamic-alloc, it seems to be unnecessary for now. The stack allocs for copies of sig_context are tightly scoped with its usage. After #4888, even 24K signal stack seems to be enough for two nested signals regardless of the used code paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants