Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rr sometimes breaks traces of dynamically loaded .so libraries #3845

Closed
midenok opened this issue Oct 7, 2024 · 4 comments
Closed

rr sometimes breaks traces of dynamically loaded .so libraries #3845

midenok opened this issue Oct 7, 2024 · 4 comments

Comments

@midenok
Copy link

midenok commented Oct 7, 2024

While supplying full repro environment is hard I will try to explain what do we have. There are 2 methods of running rr in our testing environment: good and bad. The bad environment is the complex one: we run rr under several perl scripts forked each other. The good environment is the simple one: only rr and debugged program. But rr arguments are similar in both runs.

The bad envirnment produces broken stacks like this:

Broken stacktrace
Thread 14 (Thread 3415313.3415527 (mariadbd)):
#0  0x0000000070000002 in syscall_traced ()
#1  0x00007503ea93ec9b in _raw_syscall () at /usr/src/rr-5.8.0-1/src/preload/raw_syscall.S:120
#2  0x00007503ea9388bd in traced_raw_syscall (call=call@entry=0x7503c7bfffa0) at /usr/src/rr-5.8.0-1/src/preload/syscallbuf.c:377
#3  0x00007503ea93a337 in sys_poll (call=call@entry=0x7503c7bfffa0) at /usr/src/rr-5.8.0-1/src/preload/syscallbuf.c:2660
#4  0x00007503ea93ca64 in syscall_hook_internal (call=0x7503c7bfffa0) at /usr/src/rr-5.8.0-1/src/preload/syscallbuf.c:4218
#5  syscall_hook (call=0x7503c7bfffa0) at /usr/src/rr-5.8.0-1/src/preload/syscallbuf.c:4355
#6  syscall_hook (call=0x7503c7bfffa0) at /usr/src/rr-5.8.0-1/src/preload/syscallbuf.c:4339
#7  0x00007503ea9382d3 in _syscall_hook_trampoline () at /usr/src/rr-5.8.0-1/src/preload/syscall_hook.S:308
#8  0x00007503ea93833d in __morestack () at /usr/src/rr-5.8.0-1/src/preload/syscall_hook.S:443
#9  0x00007503ea938359 in _syscall_hook_trampoline_48_3d_00_f0_ff_ff () at /usr/src/rr-5.8.0-1/src/preload/syscall_hook.S:462
#10 0x00007503ea11a215 in __GI___poll (fds=0x7503e807f978, nfds=1, timeout=600000) at ../sysdeps/unix/sysv/linux/poll.c:29
#11 0x00005e28cb88528e in vio_io_wait (vio=0x7503cc405028, event=VIO_IO_EVENT_READ, timeout=600000) at ../src/vio/viosocket.c:1000
#12 0x00005e28cb884ff5 in vio_socket_io_wait (vio=0x7503cc405028, event=VIO_IO_EVENT_READ) at ../src/vio/viosocket.c:118
#13 0x00005e28cb88564f in vio_read (vio=0x7503cc405028, buf=0x7503cc0788d8 "\a", size=16384) at ../src/vio/viosocket.c:199
#14 0x00005e28cb885ba6 in vio_read_buff (vio=0x7503cc405028, buf=0x7503cc2a9b88 "\230", size=4) at ../src/vio/viosocket.c:247
#15 0x00005e28cb3449fb in my_real_read (net=0x7503cc1fad78, complen=0x7503e807fbf0, header=0 '\000') at ../src/sql/net_serv.cc:996
#16 0x00005e28cb34447e in my_net_read_packet_reallen (net=0x7503cc1fad78, read_from_server=0 '\000', reallen=0x7503e807fcb0) at ../src/sql/net_serv.cc:1277
#17 0x00005e28cb3390b9 in cli_safe_read_reallen (mysql=0x7503cc1fad78, reallen=0x7503e807fcb0) at ../src/sql-common/client.c:385
#18 0x00005e28cb339071 in cli_safe_read (mysql=0x7503cc1fad78) at ../src/sql-common/client.c:374
#19 0x00005e28cb3425dc in cli_read_query_result (mysql=0x7503cc1fad78) at ../src/sql-common/client.c:3533
#20 0x00005e28cb34025d in server_mysql_real_query (mysql=0x7503cc1fad78, query=0x7503cc393818 "select `x`,`row_start`,`row_end` from `test`.`t10` where ((`row_end` > '2024-10-07 17:35:46.398977') and (`row_start` <= '2024-10-07 17:35:46.398977'))", length=151) at ../src/sql-common/client.c:3627
#21 0x00007503e0193648 in ?? ()
#22 0x0000000100010000 in ?? ()
#23 0x00005e28cbf3bb62 in _my_thread_var () at ../src/mysys/my_thr_init.c:389
#24 0x00007503e00cac39 in ?? ()
#25 0x00007503e8080170 in ?? ()
#26 0x00005e28cbf3c4e0 in my_thread_var_dbug () at ../src/mysys/my_thr_init.c:439
#27 0x00007503e01c42fc in ?? ()
#28 0x00007503e80802a0 in ?? ()
#29 0x00007503cc406bb8 in ?? ()
#30 0x00007503cc406558 in ?? ()
#31 0x00000000cb5ad320 in ?? ()
#32 0x0000010000000001 in ?? ()
#33 0x00007503cc001618 in ?? ()
#34 0x00007503e0082234 in ?? ()
#35 0x00007503e0084f65 in ?? ()
#36 0x000005e38000000e in ?? ()
#37 0x00007503cc0015b0 in ?? ()
#38 0x0000000000000097 in ?? ()
#39 0x0000000000000001 in ?? ()
#40 0x00000000e8080320 in ?? ()
#41 0x00007503cc2d3fb0 in ?? ()
#42 0x0000000dcc18ad88 in ?? ()
#43 0x00000258e80802e8 in ?? ()
#44 0x00007503e0075476 in ?? ()
#45 0x00007503e00819de in ?? ()
#46 0x000031de8000000d in ?? ()
#47 0x00007503e8080438 in ?? ()
#48 0x00000097e0075476 in ?? ()
#49 0x00007503cc2d4000 in ?? ()
#50 0x00007503cc3d6a98 in ?? ()
#51 0x00000003e8080438 in ?? ()
#52 0x00007503cc406bb8 in ?? ()
#53 0x0000000000000001 in ?? ()
#54 0x00007503cc2d3fb0 in ?? ()
#55 0x00007503cc406bb8 in ?? ()
#56 0x00007503e8080370 in ?? ()
#57 0x00007503e0164fb2 in ?? ()
#58 0x00007503cc2d3fb0 in ?? ()
#59 0x00007503cc406bb8 in ?? ()
#60 0x0000000000000001 in ?? ()
#61 0x00007503e8080464 in ?? ()
#62 0x0000000000000000 in ?? ()

Logging with:

export RR_LOG=all:debug

produces following differences:

bad.log

[record_signal] 3459971: handling signal SIGSTKFLT (pevent: PTRACE_EVENT(0), event: (none)
[record_signal] Not safe to deliver signal at 0x73de3154345a
[RecordTask] Set signal mask to block all signals (bar SYSCALLBUF_DESCHED_SIGNAL/TIME_SLICE_SIGNAL) while we  have a stashed signal
[record_signal] Reached final syscallbuf instruction, singlestepping to enable signal dispatch
[record_signal] 3459971: handling signal SIGSTKFLT (pevent: PTRACE_EVENT(0), event: (none)
[record_signal] Safe to deliver signal at 0x73de30d2a882 because not in syscallbuf
[record_signal] 3459971: handling signal SIGPWR (pevent: PTRACE_EVENT(0), event: (none)
[record_signal]   dropping ignored SIGPWR
[record_signal]   (at disarm-desched, so finished buffered syscall; resuming)
...
[record_signal] 3459971: handling signal SIGSTKFLT (pevent: PTRACE_EVENT(0), event: (none)
[record_signal] Not safe to deliver signal at 0x73de3154345a
[RecordTask] Set signal mask to block all signals (bar SYSCALLBUF_DESCHED_SIGNAL/TIME_SLICE_SIGNAL) while we  have a stashed signal
[record_signal] Reached final syscallbuf instruction, singlestepping to enable signal dispatch
[record_signal] 3459971: handling signal SIGSTKFLT (pevent: PTRACE_EVENT(0), event: (none)
[record_signal] Safe to deliver signal at 0x73de30d2a882 because not in syscallbuf
[record_signal] 3459971: handling signal SIGPWR (pevent: PTRACE_EVENT(0), event: (none)
[record_signal]   dropping ignored SIGPWR
[record_signal]   (at disarm-desched, so finished buffered syscall; resuming)
2024-10-07 19:54:43 0 [Note] Got signal 15 to shutdown mysqld
[record_signal] 3459739: handling signal SIGUSR1 (pevent: PTRACE_EVENT(0), event: SYSCALL_INTERRUPTION: poll

good.log

[record_signal] 3462104: handling signal SIGSTKFLT (pevent: PTRACE_EVENT(0), event: (none)
[record_signal] Safe to deliver signal at 0x601027a14905 because not in syscallbuf

Full logs attached.

full-logs.zip

signal-bad.log
signal-good.log

Can you please give an insight of what may cause Not safe to deliver signal and why this results into broken trace.

rr version 5.8.0
@khuey
Copy link
Collaborator

khuey commented Oct 7, 2024

By broken stacks you mean that the trace replays fine just the backtrace in gdb is wrong? rr record -n will work around this problem at the cost of slower recording performance. What specifically is wrong with this backtrace? Just that frames 21 and below are garbage? Or is there more to it than that?

The "Not safe to deliver signal" log message just means we were in some part of the syscallbuf code (what the -n argument disables) that expects to run atomically when we tried to deliver that signal (we use SIGSTKFLT internally for context switches) and we had to defer it to a safe place before delivering it.

@midenok
Copy link
Author

midenok commented Oct 7, 2024

Yes, it seems to replay fine. I replay it to the end, then breakpoint and reverse-continue. On the breakpoint I examine threads' stack traces. Everything from the executable is fine, but the frames from .so library are not unwinding.

So I tried rr record -n and it is still reproducing. "Not safe to deliver" has been gone, so my guess about this log difference was wrong.

Now what only difference I can see from grepped by "signal" logs is:

signal-bad.log

2024-10-07 19:54:43 0 [Note] Got signal 15 to shutdown mysqld
[record_signal] 3459739: handling signal SIGUSR1 (pevent: PTRACE_EVENT(0), event: SYSCALL_INTERRUPTION: poll
[record_signal] Safe to deliver signal at 0x70000002 because in traced syscall
[Scheduler]   (3459739 is un-switchable at SIGNAL: SIGUSR1(async))
[INFO log_pending_events()] SIGNAL: SIGUSR1(async)
[RecordTask] Wrote event SIGNAL: SIGUSR1(async) for time 8035
[RecordSession]     in signal-stop for SIGUSR1
[RecordSession]     injecting signal SIGUSR1
[Task] resuming execution of 3459739 with PTRACE_SINGLESTEP, signal SIGUSR1 tick_period -2 wait 2
[RecordTask] Wrote event SIGNAL_HANDLER: SIGUSR1(async) for time 8036
[Scheduler] Task event is SIGNAL_HANDLER: SIGUSR1(async)
[INFO log_pending_events()] SIGNAL_HANDLER: SIGUSR1(async)
[Scheduler]   (3459739 is un-switchable at SIGNAL_HANDLER: SIGUSR1(async))
[INFO log_pending_events()] SIGNAL_HANDLER: SIGUSR1(async)
[RecordSession]   3459739: handle_ptrace_event PTRACE_EVENT_SECCOMP: event SIGNAL_HANDLER: SIGUSR1(async)
[INFO log_pending_events()] SIGNAL_HANDLER: SIGUSR1(async)
[INFO log_pending_events()] SIGNAL_HANDLER: SIGUSR1(async)

signal-good.log

[record_signal] Safe to deliver signal at 0x70000002 because in traced syscall
[Scheduler]   (3462104 is un-switchable at SIGNAL: SIGINT(async))
[INFO log_pending_events()] SIGNAL: SIGINT(async)
[RecordTask] Wrote event SIGNAL: SIGINT(async) for time 9285
[RecordSession]     in signal-stop for SIGINT
[RecordSession]     injecting signal SIGINT
[Task] resuming execution of 3462104 with PTRACE_SINGLESTEP, signal SIGINT tick_period -2 wait 2
[RecordTask] Wrote event SIGNAL_HANDLER: SIGINT(async) for time 9286
[Scheduler] Task event is SIGNAL_HANDLER: SIGINT(async)
[INFO log_pending_events()] SIGNAL_HANDLER: SIGINT(async)
[Scheduler]   (3462104 is un-switchable at SIGNAL_HANDLER: SIGINT(async))
[INFO log_pending_events()] SIGNAL_HANDLER: SIGINT(async)
[RecordSession]   3462104: handle_ptrace_event PTRACE_EVENT_SECCOMP: event SIGNAL_HANDLER: SIGINT(async)
[INFO log_pending_events()] SIGNAL_HANDLER: SIGINT(async)
[INFO log_pending_events()] SIGNAL_HANDLER: SIGINT(async)

It is different terminate procedure. Good is just stopped by Ctrl+C, while bad goes through SIGTERM. How SIGTERM is turned into SIGUSR1 I don't understand. Maybe it is internal rr hack to avoid its own SIGTERM. But I see nowhere SIGUSR1 in our code that might be sent during shutdown process.

@midenok midenok changed the title rr sometimes breaks traces of dynamically loaded .so libraries (Not safe to deliver signal SIGSTKFLT) rr sometimes breaks traces of dynamically loaded .so libraries Oct 7, 2024
@midenok
Copy link
Author

midenok commented Oct 9, 2024

I just noticed that on Bad there is no subject .so in info shared:

Bad
(rr) info shared
From                To                  Syms Read   Shared Object Library
0x00007ed5e6804000  0x00007ed5e682d455  Yes         /lib64/ld-linux-x86-64.so.2
0x00007ed5e67eb080  0x00007ed5e67f1ca9  Yes         /usr/bin/../lib/rr/librrpreload.so
0x00007ed5e67ba040  0x00007ed5e67ba105  Yes         /lib/x86_64-linux-gnu/libdl.so.2
0x00007ed5e67242e0  0x00007ed5e678e1cf  Yes (*)     /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0
0x00007ed5e66ea040  0x00007ed5e66fd97b  Yes (*)     /lib/x86_64-linux-gnu/libcrypt.so.1
0x00007ed5e66c9180  0x00007ed5e66e3e8c  Yes (*)     /usr/lib/x86_64-linux-gnu/liblz4.so.1
0x00007ed5e6695400  0x00007ed5e66b65ef  Yes (*)     /usr/lib/x86_64-linux-gnu/liblzma.so.5
0x00007ed5e668e0a0  0x00007ed5e668e5eb  Yes (*)     /usr/lib/x86_64-linux-gnu/libaio.so.1
0x00007ed5e6683800  0x00007ed5e6688822  Yes (*)     /usr/lib/x86_64-linux-gnu/libnuma.so.1
0x00007ed5e6664280  0x00007ed5e6676103  Yes (*)     /usr/lib/x86_64-linux-gnu/libz.so.1
0x00007ed5e65863c0  0x00007ed5e6605878  Yes         /lib/x86_64-linux-gnu/libm.so.6
0x00007ed5e62a55c0  0x00007ed5e63e4ce2  Yes (*)     /usr/lib/x86_64-linux-gnu/libstdc++.so.6
0x00007ed5e654a780  0x00007ed5e656d745  Yes (*)     /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
0x00007ed5e5e267c0  0x00007ed5e5fa45ad  Yes         /lib/x86_64-linux-gnu/libc.so.6
Good
(rr) info shared
From                To                  Syms Read   Shared Object Library
0x00000636f94d8000  0x00000636f9501455  Yes         /lib64/ld-linux-x86-64.so.2
0x00000636f94bf080  0x00000636f94c5ca9  Yes         /usr/bin/../lib/rr/librrpreload.so
0x00000636f948e040  0x00000636f948e105  Yes         /lib/x86_64-linux-gnu/libdl.so.2
0x00000636f93f82e0  0x00000636f94621cf  Yes (*)     /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0
0x00000636f93be040  0x00000636f93d197b  Yes (*)     /lib/x86_64-linux-gnu/libcrypt.so.1
0x00000636f939d180  0x00000636f93b7e8c  Yes (*)     /usr/lib/x86_64-linux-gnu/liblz4.so.1
0x00000636f9369400  0x00000636f938a5ef  Yes (*)     /usr/lib/x86_64-linux-gnu/liblzma.so.5
0x00000636f93620a0  0x00000636f93625eb  Yes (*)     /usr/lib/x86_64-linux-gnu/libaio.so.1
0x00000636f9357800  0x00000636f935c822  Yes (*)     /usr/lib/x86_64-linux-gnu/libnuma.so.1
0x00000636f9338280  0x00000636f934a103  Yes (*)     /usr/lib/x86_64-linux-gnu/libz.so.1
0x00000636f925a3c0  0x00000636f92d9878  Yes         /lib/x86_64-linux-gnu/libm.so.6
0x00000636f8ea55c0  0x00000636f8fe4ce2  Yes (*)     /usr/lib/x86_64-linux-gnu/libstdc++.so.6
0x00000636f921e780  0x00000636f9241745  Yes (*)     /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
0x00000636f8a267c0  0x00000636f8ba45ad  Yes         /lib/x86_64-linux-gnu/libc.so.6
0x00000636f05dcc00  0x00000636f070cefc  Yes         /home/midenok/src/mariadb/10.5b/build/mysql-test/var/plugins/ha_spider.so

An attempt to load it fails:

(rr) load /home/midenok/src/mariadb/10.5b/build/mysql-test/var/plugins/ha_spider.so
`/home/midenok/src/mariadb/10.5b/build/mysql-test/var/log/mysqld.1.1.rr/mariadbd-0/mmap_hardlink_4_mariadbd' has changed; re-reading symbols.
Loading section .note.gnu.build-id, size 0x24 lma 0x270
[ERROR ./src/GdbServer.cc:570:dispatch_debugger_request()] Attempt to write memory outside diversion session
Load failed

But that is expected, I guess, as this is read-only session, though RR might print a better message. The problem is after dlclose() .so goes away from the list and does not return back when replaying back.

@midenok
Copy link
Author

midenok commented Oct 9, 2024

I'm closing this bug as the original description is wrong and reopening a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants