-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory_prefaulter leaves a zombie pthread, confuses gdb during coredump debugging #2623
Comments
Arguably a bug in gdb? |
It would be better to tell the reactor (perhaps via seastar::alien) to collect the thread. However, it could be tricky since the infrastructure around cleanup is very dated. Maybe it should be hooked into app_template. |
I realized that I didn't explicitly say what's the effect of this bug: This has caused us some problems in Scylla over the last year, since several times we needed coredumps to debug an issue, but they seemed to be corrupted, and we couldn't use them. (I linked some affected Scylla issues in the opening post, but there were more). But AFAIK, until now nobody tried to understand the source of this "corruption". This issue can be worked around by manually editing the core so that the pthread handle of memory_prefaulter points to the TLS of shard 0. (Or by copying the TLS of shard 0 over the TLS of memory prefaulter. Or any other equivalent. Or by forking gdb and ignoring zombie threads in |
Interesting. I wanted to collect the thread while writing the prefaulter, but only from a dislike of leaving garbage around. This is better motivation. |
Could we fix up gdb to use a real reactor thread to read these thread local variables? I guess the problem will appear even if the thread isn't a zombie (and it's quite likely to happen if we crash while prefaulting, due to some startup bug). |
Well, as I said, it seems to me that the right solution to all of this would be to teach gdb (specifically
No. When the thread is alive, then its Only after the thread becomes a zombie, its |
After the
memory_prefaulter
threads do their job and exit, their handles aren't join()ed untilsmp::~smp
runs.This means that they leave a zombie entry in pthread's thread lists.
As it turns out, this confuses gdb. When looking for threads and their TLS segments, gdb uses
td_ta_thr_iter
(from libthread-db/nplt-db) to iterate over all threads:https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/linux-thread-db.c;h=9d84187a9ad0897ced25c7639e92ebb2e1e96746;hb=refs/heads/master#l1540
Then, it uses the
ti_lid
field (which is supposed to contain the PID of the thread) returned by libthread-db as an identifier for the thread:https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/linux-thread-db.c;h=9d84187a9ad0897ced25c7639e92ebb2e1e96746;hb=refs/heads/master#l1513
The issue is: for those zombie entries, libthread-db reports the PID of the process (which in Seastar is equal to the PID of reactor-0), not the PID of the thread:
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl_db/td_thr_get_info.c;h=7a64ef4c63614cf86714df9fc56f235b04f52253;hb=refs/heads/master#l108
So what happens is that gdb sees the handle of the zombie
memory_prefaulter
thread, thinks it's the entry of reactor-0, and uses this entry to serve queries about thread-local variables of reactor-0. The real entry of reactor-0 is ignored, since an entry for this PID has already been recorded by the time the real entry is seen.Refs scylladb/scylladb#15665 (comment)
Refs scylladb/scylladb#19110 (comment)
Refs scylladb/scylladb#19110 (comment)
Refs scylladb/scylladb#22245 (comment)
The text was updated successfully, but these errors were encountered: