Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task block detector deadlocks with exception throwing #298

Open
gleb-cloudius opened this issue Jun 6, 2017 · 10 comments
Open

Task block detector deadlocks with exception throwing #298

gleb-cloudius opened this issue Jun 6, 2017 · 10 comments

Comments

@gleb-cloudius
Copy link
Contributor

Exception throwing takes symbol table lock while stack unwinding. If many threads throw exceptions simultaneously a thread may wait for a lock for a long time at which point task block detector will run and will try to unwind the stack too and deadlock as a result:

#0  0x00007ff8bda511bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007ff8bda4cd1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007ff8bda4cc3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007ff8bd7b399f in dl_iterate_phdr () from /lib64/libc.so.6
#4  0x00007ff8bf6d7b5f in _ULx86_64_dwarf_find_proc_info () from /lib64/libunwind.so.8
#5  0x00007ff8bf6d5b45 in fetch_proc_info () from /lib64/libunwind.so.8
#6  0x00007ff8bf6d6de9 in _ULx86_64_dwarf_find_save_locs () from /lib64/libunwind.so.8
#7  0x00007ff8bf6d7519 in _ULx86_64_dwarf_step () from /lib64/libunwind.so.8
#8  0x00007ff8bf6d3901 in _ULx86_64_step () from /lib64/libunwind.so.8
#9  0x000000000055785d in backtrace<backtrace_buffer::append_backtrace()::{lambda(unsigned long)#1}>(backtrace_buffer::append_backtrace()::{lambda(unsigned long)#1}&&) (func=func@entry=<unknown type in /usr/bin/scylla, CU 0x0, DIE 0x3088bd>) at ./util/backtrace.hh:42
#10 0x00000000004ee2b9 in backtrace_buffer::append_backtrace (this=0x7ff8b7ff4220) at core/reactor.cc:281
#11 print_with_backtrace (buf=...) at core/reactor.cc:292
#12 0x00000000004ee5a2 in reactor::block_notifier () at core/reactor.cc:553
#13 <signal handler called>
#14 0x00007ff8bda4cc34 in pthread_mutex_lock () from /lib64/libpthread.so.0
#15 0x00007ff8bd7b399f in dl_iterate_phdr () from /lib64/libc.so.6
#16 0x00007ff8bf8fbc1f in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#17 0x00007ff8bf8f8d8c in ?? () from /lib64/libgcc_s.so.1
#18 0x00007ff8bf8f9c33 in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#19 0x000000000196824c in __cxa_throw ()
#20 0x0000000000fb84e7 in std::make_exception_ptr<timed_out_error> (__ex=...) at /opt/scylladb/include/c++/5.3.1/bits/exception_ptr.h:174
#21 0x0000000000ff0f09 in promise<>::set_exception<timed_out_error>(timed_out_error&&) (e=<unknown type in /usr/bin/scylla, CU 0x13dd157e, DIE 0x1409e4ea>, this=0x604093668680) at /root/scylla/seastar/core/future.hh:489
#22 basic_semaphore<default_timeout_exception_factory, lowres_clock>::expiry_handler::operator() (this=<optimized out>, e=...) at /root/scylla/seastar/core/semaphore.hh:104
#23 expiring_fifo<basic_semaphore<default_timeout_exception_factory, lowres_clock>::entry, basic_semaphore<default_timeout_exception_factory, lowres_clock>::expiry_handler, lowres_clock>::entry::entry(basic_semaphore<default_timeout_exception_factory, lowres_clock>::entry, expiring_fifo<basic_semaphore<default_timeout_exception_factory, lowres_clock>::entry, basic_semaphore<default_timeout_exception_factory, lowres_clock>::expiry_handler, lowres_clock>&, std::chrono::time_point<lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}::operator()() const (__closure=0x6040936686c0) at /root/scylla/seastar/core/expiring_fifo.hh:66
#24 std::_Function_handler<void (), expiring_fifo<basic_semaphore<default_timeout_exception_factory, lowres_clock>::entry, basic_semaphore<default_timeout_exception_factory, lowres_clock>::expiry_handler, lowres_clock>::entry::entry(basic_semaphore<default_timeout_exception_factory, lowres_clock>::entry, expiring_fifo<basic_semaphore<default_timeout_exception_factory, lowres_clock>::entry, basic_semaphore<default_timeout_exception_factory, lowres_clock>::expiry_handler, lowres_clock>&, std::chrono::time_point<lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /opt/scylladb/include/c++/5.3.1/functional:1871
#25 0x0000000000511598 in std::function<void ()>::operator()() const (this=<optimized out>) at /opt/scylladb/include/c++/5.3.1/functional:2271

@avikivity
Copy link
Member

Not sure how we can fix this. Does that lock need to be recursive?

@gleb-cloudius
Copy link
Contributor Author

gleb-cloudius commented Jun 6, 2017 via email

@avikivity
Copy link
Member

Just hit this again.

@avikivity
Copy link
Member

Possible solution: hijack __cxa_throw, set a thread-local flag, call original __cxa_throw, unset flag.

Signal handler can then just look at the flag and look the other way if it is set.

@gleb-cloudius ?

@avikivity
Copy link
Member

Managed to reproduce easily with a test program, but only with 464f5e3 reverted.

Hijacking __cxa_throw won't work, because it never returns. However we might build on the existing hijack by 464f5e3.

@gleb-cloudius
Copy link
Contributor Author

gleb-cloudius commented Aug 27, 2017 via email

@avikivity
Copy link
Member

I suspected it, so I tried with gcc 6.3 (also to try on a bigger machine).

@frank8989
Copy link

I also hit this problem with scylla-2.0.0 on CentOS Linux release 7.2.1511 . Many task hang because of deadlock, which is caused by exceptions(write timeout).

Thread 1 (Thread 0x7fb4a4c2f080 (LWP 48314)):
#0 0x00007fb4a120bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fb4a1207d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2 0x00007fb4a1207c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fb4a2e91267 in _ULx86_64_dwarf_find_save_locs () from /lib64/libunwind.so.8
#4 0x00007fb4a2e91b49 in _ULx86_64_dwarf_step () from /lib64/libunwind.so.8
#5 0x00007fb4a2e8db31 in _ULx86_64_step () from /lib64/libunwind.so.8
#6 0x000000000055716d in seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(unsigned long)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(unsigned long)#1}&&) (
func=func@entry=<unknown type in /usr/lib/debug/usr/bin/scylla.debug, CU 0x0, DIE 0x2f4971>) at /usr/src/debug/scylla-2.0.0/seastar/util/backtrace.hh:44
#7 0x00000000004eef59 in seastar::backtrace_buffer::append_backtrace (this=0x7ffcfcf67ce0) at core/reactor.cc:277
#8 (anonymous namespace)::print_with_backtrace (buf=...) at core/reactor.cc:288
#9 0x00000000004ef23a in seastar::reactor::block_notifier () at core/reactor.cc:514
#10
#11 0x00007fb4a120bf4b in __lll_lock_wait () from /lib64/libpthread.so.0
#12 0x00007fb4a1207d1d in _L_lock_840 () from /lib64/libpthread.so.0
#13 0x00007fb4a1207c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#14 0x00007fb4a0f6ceef in dl_iterate_phdr () from /lib64/libc.so.6
#15 0x00007fb4a30b5c1f in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#16 0x00007fb4a30b2d8c in ?? () from /lib64/libgcc_s.so.1
#17 0x00007fb4a30b374d in ?? () from /lib64/libgcc_s.so.1
#18 0x00007fb4a30b3bde in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#19 0x0000000001b9caac in __cxa_throw ()
#20 0x000000000149a88c in std::make_exception_ptrseastar::broken_condition_variable (__ex=...) at /opt/scylladb/include/c++/5.3.1/bits/exception_ptr.h:174
#21 seastar::basic_semaphore<seastar::condition_variable::condition_variable_exception_factory, std::chrono::_V2::steady_clock>::broken (this=0x6000062aa428) at /usr/src/debug/scylla-2.0.0/seastar/core/semaphore.hh:255
#22 seastar::condition_variable::broken (this=0x6000062aa428) at /usr/src/debug/scylla-2.0.0/seastar/core/condition-variable.hh:174
#23 seastar::rpc::protocol<netw::serializer, netw::messaging_verb>::connection::stop_send_loop (this=0x6000062aa300) at /usr/src/debug/scylla-2.0.0/seastar/rpc/rpc.hh:225
#24 seastar::rpc::protocol<netw::serializer, netw::messaging_verb>::server::connection::process()::{lambda(seastar::future<>)#2}::operator()(seastar::future<>) const (__closure=__closure@entry=0x600072fe9fb8, f=...)
at /usr/src/debug/scylla-2.0.0/seastar/rpc/rpc_impl.hh:999
#25 0x000000000149ad47 in seastar::futurize<seastar::future<> >::apply<seastar::rpc::protocol<netw::serializer, netw::messaging_verb>::server::connection::process()::{lambda(seastar::future<>)#2}, seastar::future<> >(seastar::rpc::protocol<netw::serializer, netw::messaging_verb>::server::connection::process()::{lambda(seastar::future<>)#2}&&, seastar::future<>&&) (func=func@entry=<unknown type in /usr/lib/debug/usr/bin/scylla.debug, CU 0x19b3f38a, DIE 0x1a5afc3d>)
at /usr/src/debug/scylla-2.0.0/seastar/core/future.hh:1312
#26 0x000000000149b234 in ZZN7seastar6futureIJEE12then_wrappedIZNS_3rpc8protocolIN4netw10serializerENS5_14messaging_verbEE6server10connection7processEvEUlS1_E0_S1_EET0_OT_ENUlSE_E_clINS_12future_stateIJEEEEEDaSE (
state=<unknown type in /usr/lib/debug/usr/bin/scylla.debug, CU 0x19b3f38a, DIE 0x1a5b09ae>, __closure=0x600072fe9f98) at /usr/src/debug/scylla-2.0.0/seastar/core/future.hh:940
#27 _ZN7seastar12continuationIZNS_6futureIJEE12then_wrappedIZNS_3rpc8protocolIN4netw10serializerENS6_14messaging_verbEE6server10connection7processEvEUlS2_E0_S2_EET0_OT_EUlSF_E_JEE3runEv (this=0x600072fe9f88)
at /usr/src/debug/scylla-2.0.0/seastar/core/future.hh:395
#28 0x00000000004ecd10 in seastar::reactor::run_tasks (this=this@entry=0x600000493000, tasks=...) at core/reactor.cc:2316
#29 0x000000000054083b in seastar::reactor::run (this=0x600000493000) at core/reactor.cc:2774
#30 0x00000000005c4f85 in seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) (this=this@entry=0x7ffcfcf6b670, ac=ac@entry=16, av=av@entry=0x7ffcfcf6b8d8,
func=func@entry=<unknown type in /usr/lib/debug/usr/bin/scylla.debug, CU 0xa2cbc7, DIE 0xb2de88>) at core/app-template.cc:142
#31 0x00000000004751a0 in main (ac=16, av=0x7ffcfcf6b8d8) at main.cc:682

@glommer
Copy link
Contributor

glommer commented Jun 8, 2018

@gleb-cloudius this is fixed. Let's close it.

@glommer
Copy link
Contributor

glommer commented Nov 14, 2019

@avikivity this seems fixed and I don't have permissions to close seastar issues.
Please close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants