Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential deadlock at the shutdown #939

Closed
fanlong opened this issue Feb 23, 2020 · 4 comments · Fixed by #1714
Closed

Potential deadlock at the shutdown #939

fanlong opened this issue Feb 23, 2020 · 4 comments · Fixed by #1714
Assignees
Labels
bug Something isn't working P2 Low Priority Issue that awaits fix

Comments

@fanlong
Copy link
Contributor

fanlong commented Feb 23, 2020

This is an issue that may non-deterministically occur when running out test scripts.

Traceback (most recent call last):
  File "/Users/fanl/Workspace/conflux/conflux-rust/tests/test_framework/test_framework.py", line 199, in main
    self.run_test()
  File "./tests/p2p_era_test.py", line 46, in run_test
    self.stop_node(chosen_peer)
  File "/Users/fanl/Workspace/conflux/conflux-rust/tests/test_framework/test_framework.py", line 367, in stop_node
    self.nodes[i].stop_node(expected_stderr, kill, wait)
  File "/Users/fanl/Workspace/conflux/conflux-rust/tests/test_framework/test_node.py", line 232, in stop_node
    stderr, expected_stderr, self.ip, self.port, self.index))
AssertionError: Unexpected stderr thread 'io_service' panicked at 'failed to join thread: Resource deadlock avoided (os error 11)', src/libstd/sys/unix/thread.rs:180:13
stack backtrace:
   0:        0x108c38f75 - backtrace::backtrace::libunwind::trace::hb16ec6045891ce5a
                               at /Users/runner/.cargo/registry/src/github.aaakk.us.kg-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1:        0x108c38f75 - backtrace::backtrace::trace_unsynchronized::hcacbd0efdffd74c6
                               at /Users/runner/.cargo/registry/src/github.aaakk.us.kg-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2:        0x108c38f75 - std::sys_common::backtrace::_print_fmt::h39e22de9d6757d12
                               at src/libstd/sys_common/backtrace.rs:77
   3:        0x108c38f75 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h415ddd0ba88caaaf
                               at src/libstd/sys_common/backtrace.rs:61
   4:        0x108c608b0 - core::fmt::ArgumentV1::show_usize::hca33af0aed7db5c5
   5:        0x108c31bfb - std::io::Write::write_fmt::he6837371b9a45188
                               at src/libstd/io/mod.rs:1412
   6:        0x108c3b2c3 - std::sys_common::backtrace::_print::h89459d14ba97f5fa
                               at src/libstd/sys_common/backtrace.rs:65
   7:        0x108c3b2c3 - std::sys_common::backtrace::print::ha4c6688e811b8829
                               at src/libstd/sys_common/backtrace.rs:50
   8:        0x108c3b2c3 - std::panicking::default_hook::{{closure}}::h708e66cfeb0483ba
                               at src/libstd/panicking.rs:188
   9:        0x108c3afca - std::panicking::default_hook::h39ea8ddf674c04ec
                               at src/libstd/panicking.rs:205
  10:        0x108c3b98b - std::panicking::rust_panic_with_hook::h9db77b22c2255a16
                               at src/libstd/panicking.rs:464
  11:        0x108c3b519 - std::panicking::continue_panic_fmt::h2dfa3a5b90265361
                               at src/libstd/panicking.rs:373
  12:        0x108ce61df - std::panicking::begin_panic_fmt::h90396a215538fa8f
                               at src/libstd/panicking.rs:328
  13:        0x108c444b0 - std::sys::unix::thread::Thread::join::hac21be441f6419ed
                               at src/libstd/sys/unix/thread.rs:180
  14:        0x1089106ba - std::thread::JoinInner<T>::join::hcd9a73a2135e8bd8
                               at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/thread/mod.rs:1324
  15:        0x1089106ba - std::thread::JoinHandle<T>::join::h1491e38ebdad9fd1
                               at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/thread/mod.rs:1457
  16:        0x1089103a1 - <io::worker::SocketWorker as core::ops::drop::Drop>::drop::h2356c64dfa15aafb
                               at util/io/src/worker.rs:121
  17:        0x1087971c4 - core::ptr::real_drop_in_place::h689d1bdda513af5b
  18:        0x108774cb5 - core::ptr::real_drop_in_place::hda9ac802461a14ee
  19:        0x1087713dc - io::service_mio::IoManager<Message>::start::hc5190eb589390067
  20:        0x1087a68b7 - std::sys_common::backtrace::__rust_begin_short_backtrace::hf722f3ffb79abe0d
  21:        0x108772afb - std::panicking::try::do_call::h36c7a05bc7b14608
  22:        0x108c44f4f - __rust_maybe_catch_panic
                               at src/libpanic_unwind/lib.rs:78
  23:        0x1087a86e6 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hc5268dd56e0b7ee7
  24:        0x108c2a4ee - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h49b3841a036d0711
                               at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/liballoc/boxed.rs:942
  25:        0x108c4439e - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h700a96d301634ce0
                               at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/liballoc/boxed.rs:942
  26:        0x108c4439e - std::sys_common::thread::start_thread::hf3cc4eddc63c33e0
                               at src/libstd/sys_common/thread.rs:13
  27:        0x108c4439e - std::sys::unix::thread::Thread::new::thread_start::hd9de55a9c1593989
                               at src/libstd/sys/unix/thread.rs:79
  28:     0x7fff6da94e65 - <unknown> !=  from 127.0.0.1:13688 index=3
@fanlong fanlong added bug Something isn't working P1 Important issue labels Feb 23, 2020
@Thegaram
Copy link
Contributor

I've also seen this when running p2p_era_test last week but have been unable to reproduce ever since.

One of our io_service threads seems to panic on thread.join() here. Based on similar issues discussed online, this (and os error 11) might happen when we try to join a thread within itself, but I don't see how that could be the case.

Also, the error message from the line above does not help as we have e : Any (it prints Error joining IO service event loop thread: Any for me). I suggest changing it to this for debugging at least:

thread.join().unwrap_or_else(|e| {
    if let Some(e) = e.downcast_ref::<&'static str>() {
        println!("Error joining IO service event loop thread: {}", e);
    } else {
        println!("Error joining IO service event loop thread: Unknown error: {:?}", e);
    }
});

@fanlong fanlong added P2 Low Priority Issue that awaits fix and removed P1 Important issue labels Feb 27, 2020
@fanlong
Copy link
Contributor Author

fanlong commented Mar 1, 2020

@Thegaram I think your change is reasonable. Could you submit a PR to include this change?

@sparkmiw
Copy link
Contributor

might be fixed by some commit. Close for now. Will reopen if shows up again

@Thegaram
Copy link
Contributor

Just ran into this issue on commit 28229f1 (branch Thegaram/storage-root, branched off from commit 9776b9c on master).

The output is still Error joining IO service event loop thread: Unknown error: Any so it is not a panic, at least not one with a string message.

I think it might be possible that we're occasionally trying to join the event loop thread from within itself. That might happen if the last reference to Arc<IoService> is dropped in an event loop thread, which will call IoService::stop, which in turn will join itself. We could test this by printing the thread ids on launch. It might also be something else though, but I have no other ideas.

node-4-conflux-log.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2 Low Priority Issue that awaits fix
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants