Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: unsoundness of rmw_shutdown #170

Closed
YuanYuYuan opened this issue May 8, 2024 · 12 comments
Closed

bug: unsoundness of rmw_shutdown #170

YuanYuYuan opened this issue May 8, 2024 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@YuanYuYuan
Copy link
Contributor

YuanYuYuan commented May 8, 2024

We notice that ApexAI performance test with rmw_zenoh keep failing with the error.

thread '<unnamed>' panicked at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/thread/local.rs:262:26:
cannot access a Thread Local Storage value during or after destruction: AccessError
stack backtrace:
   0:     0x7e09df3dfab6 - std::backtrace_rs::backtrace::libunwind::trace::h92ba5ac126ecc5ad
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
   1:     0x7e09df3dfab6 - std::backtrace_rs::backtrace::trace_unsynchronized::hfaa118b6e3a8f31a
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x7e09df3dfab6 - std::sys_common::backtrace::_print_fmt::h2075f185adf13bbd
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:68:5
   3:     0x7e09df3dfab6 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h9c4bd387f9f3f544
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x7e09df40ab20 - core::fmt::rt::Argument::fmt::h018ecebdf268b5bb
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/fmt/rt.rs:142:9
   5:     0x7e09df40ab20 - core::fmt::write::h938c332fdab924eb
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/fmt/mod.rs:1120:17
   6:     0x7e09df3dcc5f - std::io::Write::write_fmt::h4a694b02e44e6363
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/io/mod.rs:1846:15
   7:     0x7e09df3df894 - std::sys_common::backtrace::_print::h3595c0f4303f3ccd
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x7e09df3df894 - std::sys_common::backtrace::print::ha888e6736b0bc71f
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x7e09df3e1117 - std::panicking::default_hook::{{closure}}::he19a7f79f7beab5e
  10:     0x7e09df3e0e79 - std::panicking::default_hook::h67efe04e9a5d446e
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:292:9
  11:     0x7e09df3e15a8 - std::panicking::rust_panic_with_hook::h49021cdbc4b22349
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:781:13
  12:     0x7e09df3e1482 - std::panicking::begin_panic_handler::{{closure}}::hfbf601f3d8c62d13
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:659:13
  13:     0x7e09df3dffb6 - std::sys_common::backtrace::__rust_end_short_backtrace::h98dd020b6e913806
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:171:18
  14:     0x7e09df3e11d4 - rust_begin_unwind
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:647:5
  15:     0x7e09de3978e5 - core::panicking::panic_fmt::h0d3f1893e38be419
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:72:14
  16:     0x7e09de397e33 - core::result::unwrap_failed::h1356562ef8bc898c
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/result.rs:1649:5
  17:     0x7e09df3826d9 - core::result::Result<T,E>::expect::h6b8b29ca8643e857
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/result.rs:1030:23
  18:     0x7e09df343d8b - std::thread::local::LocalKey<T>::with::h1752a920e6b97eb1
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/thread/local.rs:262:9
  19:     0x7e09df364e6e - tokio::runtime::context::runtime_mt::current_enter_context::haf9beb3610bddfe5
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context/runtime_mt.rs:5:5
  20:     0x7e09de6c3f24 - tokio::runtime::scheduler::multi_thread::worker::block_in_place::{{closure}}::h9c06599e5c088372
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:355:13
  21:     0x7e09de6c31aa - tokio::runtime::scheduler::multi_thread::worker::with_current::{{closure}}::hc2aa2ec0b046bb24
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:1223:14
  22:     0x7e09de786485 - tokio::runtime::context::with_scheduler::{{closure}}::h4358f2c1fed3380d
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context.rs:183:33
  23:     0x7e09de7ada3f - core::result::Result<T,E>::unwrap_or_else::hf45d4956e923594f
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/result.rs:1426:23
  24:     0x7e09de7861af - tokio::runtime::context::with_scheduler::h60fbeb43725fdaf0
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context.rs:182:9
  25:     0x7e09de6c2f9b - tokio::runtime::scheduler::multi_thread::worker::with_current::he3f8c8d1928a9152
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:1221:5
  26:     0x7e09de6c3541 - tokio::runtime::scheduler::multi_thread::worker::block_in_place::h812065477b42e82f
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:353:24
  27:     0x7e09de786afa - tokio::runtime::scheduler::block_in_place::block_in_place::he4fdf8f5613ebc2b
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/block_in_place.rs:20:5
  28:     0x7e09de5b9f8a - tokio::task::blocking::block_in_place::hd8175eae66af42f5
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/task/blocking.rs:78:9
  29:     0x7e09de7a6764 - zenoh_runtime::ZRuntime::block_in_place::hc7706bb648ae8ea0
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-runtime/src/lib.rs:130:9
  30:     0x7e09de83f367 - <zenoh_core::ResolveFuture<F,To> as zenoh_core::SyncResolve>::res_sync::hc9409ed43c1f44e0
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-core/src/lib.rs:166:9
  31:     0x7e09de45d0ea - zenoh_core::SyncResolve::res::hb45a81dafd5ac98f
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-core/src/lib.rs:57:9
  32:     0x7e09de46c518 - z_close
                               at /home/zstw/Workspace/benchmark_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/session.rs:144:11
  33:     0x7e09df9d2d5b - rmw_shutdown
  34:     0x7e09e15c2104 - rcl_shutdown
  35:     0x7e09e12fa41e - _ZN6rclcpp7Context8shutdownERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
  36:     0x7e09e12fac91 - _ZN6rclcpp7ContextD1Ev
  37:     0x7e09e12f3bea - <unknown>
  38:     0x7e09e0a45495 - __run_exit_handlers
                               at ./stdlib/exit.c:113:8
  39:     0x7e09e0a45610 - __GI_exit
                               at ./stdlib/exit.c:143:3
  40:     0x7e09e0a29d97 - __libc_start_call_main
                               at ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3
  41:     0x7e09e0a29e40 - __libc_start_main_impl
                               at ./csu/../csu/libc-start.c:392:3
  42:     0x584c927db915 - _start
  43:                0x0 - <unknown>
thread '<unnamed>' panicked at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/thread/local.rs:262:26:
cannot access a Thread Local Storage value during or after destruction: AccessError
stack backtrace:
   0:     0x7e09df3dfab6 - std::backtrace_rs::backtrace::libunwind::trace::h92ba5ac126ecc5ad
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
   1:     0x7e09df3dfab6 - std::backtrace_rs::backtrace::trace_unsynchronized::hfaa118b6e3a8f31a
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x7e09df3dfab6 - std::sys_common::backtrace::_print_fmt::h2075f185adf13bbd
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:68:5
   3:     0x7e09df3dfab6 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h9c4bd387f9f3f544
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x7e09df40ab20 - core::fmt::rt::Argument::fmt::h018ecebdf268b5bb
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/fmt/rt.rs:142:9
   5:     0x7e09df40ab20 - core::fmt::write::h938c332fdab924eb
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/fmt/mod.rs:1120:17
   6:     0x7e09df3dcc5f - std::io::Write::write_fmt::h4a694b02e44e6363
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/io/mod.rs:1846:15
   7:     0x7e09df3df894 - std::sys_common::backtrace::_print::h3595c0f4303f3ccd
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x7e09df3df894 - std::sys_common::backtrace::print::ha888e6736b0bc71f
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x7e09df3e1117 - std::panicking::default_hook::{{closure}}::he19a7f79f7beab5e
  10:     0x7e09df3e0e79 - std::panicking::default_hook::h67efe04e9a5d446e
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:292:9
  11:     0x7e09df3e15a8 - std::panicking::rust_panic_with_hook::h49021cdbc4b22349
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:781:13
  12:     0x7e09df3e1482 - std::panicking::begin_panic_handler::{{closure}}::hfbf601f3d8c62d13
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:659:13
  13:     0x7e09df3dffb6 - std::sys_common::backtrace::__rust_end_short_backtrace::h98dd020b6e913806
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:171:18
  14:     0x7e09df3e11d4 - rust_begin_unwind
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:647:5
  15:     0x7e09de3978e5 - core::panicking::panic_fmt::h0d3f1893e38be419
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:72:14
  16:     0x7e09de397e33 - core::result::unwrap_failed::h1356562ef8bc898c
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/result.rs:1649:5
  17:     0x7e09df3826d9 - core::result::Result<T,E>::expect::h6b8b29ca8643e857
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/result.rs:1030:23
  18:     0x7e09df343d8b - std::thread::local::LocalKey<T>::with::h1752a920e6b97eb1
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/thread/local.rs:262:9
  19:     0x7e09df364e6e - tokio::runtime::context::runtime_mt::current_enter_context::haf9beb3610bddfe5
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context/runtime_mt.rs:5:5
  20:     0x7e09de6c3f24 - tokio::runtime::scheduler::multi_thread::worker::block_in_place::{{closure}}::h9c06599e5c088372
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:355:13
  21:     0x7e09de6c31aa - tokio::runtime::scheduler::multi_thread::worker::with_current::{{closure}}::hc2aa2ec0b046bb24
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:1223:14
  22:     0x7e09de786485 - tokio::runtime::context::with_scheduler::{{closure}}::h4358f2c1fed3380d
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context.rs:183:33
  23:     0x7e09de7ada3f - core::result::Result<T,E>::unwrap_or_else::hf45d4956e923594f
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/result.rs:1426:23
  24:     0x7e09de7861af - tokio::runtime::context::with_scheduler::h60fbeb43725fdaf0
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context.rs:182:9
  25:     0x7e09de6c2f9b - tokio::runtime::scheduler::multi_thread::worker::with_current::he3f8c8d1928a9152
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:1221:5
  26:     0x7e09de6c3541 - tokio::runtime::scheduler::multi_thread::worker::block_in_place::h812065477b42e82f
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:353:24
  27:     0x7e09de786afa - tokio::runtime::scheduler::block_in_place::block_in_place::he4fdf8f5613ebc2b
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/block_in_place.rs:20:5
  28:     0x7e09de5b9f8a - tokio::task::blocking::block_in_place::hd8175eae66af42f5
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/task/blocking.rs:78:9
  29:     0x7e09de7a6764 - zenoh_runtime::ZRuntime::block_in_place::hc7706bb648ae8ea0
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-runtime/src/lib.rs:130:9
  30:     0x7e09de83f367 - <zenoh_core::ResolveFuture<F,To> as zenoh_core::SyncResolve>::res_sync::hc9409ed43c1f44e0
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-core/src/lib.rs:166:9
  31:     0x7e09de7e6e83 - <zenoh::session::Session as core::ops::drop::Drop>::drop::h19c93fde394c5c5e
                               at /home/zstw/Workspace/src/zenoh/zenoh/src/session.rs:2529:21
  32:     0x7e09de830457 - core::ptr::drop_in_place<zenoh::session::Session>::hdc7d9bce22baadd0
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/ptr/mod.rs:507:1
  33:     0x7e09de83900b - core::ptr::drop_in_place<zenoh::session::Session::close::{{closure}}>::h764a6fc98a47f868
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/ptr/mod.rs:525:28
  34:     0x7e09de82dcaf - core::ptr::drop_in_place<zenoh_runtime::ZRuntime::block_in_place<zenoh::session::Session::close::{{closure}},core::result::Result<(),alloc::boxed::Box<dyn core::error::Error+core::marker::Send+core::marker::Sync>>>::{{closure}}>::h229fbbc24e853948
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/ptr/mod.rs:507:1
  35:     0x7e09de6c374d - tokio::runtime::scheduler::multi_thread::worker::block_in_place::h812065477b42e82f
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:442:1
  36:     0x7e09de786afa - tokio::runtime::scheduler::block_in_place::block_in_place::he4fdf8f5613ebc2b
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/block_in_place.rs:20:5
  37:     0x7e09de5b9f8a - tokio::task::blocking::block_in_place::hd8175eae66af42f5
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/task/blocking.rs:78:9
  38:     0x7e09de7a6764 - zenoh_runtime::ZRuntime::block_in_place::hc7706bb648ae8ea0
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-runtime/src/lib.rs:130:9
  39:     0x7e09de83f367 - <zenoh_core::ResolveFuture<F,To> as zenoh_core::SyncResolve>::res_sync::hc9409ed43c1f44e0
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-core/src/lib.rs:166:9
  40:     0x7e09de45d0ea - zenoh_core::SyncResolve::res::hb45a81dafd5ac98f
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-core/src/lib.rs:57:9
  41:     0x7e09de46c518 - z_close
                               at /home/zstw/Workspace/benchmark_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/session.rs:144:11
  42:     0x7e09df9d2d5b - rmw_shutdown
  43:     0x7e09e15c2104 - rcl_shutdown
  44:     0x7e09e12fa41e - _ZN6rclcpp7Context8shutdownERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
  45:     0x7e09e12fac91 - _ZN6rclcpp7ContextD1Ev
  46:     0x7e09e12f3bea - <unknown>
  47:     0x7e09e0a45495 - __run_exit_handlers
                               at ./stdlib/exit.c:113:8
  48:     0x7e09e0a45610 - __GI_exit
                               at ./stdlib/exit.c:143:3
  49:     0x7e09e0a29d97 - __libc_start_call_main
                               at ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3
  50:     0x7e09e0a29e40 - __libc_start_main_impl
                               at ./csu/../csu/libc-start.c:392:3
  51:     0x584c927db915 - _start
  52:                0x0 - <unknown>
thread '<unnamed>' panicked at library/core/src/panicking.rs:163:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.

This happens since we call z_close within __run_exit_handlers. We had concluded that termination of tokio runtime within the atexit handler is unsound
and leads to undefined behavior. See this and this. Basically, the termination during exit stage is beyond what tokio runtime covering. Rust is designed not to drop any static variable until the end of the program. Calling destruction in exit handlers is also error-prone even in C/C++.

On the other hand, rmw_cyclonedds seems not to use any explicit termination during the exit stage. https://github.com/ros2/rmw_cyclonedds/blob/c6dbe24b2f2be87cf8e4750d89501657ab83566f/rmw_cyclonedds_cpp/src/rmw_node.cpp#L1629. I've ran the same test with the following patch. The error disappears and the rmw_zenoh seems fine.

diff --git a/rmw_zenoh_cpp/src/rmw_init.cpp b/rmw_zenoh_cpp/src/rmw_init.cpp
index edd9545..90ef360 100644
--- a/rmw_zenoh_cpp/src/rmw_init.cpp
+++ b/rmw_zenoh_cpp/src/rmw_init.cpp
@@ -371,11 +371,12 @@ rmw_shutdown(rmw_context_t * context)
   if (context->impl->shm_manager.has_value()) {
     z_drop(z_move(context->impl->shm_manager.value()));
   }
-  // Close the zenoh session
-  if (z_close(z_move(context->impl->session)) < 0) {
-    RMW_SET_ERROR_MSG("Error while closing zenoh session");
-    return RMW_RET_ERROR;
-  }
+
+  // // Close the zenoh session
+  // if (z_close(z_move(context->impl->session)) < 0) {
+  //   RMW_SET_ERROR_MSG("Error while closing zenoh session");
+  //   return RMW_RET_ERROR;
+  // }

   context->impl->is_shutdown = true;

In conclusion, we have a few solutions to it,

  1. Apply the patch above.
  2. Using a "Guard" to terminate the zenoh sessions properly. An example looks like this.
  3. Terminate the rmw_zenoh proactively instead of within the exit phase.

TBH, I'm not sure if 2. and 3. are feasible in ROS RMW. Any comment is welcome! 🙂

@Yadunund
Copy link
Member

Yadunund commented May 8, 2024

@YuanYuYuan could you provide steps to reproduce the error?

With the patch above, are you suggesting that it is ok to not close the zenoh session? Wouldn't that lead to memory leaks?

@clalancette should we try moving this block back into rmw_context_fini?

@clalancette
Copy link
Collaborator

@clalancette should we try moving this block back into rmw_context_fini?

We need to take a close look before making changes here again. I've moved it twice already for 2 different bugs, so we need to go back, look at the contexts these are being called in (for both rclcpp and rclpy), and see what the previous bugs were.

@YuanYuYuan
Copy link
Contributor Author

@YuanYuYuan could you provide steps to reproduce the error?

With the patch above, are you suggesting that it is ok to not close the zenoh session? Wouldn't that lead to memory leaks?

@clalancette should we try moving this block back into rmw_context_fini?

Please follow the scripts written by @evshary.

@evshary
Copy link
Contributor

evshary commented May 9, 2024

The issue disappeared after the patch. If you want to use the scripts, please note that you need to update the repository here. The issue occurs no matter running intra-process or inter-process test with rmw_zenoh.

@takasehideki
Copy link
Contributor

Same here. I'm trying to operate ApexAI/performance_test, and encounted the same issue. But I'd like to report that applying the patch suggested in this post is working fine in my environment.

Question just in case: I cannot access evshary/benchmark_rmw_zenoh that was mentioned in this thread. I'm wondering the method (scripts?) of dealing with this issue is the same as suggested in this post?

@YuanYuYuan
Copy link
Contributor Author

Same here. I'm trying to operate ApexAI/performance_test, and encounted the same issue. But I'd like to report that applying the patch suggested in this post is working fine in my environment.

Question just in case: I cannot access evshary/benchmark_rmw_zenoh that was mentioned in this thread. I'm wondering the method (scripts?) of dealing with this issue is the same as suggested in this post?

Hi @takasehideki! The script has been moved here. And the fix is still the same.

@Yadunund
Copy link
Member

@YuanYuYuan i'm looking into this issue more closely now and could benefit from your insights.

Summarizing some basic information first:

  • With the current implementation in rolling branch, we call z_close() when rmw_shutdown() is called.
  • With ROS 2 nodes written with rclcpp, rmw_shutdown() is invoked within rcl_shutdown() and rcl_shutdown is called within Context::shutdown().
  • Context::shutdown() is called in three scenarios
    1. Within deferred_signal_handler which is a function that executes upon receiving a SIGINT/SIGTERM.
    2. Within rclcpp::shutdown() which the user should call before their program terminates.
    3. Within ~Context() in case the user forgot to do 2).

Above you mentioned

This happens since we call z_close within __run_exit_handlers. We had concluded that termination of tokio runtime within the atexit handler is unsound
and leads to undefined behavior. See eclipse-zenoh/zenoh#973 and eclipse-zenoh/zenoh#1004 (comment). Basically, the termination during exit stage is beyond what tokio runtime covering. Rust is designed not to drop any static variable until the end of the program. Calling destruction in exit handlers is also error-prone even in C/C++.

This would imply that anytime I terminate a ROS executable (written with rclcpp), I should expect to see the above panic/backtrace since we will close the session as part of i that I described. But this does not seem to be the case. I can run ros2 run demo_nodes_cpp talker, hit CTRL+C and I don't see the panic. So is it really related to exit handlers?

Or could it be that we see this unsoundness when iii occurs? ie, the user does not call rclcpp::shutdown and the ~Context() destructor is the one closing the Zenoh session? This destructor will be called when the program terminates at which point the Thread Local Storage has been cleared?

I did a quick search for rclcpp::shutdown() in the Apex repo and couldn't find any references 🤔

@YuanYuYuan
Copy link
Contributor Author

Hi @Yadunund, thanks for spending time on this issue. I think the panic depends on some race conditions I'm still investigating. Here is one that can steadily reproduce. eclipse-zenoh/zenoh-cpp#198.

@takasehideki
Copy link
Contributor

Hi @YuanYuYuan

Same here. I'm trying to operate ApexAI/performance_test, and encounted the same issue. But I'd like to report that applying the patch suggested in this post is working fine in my environment.
Question just in case: I cannot access evshary/benchmark_rmw_zenoh that was mentioned in this thread. I'm wondering the method (scripts?) of dealing with this issue is the same as suggested in this post?

Hi @takasehideki! The script has been moved here. And the fix is still the same.

Thank you so much for pointing this out. But unfortunately I cannot still access this link (maybe due to the repository is set to private). But it is not a problem for me because I can now catch up that the fix is the same.

@YuanYuYuan
Copy link
Contributor Author

Thank you so much for pointing this out. But unfortunately I cannot still access this link (maybe due to the repository is set to private). But it is not a problem for me because I can now catch up that the fix is the same.

My bad! I'm not even aware that the repo has been closed 😅. Yup, the fix should be the same.

@Yadunund
Copy link
Member

Hi @Yadunund, thanks for spending time on this issue. I think the panic depends on some race conditions I'm still investigating. Here is one that can steadily reproduce. eclipse-zenoh/zenoh-cpp#198.

@YuanYuYuan I'm not sure if this is a race condition. I have a much simpler reproducible scenario that confirms my hypothesis above. See https://github.com/Yadunund/rmw_zenoh_shutdown_test.

Basically, if we let the rclcpp::Context's destructor to call Context::shutdown() (which in turn calls rmw_shutdown) when the program terminates, tokio will panic because the Thread local storage gets cleared before the destructor executes. Perhaps due to ordering of static initialization.

Therefore we need to ensure that the Context is shutdown before termination. The signal handler thread does this and so will explicitly calling rclcpp::shutdown() in the program.

This also explains why we see the panic all the time if we close the Zenoh session in rmw_context_fini instead of rmw_shutdown. The rmw_context_fini is only called when the custom deleter for the rcl_context executes which happens always be within ~Context() as seen here.

Therefore until the tokio behavior is fixed upstream, I propose the following workarounds

  1. Add a "Known Issues" section to the README of rmw_zenoh_cpp informing users that they will need to explicitly call rclcpp::shutdown() in their programs.
  2. Updating the Apex test and other tests in ros2 to ensure rclcpp::shutdown() is invoked. Some of the tests now implicitly rely on ~Context() to shutdown the context.

cc: @JEnoch @clalancette @ahcorde

@Yadunund
Copy link
Member

Yadunund commented Sep 20, 2024

With ros2/rclcpp#2633, the shutdown issue has disappeared for all tests in rclcpp. We should do the same for the Apex repo. @takasehideki would you be able to apply similar changes to the Apex repo and see if that resolves the issue there as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants