-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid double-sending on replayed channels #14
Comments
frc971-automation
pushed a commit
that referenced
this issue
Aug 17, 2023
This patch came out of trying to make `message_bridge_test` less flaky. The test currently fails occasionally without an error message. It just stops. When run under `gdb`, we see the following: ... [New Thread 0x7fffeda496c0 (LWP 59)] I20230806 21:23:50.636907 59 message_bridge_test_lib.cc:28] Started pong [New Thread 0x7fffed3e06c0 (LWP 60)] I20230806 21:23:50.648432 60 message_bridge_test_lib.cc:28] Started ping [New Thread 0x7fffecd776c0 (LWP 61)] I20230806 21:23:50.658921 61 message_bridge_test_lib.cc:28] Started pi1_message_bridge_server I20230806 21:23:50.702045 61 shm_event_loop.cc:1026] Setting priority to 1 [New Thread 0x7fffec70e6c0 (LWP 62)] I20230806 21:23:50.705500 62 message_bridge_test_lib.cc:28] Started pi1_message_bridge_client I20230806 21:23:50.718152 62 shm_event_loop.cc:1026] Setting priority to 1 [New Thread 0x7fffec0a56c0 (LWP 63)] I20230806 21:23:50.724790 63 message_bridge_test_lib.cc:28] Started pi2_message_bridge_client I20230806 21:23:50.735529 63 shm_event_loop.cc:1026] Setting priority to 1 [New Thread 0x7fffeba3c6c0 (LWP 64)] I20230806 21:23:50.739696 64 message_bridge_test_lib.cc:28] Started pi2_message_bridge_server I20230806 21:23:50.751094 64 shm_event_loop.cc:1026] Setting priority to 1 Thread 3 "ping" received signal SIG36, Real-time event 36. [Switching to Thread 0x7fffed3e06c0 (LWP 60)] __GI___pthread_sigmask (how=1, newmask=<optimized out>, oldmask=0x7fffed3de2f8) at ./nptl/pthread_sigmask.c:43 43 ./nptl/pthread_sigmask.c: No such file or directory. #0 __GI___pthread_sigmask (how=1, newmask=<optimized out>, oldmask=0x7fffed3de2f8) at ./nptl/pthread_sigmask.c:43 #1 0x00007ffff530c503 in aos::ipc_lib::(anonymous namespace)::wrapped_pthread_sigmask (how=1, set=0xb2cf38, oldset=0x7fffed3de2f8) at aos/ipc_lib/signalfd.cc:65 #2 0x00007ffff530c558 in aos::ipc_lib::SignalFd::~SignalFd (this=0xb2cf30) at aos/ipc_lib/signalfd.cc:113 #3 0x00007ffff58b2c9b in std::default_delete<aos::ipc_lib::SignalFd>::operator() (this=0x7fffffffd4e0, __ptr=0xb2cf30) at external/amd64_debian_sysroot//usr/include/c++/10/bits/unique_ptr.h:85 #4 0x00007ffff58b473c in std::__uniq_ptr_impl<aos::ipc_lib::SignalFd, std::default_delete<aos::ipc_lib::SignalFd> >::reset (this=0x7fffffffd4e0, __p=0x0) at external/amd64_debian_sysroot//usr/include/c++/10/bits/unique_ptr.h:182 #5 0x00007ffff58a535d in std::unique_ptr<aos::ipc_lib::SignalFd, std::default_delete<aos::ipc_lib::SignalFd> >::reset (this=0x7fffffffd4e0, __p=0x0) at external/amd64_debian_sysroot//usr/include/c++/10/bits/unique_ptr.h:456 #6 0x00007ffff58a213e in aos::ShmEventLoop::Run (this=0x7fffffffd200) at aos/events/shm_event_loop.cc:1063 #7 0x00007ffff7eb3980 in aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0::operator()() const ( this=0xaef998) at aos/network/message_bridge_test_lib.cc:30 #8 0x00007ffff7eb382d in std::__invoke_impl<void, aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0>(std::__invoke_other, aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0&&) (__f=...) at external/amd64_debian_sysroot//usr/include/c++/10/bits/invoke.h:60 #9 0x00007ffff7eb37bd in std::__invoke<aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0>(aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0&&) (__fn=...) at external/amd64_debian_sysroot//usr/include/c++/10/bits/invoke.h:95 #10 0x00007ffff7eb3795 in std::thread::_Invoker<std::tuple<aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0xaef998) at external/amd64_debian_sysroot//usr/include/c++/10/thread:264 #11 0x00007ffff7eb3765 in std::thread::_Invoker<std::tuple<aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0> >::operator()() (this=0xaef998) at external/amd64_debian_sysroot//usr/include/c++/10/thread:271 #12 0x00007ffff7eb36c9 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0> > >::_M_run() (this=0xaef990) at external/amd64_debian_sysroot//usr/include/c++/10/thread:215 #13 0x00007ffff7cd44a3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #14 0x00007ffff48a8044 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #15 0x00007ffff49285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 Sometimes, as soon as the `ping` thread starts exiting, a wakeup signal arrives. That causes the default signal handler to be called. Which for SIGRT2 is to terminate the program. For most applications that's probably fine in the grand scheme of things, but it's a bit annoying for tests. Especially for `message_bridge_test` which has multiple threads with watchers. I am not 100% convinced that the issue is happening because of a race condition in `lockless_queue.cc`, but it seems the most likely candidate. Quoted from the `LocklessQueueWakeUpper::Wakeup` function: // Grab a copy so it won't change out from underneath us, and we can sort it // nicely in C++. // Do note that there is still a window where the process can die *after* we // read everything. We will still PI boost and send a signal to the thread in // question. There is no way without pidfd's to close this window, and // creating a pidfd is likely not RT. The way I'm reading this is that there's a window between the sender grabbing a copy of the watchers and the watcher unblocking the signal. That window allows for the receiver to be killed before it's finished tearing down. Rough sequence of events: 1. The sender reads the list of watchers from SHMEM. 2. The reader shuts down and unblocks the wakeup signal. 3. The sender sends the wakeup signal to all watchers. 4. The reader invokes the default signal handler for the wakeup signal. By default this causes the process to exit uncleanly immediately. Before this patch, I saw 20-30 tests out of 1000 fail due to this error. After this patch, there are no failures of this kind. Change-Id: I12a18e2977b50b9cb33334373e131ae9774e7579 Signed-off-by: Philipp Schrader <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add a check to logger to ensure that applications in log replay aren't sending on channels that are also getting replayed (i.e., all the relevant channels from the log are remapped).
Should pretty much just be a matter of doing this TODO:
971-Robot-Code/aos/events/logging/log_reader.h
Lines 409 to 413 in 890c249
and then seeing what blows up.
The text was updated successfully, but these errors were encountered: