Skip to content

Commit

Permalink
Ignore ipc_lib::kWakeupSignal by default
Browse files Browse the repository at this point in the history
This patch came out of trying to make `message_bridge_test` less
flaky. The test currently fails occasionally without an error message.
It just stops.

When run under `gdb`, we see the following:

    ...
    [New Thread 0x7fffeda496c0 (LWP 59)]
    I20230806 21:23:50.636907    59 message_bridge_test_lib.cc:28] Started pong
    [New Thread 0x7fffed3e06c0 (LWP 60)]
    I20230806 21:23:50.648432    60 message_bridge_test_lib.cc:28] Started ping
    [New Thread 0x7fffecd776c0 (LWP 61)]
    I20230806 21:23:50.658921    61 message_bridge_test_lib.cc:28] Started pi1_message_bridge_server
    I20230806 21:23:50.702045    61 shm_event_loop.cc:1026] Setting priority to 1
    [New Thread 0x7fffec70e6c0 (LWP 62)]
    I20230806 21:23:50.705500    62 message_bridge_test_lib.cc:28] Started pi1_message_bridge_client
    I20230806 21:23:50.718152    62 shm_event_loop.cc:1026] Setting priority to 1
    [New Thread 0x7fffec0a56c0 (LWP 63)]
    I20230806 21:23:50.724790    63 message_bridge_test_lib.cc:28] Started pi2_message_bridge_client
    I20230806 21:23:50.735529    63 shm_event_loop.cc:1026] Setting priority to 1
    [New Thread 0x7fffeba3c6c0 (LWP 64)]
    I20230806 21:23:50.739696    64 message_bridge_test_lib.cc:28] Started pi2_message_bridge_server
    I20230806 21:23:50.751094    64 shm_event_loop.cc:1026] Setting priority to 1

    Thread 3 "ping" received signal SIG36, Real-time event 36.
    [Switching to Thread 0x7fffed3e06c0 (LWP 60)]
    __GI___pthread_sigmask (how=1, newmask=<optimized out>,
        oldmask=0x7fffed3de2f8) at ./nptl/pthread_sigmask.c:43
    43      ./nptl/pthread_sigmask.c: No such file or directory.
    #0  __GI___pthread_sigmask (how=1, newmask=<optimized out>,
        oldmask=0x7fffed3de2f8) at ./nptl/pthread_sigmask.c:43
    #1  0x00007ffff530c503 in aos::ipc_lib::(anonymous namespace)::wrapped_pthread_sigmask (how=1, set=0xb2cf38, oldset=0x7fffed3de2f8)
        at aos/ipc_lib/signalfd.cc:65
    #2  0x00007ffff530c558 in aos::ipc_lib::SignalFd::~SignalFd (this=0xb2cf30)
        at aos/ipc_lib/signalfd.cc:113
    #3  0x00007ffff58b2c9b in std::default_delete<aos::ipc_lib::SignalFd>::operator() (this=0x7fffffffd4e0, __ptr=0xb2cf30)
        at external/amd64_debian_sysroot//usr/include/c++/10/bits/unique_ptr.h:85
    #4  0x00007ffff58b473c in std::__uniq_ptr_impl<aos::ipc_lib::SignalFd, std::default_delete<aos::ipc_lib::SignalFd> >::reset (this=0x7fffffffd4e0, __p=0x0)
        at external/amd64_debian_sysroot//usr/include/c++/10/bits/unique_ptr.h:182
    #5  0x00007ffff58a535d in std::unique_ptr<aos::ipc_lib::SignalFd, std::default_delete<aos::ipc_lib::SignalFd> >::reset (this=0x7fffffffd4e0, __p=0x0)
        at external/amd64_debian_sysroot//usr/include/c++/10/bits/unique_ptr.h:456
    #6  0x00007ffff58a213e in aos::ShmEventLoop::Run (this=0x7fffffffd200)
        at aos/events/shm_event_loop.cc:1063
    #7  0x00007ffff7eb3980 in aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0::operator()() const (
        this=0xaef998) at aos/network/message_bridge_test_lib.cc:30
    #8  0x00007ffff7eb382d in std::__invoke_impl<void, aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0>(std::__invoke_other, aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0&&) (__f=...)
        at external/amd64_debian_sysroot//usr/include/c++/10/bits/invoke.h:60
    #9  0x00007ffff7eb37bd in std::__invoke<aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0>(aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0&&) (__fn=...)
        at external/amd64_debian_sysroot//usr/include/c++/10/bits/invoke.h:95
    #10 0x00007ffff7eb3795 in std::thread::_Invoker<std::tuple<aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0xaef998)
        at external/amd64_debian_sysroot//usr/include/c++/10/thread:264
    #11 0x00007ffff7eb3765 in std::thread::_Invoker<std::tuple<aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0> >::operator()() (this=0xaef998)
        at external/amd64_debian_sysroot//usr/include/c++/10/thread:271
    #12 0x00007ffff7eb36c9 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0> > >::_M_run() (this=0xaef990)
        at external/amd64_debian_sysroot//usr/include/c++/10/thread:215
    #13 0x00007ffff7cd44a3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
    #14 0x00007ffff48a8044 in start_thread (arg=<optimized out>)
        at ./nptl/pthread_create.c:442
    #15 0x00007ffff49285fc in clone3 ()
        at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Sometimes, as soon as the `ping` thread starts exiting, a wakeup
signal arrives. That causes the default signal handler to be called.
Which for SIGRT2 is to terminate the program.

For most applications that's probably fine in the grand scheme of
things, but it's a bit annoying for tests. Especially for
`message_bridge_test` which has multiple threads with watchers.

I am not 100% convinced that the issue is happening because of a race
condition in `lockless_queue.cc`, but it seems the most likely
candidate. Quoted from the `LocklessQueueWakeUpper::Wakeup` function:

    // Grab a copy so it won't change out from underneath us, and we can sort it
    // nicely in C++.
    // Do note that there is still a window where the process can die *after* we
    // read everything.  We will still PI boost and send a signal to the thread in
    // question.  There is no way without pidfd's to close this window, and
    // creating a pidfd is likely not RT.

The way I'm reading this is that there's a window between the sender
grabbing a copy of the watchers and the watcher unblocking the signal.
That window allows for the receiver to be killed before it's finished
tearing down.

Rough sequence of events:

1. The sender reads the list of watchers from SHMEM.
2. The reader shuts down and unblocks the wakeup signal.
3. The sender sends the wakeup signal to all watchers.
4. The reader invokes the default signal handler for the wakeup
   signal. By default this causes the process to exit uncleanly
   immediately.

Before this patch, I saw 20-30 tests out of 1000 fail due to this
error. After this patch, there are no failures of this kind.

Change-Id: I12a18e2977b50b9cb33334373e131ae9774e7579
Signed-off-by: Philipp Schrader <[email protected]>
  • Loading branch information
philsc committed Aug 17, 2023
1 parent 0925766 commit a873466
Showing 1 changed file with 13 additions and 0 deletions.
13 changes: 13 additions & 0 deletions aos/events/shm_event_loop.cc
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,14 @@ const Node *MaybeMyNode(const Configuration *configuration) {
return configuration::GetMyNode(configuration);
}

void IgnoreWakeupSignal() {
struct sigaction action;
action.sa_handler = SIG_IGN;
PCHECK(sigemptyset(&action.sa_mask) == 0);
action.sa_flags = 0;
PCHECK(sigaction(ipc_lib::kWakeupSignal, &action, nullptr) == 0);
}

} // namespace

ShmEventLoop::ShmEventLoop(const Configuration *configuration)
Expand All @@ -81,6 +89,11 @@ ShmEventLoop::ShmEventLoop(const Configuration *configuration)
shm_base_(FLAGS_shm_base),
name_(FLAGS_application_name),
node_(MaybeMyNode(configuration)) {
// Ignore the wakeup signal by default. Otherwise, we have race conditions on
// shutdown where a wakeup signal will uncleanly terminate the process.
// See LocklessQueueWakeUpper::Wakeup() for some more information.
IgnoreWakeupSignal();

CHECK(IsInitialized()) << ": Need to initialize AOS first.";
ClearContext();
if (configuration->has_nodes()) {
Expand Down

0 comments on commit a873466

Please sign in to comment.