Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid double-sending on replayed channels #14

Open
jkuszmaul opened this issue May 6, 2022 · 0 comments
Open

Avoid double-sending on replayed channels #14

jkuszmaul opened this issue May 6, 2022 · 0 comments

Comments

@jkuszmaul
Copy link
Contributor

Add a check to logger to ensure that applications in log replay aren't sending on channels that are also getting replayed (i.e., all the relevant channels from the log are remapped).

Should pretty much just be a matter of doing this TODO:

// TODO(james): Enable exclusive senders on LogReader to allow us to
// ensure we are remapping channels correctly.
event_loop_unique_ptr_ = node_event_loop_factory_->MakeEventLoop(
"log_reader", {NodeEventLoopFactory::CheckSentTooFast::kNo,
NodeEventLoopFactory::ExclusiveSenders::kNo});

and then seeing what blows up.

frc971-automation pushed a commit that referenced this issue Aug 17, 2023
This patch came out of trying to make `message_bridge_test` less
flaky. The test currently fails occasionally without an error message.
It just stops.

When run under `gdb`, we see the following:

    ...
    [New Thread 0x7fffeda496c0 (LWP 59)]
    I20230806 21:23:50.636907    59 message_bridge_test_lib.cc:28] Started pong
    [New Thread 0x7fffed3e06c0 (LWP 60)]
    I20230806 21:23:50.648432    60 message_bridge_test_lib.cc:28] Started ping
    [New Thread 0x7fffecd776c0 (LWP 61)]
    I20230806 21:23:50.658921    61 message_bridge_test_lib.cc:28] Started pi1_message_bridge_server
    I20230806 21:23:50.702045    61 shm_event_loop.cc:1026] Setting priority to 1
    [New Thread 0x7fffec70e6c0 (LWP 62)]
    I20230806 21:23:50.705500    62 message_bridge_test_lib.cc:28] Started pi1_message_bridge_client
    I20230806 21:23:50.718152    62 shm_event_loop.cc:1026] Setting priority to 1
    [New Thread 0x7fffec0a56c0 (LWP 63)]
    I20230806 21:23:50.724790    63 message_bridge_test_lib.cc:28] Started pi2_message_bridge_client
    I20230806 21:23:50.735529    63 shm_event_loop.cc:1026] Setting priority to 1
    [New Thread 0x7fffeba3c6c0 (LWP 64)]
    I20230806 21:23:50.739696    64 message_bridge_test_lib.cc:28] Started pi2_message_bridge_server
    I20230806 21:23:50.751094    64 shm_event_loop.cc:1026] Setting priority to 1

    Thread 3 "ping" received signal SIG36, Real-time event 36.
    [Switching to Thread 0x7fffed3e06c0 (LWP 60)]
    __GI___pthread_sigmask (how=1, newmask=<optimized out>,
        oldmask=0x7fffed3de2f8) at ./nptl/pthread_sigmask.c:43
    43      ./nptl/pthread_sigmask.c: No such file or directory.
    #0  __GI___pthread_sigmask (how=1, newmask=<optimized out>,
        oldmask=0x7fffed3de2f8) at ./nptl/pthread_sigmask.c:43
    #1  0x00007ffff530c503 in aos::ipc_lib::(anonymous namespace)::wrapped_pthread_sigmask (how=1, set=0xb2cf38, oldset=0x7fffed3de2f8)
        at aos/ipc_lib/signalfd.cc:65
    #2  0x00007ffff530c558 in aos::ipc_lib::SignalFd::~SignalFd (this=0xb2cf30)
        at aos/ipc_lib/signalfd.cc:113
    #3  0x00007ffff58b2c9b in std::default_delete<aos::ipc_lib::SignalFd>::operator() (this=0x7fffffffd4e0, __ptr=0xb2cf30)
        at external/amd64_debian_sysroot//usr/include/c++/10/bits/unique_ptr.h:85
    #4  0x00007ffff58b473c in std::__uniq_ptr_impl<aos::ipc_lib::SignalFd, std::default_delete<aos::ipc_lib::SignalFd> >::reset (this=0x7fffffffd4e0, __p=0x0)
        at external/amd64_debian_sysroot//usr/include/c++/10/bits/unique_ptr.h:182
    #5  0x00007ffff58a535d in std::unique_ptr<aos::ipc_lib::SignalFd, std::default_delete<aos::ipc_lib::SignalFd> >::reset (this=0x7fffffffd4e0, __p=0x0)
        at external/amd64_debian_sysroot//usr/include/c++/10/bits/unique_ptr.h:456
    #6  0x00007ffff58a213e in aos::ShmEventLoop::Run (this=0x7fffffffd200)
        at aos/events/shm_event_loop.cc:1063
    #7  0x00007ffff7eb3980 in aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0::operator()() const (
        this=0xaef998) at aos/network/message_bridge_test_lib.cc:30
    #8  0x00007ffff7eb382d in std::__invoke_impl<void, aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0>(std::__invoke_other, aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0&&) (__f=...)
        at external/amd64_debian_sysroot//usr/include/c++/10/bits/invoke.h:60
    #9  0x00007ffff7eb37bd in std::__invoke<aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0>(aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0&&) (__fn=...)
        at external/amd64_debian_sysroot//usr/include/c++/10/bits/invoke.h:95
    #10 0x00007ffff7eb3795 in std::thread::_Invoker<std::tuple<aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0xaef998)
        at external/amd64_debian_sysroot//usr/include/c++/10/thread:264
    #11 0x00007ffff7eb3765 in std::thread::_Invoker<std::tuple<aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0> >::operator()() (this=0xaef998)
        at external/amd64_debian_sysroot//usr/include/c++/10/thread:271
    #12 0x00007ffff7eb36c9 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<aos::message_bridge::testing::ThreadedEventLoopRunner::ThreadedEventLoopRunner(aos::ShmEventLoop*)::$_0> > >::_M_run() (this=0xaef990)
        at external/amd64_debian_sysroot//usr/include/c++/10/thread:215
    #13 0x00007ffff7cd44a3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
    #14 0x00007ffff48a8044 in start_thread (arg=<optimized out>)
        at ./nptl/pthread_create.c:442
    #15 0x00007ffff49285fc in clone3 ()
        at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Sometimes, as soon as the `ping` thread starts exiting, a wakeup
signal arrives. That causes the default signal handler to be called.
Which for SIGRT2 is to terminate the program.

For most applications that's probably fine in the grand scheme of
things, but it's a bit annoying for tests. Especially for
`message_bridge_test` which has multiple threads with watchers.

I am not 100% convinced that the issue is happening because of a race
condition in `lockless_queue.cc`, but it seems the most likely
candidate. Quoted from the `LocklessQueueWakeUpper::Wakeup` function:

    // Grab a copy so it won't change out from underneath us, and we can sort it
    // nicely in C++.
    // Do note that there is still a window where the process can die *after* we
    // read everything.  We will still PI boost and send a signal to the thread in
    // question.  There is no way without pidfd's to close this window, and
    // creating a pidfd is likely not RT.

The way I'm reading this is that there's a window between the sender
grabbing a copy of the watchers and the watcher unblocking the signal.
That window allows for the receiver to be killed before it's finished
tearing down.

Rough sequence of events:

1. The sender reads the list of watchers from SHMEM.
2. The reader shuts down and unblocks the wakeup signal.
3. The sender sends the wakeup signal to all watchers.
4. The reader invokes the default signal handler for the wakeup
   signal. By default this causes the process to exit uncleanly
   immediately.

Before this patch, I saw 20-30 tests out of 1000 fail due to this
error. After this patch, there are no failures of this kind.

Change-Id: I12a18e2977b50b9cb33334373e131ae9774e7579
Signed-off-by: Philipp Schrader <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant