-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docker-wait-any]: Exit worker thread if main thread is expected to exit #12255
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There's an odd crash that intermittently happens after the teamd container exits, and a signal is raised to the main thread to exit. This thread (watching teamd) continues execution because it's in a `while True`. The subsequent wait call on the teamd container very likely returns immediately, and it calls `is_warm_restart_enabled` and `is_fast_reboot_enabled`. In either of these cases, sometimes, there is a crash in the transition from C code to Python code (after the function gets executed). Python sees that this thread got a signal to exit, because the main thread is exiting, and tells pthread to exit the thread. However, during the stack unwinding, _something_ is telling the unwinder to call `std::terminate`. The reason is unknown. This then results in a python3 SIGABRT, and systemd then doesn't call the stop script to actually stop the container (possibly because the main process exited with a SIGABRT, so it's a hard crash). This means that the container doesn't actually get stopped or restarted, resulting in an inconsistent state afterwards. The workaround appears to be that if we know the main thread needs to exit, just return here, and don't continue execution. This at least tries to avoid it from getting into the problematic code path. However, it's still feasible to get a SIGABRT, depending on thread/process timings (i.e. teamd exits, signals the main thread to exit, and then syncd exits, and syncd calls one of the two C functions, potentially hitting the issue). Signed-off-by: Saikrishna Arcot <[email protected]>
/azp run Azure.sonic-buildimage |
Azure Pipelines successfully started running 1 pipeline(s). |
yxieca
approved these changes
Oct 6, 2022
yxieca
pushed a commit
that referenced
this pull request
Oct 6, 2022
…xit (#12255) There's an odd crash that intermittently happens after the teamd container exits, and a signal is raised to the main thread to exit. This thread (watching teamd) continues execution because it's in a `while True`. The subsequent wait call on the teamd container very likely returns immediately, and it calls `is_warm_restart_enabled` and `is_fast_reboot_enabled`. In either of these cases, sometimes, there is a crash in the transition from C code to Python code (after the function gets executed). Python sees that this thread got a signal to exit, because the main thread is exiting, and tells pthread to exit the thread. However, during the stack unwinding, _something_ is telling the unwinder to call `std::terminate`. The reason is unknown. This then results in a python3 SIGABRT, and systemd then doesn't call the stop script to actually stop the container (possibly because the main process exited with a SIGABRT, so it's a hard crash). This means that the container doesn't actually get stopped or restarted, resulting in an inconsistent state afterwards. The workaround appears to be that if we know the main thread needs to exit, just return here, and don't continue execution. This at least tries to avoid it from getting into the problematic code path. However, it's still feasible to get a SIGABRT, depending on thread/process timings (i.e. teamd exits, signals the main thread to exit, and then syncd exits, and syncd calls one of the two C functions, potentially hitting the issue). Signed-off-by: Saikrishna Arcot <[email protected]> Signed-off-by: Saikrishna Arcot <[email protected]>
richardyu-ms
pushed a commit
to richardyu-ms/sonic-buildimage
that referenced
this pull request
Nov 16, 2022
Related work items: sonic-net#2151, sonic-net#2194, sonic-net#2224, sonic-net#2237, sonic-net#2264, sonic-net#2281, sonic-net#2286, sonic-net#2297, sonic-net#2299, sonic-net#2305, sonic-net#2325, sonic-net#2335, sonic-net#2338, sonic-net#2341, sonic-net#2343, sonic-net#2347, sonic-net#2350, sonic-net#2355, sonic-net#2356, sonic-net#2358, sonic-net#2360, sonic-net#2363, sonic-net#2367, sonic-net#2368, sonic-net#2370, sonic-net#2374, sonic-net#2392, sonic-net#2398, sonic-net#2408, sonic-net#2414, sonic-net#2415, sonic-net#2419, sonic-net#2421, sonic-net#2422, sonic-net#2423, sonic-net#2426, sonic-net#2427, sonic-net#2430, sonic-net#2431, sonic-net#2433, sonic-net#2434, sonic-net#2436, sonic-net#2437, sonic-net#2441, sonic-net#2444, sonic-net#2445, sonic-net#2446, sonic-net#2456, sonic-net#2458, sonic-net#2460, sonic-net#2461, sonic-net#2463, sonic-net#2472, sonic-net#2475, sonic-net#11877, sonic-net#12024, sonic-net#12065, sonic-net#12097, sonic-net#12130, sonic-net#12209, sonic-net#12217, sonic-net#12244, sonic-net#12251, sonic-net#12255, sonic-net#12276, sonic-net#12284
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Signed-off-by: Saikrishna Arcot [email protected]
Why I did it
There's an odd crash that intermittently happens during
config reload
orconfig load_minigraph
. The swss container has a python script at/usr/bin/docker-wait-any
that waits for either the swss, syncd, or teamd containers to exit (one container being monitored per thread). When a container exits, a signal is sent to the main thread to exit the python script, which then tells systemd to stop the container.Snippet of the backtrace:
What's happening is that after the teamd container exits, the signal is sent to the main thread, but because there's no return or exit out of the
while True
loop that it's in, it waits for the teamd container to exit. Because the container isn't running, this is effectively a no-op, and execution moves on to thedevice_info.is_warm_restart_enabled(container_name)
anddevice_info.is_fast_reboot_enabled()
function calls (which call to C++ code). Meanwhile, the main thread has calledsys.exit(0)
, and Python is bringing down all of its references/data structures, and (more importantly here) is telling the other threads to exit.For the teamd thread, when it returns from calling the C++ function, the wrapper code generated by SWIG is destructing a C++ object that it has created (for the purposes of saving/restoring the Python thread state). It calls
PyEval_RestoreThread()
in the destructor, which sees that the thread is supposed to exit, and proceeds to callpthread_exit()
. This is shown in frames 12 and 13 above.pthread_exit()
then calls a function that will unwind the stack, so that any cleanup or other handler functions can be called. This is so that there's a graceful exit to the thread. However, the way that unwinding works is that a special exception is called (abi::__forced_unwind
or__cxxabiv1::__forced_unwind
) that is expected to be propagated to the first frame. As the unwinder works frame-by-frame, one of these things happen for each frame on the stack:try
/catch
block), and there's a matchingcatch
block for this exception, then that will get called.For most frames, nothing probably happens. However, one of the frames on the stack here is a C++ destructor (which had called
PyEval_RestoreThread()
earlier). In C++11 and newer, C++ destructors are not allowed to have an exception get propagated outside of the destructor, and if they do,std::terminate()
gets called. In other words, any exceptions that could be caused by functions that the destructor calls must be handled within the destructor, and must not be propagated up the stack. If the destructor specifiesnoexcept(false)
to signify that exceptions could be propagated up, then maybe it's fine (I'm not entirely certain about this). Because the unwinder essentially uses a special exception to go up the stack,std::terminate
gets called, which then results in a SIGABRT for the process. Because of this SIGABRT, systemd appears to treat the service as stopped, and doesn't call theExecStop=
command, which means the containers don't actually go down.All of this is a timing issue; if it's unlucky enough that the thread exiting check is done around the call to C++ code, then a SIGABRT could happen. This, unfortunately, appears to be happening sufficiently often in some cases, as well as some forced cases (see below).
How I did it
A quick workaround is that if we know the main thread needs to exit, just return after sending the signal to the main thread, and don't continue execution. This at least tries to avoid it from getting into the problematic code path. However, it's still possible to get a SIGABRT because of the above, depending on thread/process timings (i.e. teamd exits, signals the main thread to exit, and then syncd exits, and syncd calls one of the two C++ functions, potentially hitting the issue).
A proper fix would likely be to make sure
PyEval_RestoreThread()
(and, in turn,pthread_exit()
gets called from a regular C++ function, and not the destructor. The SWIG wrapper code generated with the-threads
option does this, but there's still a gap there where it might get called from the destructor, so it's not immune. (Currently, the swsscommon wrapper code is not using the-threads
option, and is manually adding support for multithreading.)How to verify it
This was tested with the following Bash script. On my dev VM, at least, with this script, the core file was repro'ed in 1-2 iterations. With my fix, 90+ iterations were successfully done with no core file:
Which release branch to backport (provide reason below if selected)
Description for the changelog
Ensure to add label/tag for the feature raised.
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)