-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock induced MacOS PW Pool collapse #214
Conversation
ef3b8f5
to
39dcbda
Compare
@edsantiago @Luap99 when you have a moment, any major SNAFU's with this change? Once merged I'll manually update it on the management VM and monitor to be double sure it's working. P.S. I know most of this stuff is really ugly, it was all intended to be temporary 😢 Paul suggested a refactor using systemd timers which seems like a good idea. |
Successfully triggered github-actions/success task to indicate successful run of cirrus-ci_retrospective integration and unit testing from this PR's 39dcbda0fc4f1a401e876e33df056e52a1413d18. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[[ $fd_nr -ge 3 ]] || \ | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this usage? Why not a simple if
? (Just curious. My brain does not process test ||
as easily as it does if/then
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed I also find if else easier to read in scripts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My brain has an easier time understanding this as a "filter", "only greater/equal to 3 may proceed". If it were a more complex condition in a different context, I'd probably agree with you though 😄
Every night a script runs to check and possibly update all the scripts in the repo. When this happens, two important activities take place: 1. The script is restarted (presuming it's own code changed). 2. The container running nginx (for the usage graph) is restarted. For unknown reasons, possibly due to a system update, a pasta (previously slirp4netns) sub-process spawned by podman is holding open the lock-file required by both the maintenance script and the (very important) `Cron.sh`. This leads to a deadlock situation where the entire pool becomes unmanaged since `Cron.sh` can't run. To prevent unchecked nefarious/unintended use, all workers automatically recycle themselves after some time should they become unmanaged. Therefore, without `Cron.sh` operating, the entire pool will eventually collapse. Though complex, as a (hopefully) temporary fix, ensure all non-stdio FDs are closed (in a sub-shell) prior to restarting the container. Signed-off-by: Chris Evich <[email protected]>
39dcbda
to
47a5015
Compare
Force-push: Simplified loop as Ed suggested. Tested w/ test-script. |
Successfully triggered github-actions/success task to indicate successful run of cirrus-ci_retrospective integration and unit testing from this PR's 47a5015b075c84fa08cd11cf6e1f125a2e85d897. |
Manually updated maintenance VM w/ this new code. |
Pasta now closes all additional fds on startup which should help others from running into such a problem in the future But I still think it is best to keep this even once a updated pasta is available to avoid issues with other child processes that might face the same issue. |
That's great news, thanks Paul for pursuing that. I'm sure it will help someone else avoid running into similar issues. Yes, I completely agree about keeping this in place. |
Every night a script runs to check and possibly update all the scripts in the repo. When this happens, two important activities take place:
For unknown reasons, possibly due to a system update, a pasta (previously slirp4netns) sub-process spawned by podman is holding open the lock-file required by both the maintenance script and the (very important)
Cron.sh
. This leads to a deadlock situation where the entire pool becomes unmanaged sinceCron.sh
can't run.To prevent unchecked nefarious/unintended use, all workers automatically recycle themselves after some time should they become unmanaged. Therefore, without
Cron.sh
operating, the entire pool will eventually collapse.Though complex, as a (hopefully) temporary fix, ensure all non-stdio FDs are closed (in a sub-shell) prior to restarting the container.
Tested this change manually: Using a modified copy of the script which simply obtains the lock and always calls
relaunch_web_container()
. Examining a before/afterlsof
embedded in that function, there is no more leakedCron.sh
FD open by pasta.