Fix deadlock induced MacOS PW Pool collapse #214

cevich · 2024-08-05T16:33:18Z

Every night a script runs to check and possibly update all the scripts in the repo. When this happens, two important activities take place:

The script is restarted (presuming it's own code changed).
The container running nginx (for the usage graph) is restarted.

For unknown reasons, possibly due to a system update, a pasta (previously slirp4netns) sub-process spawned by podman is holding open the lock-file required by both the maintenance script and the (very important) Cron.sh. This leads to a deadlock situation where the entire pool becomes unmanaged since Cron.sh can't run.

To prevent unchecked nefarious/unintended use, all workers automatically recycle themselves after some time should they become unmanaged. Therefore, without Cron.sh operating, the entire pool will eventually collapse.

Though complex, as a (hopefully) temporary fix, ensure all non-stdio FDs are closed (in a sub-shell) prior to restarting the container.

Tested this change manually: Using a modified copy of the script which simply obtains the lock and always calls relaunch_web_container(). Examining a before/after lsof embedded in that function, there is no more leaked Cron.sh FD open by pasta.

cevich · 2024-08-05T17:21:07Z

@edsantiago @Luap99 when you have a moment, any major SNAFU's with this change? Once merged I'll manually update it on the management VM and monitor to be double sure it's working.

P.S. I know most of this stuff is really ugly, it was all intended to be temporary 😢 Paul suggested a refactor using systemd timers which seems like a good idea.

github-actions · 2024-08-05T17:21:45Z

Successfully triggered github-actions/success task to indicate successful run of cirrus-ci_retrospective integration and unit testing from this PR's 39dcbda0fc4f1a401e876e33df056e52a1413d18.

Luap99

LGTM

mac_pw_pool/nightly_maintenance.sh

edsantiago · 2024-08-05T17:42:34Z

mac_pw_pool/nightly_maintenance.sh

+        [[ $fd_nr -ge 3 ]] || \
+            continue


Why this usage? Why not a simple if? (Just curious. My brain does not process test || as easily as it does if/then).

agreed I also find if else easier to read in scripts

My brain has an easier time understanding this as a "filter", "only greater/equal to 3 may proceed". If it were a more complex condition in a different context, I'd probably agree with you though 😄

Every night a script runs to check and possibly update all the scripts in the repo. When this happens, two important activities take place: 1. The script is restarted (presuming it's own code changed). 2. The container running nginx (for the usage graph) is restarted. For unknown reasons, possibly due to a system update, a pasta (previously slirp4netns) sub-process spawned by podman is holding open the lock-file required by both the maintenance script and the (very important) `Cron.sh`. This leads to a deadlock situation where the entire pool becomes unmanaged since `Cron.sh` can't run. To prevent unchecked nefarious/unintended use, all workers automatically recycle themselves after some time should they become unmanaged. Therefore, without `Cron.sh` operating, the entire pool will eventually collapse. Though complex, as a (hopefully) temporary fix, ensure all non-stdio FDs are closed (in a sub-shell) prior to restarting the container. Signed-off-by: Chris Evich <[email protected]>

cevich · 2024-08-05T19:10:13Z

Force-push: Simplified loop as Ed suggested. Tested w/ test-script.

github-actions · 2024-08-05T19:14:24Z

Successfully triggered github-actions/success task to indicate successful run of cirrus-ci_retrospective integration and unit testing from this PR's 47a5015b075c84fa08cd11cf6e1f125a2e85d897.

cevich · 2024-08-05T19:20:53Z

Manually updated maintenance VM w/ this new code.

Luap99 · 2024-08-09T10:57:07Z

Pasta now closes all additional fds on startup which should help others from running into such a problem in the future
https://passt.top/passt/commit/?id=09603cab28f9883baf1d7b48bdc102d6641dc300

But I still think it is best to keep this even once a updated pasta is available to avoid issues with other child processes that might face the same issue.

cevich · 2024-08-09T14:26:42Z

That's great news, thanks Paul for pursuing that. I'm sure it will help someone else avoid running into similar issues. Yes, I completely agree about keeping this in place.

cevich force-pushed the mac_pw_pool_fix_deadlock branch from ef3b8f5 to 39dcbda Compare August 5, 2024 17:15

cevich requested review from edsantiago and Luap99 August 5, 2024 17:18

cevich marked this pull request as ready for review August 5, 2024 17:18

Luap99 approved these changes Aug 5, 2024

View reviewed changes

edsantiago approved these changes Aug 5, 2024

View reviewed changes

cevich force-pushed the mac_pw_pool_fix_deadlock branch from 39dcbda to 47a5015 Compare August 5, 2024 19:08

cevich merged commit 13be116 into containers:main Aug 5, 2024
9 checks passed

cevich mentioned this pull request Aug 6, 2024

MacOS PW Pool Collapse due to Cron.sh deadlock #218

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock induced MacOS PW Pool collapse #214

Fix deadlock induced MacOS PW Pool collapse #214

cevich commented Aug 5, 2024 •

edited

Loading

cevich commented Aug 5, 2024

github-actions bot commented Aug 5, 2024 •

edited

Loading

Luap99 left a comment

edsantiago Aug 5, 2024

Luap99 Aug 5, 2024

cevich Aug 5, 2024

cevich commented Aug 5, 2024

github-actions bot commented Aug 5, 2024 •

edited

Loading

cevich commented Aug 5, 2024

Luap99 commented Aug 9, 2024

cevich commented Aug 9, 2024

Fix deadlock induced MacOS PW Pool collapse #214

Fix deadlock induced MacOS PW Pool collapse #214

Conversation

cevich commented Aug 5, 2024 • edited Loading

cevich commented Aug 5, 2024

github-actions bot commented Aug 5, 2024 • edited Loading

Luap99 left a comment

Choose a reason for hiding this comment

edsantiago Aug 5, 2024

Choose a reason for hiding this comment

Luap99 Aug 5, 2024

Choose a reason for hiding this comment

cevich Aug 5, 2024

Choose a reason for hiding this comment

cevich commented Aug 5, 2024

github-actions bot commented Aug 5, 2024 • edited Loading

cevich commented Aug 5, 2024

Luap99 commented Aug 9, 2024

cevich commented Aug 9, 2024

cevich commented Aug 5, 2024 •

edited

Loading

github-actions bot commented Aug 5, 2024 •

edited

Loading

github-actions bot commented Aug 5, 2024 •

edited

Loading