Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock induced MacOS PW Pool collapse #214

Merged
merged 1 commit into from
Aug 5, 2024

Conversation

cevich
Copy link
Member

@cevich cevich commented Aug 5, 2024

Every night a script runs to check and possibly update all the scripts in the repo. When this happens, two important activities take place:

  1. The script is restarted (presuming it's own code changed).
  2. The container running nginx (for the usage graph) is restarted.

For unknown reasons, possibly due to a system update, a pasta (previously slirp4netns) sub-process spawned by podman is holding open the lock-file required by both the maintenance script and the (very important) Cron.sh. This leads to a deadlock situation where the entire pool becomes unmanaged since Cron.sh can't run.

To prevent unchecked nefarious/unintended use, all workers automatically recycle themselves after some time should they become unmanaged. Therefore, without Cron.sh operating, the entire pool will eventually collapse.

Though complex, as a (hopefully) temporary fix, ensure all non-stdio FDs are closed (in a sub-shell) prior to restarting the container.


Tested this change manually: Using a modified copy of the script which simply obtains the lock and always calls relaunch_web_container(). Examining a before/after lsof embedded in that function, there is no more leaked Cron.sh FD open by pasta.

@cevich cevich force-pushed the mac_pw_pool_fix_deadlock branch from ef3b8f5 to 39dcbda Compare August 5, 2024 17:15
@cevich cevich requested review from edsantiago and Luap99 August 5, 2024 17:18
@cevich cevich marked this pull request as ready for review August 5, 2024 17:18
@cevich
Copy link
Member Author

cevich commented Aug 5, 2024

@edsantiago @Luap99 when you have a moment, any major SNAFU's with this change? Once merged I'll manually update it on the management VM and monitor to be double sure it's working.

P.S. I know most of this stuff is really ugly, it was all intended to be temporary 😢 Paul suggested a refactor using systemd timers which seems like a good idea.

Copy link

github-actions bot commented Aug 5, 2024

Successfully triggered github-actions/success task to indicate successful run of cirrus-ci_retrospective integration and unit testing from this PR's 39dcbda0fc4f1a401e876e33df056e52a1413d18.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

mac_pw_pool/nightly_maintenance.sh Outdated Show resolved Hide resolved
[[ $fd_nr -ge 3 ]] || \
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this usage? Why not a simple if? (Just curious. My brain does not process test || as easily as it does if/then).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed I also find if else easier to read in scripts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My brain has an easier time understanding this as a "filter", "only greater/equal to 3 may proceed". If it were a more complex condition in a different context, I'd probably agree with you though 😄

Every night a script runs to check and possibly update all the scripts
in the repo.  When this happens, two important activities take place:

1. The script is restarted (presuming it's own code changed).
2. The container running nginx (for the usage graph) is restarted.

For unknown reasons, possibly due to a system update, a pasta
(previously slirp4netns) sub-process spawned by podman is holding open
the lock-file required by both the maintenance script and the (very
important) `Cron.sh`.  This leads to a deadlock situation where
the entire pool becomes unmanaged since `Cron.sh` can't run.

To prevent unchecked nefarious/unintended use, all workers automatically
recycle themselves after some time should they become unmanaged.
Therefore, without `Cron.sh` operating, the entire pool will eventually
collapse.

Though complex, as a (hopefully) temporary fix, ensure all non-stdio FDs
are closed (in a sub-shell) prior to restarting the container.

Signed-off-by: Chris Evich <[email protected]>
@cevich cevich force-pushed the mac_pw_pool_fix_deadlock branch from 39dcbda to 47a5015 Compare August 5, 2024 19:08
@cevich
Copy link
Member Author

cevich commented Aug 5, 2024

Force-push: Simplified loop as Ed suggested. Tested w/ test-script.

Copy link

github-actions bot commented Aug 5, 2024

Successfully triggered github-actions/success task to indicate successful run of cirrus-ci_retrospective integration and unit testing from this PR's 47a5015b075c84fa08cd11cf6e1f125a2e85d897.

@cevich cevich merged commit 13be116 into containers:main Aug 5, 2024
9 checks passed
@cevich
Copy link
Member Author

cevich commented Aug 5, 2024

Manually updated maintenance VM w/ this new code.

@Luap99
Copy link
Member

Luap99 commented Aug 9, 2024

Pasta now closes all additional fds on startup which should help others from running into such a problem in the future
https://passt.top/passt/commit/?id=09603cab28f9883baf1d7b48bdc102d6641dc300

But I still think it is best to keep this even once a updated pasta is available to avoid issues with other child processes that might face the same issue.

@cevich
Copy link
Member Author

cevich commented Aug 9, 2024

That's great news, thanks Paul for pursuing that. I'm sure it will help someone else avoid running into similar issues. Yes, I completely agree about keeping this in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants