Skip to content

Commit

Permalink
sdnotify tests: try real hard to kill socat processes
Browse files Browse the repository at this point in the history
podman gating tests are hanging in the new Fedora CI setup;
long and tedious investigation suggests that 'socat' processes
are being left unkilled, which then causes BATS to hang when
it (presumably) runs a final 'wait' in its end cleanup.

The two principal changes are to exec socat in a subshell
with fd3 closed, and to pkill its child processes before
killing the process itself. I don't know if both are needed.
The pkill definitely is; the exec may just be superstition.
Since I've wasted more than a day of PTO time on this, I'm
okay with a little superstition. What I do know is that with
these two changes, my reproducer fails to reproduce in over
one hour of trying (normally it fails within 5 minutes).

AND, update: only rawhide (f35) leaves stray socat processes
behind. f33 and ubuntu do not, so 'pkill -P' fails.

I really have no idea what's going on.

Signed-off-by: Ed Santiago <[email protected]>
  • Loading branch information
edsantiago committed Mar 11, 2021
1 parent 8d33bfa commit 660a729
Showing 1 changed file with 11 additions and 2 deletions.
13 changes: 11 additions & 2 deletions test/system/260-sdnotify.bats
Original file line number Diff line number Diff line change
Expand Up @@ -42,21 +42,30 @@ function _start_socat() {
_SOCAT_LOG="$PODMAN_TMPDIR/socat.log"

rm -f $_SOCAT_LOG
socat unix-recvfrom:"$NOTIFY_SOCKET",fork \
system:"(cat;echo) >> $_SOCAT_LOG" &
# Execute in subshell so we can close fd3 (which BATS uses).
# This is a superstitious ritual to try to avoid leaving processes behind,
# and thus prevent CI hangs.
(exec socat unix-recvfrom:"$NOTIFY_SOCKET",fork \
system:"(cat;echo) >> $_SOCAT_LOG" 3>&-) &
_SOCAT_PID=$!
}

# Stop the socat background process and clean up logs
function _stop_socat() {
if [[ -n "$_SOCAT_PID" ]]; then
# Kill all child processes, then the process itself.
# This is a superstitious incantation to avoid leaving processes behind.
# The '|| true' is because only f35 leaves behind socat processes;
# f33 (and perhaps others?) behave nicely. ARGH!
pkill -P $_SOCAT_PID || true
kill $_SOCAT_PID
fi
_SOCAT_PID=

if [[ -n "$_SOCAT_LOG" ]]; then
rm -f $_SOCAT_LOG
fi
_SOCAT_LOG=
}

# Check that MAINPID=xxxxx points to a running conmon process
Expand Down

0 comments on commit 660a729

Please sign in to comment.