Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdnotify tests: try real hard to kill socat processes #9697

Merged

Conversation

edsantiago
Copy link
Member

podman gating tests are hanging in the new Fedora CI setup;
long and tedious investigation suggests that 'socat' processes
are being left unkilled, which then causes BATS to hang when
it (presumably) runs a final 'wait' in its end cleanup.

The two principal changes are to exec socat in a subshell
with fd3 closed, and to pkill its child processes before
killing the process itself. I don't know if both are needed.
The pkill definitely is; the exec may just be superstition.
Since I've wasted more than a day of PTO time on this, I'm
okay with a little superstition. What I do know is that with
these two changes, my reproducer fails to reproduce in over
one hour of trying (normally it fails within 5 minutes).

Signed-off-by: Ed Santiago [email protected]

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 11, 2021
@mheon
Copy link
Member

mheon commented Mar 11, 2021

Sure, LGTM

@rhatdan
Copy link
Member

rhatdan commented Mar 11, 2021

LGTM

podman gating tests are hanging in the new Fedora CI setup;
long and tedious investigation suggests that 'socat' processes
are being left unkilled, which then causes BATS to hang when
it (presumably) runs a final 'wait' in its end cleanup.

The two principal changes are to exec socat in a subshell
with fd3 closed, and to pkill its child processes before
killing the process itself. I don't know if both are needed.
The pkill definitely is; the exec may just be superstition.
Since I've wasted more than a day of PTO time on this, I'm
okay with a little superstition. What I do know is that with
these two changes, my reproducer fails to reproduce in over
one hour of trying (normally it fails within 5 minutes).

AND, update: only rawhide (f35) leaves stray socat processes
behind. f33 and ubuntu do not, so 'pkill -P' fails.

I really have no idea what's going on.

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago edsantiago force-pushed the fedora_gating_test_hang branch from 2e94cff to 660a729 Compare March 11, 2021 23:23
@edsantiago
Copy link
Member Author

Update: changed pkill to pkill ... || true, because (sigh) f33 and ubuntu do not leave behind a set of stray socat processes. Only rawhide does:

  [parent socat process has been killed]
  ├─socat,358891 unix-recvfrom:/tmp/podman_bats.OBYLSk/conmon.sock,fork system:(cat;echo) >> /tmp/podman_bats.OBYLSk/socat.log
  │   └─socat,358893 unix-recvfrom:/tmp/podman_bats.OBYLSk/conmon.sock,fork system:(cat;echo) >> /tmp/podman_bats.OBYLSk/socat.log
  │       └─sh,358895 -c (cat;echo) >> /tmp/podman_bats.OBYLSk/socat.log
  │           └─sh,358897 -c (cat;echo) >> /tmp/podman_bats.OBYLSk/socat.log
  │               └─cat,358900

Anyhow, pkill -P pid, on f33 and ubuntu, will error out because pid has no children. Only rawhide has child socats. I have no idea what's going on.

on f33: socat-1.7.4.1-1.fc33 5.10.19-200.fc33
on rawhide: socat-1.7.4.1-2.fc34 5.12.0-0.rc2.165.fc35

@TomSweeneyRedHat
Copy link
Member

LGTM

@rhatdan
Copy link
Member

rhatdan commented Mar 12, 2021

/lgtm
/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 12, 2021
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 12, 2021
@rhatdan
Copy link
Member

rhatdan commented Mar 12, 2021

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 12, 2021
@openshift-merge-robot openshift-merge-robot merged commit 5b22ddd into containers:master Mar 12, 2021
@edsantiago edsantiago deleted the fedora_gating_test_hang branch March 13, 2021 15:07
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants