Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unconditionally remove conmon files before starting #4065

Merged

Conversation

mheon
Copy link
Member

@mheon mheon commented Sep 20, 2019

We've been seeing a lot of issues (ref: #4061, but there are others) where Podman hiccups on trying to start a container, because some temporary files have been retained and Conmon will not overwrite them.

If we're calling start() we can safely assume that we really want those files gone so the container starts without error, so invoke the cleanup routine. It's relatively cheap (four file removes) so it shouldn't hurt us that much.

Also contains a small simplification to the removeConmonFiles logic - we don't need to stat-then-remove when ignoring ENOENT is fine.

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S labels Sep 20, 2019
@rhatdan
Copy link
Member

rhatdan commented Sep 20, 2019

LGTM
Nice change and cleanup.
@vrothberg @giuseppe @TomSweeneyRedHat @baude @jwhonce @QiWang19 PTAL

@giuseppe
Copy link
Member

LGTM

@rhatdan
Copy link
Member

rhatdan commented Sep 20, 2019

Testing is not happy though.

@debarshiray
Copy link
Member

This broke starting containers on my Workstation 29 where things were somehow working regardless of the warning about crun.

$ /opt/bin/podman --log-level debug start fedora-toolbox-29
ERRO[0000] could not get runtime: eventer creation: No support for journald logging

@mheon
Copy link
Member Author

mheon commented Sep 20, 2019

Oh, wait, wait, wait. This is happening too late in the process. Needs to be attached to init() (when we initially spawn conmon) and not start(). This is what I get for working after midnight.

We've been seeing a lot of issues (ref: containers#4061, but there are
others) where Podman hiccups on trying to start a container,
because some temporary files have been retained and Conmon will
not overwrite them.

If we're calling start() we can safely assume that we really want
those files gone so the container starts without error, so invoke
the cleanup routine. It's relatively cheap (four file removes) so
it shouldn't hurt us that much.

Also contains a small simplification to the removeConmonFiles
logic - we don't need to stat-then-remove when ignoring ENOENT is
fine.

Signed-off-by: Matthew Heon <[email protected]>
@mheon mheon force-pushed the unconditional_conmon_rm branch from 65a7db2 to 407fba4 Compare September 20, 2019 13:30
@debarshiray
Copy link
Member

Oh, wait, wait, wait. This is happening too late in the process. Needs
to be attached to init() (when we initially spawn conmon) and not start().
This is what I get for working after midnight.

Ok, so my Workstation 29 is back to being as it used to be. podman start works but I continue to see the crun warning without bumping the log level.

WARN[0000] Error initializing configured OCI runtime crun: no valid executable found for OCI runtime crun: invalid argument 

Let me now see if it unbreaks podman start my Silverblue 30.

@mheon
Copy link
Member Author

mheon commented Sep 20, 2019

Next step on the logs is to get a print statement in stating our default log level, to see if it changed

@debarshiray
Copy link
Member

podman start continues to be broken on my Silverblue 30 with the same spew from podman start about not being able to mount /usr:

Error: unable to start container "fedora-toolbox-30": container_linux.go:346: starting container process caused "process_linux.go:446: container init caused \"rootfs_linux.go:58: mounting \\\"/usr\\\" to rootfs \\\"/var/home/rishi/.local/share/containers/storage/overlay/25d8867e47b79ef5a590bb748eb0feecd6fbdf12de930d1114f531218c0fac45/merged\\\" at \\\"/var/home/rishi/.local/share/containers/storage/overlay/25d8867e47b79ef5a590bb748eb0feecd6fbdf12de930d1114f531218c0fac45/merged/run/host/usr\\\" caused \\\"operation not permitted\\\"\"": OCI runtime permission denied error

As before, the first failed podman start doesn't put anything in the journal, but the second attempt puts a slightly different set of logs than before. It's no longer about a failed mkfifo but Failed to create container: exit status 1:

Sep 20 17:11:31 bollard conmon[8799]: conmon 7f293d9029e04fba2793 <ndebug>: failed to write to /proc/self/oom_score_adj: Permission denied
Sep 20 17:11:31 bollard conmon[8800]: conmon 7f293d9029e04fba2793 <ninfo>: attach sock path: /run/user/1000/libpod/tmp/socket/7f293d9029e04fba27934561472b3293754e5baf22cb4d950b6eafdb87255aac/attach
Sep 20 17:11:31 bollard conmon[8800]: conmon 7f293d9029e04fba2793 <ninfo>: addr{sun_family=AF_UNIX, sun_path=/run/user/1000/libpod/tmp/socket/7f293d9029e04fba27934561472b3293754e5baf22cb4d950b6eafdb87255aac/attach}
Sep 20 17:11:31 bollard conmon[8800]: conmon 7f293d9029e04fba2793 <ninfo>: ctl fifo path: /var/home/rishi/.local/share/containers/storage/overlay-containers/7f293d9029e04fba27934561472b3293754e5baf22cb4d950b6eafdb87255aac/userdata/ctl
Sep 20 17:11:31 bollard conmon[8800]: conmon 7f293d9029e04fba2793 <ninfo>: terminal_ctrl_fd: 12
Sep 20 17:11:31 bollard conmon[8800]: conmon 7f293d9029e04fba2793 <error>: Failed to create container: exit status 1

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @mheon ready to merge?

return errors.Wrapf(err, "error removing container %s exit file", c.ID())
}
if err := os.Remove(exitFile); err != nil && !os.IsNotExist(err) {
return errors.Wrapf(err, "error removing container %s exit file", c.ID())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to throw out the question as I'm not sure. Would it be better to not return an error for each of these attempts? I.e. If one fails, note it, then try removing the next file, if that fails note it, then try next? Then at the end report all of the errors encountered? Otherwise, if there are multiple things wrong, will the user be able to clean up/investigate fully?

@mheon
Copy link
Member Author

mheon commented Sep 24, 2019

@vrothberg I don't know if this one actually solved the issue, which is why I'm sitting on it - @debarshiray indicates that it didn't, which makes me concerned that I hit a symptom and the underlying problem is unaddressed.

@vrothberg vrothberg changed the title Unconditionally remove conmon files before starting WIP - Unconditionally remove conmon files before starting Sep 24, 2019
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 24, 2019
@vrothberg
Copy link
Member

Added a WIP - to make sure we're not accidentally merging it.

@debarshiray
Copy link
Member

@vrothberg I don't know if this one actually solved the issue, which is why I'm sitting
on it - @debarshiray indicates that it didn't, which makes me concerned that I hit a
symptom and the underlying problem is unaddressed.

If you look very closely at #4061 and read between the lines, then it seems that this pull request at least made things move further. podman start still fails but with a different error.

@debarshiray
Copy link
Member

@vrothberg I don't know if this one actually solved the issue, which is why I'm sitting
on it - @debarshiray indicates that it didn't, which makes me concerned that I hit a
symptom and the underlying problem is unaddressed.

If you look very closely at #4061 and read between the lines, then it seems that this
pull request at least made things move further. podman start still fails but with a different
error.

Ok, I have a better understanding of #4061 now. I don't think we should block this pull request just because it doesn't address #4061 fully. The remaining issue in #4061 is unrelated to this PR in my opinion.

@mheon
Copy link
Member Author

mheon commented Sep 26, 2019

/hold cancel
@rhatdan @vrothberg @giuseppe @baude Alright, sounds like we're ready to merge, then.

@mheon mheon changed the title WIP - Unconditionally remove conmon files before starting Unconditionally remove conmon files before starting Sep 26, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2019
@jwhonce
Copy link
Member

jwhonce commented Sep 27, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2019
@openshift-merge-robot openshift-merge-robot merged commit e87012d into containers:master Sep 27, 2019
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 26, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants