Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kill: resync the container if runtime fails #16320

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion libpod/container_internal.go
Original file line number Diff line number Diff line change
Expand Up @@ -1307,7 +1307,15 @@ func (c *Container) stop(timeout uint) error {

// We have to check stopErr *after* we lock again - otherwise, we have a
// change of panicking on a double-unlock. Ref: GH Issue 9615
if stopErr != nil {
if errors.Is(stopErr, errSendingSignal) {
// Maybe sending the signal has failed because the container
// exited after unlocking above? So let's check the state.
if !c.ensureState(define.ContainerStateStopped, define.ContainerStateExited) {
// Return the error unless the container has been
// stopped or exited.
return stopErr
}
} else if stopErr != nil {
return stopErr
}

Expand Down
25 changes: 9 additions & 16 deletions libpod/oci_conmon_common.go
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,8 @@ func generateResourceFile(res *spec.LinuxResources) (string, []string, error) {
return f.Name(), flags, nil
}

var errSendingSignal = errors.New("sending signal to container")

// KillContainer sends the given signal to the given container.
// If all is set, send to all PIDs in the container.
// All is only supported if the container created cgroups.
Expand All @@ -370,15 +372,7 @@ func (r *ConmonOCIRuntime) KillContainer(ctr *Container, signal uint, all bool)
args = append(args, "kill", ctr.ID(), fmt.Sprintf("%d", signal))
}
if err := utils.ExecCmdWithStdStreams(os.Stdin, os.Stdout, os.Stderr, env, r.path, args...); err != nil {
// Update container state - there's a chance we failed because
// the container exited in the meantime.
if err2 := r.UpdateContainerStatus(ctr); err2 != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't do this, this is going to break things further up the stack re: Sigproxy. The update needs to remain here.

logrus.Infof("Error updating status for container %s: %v", ctr.ID(), err2)
}
if ctr.ensureState(define.ContainerStateStopped, define.ContainerStateExited) {
return fmt.Errorf("%w: %s", define.ErrCtrStateInvalid, ctr.state.State)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also going to break SigProxy. We need this to remain ErrCtrStateInvalid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit more? At least for kill and stop, the container is not locked, so we should not fiddle with the state.

Maybe we need to make this conditional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container is locked during kill.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see only one place where KillContainer is run unlocked. I'll fix that.

We should not be touching KillContainer as such.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It's unlocked during stop but not kill.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#16323 catches the last case where KillContainer is run unlocked

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see only one place where KillContainer is run unlocked. I'll fix that.

So you will fix #16142?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

}
return fmt.Errorf("sending signal to container %s: %w", ctr.ID(), err)
return fmt.Errorf("%w %s: %v", errSendingSignal, ctr.ID(), err)
}

return nil
Expand Down Expand Up @@ -407,15 +401,14 @@ func (r *ConmonOCIRuntime) StopContainer(ctr *Container, timeout uint, all bool)

if timeout > 0 {
if err := r.KillContainer(ctr, stopSignal, all); err != nil {
// Is the container gone?
// If so, it probably died between the first check and
// our sending the signal
// The container is stopped, so exit cleanly
err := unix.Kill(ctr.state.PID, 0)
if err == unix.ESRCH {
if !errors.Is(err, errSendingSignal) {
return err
}
// Maybe sending the signal has failed because the
// container is already gone.
if goneErr := unix.Kill(ctr.state.PID, 0); goneErr == unix.ESRCH {
return nil
}

return err
}

Expand Down