Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

virtcontainers: Properly remove the container when shim gets killed #275

Merged
merged 1 commit into from
Apr 30, 2018

Conversation

sboeuf
Copy link

@sboeuf sboeuf commented Apr 28, 2018

Here is an interesting case I have been debugging. I was trying to
understand why a "kubeadm reset" was not working for kata-runtime
compared to runc. In this case, the only pod started with Kata is
the kube-dns pod. For some reasons, when this pod is stopped and
removed, its containers receive some signals, 2 of them being SIGTERM
signals, which seems the way to properly stop them, but the third
container receives a SIGCONT. Obviously, nothing happens in this
case, but apparently CRI-O considers this should be the end of the
container and after a few seconds, it kills the container process
(being the shim in Kata case). Because it is using a SIGKILL, the
signal does not get forwarded to the agent because the shim itself
is killed right away. After this happened, CRI-O calls into
"kata-runtime state", we detect the shim is not running anymore
and we try to stop the container. The code will eventually call
into agent.RemoveContainer(), but this will fail and return an
error because inside the agent, the container is still running.

The approach to solve this issue here is to send a SIGKILL signal
to the container after the shim has been waited for. This call does
not check for the error returned because most of the cases, regular
use cases, will end up returning an error because the shim itself
not being there actually represents the container inside the VM has
already terminated.
And in case the shim has been killed without the possibility to
forward the signal (like described in first paragraph), the SIGKILL
will work and will allow the following call to agent.stopContainer()
to proceed to the removal of the container inside the agent.

Fixes #274

Signed-off-by: Sebastien Boeuf [email protected]

Here is an interesting case I have been debugging. I was trying to
understand why a "kubeadm reset" was not working for kata-runtime
compared to runc. In this case, the only pod started with Kata is
the kube-dns pod. For some reasons, when this pod is stopped and
removed, its containers receive some signals, 2 of them being SIGTERM
signals, which seems the way to properly stop them, but the third
container receives a SIGCONT. Obviously, nothing happens in this
case, but apparently CRI-O considers this should be the end of the
container and after a few seconds, it kills the container process
(being the shim in Kata case). Because it is using a SIGKILL, the
signal does not get forwarded to the agent because the shim itself
is killed right away. After this happened, CRI-O calls into
"kata-runtime state", we detect the shim is not running anymore
and we try to stop the container. The code will eventually call
into agent.RemoveContainer(), but this will fail and return an
error because inside the agent, the container is still running.

The approach to solve this issue here is to send a SIGKILL signal
to the container after the shim has been waited for. This call does
not check for the error returned because most of the cases, regular
use cases, will end up returning an error because the shim itself
not being there actually represents the container inside the VM has
already terminated.
And in case the shim has been killed without the possibility to
forward the signal (like described in first paragraph), the SIGKILL
will work and will allow the following call to agent.stopContainer()
to proceed to the removal of the container inside the agent.

Fixes kata-containers#274

Signed-off-by: Sebastien Boeuf <[email protected]>
@sboeuf sboeuf added the review label Apr 28, 2018
@sboeuf
Copy link
Author

sboeuf commented Apr 28, 2018

@sboeuf
Copy link
Author

sboeuf commented Apr 28, 2018

@chavafg if you try this out, you should be able to properly reset a k8s cluster with kubeadm reset. Given that you have your agent patched with kata-containers/agent#225.

@codecov
Copy link

codecov bot commented Apr 28, 2018

Codecov Report

Merging #275 into master will increase coverage by 0.14%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #275      +/-   ##
=========================================
+ Coverage   65.35%   65.5%   +0.14%     
=========================================
  Files          74      74              
  Lines        7892    7893       +1     
=========================================
+ Hits         5158    5170      +12     
+ Misses       2180    2167      -13     
- Partials      554     556       +2
Impacted Files Coverage Δ
virtcontainers/container.go 47.08% <100%> (+0.14%) ⬆️
virtcontainers/hyperstart_agent.go 61% <0%> (+2.63%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4225ed...789dbca. Read the comment docs.

Copy link

@devimc devimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@grahamwhaley
Copy link
Contributor

@sboeuf Did you figure out exactly why we see the SIGCONT - and what actually happens in the runc case? I didn't see an explanation of if that flow is something different, and for good reason, and if we are exactly emulating that flow or just working around something?
This old k8s Issue maybe looks relevant? kubernetes/kubernetes#33050

I just want to make sure we understand what we are working around and why, and not making a fix that may not hold if other k8s things change in the future etc.

@sboeuf
Copy link
Author

sboeuf commented Apr 30, 2018

@grahamwhaley I 100% agree we should also investigate why the SIGCONT does not work. But the issue highlighted because the SIGCONT didn't terminate the container is also a valid issue, solved by this PR and explained by #274.
A separate issue for SIGCONT seems appropriate in this case, don't you think ?

Copy link
Contributor

@grahamwhaley grahamwhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That must be the highest commit message-to-lines-changed ratio so far for the project - which is not a bad thing ;-)
lgtm

@grahamwhaley
Copy link
Contributor

Agreed @sboeuf - if we can open an Issue to note we should figure out the oddity of workflow with the kube-dns pod and signals, that would be great.

@grahamwhaley grahamwhaley merged commit f92d7dd into kata-containers:master Apr 30, 2018
@sboeuf
Copy link
Author

sboeuf commented Apr 30, 2018

@grahamwhaley I have just opened #276 to track this down !

@chavafg
Copy link
Contributor

chavafg commented May 2, 2018

I am late on this, but yes it works, now the jenkins jobs don't have the errors when running the kubeadm reset. Thanks @sboeuf

@sboeuf
Copy link
Author

sboeuf commented May 2, 2018

@chavafg thanks for confirmation!

zklei pushed a commit to zklei/runtime that referenced this pull request Jun 13, 2019
jcvenegas added a commit to jcvenegas/runtime that referenced this pull request Aug 6, 2020
Highlights for cloud-hypervisor version 0.9.0 include:
virtiofs updates to new dax implementation based in qemu 5.0
Fixed random issues caused due to seccomp filters

io_uring Based Block Device Support

If the io_uring feature is enabled and the host kernel supports it then io_uring will be used for block devices. This results a very significant performance improvement.
Block and Network Device Statistics

Statistics for activity of the virtio network and block devices is now exposed through a new vm.counters HTTP API entry point. These take the form of simple counters which can be used to observe the activity of the VM.
HTTP API Responses

The HTTP API for adding devices now responds with the name that was assigned to the device as well the PCI BDF.
CPU Topology

A topology parameter has been added to --cpus which allows the configuration of the guest CPU topology allowing the user to specify the numbers of sockets, packages per socket, cores per package and threads per core.
Release Build Optimization

Our release build is now built with LTO (Link Time Optimization) which results in a ~20% reduction in the binary size.
Hypervisor Abstraction

A new abstraction has been introduced, in the form of a hypervisor crate so as to enable the support of additional hypervisors beyond KVM.
Snapshot/Restore Improvements

Multiple improvements have been made to the VM snapshot/restore support that was added in the last release. This includes persisting more vCPU state and in particular preserving the guest paravirtualized clock in order to avoid vCPU hangs inside the guest when running with multiple vCPUs.
Virtio Memory Ballooning Support

A virtio-balloon device has been added, controlled through the resize control, which allows the reclamation of host memory by resizing a memory balloon inside the guest.
Enhancements to ARM64 Support

The ARM64 support introduced in the last release has been further enhanced with support for using PCI for exposing devices into the guest as well as multiple bug fixes. It also now supports using an initramfs when booting.
Intel SGX Support

The guest can now use Intel SGX if the host supports it. Details can be found in the dedicated SGX documentation.
Seccomp Sandbox Improvements

The most frequently used virtio devices are now isolated with their own seccomp filters. It is also now possible to pass --seccomp=log which result in the logging of requests that would have otherwise been denied to further aid development.
Notable Bug Fixes

    Our virtio-vsock implementation has been resynced with the implementation from Firecracker and includes multiple bug fixes.
    CPU hotplug has been fixed so that it is now possible to add, remove, and re-add vCPUs (kata-containers#1338)
    A workaround is now in place for when KVM reports MSRs available MSRs that are in fact unreadable preventing snapshot/restore from working correctly (kata-containers#1543).
    virtio-mmio based devices are now more widely tested (kata-containers#275).
    Multiple issues have been fixed with virtio device configuration (kata-containers#1217)
    Console input was wrongly consumed by both virtio-console and the serial. (kata-containers#1521)

Fixes: kata-containers#2864

Signed-off-by: Jose Carlos Venegas Munoz <[email protected]>
jcvenegas added a commit to jcvenegas/runtime that referenced this pull request Aug 11, 2020
Highlights for cloud-hypervisor version 0.9.0 include:
virtiofs updates to new dax implementation based in qemu 5.0
Fixed random issues caused due to seccomp filters

io_uring Based Block Device Support

If the io_uring feature is enabled and the host kernel supports it then io_uring will be used for block devices. This results a very significant performance improvement.
Block and Network Device Statistics

Statistics for activity of the virtio network and block devices is now exposed through a new vm.counters HTTP API entry point. These take the form of simple counters which can be used to observe the activity of the VM.
HTTP API Responses

The HTTP API for adding devices now responds with the name that was assigned to the device as well the PCI BDF.
CPU Topology

A topology parameter has been added to --cpus which allows the configuration of the guest CPU topology allowing the user to specify the numbers of sockets, packages per socket, cores per package and threads per core.
Release Build Optimization

Our release build is now built with LTO (Link Time Optimization) which results in a ~20% reduction in the binary size.
Hypervisor Abstraction

A new abstraction has been introduced, in the form of a hypervisor crate so as to enable the support of additional hypervisors beyond KVM.
Snapshot/Restore Improvements

Multiple improvements have been made to the VM snapshot/restore support that was added in the last release. This includes persisting more vCPU state and in particular preserving the guest paravirtualized clock in order to avoid vCPU hangs inside the guest when running with multiple vCPUs.
Virtio Memory Ballooning Support

A virtio-balloon device has been added, controlled through the resize control, which allows the reclamation of host memory by resizing a memory balloon inside the guest.
Enhancements to ARM64 Support

The ARM64 support introduced in the last release has been further enhanced with support for using PCI for exposing devices into the guest as well as multiple bug fixes. It also now supports using an initramfs when booting.
Intel SGX Support

The guest can now use Intel SGX if the host supports it. Details can be found in the dedicated SGX documentation.
Seccomp Sandbox Improvements

The most frequently used virtio devices are now isolated with their own seccomp filters. It is also now possible to pass --seccomp=log which result in the logging of requests that would have otherwise been denied to further aid development.
Notable Bug Fixes

    Our virtio-vsock implementation has been resynced with the implementation from Firecracker and includes multiple bug fixes.
    CPU hotplug has been fixed so that it is now possible to add, remove, and re-add vCPUs (kata-containers#1338)
    A workaround is now in place for when KVM reports MSRs available MSRs that are in fact unreadable preventing snapshot/restore from working correctly (kata-containers#1543).
    virtio-mmio based devices are now more widely tested (kata-containers#275).
    Multiple issues have been fixed with virtio device configuration (kata-containers#1217)
    Console input was wrongly consumed by both virtio-console and the serial. (kata-containers#1521)

Fixes: kata-containers#2864

Signed-off-by: Jose Carlos Venegas Munoz <[email protected]>
jcvenegas added a commit to jcvenegas/runtime that referenced this pull request Aug 11, 2020
Highlights for cloud-hypervisor version 0.9.0 include:
virtiofs updates to new dax implementation based in qemu 5.0
Fixed random issues caused due to seccomp filters

io_uring Based Block Device Support

If the io_uring feature is enabled and the host kernel supports it then io_uring will be used for block devices. This results a very significant performance improvement.
Block and Network Device Statistics

Statistics for activity of the virtio network and block devices is now exposed through a new vm.counters HTTP API entry point. These take the form of simple counters which can be used to observe the activity of the VM.
HTTP API Responses

The HTTP API for adding devices now responds with the name that was assigned to the device as well the PCI BDF.
CPU Topology

A topology parameter has been added to --cpus which allows the configuration of the guest CPU topology allowing the user to specify the numbers of sockets, packages per socket, cores per package and threads per core.
Release Build Optimization

Our release build is now built with LTO (Link Time Optimization) which results in a ~20% reduction in the binary size.
Hypervisor Abstraction

A new abstraction has been introduced, in the form of a hypervisor crate so as to enable the support of additional hypervisors beyond KVM.
Snapshot/Restore Improvements

Multiple improvements have been made to the VM snapshot/restore support that was added in the last release. This includes persisting more vCPU state and in particular preserving the guest paravirtualized clock in order to avoid vCPU hangs inside the guest when running with multiple vCPUs.
Virtio Memory Ballooning Support

A virtio-balloon device has been added, controlled through the resize control, which allows the reclamation of host memory by resizing a memory balloon inside the guest.
Enhancements to ARM64 Support

The ARM64 support introduced in the last release has been further enhanced with support for using PCI for exposing devices into the guest as well as multiple bug fixes. It also now supports using an initramfs when booting.
Intel SGX Support

The guest can now use Intel SGX if the host supports it. Details can be found in the dedicated SGX documentation.
Seccomp Sandbox Improvements

The most frequently used virtio devices are now isolated with their own seccomp filters. It is also now possible to pass --seccomp=log which result in the logging of requests that would have otherwise been denied to further aid development.
Notable Bug Fixes

    Our virtio-vsock implementation has been resynced with the implementation from Firecracker and includes multiple bug fixes.
    CPU hotplug has been fixed so that it is now possible to add, remove, and re-add vCPUs (kata-containers#1338)
    A workaround is now in place for when KVM reports MSRs available MSRs that are in fact unreadable preventing snapshot/restore from working correctly (kata-containers#1543).
    virtio-mmio based devices are now more widely tested (kata-containers#275).
    Multiple issues have been fixed with virtio device configuration (kata-containers#1217)
    Console input was wrongly consumed by both virtio-console and the serial. (kata-containers#1521)

Fixes: kata-containers#2864

Signed-off-by: Jose Carlos Venegas Munoz <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants