-
Notifications
You must be signed in to change notification settings - Fork 375
virtcontainers: Properly remove the container when shim gets killed #275
virtcontainers: Properly remove the container when shim gets killed #275
Conversation
Here is an interesting case I have been debugging. I was trying to understand why a "kubeadm reset" was not working for kata-runtime compared to runc. In this case, the only pod started with Kata is the kube-dns pod. For some reasons, when this pod is stopped and removed, its containers receive some signals, 2 of them being SIGTERM signals, which seems the way to properly stop them, but the third container receives a SIGCONT. Obviously, nothing happens in this case, but apparently CRI-O considers this should be the end of the container and after a few seconds, it kills the container process (being the shim in Kata case). Because it is using a SIGKILL, the signal does not get forwarded to the agent because the shim itself is killed right away. After this happened, CRI-O calls into "kata-runtime state", we detect the shim is not running anymore and we try to stop the container. The code will eventually call into agent.RemoveContainer(), but this will fail and return an error because inside the agent, the container is still running. The approach to solve this issue here is to send a SIGKILL signal to the container after the shim has been waited for. This call does not check for the error returned because most of the cases, regular use cases, will end up returning an error because the shim itself not being there actually represents the container inside the VM has already terminated. And in case the shim has been killed without the possibility to forward the signal (like described in first paragraph), the SIGKILL will work and will allow the following call to agent.stopContainer() to proceed to the removal of the container inside the agent. Fixes kata-containers#274 Signed-off-by: Sebastien Boeuf <[email protected]>
@chavafg if you try this out, you should be able to properly reset a k8s cluster with |
Codecov Report
@@ Coverage Diff @@
## master #275 +/- ##
=========================================
+ Coverage 65.35% 65.5% +0.14%
=========================================
Files 74 74
Lines 7892 7893 +1
=========================================
+ Hits 5158 5170 +12
+ Misses 2180 2167 -13
- Partials 554 556 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@sboeuf Did you figure out exactly why we see the SIGCONT - and what actually happens in the I just want to make sure we understand what we are working around and why, and not making a fix that may not hold if other k8s things change in the future etc. |
@grahamwhaley I 100% agree we should also investigate why the SIGCONT does not work. But the issue highlighted because the SIGCONT didn't terminate the container is also a valid issue, solved by this PR and explained by #274. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That must be the highest commit message-to-lines-changed ratio so far for the project - which is not a bad thing ;-)
lgtm
Agreed @sboeuf - if we can open an Issue to note we should figure out the oddity of workflow with the |
@grahamwhaley I have just opened #276 to track this down ! |
I am late on this, but yes it works, now the jenkins jobs don't have the errors when running the |
@chavafg thanks for confirmation! |
CI: update static-checks.sh in .ci/lib.sh
Highlights for cloud-hypervisor version 0.9.0 include: virtiofs updates to new dax implementation based in qemu 5.0 Fixed random issues caused due to seccomp filters io_uring Based Block Device Support If the io_uring feature is enabled and the host kernel supports it then io_uring will be used for block devices. This results a very significant performance improvement. Block and Network Device Statistics Statistics for activity of the virtio network and block devices is now exposed through a new vm.counters HTTP API entry point. These take the form of simple counters which can be used to observe the activity of the VM. HTTP API Responses The HTTP API for adding devices now responds with the name that was assigned to the device as well the PCI BDF. CPU Topology A topology parameter has been added to --cpus which allows the configuration of the guest CPU topology allowing the user to specify the numbers of sockets, packages per socket, cores per package and threads per core. Release Build Optimization Our release build is now built with LTO (Link Time Optimization) which results in a ~20% reduction in the binary size. Hypervisor Abstraction A new abstraction has been introduced, in the form of a hypervisor crate so as to enable the support of additional hypervisors beyond KVM. Snapshot/Restore Improvements Multiple improvements have been made to the VM snapshot/restore support that was added in the last release. This includes persisting more vCPU state and in particular preserving the guest paravirtualized clock in order to avoid vCPU hangs inside the guest when running with multiple vCPUs. Virtio Memory Ballooning Support A virtio-balloon device has been added, controlled through the resize control, which allows the reclamation of host memory by resizing a memory balloon inside the guest. Enhancements to ARM64 Support The ARM64 support introduced in the last release has been further enhanced with support for using PCI for exposing devices into the guest as well as multiple bug fixes. It also now supports using an initramfs when booting. Intel SGX Support The guest can now use Intel SGX if the host supports it. Details can be found in the dedicated SGX documentation. Seccomp Sandbox Improvements The most frequently used virtio devices are now isolated with their own seccomp filters. It is also now possible to pass --seccomp=log which result in the logging of requests that would have otherwise been denied to further aid development. Notable Bug Fixes Our virtio-vsock implementation has been resynced with the implementation from Firecracker and includes multiple bug fixes. CPU hotplug has been fixed so that it is now possible to add, remove, and re-add vCPUs (kata-containers#1338) A workaround is now in place for when KVM reports MSRs available MSRs that are in fact unreadable preventing snapshot/restore from working correctly (kata-containers#1543). virtio-mmio based devices are now more widely tested (kata-containers#275). Multiple issues have been fixed with virtio device configuration (kata-containers#1217) Console input was wrongly consumed by both virtio-console and the serial. (kata-containers#1521) Fixes: kata-containers#2864 Signed-off-by: Jose Carlos Venegas Munoz <[email protected]>
Highlights for cloud-hypervisor version 0.9.0 include: virtiofs updates to new dax implementation based in qemu 5.0 Fixed random issues caused due to seccomp filters io_uring Based Block Device Support If the io_uring feature is enabled and the host kernel supports it then io_uring will be used for block devices. This results a very significant performance improvement. Block and Network Device Statistics Statistics for activity of the virtio network and block devices is now exposed through a new vm.counters HTTP API entry point. These take the form of simple counters which can be used to observe the activity of the VM. HTTP API Responses The HTTP API for adding devices now responds with the name that was assigned to the device as well the PCI BDF. CPU Topology A topology parameter has been added to --cpus which allows the configuration of the guest CPU topology allowing the user to specify the numbers of sockets, packages per socket, cores per package and threads per core. Release Build Optimization Our release build is now built with LTO (Link Time Optimization) which results in a ~20% reduction in the binary size. Hypervisor Abstraction A new abstraction has been introduced, in the form of a hypervisor crate so as to enable the support of additional hypervisors beyond KVM. Snapshot/Restore Improvements Multiple improvements have been made to the VM snapshot/restore support that was added in the last release. This includes persisting more vCPU state and in particular preserving the guest paravirtualized clock in order to avoid vCPU hangs inside the guest when running with multiple vCPUs. Virtio Memory Ballooning Support A virtio-balloon device has been added, controlled through the resize control, which allows the reclamation of host memory by resizing a memory balloon inside the guest. Enhancements to ARM64 Support The ARM64 support introduced in the last release has been further enhanced with support for using PCI for exposing devices into the guest as well as multiple bug fixes. It also now supports using an initramfs when booting. Intel SGX Support The guest can now use Intel SGX if the host supports it. Details can be found in the dedicated SGX documentation. Seccomp Sandbox Improvements The most frequently used virtio devices are now isolated with their own seccomp filters. It is also now possible to pass --seccomp=log which result in the logging of requests that would have otherwise been denied to further aid development. Notable Bug Fixes Our virtio-vsock implementation has been resynced with the implementation from Firecracker and includes multiple bug fixes. CPU hotplug has been fixed so that it is now possible to add, remove, and re-add vCPUs (kata-containers#1338) A workaround is now in place for when KVM reports MSRs available MSRs that are in fact unreadable preventing snapshot/restore from working correctly (kata-containers#1543). virtio-mmio based devices are now more widely tested (kata-containers#275). Multiple issues have been fixed with virtio device configuration (kata-containers#1217) Console input was wrongly consumed by both virtio-console and the serial. (kata-containers#1521) Fixes: kata-containers#2864 Signed-off-by: Jose Carlos Venegas Munoz <[email protected]>
Highlights for cloud-hypervisor version 0.9.0 include: virtiofs updates to new dax implementation based in qemu 5.0 Fixed random issues caused due to seccomp filters io_uring Based Block Device Support If the io_uring feature is enabled and the host kernel supports it then io_uring will be used for block devices. This results a very significant performance improvement. Block and Network Device Statistics Statistics for activity of the virtio network and block devices is now exposed through a new vm.counters HTTP API entry point. These take the form of simple counters which can be used to observe the activity of the VM. HTTP API Responses The HTTP API for adding devices now responds with the name that was assigned to the device as well the PCI BDF. CPU Topology A topology parameter has been added to --cpus which allows the configuration of the guest CPU topology allowing the user to specify the numbers of sockets, packages per socket, cores per package and threads per core. Release Build Optimization Our release build is now built with LTO (Link Time Optimization) which results in a ~20% reduction in the binary size. Hypervisor Abstraction A new abstraction has been introduced, in the form of a hypervisor crate so as to enable the support of additional hypervisors beyond KVM. Snapshot/Restore Improvements Multiple improvements have been made to the VM snapshot/restore support that was added in the last release. This includes persisting more vCPU state and in particular preserving the guest paravirtualized clock in order to avoid vCPU hangs inside the guest when running with multiple vCPUs. Virtio Memory Ballooning Support A virtio-balloon device has been added, controlled through the resize control, which allows the reclamation of host memory by resizing a memory balloon inside the guest. Enhancements to ARM64 Support The ARM64 support introduced in the last release has been further enhanced with support for using PCI for exposing devices into the guest as well as multiple bug fixes. It also now supports using an initramfs when booting. Intel SGX Support The guest can now use Intel SGX if the host supports it. Details can be found in the dedicated SGX documentation. Seccomp Sandbox Improvements The most frequently used virtio devices are now isolated with their own seccomp filters. It is also now possible to pass --seccomp=log which result in the logging of requests that would have otherwise been denied to further aid development. Notable Bug Fixes Our virtio-vsock implementation has been resynced with the implementation from Firecracker and includes multiple bug fixes. CPU hotplug has been fixed so that it is now possible to add, remove, and re-add vCPUs (kata-containers#1338) A workaround is now in place for when KVM reports MSRs available MSRs that are in fact unreadable preventing snapshot/restore from working correctly (kata-containers#1543). virtio-mmio based devices are now more widely tested (kata-containers#275). Multiple issues have been fixed with virtio device configuration (kata-containers#1217) Console input was wrongly consumed by both virtio-console and the serial. (kata-containers#1521) Fixes: kata-containers#2864 Signed-off-by: Jose Carlos Venegas Munoz <[email protected]>
Here is an interesting case I have been debugging. I was trying to
understand why a "kubeadm reset" was not working for kata-runtime
compared to runc. In this case, the only pod started with Kata is
the kube-dns pod. For some reasons, when this pod is stopped and
removed, its containers receive some signals, 2 of them being SIGTERM
signals, which seems the way to properly stop them, but the third
container receives a SIGCONT. Obviously, nothing happens in this
case, but apparently CRI-O considers this should be the end of the
container and after a few seconds, it kills the container process
(being the shim in Kata case). Because it is using a SIGKILL, the
signal does not get forwarded to the agent because the shim itself
is killed right away. After this happened, CRI-O calls into
"kata-runtime state", we detect the shim is not running anymore
and we try to stop the container. The code will eventually call
into agent.RemoveContainer(), but this will fail and return an
error because inside the agent, the container is still running.
The approach to solve this issue here is to send a SIGKILL signal
to the container after the shim has been waited for. This call does
not check for the error returned because most of the cases, regular
use cases, will end up returning an error because the shim itself
not being there actually represents the container inside the VM has
already terminated.
And in case the shim has been killed without the possibility to
forward the signal (like described in first paragraph), the SIGKILL
will work and will allow the following call to agent.stopContainer()
to proceed to the removal of the container inside the agent.
Fixes #274
Signed-off-by: Sebastien Boeuf [email protected]