-
Notifications
You must be signed in to change notification settings - Fork 374
OOM Kill not being propagated #2150
Comments
Hi @awprice |
@kata-containers/community , If we do want to add the feature of propagating OOMEvent from guest to containerd/cri, there're two way in my mind: the first one is changing agent api by adding an item such as exitReason to WaitProcessResponse, and let it bring the OOMEvent back from guest; the second one is adding an agent api such as monitorEvent() to monitor the specific container's OOMEvent synchronized. WDYT? |
This issue might introduce API compatibility issue (Noted here: kata-containers/agent#684) |
Hi @awprice With a thoughtful consideration, since not all of the OOM events would result in an container exit, I think the best way is by adding one agent API monitorEvents() etc to monitor the guest events synchronously; kata shimv2 can call this api once it start sandbox Just as Xu said, this will introduce an API compatibility between runtime and agent, do you think it‘s a seriously impact or not? |
@lifupan thanks for the update. Regarding the API compatibility issue, could the runtime's event goroutine be designed to gracefully handle the agent being older and not having the monitorEvent endpoint? |
@awprice Hmm, yeah, in that case, we can let runtime ignore the error returned by calling the monitor api and just output a log line, thus a latest runtime can work well with old agent. |
@lifupan Sounds like a good approach to me. |
@lifupan I'm happy to take this ticket on - I've looked at how the OOM events are being propagated in runc and have a good plan of attack for implementing this. A couple of questions:
|
|
That's great. Thanks!
In my option I don' t recommend to use stream, I'd like to keep the protocol as simple as possible, and maybe I'd like to replace the grpc with ttrpc between agent client and agent server, which hasn't support the stream feature.
For the rust agent, we have try to enable all the CI running against it, once all of the CI ready and passed, I think it'll be a candidate for the default agent. |
@awprice, I agree with @lifupan In either Kata agent protocol or K8S CRI, we didn't use gRPC streaming. And ttRPC by containerd (the shim-v2 underlying RPC) doesn't support streaming at all. You know, if we use ttRPC instead of gRPC for agent protocol, we could reduce the anonymous page memory consumption of the rust-agent down to less than 1MB. However, I think in 1.x, we should keep gRPC for compatibility though we may have the chance to promote rust-agent as the default. So far I think the rust-agent is in good shape, but we still need more CI for it. |
cool, thanks @awprice |
I am working on collect some detailed information related to pod. It means I'll add api to agent.proto too. |
Thanks @yyyeerbo! Are you able to provide the patch you are working on? The only addition I need to add to the API is a new function to retrieve the latest OOM events, maybe this could be included in the pod information as an additional field? |
I think you can carry on. As I can see, the two kinds of information is for different purpose. And we will use different ways to populate it outside. I cannot post my patch because it needs to go through opensource procedure. :( |
No worries @yyyeerbo I will forge ahead. |
After starting the sandbox the shim will start polling for oom events from the agent. Propagates oom events through to containerd/cri. fixes kata-containers#2150 Signed-off-by: Alex Price <[email protected]>
After starting the sandbox the shim will start polling for oom events from the agent. Propagates oom events through to containerd/cri. fixes kata-containers#2150 Signed-off-by: Alex Price <[email protected]>
After starting the sandbox the shim will start polling for oom events from the agent. Propagates oom events through to containerd/cri. fixes kata-containers#2150 Signed-off-by: Alex Price <[email protected]>
After starting the sandbox the shim will start polling for oom events from the agent. Propagates oom events through to containerd/cri. fixes kata-containers#2150 Signed-off-by: Alex Price <[email protected]>
Description of problem
We've noticed that when a process within a container inside the guest is OOM killed, the event is not being propagated properly through to Kubernetes.
The following pod spec can be used to replicate this:
Expected result
When using runc, the reason/status for exit is set properly:
and when we describe the pod:
Actual result
When using Kata, the reason/status for exit is wrong:
and when we describe the pod:
Notes
This seems tangentially to the following:
OOM
event not supported byevents
cli #308Show kata-collect-data.sh details
Meta details
Running
kata-collect-data.sh
version1.8.3 (commit 6b55a537e5811f853c8ec381fa525fea440649aa)
at2019-10-29.01:44:13.795200214+0000
.Runtime is
/opt/kata/bin/kata-runtime
.kata-env
Output of "
/opt/kata/bin/kata-runtime kata-env
":Runtime config files
Runtime default config files
Runtime config file contents
Output of "
cat "/etc/kata-containers/configuration.toml"
":Output of "
cat "/opt/kata/share/defaults/kata-containers/configuration.toml"
":Config file
/usr/share/defaults/kata-containers/configuration.toml
not foundKSM throttler
version
Output of "
--version
":systemd service
Image details
Initrd details
No initrd
Logfiles
Runtime logs
No recent runtime problems found in system journal.
Proxy logs
No recent proxy problems found in system journal.
Shim logs
No recent shim problems found in system journal.
Throttler logs
No recent throttler problems found in system journal.
Container manager details
Have
docker
Docker
Output of "
docker version
":Output of "
docker info
":Output of "
systemctl show docker
":No
kubectl
No
crio
Have
containerd
containerd
Output of "
containerd --version
":Output of "
systemctl show containerd
":Output of "
cat /etc/containerd/config.toml
":Packages
No
dpkg
No
rpm
The text was updated successfully, but these errors were encountered: