Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Job keeping retry: Yarn container exit handler: EXIT signal received in yarn container, performing clean up action... #1793

Closed
hao1939 opened this issue Nov 29, 2018 · 12 comments
Assignees
Labels

Comments

@hao1939
Copy link
Contributor

hao1939 commented Nov 29, 2018

Please help to figure out the cause of retry.

http://*.239/view.html?username=core&jobName=debug-005-one-gpu
image

//changing the ip to a dummy value

@mzmssg
Copy link
Member

mzmssg commented Nov 29, 2018

From the log, it seems that docker container is killed for some reason.
http://*.239/yarn/10.0.0.4:8188/applicationhistory/logs/10.0.0.14:8041/container_1542939248333_0648_01_000002/container_1542939248333_0648_01_000002/core
//changing the ip to a dummy value

+[09:19:42] echo '[DEBUG] Yarn container exit handler: trying to kill docker container core-debug-005-one-gpu-container_1542939248333_0648_01_000002'
++[09:19:42] docker inspect '--format={{.State.Pid}}' core-debug-005-one-gpu-container_1542939248333_0648_01_000002
+[09:19:42] pid=
+[09:19:42] '[' ']'
+[09:19:42] debug_log 'Yarn container exit handler' 'docker container core-debug-005-one-gpu-container_1542939248333_0648_01_000002 has already exited'

I think the most possibility is the container consumes exceed resource, then killed by os.

@hao1939
Copy link
Contributor Author

hao1939 commented Nov 29, 2018

Took a quick check, it caused by job exceeding the memory limit. Similar as #1089 .
Here's some system logs:

core@next-a-gpu-0006:~$ journalctl -k | grep -i -e memory -e oom|grep 'cgroup out of memory'
Nov 28 22:07:55 next-a-gpu-0006 kernel: Memory cgroup out of memory: Kill process 16590 (python3) score 1973 or sacrifice child
Nov 28 22:07:55 next-a-gpu-0006 kernel: Memory cgroup out of memory: Kill process 16590 (python3) score 1973 or sacrifice child
Nov 28 22:07:55 next-a-gpu-0006 kernel: Memory cgroup out of memory: Kill process 16590 (python3) score 1973 or sacrifice child

@hao1939
Copy link
Contributor Author

hao1939 commented Nov 29, 2018

@Gerhut , maybe you could test your fix with this job, it's from the end user.

@fanyangCS
Copy link
Contributor

is it possible to expose the log to users?

@hao1939
Copy link
Contributor Author

hao1939 commented Nov 30, 2018

Not so easy.

We run job container with the -rm option, so we can't get the status using docker inspect after it quit.
Those logs come from system log.
@Gerhut was trying to collect such info from system log on the node, but the PR hadn't merged.

@Gerhut
Copy link
Member

Gerhut commented Nov 30, 2018

Currently, it is the docker container itself which will be killed, exit code is 137 (SIGKILL), .State.OOMKilled is true, but it cannot be found that "cgroup out of memory" in dmesg. Will do more investigations.

@Gerhut
Copy link
Member

Gerhut commented Dec 3, 2018

It is confirmed that the python3 process is killed by system while .State.Pid yields pid of bash, which is the parent process of python3.

@scarlett2018
Copy link
Member

Adding @qfyin for reference.

@scarlett2018
Copy link
Member

In addition to how to expose logs for this issue, we should also consider reporting IT admin for this as well. But that has a lower priority than diagnosis-ability.

@fanyangCS
Copy link
Contributor

@hao1939 , maybe we can remove the "-rm" flag and use a separate docker -rm command after collecting the log?

@hao1939
Copy link
Contributor Author

hao1939 commented Jan 22, 2019

So we will need to carefully design the container cleanup logical, make sure we get the info before it was cleaned.

@Gerhut
Copy link
Member

Gerhut commented Mar 4, 2019

Fixed by #1108

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants