Job keeping retry: `Yarn container exit handler: EXIT signal received in yarn container, performing clean up action...` #1793

hao1939 · 2018-11-29T08:34:43Z

Please help to figure out the cause of retry.

http://*.239/view.html?username=core&jobName=debug-005-one-gpu

//changing the ip to a dummy value

mzmssg · 2018-11-29T11:06:04Z

From the log, it seems that docker container is killed for some reason.
http://*.239/yarn/10.0.0.4:8188/applicationhistory/logs/10.0.0.14:8041/container_1542939248333_0648_01_000002/container_1542939248333_0648_01_000002/core
//changing the ip to a dummy value

+[09:19:42] echo '[DEBUG] Yarn container exit handler: trying to kill docker container core-debug-005-one-gpu-container_1542939248333_0648_01_000002'
++[09:19:42] docker inspect '--format={{.State.Pid}}' core-debug-005-one-gpu-container_1542939248333_0648_01_000002
+[09:19:42] pid=
+[09:19:42] '[' ']'
+[09:19:42] debug_log 'Yarn container exit handler' 'docker container core-debug-005-one-gpu-container_1542939248333_0648_01_000002 has already exited'

I think the most possibility is the container consumes exceed resource, then killed by os.

hao1939 · 2018-11-29T11:27:03Z

Took a quick check, it caused by job exceeding the memory limit. Similar as #1089 .
Here's some system logs:

core@next-a-gpu-0006:~$ journalctl -k | grep -i -e memory -e oom|grep 'cgroup out of memory'
Nov 28 22:07:55 next-a-gpu-0006 kernel: Memory cgroup out of memory: Kill process 16590 (python3) score 1973 or sacrifice child
Nov 28 22:07:55 next-a-gpu-0006 kernel: Memory cgroup out of memory: Kill process 16590 (python3) score 1973 or sacrifice child
Nov 28 22:07:55 next-a-gpu-0006 kernel: Memory cgroup out of memory: Kill process 16590 (python3) score 1973 or sacrifice child

hao1939 · 2018-11-29T11:30:37Z

@Gerhut , maybe you could test your fix with this job, it's from the end user.

fanyangCS · 2018-11-29T12:26:42Z

is it possible to expose the log to users?

hao1939 · 2018-11-30T01:58:04Z

Not so easy.

We run job container with the -rm option, so we can't get the status using docker inspect after it quit.
Those logs come from system log.
@Gerhut was trying to collect such info from system log on the node, but the PR hadn't merged.

Gerhut · 2018-11-30T12:06:40Z

Currently, it is the docker container itself which will be killed, exit code is 137 (SIGKILL), .State.OOMKilled is true, but it cannot be found that "cgroup out of memory" in dmesg. Will do more investigations.

Gerhut · 2018-12-03T02:48:57Z

It is confirmed that the python3 process is killed by system while .State.Pid yields pid of bash, which is the parent process of python3.

scarlett2018 · 2019-01-21T05:54:24Z

Adding @qfyin for reference.

scarlett2018 · 2019-01-21T05:55:55Z

In addition to how to expose logs for this issue, we should also consider reporting IT admin for this as well. But that has a lower priority than diagnosis-ability.

fanyangCS · 2019-01-21T07:00:54Z

@hao1939 , maybe we can remove the "-rm" flag and use a separate docker -rm command after collecting the log?

hao1939 · 2019-01-22T03:52:03Z

So we will need to carefully design the container cleanup logical, make sure we get the info before it was cleaned.

Gerhut · 2019-03-04T07:16:56Z

Fixed by #1108

hao1939 assigned Gerhut and mzmssg Nov 29, 2018

Gerhut mentioned this issue Dec 3, 2018

Job 'keras_ads_v3_1-test-001' keeping retries because of exceeding memory quote' #1089

Closed

scarlett2018 assigned qfyin Jan 21, 2019

scarlett2018 added the PAI-Exp label Jan 21, 2019

Gerhut closed this as completed Mar 4, 2019

This was referenced Mar 19, 2019

yarn container script exit handler fix and refactor #2355

Merged

PAI doesn't clean job container #2354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job keeping retry: `Yarn container exit handler: EXIT signal received in yarn container, performing clean up action...` #1793

Job keeping retry: `Yarn container exit handler: EXIT signal received in yarn container, performing clean up action...` #1793

hao1939 commented Nov 29, 2018 •

edited by scarlett2018

Loading

mzmssg commented Nov 29, 2018 •

edited by scarlett2018

Loading

hao1939 commented Nov 29, 2018

hao1939 commented Nov 29, 2018

fanyangCS commented Nov 29, 2018

hao1939 commented Nov 30, 2018

Gerhut commented Nov 30, 2018

Gerhut commented Dec 3, 2018

scarlett2018 commented Jan 21, 2019

scarlett2018 commented Jan 21, 2019

fanyangCS commented Jan 21, 2019

hao1939 commented Jan 22, 2019

Gerhut commented Mar 4, 2019

Job keeping retry: Yarn container exit handler: EXIT signal received in yarn container, performing clean up action... #1793

Job keeping retry: Yarn container exit handler: EXIT signal received in yarn container, performing clean up action... #1793

Comments

hao1939 commented Nov 29, 2018 • edited by scarlett2018 Loading

mzmssg commented Nov 29, 2018 • edited by scarlett2018 Loading

hao1939 commented Nov 29, 2018

hao1939 commented Nov 29, 2018

fanyangCS commented Nov 29, 2018

hao1939 commented Nov 30, 2018

Gerhut commented Nov 30, 2018

Gerhut commented Dec 3, 2018

scarlett2018 commented Jan 21, 2019

scarlett2018 commented Jan 21, 2019

fanyangCS commented Jan 21, 2019

hao1939 commented Jan 22, 2019

Gerhut commented Mar 4, 2019

Job keeping retry: `Yarn container exit handler: EXIT signal received in yarn container, performing clean up action...` #1793

Job keeping retry: `Yarn container exit handler: EXIT signal received in yarn container, performing clean up action...` #1793

hao1939 commented Nov 29, 2018 •

edited by scarlett2018

Loading

mzmssg commented Nov 29, 2018 •

edited by scarlett2018

Loading