-
Notifications
You must be signed in to change notification settings - Fork 549
Job keeping retry: Yarn container exit handler: EXIT signal received in yarn container, performing clean up action...
#1793
Comments
From the log, it seems that docker container is killed for some reason.
I think the most possibility is the container consumes exceed resource, then killed by os. |
Took a quick check, it caused by job exceeding the memory limit. Similar as #1089 .
|
@Gerhut , maybe you could test your fix with this job, it's from the end user. |
is it possible to expose the log to users? |
Not so easy. We run job container with the |
Currently, it is the docker container itself which will be killed, exit code is 137 (SIGKILL), |
It is confirmed that the |
Adding @qfyin for reference. |
In addition to how to expose logs for this issue, we should also consider reporting IT admin for this as well. But that has a lower priority than diagnosis-ability. |
@hao1939 , maybe we can remove the "-rm" flag and use a separate docker -rm command after collecting the log? |
So we will need to carefully design the container cleanup logical, make sure we get the info before it was cleaned. |
Fixed by #1108 |
Please help to figure out the cause of retry.
http://*.239/view.html?username=core&jobName=debug-005-one-gpu

//changing the ip to a dummy value
The text was updated successfully, but these errors were encountered: