-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory limits are not respected by all Batch containers #131
Comments
The same error is present after running fmriprep on this dataset: https://openneuro.org/datasets/ds000006/versions/00001?app=FMRIPREP&version=18&job=16ab5f1e-cb72-4c21-bbb1-f1a062b4412a |
Looks like some of these have gone missing because OOM killed the containers. #124 also means that full utilization results in the host running very close to out of memory which can cause bigger issues. @chrisfilo Do we want to get more memory per job for fmriprep to not run into its memory limit? Seems like this has happened just to fmriprep. |
Have you seen the evidence of the OOM issues in the logs of failed jobs? I'm asking because I did not expect FMRIPREP to run out of memory with 14Gb. Is there a way to limit memory the docker container gets so the docker daemon will always have some memory to spare and continue to work? This would help debugging such situations. |
I'm seeing the OOM killer in kernel logs killing fmriprep containers. What is causing the logs failures and other job failures is a secondary problem where the docker service has gotten stuck or crashed on some hosts. I'm still investigating this but in the mean time, I created a new compute environment and removed the current one from pool. This will start up new hosts rather than continue scheduling things on the bad ones. There is some buffer for the host built into the limits and the app limit is a hard limit for the container, but I've found an issue with how we're allocating this limit. Since we launch some secondary containers, those do not have the limit applied. They don't use much memory but this does push things slightly above what ECS has allocated. |
I think this is unlikely to reoccur after switching the compute environments and stopping the affected hosts, so it should work to retry any that failed because of this or start new jobs. To close this issue, I'll fix the memory limits issue in the host container and come up with a way to monitor for this so we can see if it's a leak over time or overallocation if it reoccurs. |
Great to hear a solution is on the way. Let us know when it will be deployed so @sadevp would be able to rerun jobs. |
I rerun the job that was previously failing for me and it run without errors. @sadevp did you rerun your jobs? |
@chris they're running, I will let you know when they're done. Thanks.
…Sent from my iPhone
On Oct 13, 2017, at 2:51 PM, Chris Filo Gorgolewski <[email protected]<mailto:[email protected]>> wrote:
I rerun the job that was previously failing for me and it run without errors. @sadevp<https://github.com/sadevp> did you rerun your jobs?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#131 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AY3S3MFB3z6BCb_xXkLOz-Mw8A9Azdi-ks5sr9ttgaJpZM4P12Jd>.
|
Still running, FYI.
On Oct 13, 2017, at 3:02 PM, Parikh, Sadev Sanjiv <[email protected]<mailto:[email protected]>> wrote:
@chris they're running, I will let you know when they're done. Thanks.
…Sent from my iPhone
On Oct 13, 2017, at 2:51 PM, Chris Filo Gorgolewski <[email protected]<mailto:[email protected]>> wrote:
I rerun the job that was previously failing for me and it run without errors. @sadevp<https://github.com/sadevp> did you rerun your jobs?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#131 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AY3S3MFB3z6BCb_xXkLOz-Mw8A9Azdi-ks5sr9ttgaJpZM4P12Jd>.
|
It's been too long, they probably froze. Could you cancel them and rerun?
…On Oct 15, 2017 5:34 PM, "sadevp" ***@***.***> wrote:
Still running, FYI.
On Oct 13, 2017, at 3:02 PM, Parikh, Sadev Sanjiv ***@***.***
***@***.***>> wrote:
@chris they're running, I will let you know when they're done. Thanks.
Sent from my iPhone
On Oct 13, 2017, at 2:51 PM, Chris Filo Gorgolewski <
***@***.******@***.***>> wrote:
I rerun the job that was previously failing for me and it run without
errors. @sadevp<https://github.com/sadevp> did you rerun your jobs?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://github.com/
OpenNeuroOrg/openneuro#131#issuecomment-336576232>, or mute the
thread<https://github.com/notifications/unsubscribe-
auth/AY3S3MFB3z6BCb_xXkLOz-Mw8A9Azdi-ks5sr9ttgaJpZM4P12Jd>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#131 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOkp4My81tnPNglRr8iGl36pBxOGnVAks5ssqSXgaJpZM4P12Jd>
.
|
They are still running since yesterday, do you think it timed out again?
On Oct 15, 2017, at 5:54 PM, Chris Filo Gorgolewski <[email protected]<mailto:[email protected]>> wrote:
It's been too long, they probably froze. Could you cancel them and rerun?
On Oct 15, 2017 5:34 PM, "sadevp" ***@***.******@***.***>> wrote:
Still running, FYI.
On Oct 13, 2017, at 3:02 PM, Parikh, Sadev Sanjiv ***@***.******@***.***>
***@***.***>> wrote:
@chris they're running, I will let you know when they're done. Thanks.
Sent from my iPhone
On Oct 13, 2017, at 2:51 PM, Chris Filo Gorgolewski <
***@***.******@***.******@***.***>> wrote:
I rerun the job that was previously failing for me and it run without
errors. @sadevp<https://github.com/sadevp> did you rerun your jobs?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://github.com/
OpenNeuroOrg/openneuro#131#issuecomment-336576232>, or mute the
thread<https://github.com/notifications/unsubscribe-
auth/AY3S3MFB3z6BCb_xXkLOz-Mw8A9Azdi-ks5sr9ttgaJpZM4P12Jd>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#131 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOkp4My81tnPNglRr8iGl36pBxOGnVAks5ssqSXgaJpZM4P12Jd>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#131 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AY3S3OZG91lS4ThuLk16LCRbVWR-MY1Jks5ssqlEgaJpZM4P12Jd>.
|
Give me more time - if they are still running after 48h then we have an
issue.
Best,
Chris
…On Mon, Oct 16, 2017 at 12:13 PM, sadevp ***@***.***> wrote:
They are still running since yesterday, do you think it timed out again?
On Oct 15, 2017, at 5:54 PM, Chris Filo Gorgolewski <
***@***.******@***.***>> wrote:
It's been too long, they probably froze. Could you cancel them and rerun?
On Oct 15, 2017 5:34 PM, "sadevp" ***@***.***<mailto:
***@***.***>> wrote:
> Still running, FYI.
>
> On Oct 13, 2017, at 3:02 PM, Parikh, Sadev Sanjiv ***@***.***
***@***.***>
> ***@***.***>> wrote:
>
> @chris they're running, I will let you know when they're done. Thanks.
>
> Sent from my iPhone
>
> On Oct 13, 2017, at 2:51 PM, Chris Filo Gorgolewski <
> ***@***.******@***.***><mailto:
***@***.***>> wrote:
>
>
> I rerun the job that was previously failing for me and it run without
> errors. @sadevp<https://github.com/sadevp> did you rerun your jobs?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub<https://github.com/
> OpenNeuroOrg/openneuro#131#issuecomment-336576232>, or mute the
> thread<https://github.com/notifications/unsubscribe-
> auth/AY3S3MFB3z6BCb_xXkLOz-Mw8A9Azdi-ks5sr9ttgaJpZM4P12Jd>.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/OpenNeuroOrg/openneuro/issues/
131#issuecomment-336753808>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
AAOkp4My81tnPNglRr8iGl36pBxOGnVAks5ssqSXgaJpZM4P12Jd>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://github.com/
OpenNeuroOrg/openneuro#131#issuecomment-336755094>, or mute the
thread<https://github.com/notifications/unsubscribe-auth/
AY3S3OZG91lS4ThuLk16LCRbVWR-MY1Jks5ssqlEgaJpZM4P12Jd>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#131 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOkp9ib8sviFlD55fyKJkD7UIcsP6qwks5ss6rqgaJpZM4P12Jd>
.
|
Hello, I am seeing this issue again after running fmriprep on this dataset: https://openneuro.org/datasets/ds000148/versions/00001?app=FMRIPREP&version=32&job=67629e33-bb62-41ff-8fbb-2068df24a514 As you can see in the error logs, two subjects ran successfully, but the rest seem to have hit memory limits with the 'error response from daemon' message. |
I don't think this is a platform issue, but a problem with fmriprep: nipreps/fmriprep#841 |
I ran fmriprep on this dataset: https://openneuro.org/datasets/ds000005/versions/00001?app=FMRIPREP&version=18&job=0125e5c3-50a0-4437-9e92-48c7bb4e65fe
I received a failure notice, but am seeing that some of the participants ran successfully, as shown here:
However, after clicking into the logs for one of the 'success' participants, I see the following error:
Error response from daemon: No such container: dfcdf4d8-8cbd-4ab3-bc55-cf74d238df1b-lock
No lock found for dfcdf4d8-8cbd-4ab3-bc55-cf74d238df1b
The text was updated successfully, but these errors were encountered: