Memory limits are not respected by all Batch containers #131

sadevp · 2017-10-11T17:29:32Z

I ran fmriprep on this dataset: https://openneuro.org/datasets/ds000005/versions/00001?app=FMRIPREP&version=18&job=0125e5c3-50a0-4437-9e92-48c7bb4e65fe

I received a failure notice, but am seeing that some of the participants ran successfully, as shown here:

However, after clicking into the logs for one of the 'success' participants, I see the following error:

Error response from daemon: No such container: dfcdf4d8-8cbd-4ab3-bc55-cf74d238df1b-lock
No lock found for dfcdf4d8-8cbd-4ab3-bc55-cf74d238df1b

sadevp · 2017-10-11T17:30:40Z

The same error is present after running fmriprep on this dataset: https://openneuro.org/datasets/ds000006/versions/00001?app=FMRIPREP&version=18&job=16ab5f1e-cb72-4c21-bbb1-f1a062b4412a

nellh · 2017-10-11T19:46:07Z

Looks like some of these have gone missing because OOM killed the containers. #124 also means that full utilization results in the host running very close to out of memory which can cause bigger issues.

@chrisfilo Do we want to get more memory per job for fmriprep to not run into its memory limit? Seems like this has happened just to fmriprep.

chrisgorgo · 2017-10-11T22:44:08Z

The jobs marked as "success" seems to have finished running without errors (the outputs seems to look ok). Despite the Error response from daemon: No such container: message at the end of the log.
(presumably) failed jobs are missing labels and I cannot access their logs due to regression of Job fails, logs unaccessible "The specified log stream does not exist." #12

Have you seen the evidence of the OOM issues in the logs of failed jobs? I'm asking because I did not expect FMRIPREP to run out of memory with 14Gb.

Is there a way to limit memory the docker container gets so the docker daemon will always have some memory to spare and continue to work? This would help debugging such situations.

nellh · 2017-10-11T23:16:46Z

I'm seeing the OOM killer in kernel logs killing fmriprep containers. What is causing the logs failures and other job failures is a secondary problem where the docker service has gotten stuck or crashed on some hosts.

I'm still investigating this but in the mean time, I created a new compute environment and removed the current one from pool. This will start up new hosts rather than continue scheduling things on the bad ones.

There is some buffer for the host built into the limits and the app limit is a hard limit for the container, but I've found an issue with how we're allocating this limit. Since we launch some secondary containers, those do not have the limit applied. They don't use much memory but this does push things slightly above what ECS has allocated.

nellh · 2017-10-12T01:02:46Z

I think this is unlikely to reoccur after switching the compute environments and stopping the affected hosts, so it should work to retry any that failed because of this or start new jobs.

To close this issue, I'll fix the memory limits issue in the host container and come up with a way to monitor for this so we can see if it's a leak over time or overallocation if it reoccurs.

chrisgorgo · 2017-10-12T01:04:03Z

Great to hear a solution is on the way. Let us know when it will be deployed so @sadevp would be able to rerun jobs.

chrisgorgo · 2017-10-13T21:51:39Z

I rerun the job that was previously failing for me and it run without errors. @sadevp did you rerun your jobs?

sadevp · 2017-10-13T22:02:27Z

@chris they're running, I will let you know when they're done. Thanks.

…

Sent from my iPhone On Oct 13, 2017, at 2:51 PM, Chris Filo Gorgolewski <[email protected]<mailto:[email protected]>> wrote: I rerun the job that was previously failing for me and it run without errors. @sadevp<https://github.com/sadevp> did you rerun your jobs? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#131 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AY3S3MFB3z6BCb_xXkLOz-Mw8A9Azdi-ks5sr9ttgaJpZM4P12Jd>.

sadevp · 2017-10-16T00:34:30Z

Still running, FYI. On Oct 13, 2017, at 3:02 PM, Parikh, Sadev Sanjiv <[email protected]<mailto:[email protected]>> wrote: @chris they're running, I will let you know when they're done. Thanks.

…

Sent from my iPhone On Oct 13, 2017, at 2:51 PM, Chris Filo Gorgolewski <[email protected]<mailto:[email protected]>> wrote: I rerun the job that was previously failing for me and it run without errors. @sadevp<https://github.com/sadevp> did you rerun your jobs? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#131 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AY3S3MFB3z6BCb_xXkLOz-Mw8A9Azdi-ks5sr9ttgaJpZM4P12Jd>.

chrisgorgo · 2017-10-16T00:54:27Z

It's been too long, they probably froze. Could you cancel them and rerun?

…

On Oct 15, 2017 5:34 PM, "sadevp" ***@***.***> wrote: Still running, FYI. On Oct 13, 2017, at 3:02 PM, Parikh, Sadev Sanjiv ***@***.*** ***@***.***>> wrote: @chris they're running, I will let you know when they're done. Thanks. Sent from my iPhone On Oct 13, 2017, at 2:51 PM, Chris Filo Gorgolewski < ***@***.******@***.***>> wrote: I rerun the job that was previously failing for me and it run without errors. @sadevp<https://github.com/sadevp> did you rerun your jobs? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://github.com/ OpenNeuroOrg/openneuro#131#issuecomment-336576232>, or mute the thread<https://github.com/notifications/unsubscribe- auth/AY3S3MFB3z6BCb_xXkLOz-Mw8A9Azdi-ks5sr9ttgaJpZM4P12Jd>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#131 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOkp4My81tnPNglRr8iGl36pBxOGnVAks5ssqSXgaJpZM4P12Jd> .

sadevp · 2017-10-16T19:13:36Z

They are still running since yesterday, do you think it timed out again? On Oct 15, 2017, at 5:54 PM, Chris Filo Gorgolewski <[email protected]<mailto:[email protected]>> wrote: It's been too long, they probably froze. Could you cancel them and rerun?

On Oct 15, 2017 5:34 PM, "sadevp" ***@***.******@***.***>> wrote: Still running, FYI. On Oct 13, 2017, at 3:02 PM, Parikh, Sadev Sanjiv ***@***.******@***.***> ***@***.***>> wrote: @chris they're running, I will let you know when they're done. Thanks. Sent from my iPhone On Oct 13, 2017, at 2:51 PM, Chris Filo Gorgolewski < ***@***.******@***.******@***.***>> wrote: I rerun the job that was previously failing for me and it run without errors. @sadevp<https://github.com/sadevp> did you rerun your jobs? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://github.com/ OpenNeuroOrg/openneuro#131#issuecomment-336576232>, or mute the thread<https://github.com/notifications/unsubscribe- auth/AY3S3MFB3z6BCb_xXkLOz-Mw8A9Azdi-ks5sr9ttgaJpZM4P12Jd>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#131 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOkp4My81tnPNglRr8iGl36pBxOGnVAks5ssqSXgaJpZM4P12Jd> .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#131 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AY3S3OZG91lS4ThuLk16LCRbVWR-MY1Jks5ssqlEgaJpZM4P12Jd>.

chrisgorgo · 2017-10-16T19:17:28Z

Give me more time - if they are still running after 48h then we have an issue. Best, Chris

…

On Mon, Oct 16, 2017 at 12:13 PM, sadevp ***@***.***> wrote: They are still running since yesterday, do you think it timed out again? On Oct 15, 2017, at 5:54 PM, Chris Filo Gorgolewski < ***@***.******@***.***>> wrote: It's been too long, they probably froze. Could you cancel them and rerun? On Oct 15, 2017 5:34 PM, "sadevp" ***@***.***<mailto: ***@***.***>> wrote: > Still running, FYI. > > On Oct 13, 2017, at 3:02 PM, Parikh, Sadev Sanjiv ***@***.*** ***@***.***> > ***@***.***>> wrote: > > @chris they're running, I will let you know when they're done. Thanks. > > Sent from my iPhone > > On Oct 13, 2017, at 2:51 PM, Chris Filo Gorgolewski < > ***@***.******@***.***><mailto: ***@***.***>> wrote: > > > I rerun the job that was previously failing for me and it run without > errors. @sadevp<https://github.com/sadevp> did you rerun your jobs? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub<https://github.com/ > OpenNeuroOrg/openneuro#131#issuecomment-336576232>, or mute the > thread<https://github.com/notifications/unsubscribe- > auth/AY3S3MFB3z6BCb_xXkLOz-Mw8A9Azdi-ks5sr9ttgaJpZM4P12Jd>. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/OpenNeuroOrg/openneuro/issues/ 131#issuecomment-336753808>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ AAOkp4My81tnPNglRr8iGl36pBxOGnVAks5ssqSXgaJpZM4P12Jd> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://github.com/ OpenNeuroOrg/openneuro#131#issuecomment-336755094>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ AY3S3OZG91lS4ThuLk16LCRbVWR-MY1Jks5ssqlEgaJpZM4P12Jd>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#131 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOkp9ib8sviFlD55fyKJkD7UIcsP6qwks5ss6rqgaJpZM4P12Jd> .

sadevp · 2017-11-20T16:29:13Z

Hello,

I am seeing this issue again after running fmriprep on this dataset: https://openneuro.org/datasets/ds000148/versions/00001?app=FMRIPREP&version=32&job=67629e33-bb62-41ff-8fbb-2068df24a514

As you can see in the error logs, two subjects ran successfully, but the rest seem to have hit memory limits with the 'error response from daemon' message.

chrisgorgo · 2017-11-20T16:55:07Z

I don't think this is a platform issue, but a problem with fmriprep: nipreps/fmriprep#841

nellh self-assigned this Oct 11, 2017

nellh mentioned this issue Oct 11, 2017

Job fails, logs unaccessible "The specified log stream does not exist." #12

Closed

nellh added bug major labels Oct 12, 2017

sadevp mentioned this issue Oct 16, 2017

'Workflow did not execute cleanly' Error [MRIQC] nipreps/mriqc#654

Closed

nellh changed the title ~~'Error Response from daemon: No Such Container' [fmriPREP]~~ Memory limits are not respected by all Batch containers Nov 7, 2017

nellh added the backlog label Nov 7, 2017

JohnKael added this to the 2018 Sprint 1 milestone Dec 12, 2017

JohnKael added next and removed backlog labels Jan 2, 2018

nellh added in progress and removed next labels Jan 4, 2018

nellh mentioned this issue Jan 8, 2018

Limit BIDS app container to job memory limit minus 64MB OpenNeuroOrg/bids-app-host#4

Merged

nellh added Ready for Review and removed in progress Ready for Review labels Jan 9, 2018

nellh added the client review label Jan 10, 2018

chrisgorgo closed this as completed Jan 18, 2018

chrisgorgo removed the client review label Jan 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory limits are not respected by all Batch containers #131

Memory limits are not respected by all Batch containers #131

sadevp commented Oct 11, 2017

sadevp commented Oct 11, 2017

nellh commented Oct 11, 2017

chrisgorgo commented Oct 11, 2017

nellh commented Oct 11, 2017 •

edited

Loading

nellh commented Oct 12, 2017 •

edited

Loading

chrisgorgo commented Oct 12, 2017

chrisgorgo commented Oct 13, 2017

sadevp commented Oct 13, 2017 via email

sadevp commented Oct 16, 2017 via email

chrisgorgo commented Oct 16, 2017 via email

sadevp commented Oct 16, 2017 via email

chrisgorgo commented Oct 16, 2017 via email

sadevp commented Nov 20, 2017

chrisgorgo commented Nov 20, 2017

Memory limits are not respected by all Batch containers #131

Memory limits are not respected by all Batch containers #131

Comments

sadevp commented Oct 11, 2017

sadevp commented Oct 11, 2017

nellh commented Oct 11, 2017

chrisgorgo commented Oct 11, 2017

nellh commented Oct 11, 2017 • edited Loading

nellh commented Oct 12, 2017 • edited Loading

chrisgorgo commented Oct 12, 2017

chrisgorgo commented Oct 13, 2017

sadevp commented Oct 13, 2017 via email

sadevp commented Oct 16, 2017 via email

chrisgorgo commented Oct 16, 2017 via email

sadevp commented Oct 16, 2017 via email

chrisgorgo commented Oct 16, 2017 via email

sadevp commented Nov 20, 2017

chrisgorgo commented Nov 20, 2017

nellh commented Oct 11, 2017 •

edited

Loading

nellh commented Oct 12, 2017 •

edited

Loading