Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory limits are not respected by all Batch containers #131

Closed
sadevp opened this issue Oct 11, 2017 · 14 comments
Closed

Memory limits are not respected by all Batch containers #131

sadevp opened this issue Oct 11, 2017 · 14 comments
Assignees
Labels
Milestone

Comments

@sadevp
Copy link

sadevp commented Oct 11, 2017

I ran fmriprep on this dataset: https://openneuro.org/datasets/ds000005/versions/00001?app=FMRIPREP&version=18&job=0125e5c3-50a0-4437-9e92-48c7bb4e65fe

I received a failure notice, but am seeing that some of the participants ran successfully, as shown here:

screen shot 2017-10-11 at 10 28 25 am

However, after clicking into the logs for one of the 'success' participants, I see the following error:

Error response from daemon: No such container: dfcdf4d8-8cbd-4ab3-bc55-cf74d238df1b-lock
No lock found for dfcdf4d8-8cbd-4ab3-bc55-cf74d238df1b

@sadevp
Copy link
Author

sadevp commented Oct 11, 2017

The same error is present after running fmriprep on this dataset: https://openneuro.org/datasets/ds000006/versions/00001?app=FMRIPREP&version=18&job=16ab5f1e-cb72-4c21-bbb1-f1a062b4412a

@nellh nellh self-assigned this Oct 11, 2017
@nellh
Copy link
Contributor

nellh commented Oct 11, 2017

Looks like some of these have gone missing because OOM killed the containers. #124 also means that full utilization results in the host running very close to out of memory which can cause bigger issues.

@chrisfilo Do we want to get more memory per job for fmriprep to not run into its memory limit? Seems like this has happened just to fmriprep.

@chrisgorgo
Copy link
Contributor

Have you seen the evidence of the OOM issues in the logs of failed jobs? I'm asking because I did not expect FMRIPREP to run out of memory with 14Gb.

Is there a way to limit memory the docker container gets so the docker daemon will always have some memory to spare and continue to work? This would help debugging such situations.

@nellh
Copy link
Contributor

nellh commented Oct 11, 2017

I'm seeing the OOM killer in kernel logs killing fmriprep containers. What is causing the logs failures and other job failures is a secondary problem where the docker service has gotten stuck or crashed on some hosts.

I'm still investigating this but in the mean time, I created a new compute environment and removed the current one from pool. This will start up new hosts rather than continue scheduling things on the bad ones.

There is some buffer for the host built into the limits and the app limit is a hard limit for the container, but I've found an issue with how we're allocating this limit. Since we launch some secondary containers, those do not have the limit applied. They don't use much memory but this does push things slightly above what ECS has allocated.

@nellh
Copy link
Contributor

nellh commented Oct 12, 2017

I think this is unlikely to reoccur after switching the compute environments and stopping the affected hosts, so it should work to retry any that failed because of this or start new jobs.

To close this issue, I'll fix the memory limits issue in the host container and come up with a way to monitor for this so we can see if it's a leak over time or overallocation if it reoccurs.

@chrisgorgo
Copy link
Contributor

Great to hear a solution is on the way. Let us know when it will be deployed so @sadevp would be able to rerun jobs.

@chrisgorgo
Copy link
Contributor

I rerun the job that was previously failing for me and it run without errors. @sadevp did you rerun your jobs?

@sadevp
Copy link
Author

sadevp commented Oct 13, 2017 via email

@sadevp
Copy link
Author

sadevp commented Oct 16, 2017 via email

@chrisgorgo
Copy link
Contributor

chrisgorgo commented Oct 16, 2017 via email

@sadevp
Copy link
Author

sadevp commented Oct 16, 2017 via email

@chrisgorgo
Copy link
Contributor

chrisgorgo commented Oct 16, 2017 via email

@nellh nellh changed the title 'Error Response from daemon: No Such Container' [fmriPREP] Memory limits are not respected by all Batch containers Nov 7, 2017
@nellh nellh added the backlog label Nov 7, 2017
@sadevp
Copy link
Author

sadevp commented Nov 20, 2017

Hello,

I am seeing this issue again after running fmriprep on this dataset: https://openneuro.org/datasets/ds000148/versions/00001?app=FMRIPREP&version=32&job=67629e33-bb62-41ff-8fbb-2068df24a514

As you can see in the error logs, two subjects ran successfully, but the rest seem to have hit memory limits with the 'error response from daemon' message.

screen shot 2017-11-20 at 8 26 19 am

@chrisgorgo
Copy link
Contributor

I don't think this is a platform issue, but a problem with fmriprep: nipreps/fmriprep#841

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants