Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM Killed during batch processing when running cellbender on slurm #251

Open
ggconnell opened this issue Aug 21, 2023 · 8 comments
Open
Assignees
Labels
enhancement New feature or improvement

Comments

@ggconnell
Copy link

Hi there! I'm running into a similar issue. I am running cellbender v0.3.0 on a RTXA5000 GPU device with 24GB of mem with the following line:
cellbender remove-background --cuda --input 2_alignment/DCK-sc43_alignment/outs/raw_feature_bc_matrix.h5 --output 2_alignment/DCK-sc43_alignment/outs/cellbender_cleaned_feature_bc_matrix.h5 --fpr 0.01 --epochs 150

I keep getting this error:
cellbender:remove-background: Working on chunk (173/476)
cellbender:remove-background: Working on chunk (174/476)
cellbender:remove-background: Working on chunk (175/476)
cellbender:remove-background: Working on chunk (176/476)
slurmstepd: error: Detected 1 oom_kill event in StepId=39347527.0. Some of the step tasks have been OOM Killed.

Small adjustments have just changed at which batch it is failing.
The .h5 file is 676Mb. Any help would be greatly appreciated. I have been trouble shooting this for over a week.

@sjfleming
Copy link
Member

Hi @ggconnell , unfortunately it seems like I need to track down why my memory usage during posterior computation is so high. I recently ran into this problem myself.

The good news is that the problem is with CPU memory, not GPU memory. So if you're willing to put up with the cost and you have access to a lot of RAM, you can increase the CPU memory (might take like 128GB) and the task will finish. (Since you've already trained the model, I'd recommend starting from that --checkpoint ckpt.tar.gz from the previous run, because cellbender can pick up where it left off.)

To help me figure out why this is happening, can you tell me more about your experiment? I see that 676MB is an extraordinarily large h5 file. Is this experiment overloaded somehow? I ran into this in the context of an overloaded experiment with many donors (to be demultiplexed later) run with PIP-seq. Any details on your case might be helpful.

@sjfleming sjfleming self-assigned this Aug 21, 2023
@sjfleming sjfleming added the enhancement New feature or improvement label Aug 21, 2023
@sjfleming
Copy link
Member

Related to #248

@ggconnell
Copy link
Author

Hi @sjfleming thanks for your response! Increasing my CPU memory to 450G did the trick! (450G was likely excessive but I tried to greatly overestimate it) Yes, we overloaded the sequencer to get ~70k cells recovered with scRNA-seq. The sample itself consists of more than 300 different cell lines as well.

@sjfleming
Copy link
Member

Very interesting. Okay well great to hear that the increased memory worked. I will work on tracking down what must be a bug on my part, because there's no reason cellbender should need to use that much memory during posterior calculation.

@jg9zk
Copy link

jg9zk commented Aug 29, 2023

Hi, I have this problem too. I used the tracemalloc package to track which lines of posterior.py are using all this memory. After 27 iterations of "Working on chunk (x/xxx)", it's taking up about 7 gigs of memory. It seems to be the chunk of code where you're extending the lists containing the sparse matrix values (lines 530-534). Would directly saving them as a sparse matrix each step decrease memory usage? Or maybe write to a file instead of saving in memory? I'm not really sure what the code is doing, so I might be completely off base.
cellbender_tracemalloc

@sjfleming
Copy link
Member

Indeed @jg9zk you are right on the money. And thanks for writing in with this! I have to say that I do not understand why this kind of list extend operation is behaving the way it is. It must be the case that python is keeping pointers to objects in the list (which keeps growing) and so it must also be keeping the original objects around. This was not at all the behavior I intended, and I did not realize that was happening until recently.

It might be the case that I'll have to write to a file as I go, but I am working on another fix to try to keep things in memory (just use less).

I think this change should help, but I need to run tracemalloc or something similar myself and see:
86a0c90

@sjfleming
Copy link
Member

In the meantime, if anybody wants to urgently try and see if the issue resolves in your case, feel free to run the code from the sf_memory_efficient_posterior_generation branch in this repo.
and you can track progress toward a fix here: #263

@jg9zk
Copy link

jg9zk commented Aug 29, 2023

This branch stopped memory from being used up during the for loop, but my job was killed due to OOM sometime after

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or improvement
Projects
None yet
Development

No branches or pull requests

3 participants