OOM Killed during batch processing when running cellbender on slurm #251

ggconnell · 2023-08-21T14:41:46Z

Hi there! I'm running into a similar issue. I am running cellbender v0.3.0 on a RTXA5000 GPU device with 24GB of mem with the following line:
cellbender remove-background --cuda --input 2_alignment/DCK-sc43_alignment/outs/raw_feature_bc_matrix.h5 --output 2_alignment/DCK-sc43_alignment/outs/cellbender_cleaned_feature_bc_matrix.h5 --fpr 0.01 --epochs 150

I keep getting this error:
cellbender:remove-background: Working on chunk (173/476)
cellbender:remove-background: Working on chunk (174/476)
cellbender:remove-background: Working on chunk (175/476)
cellbender:remove-background: Working on chunk (176/476)
slurmstepd: error: Detected 1 oom_kill event in StepId=39347527.0. Some of the step tasks have been OOM Killed.

Small adjustments have just changed at which batch it is failing.
The .h5 file is 676Mb. Any help would be greatly appreciated. I have been trouble shooting this for over a week.

sjfleming · 2023-08-21T20:28:02Z

Hi @ggconnell , unfortunately it seems like I need to track down why my memory usage during posterior computation is so high. I recently ran into this problem myself.

The good news is that the problem is with CPU memory, not GPU memory. So if you're willing to put up with the cost and you have access to a lot of RAM, you can increase the CPU memory (might take like 128GB) and the task will finish. (Since you've already trained the model, I'd recommend starting from that --checkpoint ckpt.tar.gz from the previous run, because cellbender can pick up where it left off.)

To help me figure out why this is happening, can you tell me more about your experiment? I see that 676MB is an extraordinarily large h5 file. Is this experiment overloaded somehow? I ran into this in the context of an overloaded experiment with many donors (to be demultiplexed later) run with PIP-seq. Any details on your case might be helpful.

sjfleming · 2023-08-21T20:36:47Z

Related to #248

ggconnell · 2023-08-23T17:07:59Z

Hi @sjfleming thanks for your response! Increasing my CPU memory to 450G did the trick! (450G was likely excessive but I tried to greatly overestimate it) Yes, we overloaded the sequencer to get ~70k cells recovered with scRNA-seq. The sample itself consists of more than 300 different cell lines as well.

sjfleming · 2023-08-23T17:54:48Z

Very interesting. Okay well great to hear that the increased memory worked. I will work on tracking down what must be a bug on my part, because there's no reason cellbender should need to use that much memory during posterior calculation.

jg9zk · 2023-08-29T13:07:17Z

Hi, I have this problem too. I used the tracemalloc package to track which lines of posterior.py are using all this memory. After 27 iterations of "Working on chunk (x/xxx)", it's taking up about 7 gigs of memory. It seems to be the chunk of code where you're extending the lists containing the sparse matrix values (lines 530-534). Would directly saving them as a sparse matrix each step decrease memory usage? Or maybe write to a file instead of saving in memory? I'm not really sure what the code is doing, so I might be completely off base.

sjfleming · 2023-08-29T13:24:01Z

Indeed @jg9zk you are right on the money. And thanks for writing in with this! I have to say that I do not understand why this kind of list extend operation is behaving the way it is. It must be the case that python is keeping pointers to objects in the list (which keeps growing) and so it must also be keeping the original objects around. This was not at all the behavior I intended, and I did not realize that was happening until recently.

It might be the case that I'll have to write to a file as I go, but I am working on another fix to try to keep things in memory (just use less).

I think this change should help, but I need to run tracemalloc or something similar myself and see:
86a0c90

sjfleming · 2023-08-29T13:38:52Z

In the meantime, if anybody wants to urgently try and see if the issue resolves in your case, feel free to run the code from the sf_memory_efficient_posterior_generation branch in this repo.
and you can track progress toward a fix here: #263

jg9zk · 2023-08-29T18:16:42Z

This branch stopped memory from being used up during the for loop, but my job was killed due to OOM sometime after

sjfleming self-assigned this Aug 21, 2023

sjfleming added the enhancement New feature or improvement label Aug 21, 2023

sjfleming mentioned this issue Aug 29, 2023

Memory-efficient posterior generation #263

Merged

sjfleming mentioned this issue Aug 31, 2023

reuse of checkpoint file #266

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM Killed during batch processing when running cellbender on slurm #251

OOM Killed during batch processing when running cellbender on slurm #251

ggconnell commented Aug 21, 2023

sjfleming commented Aug 21, 2023

sjfleming commented Aug 21, 2023

ggconnell commented Aug 23, 2023

sjfleming commented Aug 23, 2023

jg9zk commented Aug 29, 2023

sjfleming commented Aug 29, 2023

sjfleming commented Aug 29, 2023

jg9zk commented Aug 29, 2023 •

edited

Loading

OOM Killed during batch processing when running cellbender on slurm #251

OOM Killed during batch processing when running cellbender on slurm #251

Comments

ggconnell commented Aug 21, 2023

sjfleming commented Aug 21, 2023

sjfleming commented Aug 21, 2023

ggconnell commented Aug 23, 2023

sjfleming commented Aug 23, 2023

jg9zk commented Aug 29, 2023

sjfleming commented Aug 29, 2023

sjfleming commented Aug 29, 2023

jg9zk commented Aug 29, 2023 • edited Loading

jg9zk commented Aug 29, 2023 •

edited

Loading