-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM Killed during batch processing when running cellbender on slurm #251
Comments
Hi @ggconnell , unfortunately it seems like I need to track down why my memory usage during posterior computation is so high. I recently ran into this problem myself. The good news is that the problem is with CPU memory, not GPU memory. So if you're willing to put up with the cost and you have access to a lot of RAM, you can increase the CPU memory (might take like 128GB) and the task will finish. (Since you've already trained the model, I'd recommend starting from that To help me figure out why this is happening, can you tell me more about your experiment? I see that 676MB is an extraordinarily large h5 file. Is this experiment overloaded somehow? I ran into this in the context of an overloaded experiment with many donors (to be demultiplexed later) run with PIP-seq. Any details on your case might be helpful. |
Related to #248 |
Hi @sjfleming thanks for your response! Increasing my CPU memory to 450G did the trick! (450G was likely excessive but I tried to greatly overestimate it) Yes, we overloaded the sequencer to get ~70k cells recovered with scRNA-seq. The sample itself consists of more than 300 different cell lines as well. |
Very interesting. Okay well great to hear that the increased memory worked. I will work on tracking down what must be a bug on my part, because there's no reason cellbender should need to use that much memory during posterior calculation. |
Indeed @jg9zk you are right on the money. And thanks for writing in with this! I have to say that I do not understand why this kind of list extend operation is behaving the way it is. It must be the case that python is keeping pointers to objects in the list (which keeps growing) and so it must also be keeping the original objects around. This was not at all the behavior I intended, and I did not realize that was happening until recently. It might be the case that I'll have to write to a file as I go, but I am working on another fix to try to keep things in memory (just use less). I think this change should help, but I need to run tracemalloc or something similar myself and see: |
In the meantime, if anybody wants to urgently try and see if the issue resolves in your case, feel free to run the code from the |
This branch stopped memory from being used up during the for loop, but my job was killed due to OOM sometime after |
Hi there! I'm running into a similar issue. I am running cellbender v0.3.0 on a RTXA5000 GPU device with 24GB of mem with the following line:
cellbender remove-background --cuda --input 2_alignment/DCK-sc43_alignment/outs/raw_feature_bc_matrix.h5 --output 2_alignment/DCK-sc43_alignment/outs/cellbender_cleaned_feature_bc_matrix.h5 --fpr 0.01 --epochs 150
I keep getting this error:
cellbender:remove-background: Working on chunk (173/476)
cellbender:remove-background: Working on chunk (174/476)
cellbender:remove-background: Working on chunk (175/476)
cellbender:remove-background: Working on chunk (176/476)
slurmstepd: error: Detected 1 oom_kill event in StepId=39347527.0. Some of the step tasks have been OOM Killed.
Small adjustments have just changed at which batch it is failing.
The .h5 file is 676Mb. Any help would be greatly appreciated. I have been trouble shooting this for over a week.
The text was updated successfully, but these errors were encountered: