-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory issues. requiring mass amounts of memory for few genomes. HELP? #124
Comments
We believe this is an issue with how Linux reports memory usage when |
I tried with two genomes. So try a single CPU? But, wont I run out of memory? |
hmm, well that worked for two genomes. So strange. Is there a way to fix this? |
We haven't been able to find a solution to this. We strongly believe it has to do with pplacer which is a 3rd party dependency which is no longer being actively developed. |
hmm, bummer man. could fasttree be a replacement for pplacer? |
pplacer does a maximum-likelihood placement per genome instead of inferring a de novo tree. Only real replacement is EPA, but this requires more memory than pplacer. |
Hi, same issue here, I'm running gtdbtk classfy_wf for 31 MAGs, I requested for 24 cpu and 250GB RAM on a node, and the error message is that the job required 2.7TB memory. So the solution is to run with 1 cpu? How would that affect the speed? |
It isn't ideal though with just 31 genomes it isn't an issue. |
Yup, use 1 cpu but give that all the memory you can give it. It was very fast for me. |
Amazing, fixed now, thank you guys! |
Welcome! It drove me insane for like a week. |
@dparks1134 would EPA-ng (https://github.com/Pbdas/epa-ng) be any better at RAM usage than EPA? ...just a thought, but you're probably referring to the stand-alone EPA-ng and not the original within-RAxML EPA For what it's worth I'm experiencing the same memory issue, but on AWS. In my case GTDBtk's test fails due to Could I use I ask because I am not running Any other tips for successfully running GTDBtk on AWS would be much appreciated! |
Hello. You can run |
|
Where does the 2.7T come from? On our system, we run GTDB-Tk using 64 CPUs to process 1,000 genomes which uses ~100 GB of memory. This takes around 1 hour. We are running Ubuntu. I know a number of people have had issues with pplacer on OS X and it also seems to behaviour oddly with some batch systems. |
It comes from the post above from @ganiatgithub . And I also got a similar error when I tried to use 16 CPUs. I ran GTDBtk on an HPC using slurm system, and looks like it requires for about 150 GB RAM for every thread (so it would be at least 2TB memory for 16 CPUs). Now I often use 3 CPUs and 500 GB memory for GTDBtk, and it works fine. Yeah, you are right, maybe this is a bug in some batch systems. |
We've had a few people report this. pplacer is written in OCaml so my best guess it that there is something odd about how it spawns threads which can make it look like every thread is using 150 GB when it fact this is the usage for the entire program. If you run it with 16 CPUs does the GTDB-Tk (pplacer) actually crash with a memory error? |
Yes, it crashed because of |
Closing this issue as the issue has been identified and a workaround provided - unfortunately we can't do much with pplacer, EPA may be a solution in the future. Until then the workaround is available for reference in the FAQ, additionally, overriding the number of threads pplacer can use will be available as a feature in the next release (#195). |
I've been encountering some memory issues as well so I decided to take a dive into any new programs that may have come out after the last post. I found one: Syst Biol. 2020 May 1;69(3):566-578. doi: 10.1093/sysbio/syz063. Keywords: Distance-based methods; genome skimming; phylogenetic placement. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7164367/
There's also the version 2 of this but it hasn't been peer reviewed: https://www.biorxiv.org/content/10.1101/2021.02.14.431150v1 Not sure if this is helpful at all just wanted to look into a few alternatives just in case. |
I've a fair bit of experience successfully running GTDB-tk on datasets containing low-100s of genome bins. Strangely, today I have been repeatedly thwarted by PBS-Pro job failures due to exceeding requested memory resources. Didn't earlier versions of GTDB-tk default I see from the CLI help that it currently uses a default of Anyway, restricting pplacer to 2 threads still resulted in a memory termination as follows:
So even if pplacer's true memory consumption isn't scaling the way it appears to on Linux (Centos 8.4 here), extrapolating backwards, it looks like GTDB-tk v1.5.1 with GTDB Release 202 has jumped to a minimum footprint of ~300 GB? I guess I just didn't notice a month ago on a high-memory machine. [ed] I see there was a version GTDB-tk bump to 1.6 and elsewhere memory footprint is expected to be 200GB. |
Hello,
I have been trying to place them in the tree with pplacer but it fails for memory every time.
I have 512 GB of RAM on this node:
First run on 28 genomes:
slurmstepd: error: Job 9369956 exceeded memory limit (701324448 > 258048000)
slurmstepd: error: Exceeded job memory limit
Second run with 13 genomes:
Job 9370293 exceeded memory limit (1314666388 > 263856128)
slurmstepd: error: Exceeded job memory limit
Why is it requiring so much memory?
my options I am using:
export GTDBTK_DATA_PATH=/opt/apps/data/gtdbtk/release86/
gtdbtk classify_wf --cups 28 -x fasta --genome_dir /data/bins --out_dir bins_out
HELP?
The text was updated successfully, but these errors were encountered: