Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory issues. requiring mass amounts of memory for few genomes. HELP? #124

Closed
raw937 opened this issue May 14, 2019 · 21 comments
Closed

memory issues. requiring mass amounts of memory for few genomes. HELP? #124

raw937 opened this issue May 14, 2019 · 21 comments

Comments

@raw937
Copy link

raw937 commented May 14, 2019

Hello,

I have been trying to place them in the tree with pplacer but it fails for memory every time.
I have 512 GB of RAM on this node:
First run on 28 genomes:
slurmstepd: error: Job 9369956 exceeded memory limit (701324448 > 258048000)
slurmstepd: error: Exceeded job memory limit

Second run with 13 genomes:
Job 9370293 exceeded memory limit (1314666388 > 263856128)
slurmstepd: error: Exceeded job memory limit

Why is it requiring so much memory?

my options I am using:
export GTDBTK_DATA_PATH=/opt/apps/data/gtdbtk/release86/
gtdbtk classify_wf --cups 28 -x fasta --genome_dir /data/bins --out_dir bins_out

HELP?

@donovan-h-parks
Copy link
Collaborator

We believe this is an issue with how Linux reports memory usage when pplacer is run with multiple CPUs. It appears, that Linux believes the amount of memory being requested is times . As such, this can cause issues with queuing systems. It will naturally run slower, but can you try using a single CPU.

@raw937
Copy link
Author

raw937 commented May 14, 2019

I tried with two genomes.
Required even more memory!!
slurmstepd: error: Job 9370432 exceeded memory limit (1691934724 > 263856128), being killed
slurmstepd: error: Exceeded job memory limit

So try a single CPU? But, wont I run out of memory?

@raw937
Copy link
Author

raw937 commented May 14, 2019

hmm, well that worked for two genomes. So strange. Is there a way to fix this?

@donovan-h-parks
Copy link
Collaborator

We haven't been able to find a solution to this. We strongly believe it has to do with pplacer which is a 3rd party dependency which is no longer being actively developed.

@raw937
Copy link
Author

raw937 commented May 14, 2019

hmm, bummer man. could fasttree be a replacement for pplacer?

@donovan-h-parks
Copy link
Collaborator

pplacer does a maximum-likelihood placement per genome instead of inferring a de novo tree. Only real replacement is EPA, but this requires more memory than pplacer.

@ganiatgithub
Copy link

Hi, same issue here, I'm running gtdbtk classfy_wf for 31 MAGs, I requested for 24 cpu and 250GB RAM on a node, and the error message is that the job required 2.7TB memory. So the solution is to run with 1 cpu? How would that affect the speed?

@donovan-h-parks
Copy link
Collaborator

It isn't ideal though with just 31 genomes it isn't an issue.

@raw937
Copy link
Author

raw937 commented May 16, 2019

Yup, use 1 cpu but give that all the memory you can give it. It was very fast for me.

@ganiatgithub
Copy link

Amazing, fixed now, thank you guys!

@raw937
Copy link
Author

raw937 commented May 16, 2019

Welcome! It drove me insane for like a week.

@danfulop
Copy link

danfulop commented Jul 11, 2019

@dparks1134 would EPA-ng (https://github.com/Pbdas/epa-ng) be any better at RAM usage than EPA? ...just a thought, but you're probably referring to the stand-alone EPA-ng and not the original within-RAxML EPA

For what it's worth I'm experiencing the same memory issue, but on AWS. In my case GTDBtk's test fails due to pplacer, but without any sense that it's due to a memory limitation. However, I'm pretty sure that's what's happening.

Could I use mmap to circumvent the memory issue as suggested in the pplacer site: http://matsen.github.io/pplacer/generated_rst/pplacer.html#memory-usage?

I ask because I am not running pplacer directly and so don't know how or if I can pass it the --mmap-file flag through GTDBtk

Any other tips for successfully running GTDBtk on AWS would be much appreciated!

@donovan-h-parks
Copy link
Collaborator

Hello. You can run pplacer in this mode using the scratch_dir flag.

@SilentGene
Copy link

pplacer requires a lot of memory since it caches likelihood vectors for all of the internal nodes. So GTDBtk can consume a huge amount of RAM when trying to place users' genomes in a huge reference tree. A flag --mmap-file could probably be a workaround for this to use disk I/O instead of RAM. However, it's still a problem when you need something like 2.7T available space on your disk.
But I was also curious that how the author managed to use 64 CPUs to analyze 1000 genomes in 1 hour as shown in the Hardware requirements section. Is it because this is only a bug for Linux not Mac?

@donovan-h-parks
Copy link
Collaborator

Where does the 2.7T come from? On our system, we run GTDB-Tk using 64 CPUs to process 1,000 genomes which uses ~100 GB of memory. This takes around 1 hour. We are running Ubuntu. I know a number of people have had issues with pplacer on OS X and it also seems to behaviour oddly with some batch systems.

@SilentGene
Copy link

It comes from the post above from @ganiatgithub . And I also got a similar error when I tried to use 16 CPUs. I ran GTDBtk on an HPC using slurm system, and looks like it requires for about 150 GB RAM for every thread (so it would be at least 2TB memory for 16 CPUs). Now I often use 3 CPUs and 500 GB memory for GTDBtk, and it works fine. Yeah, you are right, maybe this is a bug in some batch systems.

@donovan-h-parks
Copy link
Collaborator

We've had a few people report this. pplacer is written in OCaml so my best guess it that there is something odd about how it spawns threads which can make it look like every thread is using 150 GB when it fact this is the usage for the entire program. If you run it with 16 CPUs does the GTDB-Tk (pplacer) actually crash with a memory error?

@SilentGene
Copy link

Yes, it crashed because of out of memory when I tried to run it with 16 CPUs.

@aaronmussig
Copy link
Member

Closing this issue as the issue has been identified and a workaround provided - unfortunately we can't do much with pplacer, EPA may be a solution in the future.

Until then the workaround is available for reference in the FAQ, additionally, overriding the number of threads pplacer can use will be available as a feature in the next release (#195).

@jolespin
Copy link

jolespin commented Jun 8, 2021

pplacer does a maximum-likelihood placement per genome instead of inferring a de novo tree. Only real replacement is EPA, but this requires more memory than pplacer.

I've been encountering some memory issues as well so I decided to take a dive into any new programs that may have come out after the last post. I found one:

Syst Biol. 2020 May 1;69(3):566-578. doi: 10.1093/sysbio/syz063.
APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments
Metin Balaban 1, Shahab Sarmashghi 2, Siavash Mirarab 2
PMID: 31545363 PMCID: PMC7164367 DOI: 10.1093/sysbio/syz063
Abstract
Placing a new species on an existing phylogeny has increasing relevance to several applications. Placement can be used to update phylogenies in a scalable fashion and can help identify unknown query samples using (meta-)barcoding, skimming, or metagenomic data. Maximum likelihood (ML) methods of phylogenetic placement exist, but these methods are not scalable to reference trees with many thousands of leaves, limiting their ability to enjoy benefits of dense taxon sampling in modern reference libraries. They also rely on assembled sequences for the reference set and aligned sequences for the query. Thus, ML methods cannot analyze data sets where the reference consists of unassembled reads, a scenario relevant to emerging applications of genome skimming for sample identification. We introduce APPLES, a distance-based method for phylogenetic placement. Compared to ML, APPLES is an order of magnitude faster and more memory efficient, and unlike ML, it is able to place on large backbone trees (tested for up to 200,000 leaves). We show that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run. Finally, APPLES can accurately identify samples without assembled reference or aligned queries using kmer-based distances, a scenario that ML cannot handle. APPLES is available publically at github.com/balabanmetin/apples.

Keywords: Distance-based methods; genome skimming; phylogenetic placement.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7164367/

APPLES was an order of magnitude or more faster and less memory-hungry than ML tools (pplacer and EPA-ng) for single query runs. However, for placing large numbers of queries (e.g., as found in metagenomic data sets) on a relatively small sized backbone (equation M264), EPA-ng had an advantage since it is specifically designed to tackle scalability of multiple queries.

There's also the version 2 of this but it hasn't been peer reviewed: https://www.biorxiv.org/content/10.1101/2021.02.14.431150v1

Not sure if this is helpful at all just wanted to look into a few alternatives just in case.

@cerebis
Copy link

cerebis commented Oct 2, 2021

I've a fair bit of experience successfully running GTDB-tk on datasets containing low-100s of genome bins. Strangely, today I have been repeatedly thwarted by PBS-Pro job failures due to exceeding requested memory resources.

Didn't earlier versions of GTDB-tk default --pplacer_cpus to 1 or 2 threads? I could have sworn that was the case, since I immediately burnt my fingers trying to get squeeze-in as many pplacer threads as possible a few years ago. The concurrency benefit didn't seem worth the memory demand, however, and since then I've just left it alone.

I see from the CLI help that it currently uses a default of --cpus and my recent initial attempts were --cpus 50.

Anyway, restricting pplacer to 2 threads still resulted in a memory termination as follows:

PBS: job killed: mem 690743080kb exceeded limit 262144000kb

So even if pplacer's true memory consumption isn't scaling the way it appears to on Linux (Centos 8.4 here), extrapolating backwards, it looks like GTDB-tk v1.5.1 with GTDB Release 202 has jumped to a minimum footprint of ~300 GB?

I guess I just didn't notice a month ago on a high-memory machine.

[ed] I see there was a version GTDB-tk bump to 1.6 and elsewhere memory footprint is expected to be 200GB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants