memory issues. requiring mass amounts of memory for few genomes. HELP? #124

raw937 · 2019-05-14T22:04:15Z

Hello,

I have been trying to place them in the tree with pplacer but it fails for memory every time.
I have 512 GB of RAM on this node:
First run on 28 genomes:
slurmstepd: error: Job 9369956 exceeded memory limit (701324448 > 258048000)
slurmstepd: error: Exceeded job memory limit

Second run with 13 genomes:
Job 9370293 exceeded memory limit (1314666388 > 263856128)
slurmstepd: error: Exceeded job memory limit

Why is it requiring so much memory?

my options I am using:
export GTDBTK_DATA_PATH=/opt/apps/data/gtdbtk/release86/
gtdbtk classify_wf --cups 28 -x fasta --genome_dir /data/bins --out_dir bins_out

HELP?

donovan-h-parks · 2019-05-14T22:55:55Z

We believe this is an issue with how Linux reports memory usage when pplacer is run with multiple CPUs. It appears, that Linux believes the amount of memory being requested is times . As such, this can cause issues with queuing systems. It will naturally run slower, but can you try using a single CPU.

raw937 · 2019-05-14T22:59:18Z

I tried with two genomes.
Required even more memory!!
slurmstepd: error: Job 9370432 exceeded memory limit (1691934724 > 263856128), being killed
slurmstepd: error: Exceeded job memory limit

So try a single CPU? But, wont I run out of memory?

raw937 · 2019-05-14T23:22:51Z

hmm, well that worked for two genomes. So strange. Is there a way to fix this?

donovan-h-parks · 2019-05-14T23:30:59Z

We haven't been able to find a solution to this. We strongly believe it has to do with pplacer which is a 3rd party dependency which is no longer being actively developed.

raw937 · 2019-05-14T23:32:17Z

hmm, bummer man. could fasttree be a replacement for pplacer?

donovan-h-parks · 2019-05-14T23:34:23Z

pplacer does a maximum-likelihood placement per genome instead of inferring a de novo tree. Only real replacement is EPA, but this requires more memory than pplacer.

ganiatgithub · 2019-05-16T00:37:09Z

Hi, same issue here, I'm running gtdbtk classfy_wf for 31 MAGs, I requested for 24 cpu and 250GB RAM on a node, and the error message is that the job required 2.7TB memory. So the solution is to run with 1 cpu? How would that affect the speed?

donovan-h-parks · 2019-05-16T01:10:28Z

It isn't ideal though with just 31 genomes it isn't an issue.

raw937 · 2019-05-16T02:32:14Z

Yup, use 1 cpu but give that all the memory you can give it. It was very fast for me.

ganiatgithub · 2019-05-16T02:57:17Z

Amazing, fixed now, thank you guys!

raw937 · 2019-05-16T03:02:40Z

Welcome! It drove me insane for like a week.

danfulop · 2019-07-11T21:21:06Z

@dparks1134 would EPA-ng (https://github.com/Pbdas/epa-ng) be any better at RAM usage than EPA? ...just a thought, but you're probably referring to the stand-alone EPA-ng and not the original within-RAxML EPA

For what it's worth I'm experiencing the same memory issue, but on AWS. In my case GTDBtk's test fails due to pplacer, but without any sense that it's due to a memory limitation. However, I'm pretty sure that's what's happening.

Could I use mmap to circumvent the memory issue as suggested in the pplacer site: http://matsen.github.io/pplacer/generated_rst/pplacer.html#memory-usage?

I ask because I am not running pplacer directly and so don't know how or if I can pass it the --mmap-file flag through GTDBtk

Any other tips for successfully running GTDBtk on AWS would be much appreciated!

donovan-h-parks · 2019-07-11T22:25:00Z

Hello. You can run pplacer in this mode using the scratch_dir flag.

SilentGene · 2019-08-09T04:31:11Z

pplacer requires a lot of memory since it caches likelihood vectors for all of the internal nodes. So GTDBtk can consume a huge amount of RAM when trying to place users' genomes in a huge reference tree. A flag --mmap-file could probably be a workaround for this to use disk I/O instead of RAM. However, it's still a problem when you need something like 2.7T available space on your disk.
But I was also curious that how the author managed to use 64 CPUs to analyze 1000 genomes in 1 hour as shown in the Hardware requirements section. Is it because this is only a bug for Linux not Mac?

donovan-h-parks · 2019-08-09T04:47:50Z

Where does the 2.7T come from? On our system, we run GTDB-Tk using 64 CPUs to process 1,000 genomes which uses ~100 GB of memory. This takes around 1 hour. We are running Ubuntu. I know a number of people have had issues with pplacer on OS X and it also seems to behaviour oddly with some batch systems.

SilentGene · 2019-08-09T05:05:17Z

It comes from the post above from @ganiatgithub . And I also got a similar error when I tried to use 16 CPUs. I ran GTDBtk on an HPC using slurm system, and looks like it requires for about 150 GB RAM for every thread (so it would be at least 2TB memory for 16 CPUs). Now I often use 3 CPUs and 500 GB memory for GTDBtk, and it works fine. Yeah, you are right, maybe this is a bug in some batch systems.

donovan-h-parks · 2019-08-09T12:25:49Z

We've had a few people report this. pplacer is written in OCaml so my best guess it that there is something odd about how it spawns threads which can make it look like every thread is using 150 GB when it fact this is the usage for the entire program. If you run it with 16 CPUs does the GTDB-Tk (pplacer) actually crash with a memory error?

SilentGene · 2019-08-10T05:05:38Z

Yes, it crashed because of out of memory when I tried to run it with 16 CPUs.

aaronmussig · 2019-12-18T06:12:49Z

Closing this issue as the issue has been identified and a workaround provided - unfortunately we can't do much with pplacer, EPA may be a solution in the future.

Until then the workaround is available for reference in the FAQ, additionally, overriding the number of threads pplacer can use will be available as a feature in the next release (#195).

jolespin · 2021-06-08T15:37:15Z

pplacer does a maximum-likelihood placement per genome instead of inferring a de novo tree. Only real replacement is EPA, but this requires more memory than pplacer.

I've been encountering some memory issues as well so I decided to take a dive into any new programs that may have come out after the last post. I found one:

Syst Biol. 2020 May 1;69(3):566-578. doi: 10.1093/sysbio/syz063.
APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments
Metin Balaban 1, Shahab Sarmashghi 2, Siavash Mirarab 2
PMID: 31545363 PMCID: PMC7164367 DOI: 10.1093/sysbio/syz063
Abstract
Placing a new species on an existing phylogeny has increasing relevance to several applications. Placement can be used to update phylogenies in a scalable fashion and can help identify unknown query samples using (meta-)barcoding, skimming, or metagenomic data. Maximum likelihood (ML) methods of phylogenetic placement exist, but these methods are not scalable to reference trees with many thousands of leaves, limiting their ability to enjoy benefits of dense taxon sampling in modern reference libraries. They also rely on assembled sequences for the reference set and aligned sequences for the query. Thus, ML methods cannot analyze data sets where the reference consists of unassembled reads, a scenario relevant to emerging applications of genome skimming for sample identification. We introduce APPLES, a distance-based method for phylogenetic placement. Compared to ML, APPLES is an order of magnitude faster and more memory efficient, and unlike ML, it is able to place on large backbone trees (tested for up to 200,000 leaves). We show that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run. Finally, APPLES can accurately identify samples without assembled reference or aligned queries using kmer-based distances, a scenario that ML cannot handle. APPLES is available publically at github.com/balabanmetin/apples.

Keywords: Distance-based methods; genome skimming; phylogenetic placement.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7164367/

APPLES was an order of magnitude or more faster and less memory-hungry than ML tools (pplacer and EPA-ng) for single query runs. However, for placing large numbers of queries (e.g., as found in metagenomic data sets) on a relatively small sized backbone (equation M264), EPA-ng had an advantage since it is specifically designed to tackle scalability of multiple queries.

There's also the version 2 of this but it hasn't been peer reviewed: https://www.biorxiv.org/content/10.1101/2021.02.14.431150v1

Not sure if this is helpful at all just wanted to look into a few alternatives just in case.

cerebis · 2021-10-02T01:19:33Z

I've a fair bit of experience successfully running GTDB-tk on datasets containing low-100s of genome bins. Strangely, today I have been repeatedly thwarted by PBS-Pro job failures due to exceeding requested memory resources.

Didn't earlier versions of GTDB-tk default --pplacer_cpus to 1 or 2 threads? I could have sworn that was the case, since I immediately burnt my fingers trying to get squeeze-in as many pplacer threads as possible a few years ago. The concurrency benefit didn't seem worth the memory demand, however, and since then I've just left it alone.

I see from the CLI help that it currently uses a default of --cpus and my recent initial attempts were --cpus 50.

Anyway, restricting pplacer to 2 threads still resulted in a memory termination as follows:

PBS: job killed: mem 690743080kb exceeded limit 262144000kb

So even if pplacer's true memory consumption isn't scaling the way it appears to on Linux (Centos 8.4 here), extrapolating backwards, it looks like GTDB-tk v1.5.1 with GTDB Release 202 has jumped to a minimum footprint of ~300 GB?

I guess I just didn't notice a month ago on a high-memory machine.

[ed] I see there was a version GTDB-tk bump to 1.6 and elsewhere memory footprint is expected to be 200GB.

This was referenced Sep 1, 2019

Error during pplacer #170

Closed

Error during align step: 'Boolean index did not match...' #182

Closed

aaronmussig closed this as completed Dec 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory issues. requiring mass amounts of memory for few genomes. HELP? #124

memory issues. requiring mass amounts of memory for few genomes. HELP? #124

raw937 commented May 14, 2019

donovan-h-parks commented May 14, 2019

raw937 commented May 14, 2019

raw937 commented May 14, 2019

donovan-h-parks commented May 14, 2019

raw937 commented May 14, 2019

donovan-h-parks commented May 14, 2019

ganiatgithub commented May 16, 2019

donovan-h-parks commented May 16, 2019

raw937 commented May 16, 2019

ganiatgithub commented May 16, 2019

raw937 commented May 16, 2019

danfulop commented Jul 11, 2019 •

edited

Loading

donovan-h-parks commented Jul 11, 2019

SilentGene commented Aug 9, 2019

donovan-h-parks commented Aug 9, 2019

SilentGene commented Aug 9, 2019

donovan-h-parks commented Aug 9, 2019

SilentGene commented Aug 10, 2019

aaronmussig commented Dec 18, 2019

jolespin commented Jun 8, 2021

cerebis commented Oct 2, 2021 •

edited

Loading

memory issues. requiring mass amounts of memory for few genomes. HELP? #124

memory issues. requiring mass amounts of memory for few genomes. HELP? #124

Comments

raw937 commented May 14, 2019

donovan-h-parks commented May 14, 2019

raw937 commented May 14, 2019

raw937 commented May 14, 2019

donovan-h-parks commented May 14, 2019

raw937 commented May 14, 2019

donovan-h-parks commented May 14, 2019

ganiatgithub commented May 16, 2019

donovan-h-parks commented May 16, 2019

raw937 commented May 16, 2019

ganiatgithub commented May 16, 2019

raw937 commented May 16, 2019

danfulop commented Jul 11, 2019 • edited Loading

donovan-h-parks commented Jul 11, 2019

SilentGene commented Aug 9, 2019

donovan-h-parks commented Aug 9, 2019

SilentGene commented Aug 9, 2019

donovan-h-parks commented Aug 9, 2019

SilentGene commented Aug 10, 2019

aaronmussig commented Dec 18, 2019

jolespin commented Jun 8, 2021

cerebis commented Oct 2, 2021 • edited Loading

danfulop commented Jul 11, 2019 •

edited

Loading

cerebis commented Oct 2, 2021 •

edited

Loading