Paralellization #10

pabloati · 2023-11-09T13:49:37Z

Hi, I would like to run MAGinator on a pretty large data set. I have around 420 samples, with 60 bins per sample on average, and the preprocessed reads are around 6GB each sample.

I have been running a subset of the samples (5) as a trial run on a cluster (40ppn and 180GB), and it has been running for more than 24 hours already.

Is there any possibility to run MAGinator in parallel to speed up the process? I am running the following command:

maginator -v trial/maginator_clusters.tsv
-r trial/maginator_reads.csv
-c trial/maginator_contigs.fasta
-o trial/maginator
-g /home/people/pablop/workdir/databases/gtdb_release207_v2
bin/run_maginator.sh (END)

Thank you,
Pablo

Russel88 · 2023-11-15T10:36:14Z

Hi Pablo

If you're on a compute cluster the best way to speed it up is to use multiple nodes. So if you use the qsub system, this could be added to the maginator command:
--cluster qsub --cluster_info "-l nodes=1:ppn={cores}:thinnode,mem={memory}gb,walltime={runtime}"

Can you see in the logs how far in the process maginator is? With 5 samples it shouldn't take that long.

pabloati · 2023-11-15T10:43:45Z

Hi Russel,

It was my bad that I didn't include those optionsat the beginning. However, I did it now, and it seems like it got stucked after the refinement step. This is the output from MAGINATOR's log.

ESC[36m[2023-11-14 11:56:38] INFO:ESC[0m Running MAGinator version 0.1.18
ESC[36m[2023-11-14 11:56:40] INFO:ESC[0m Filtering bins
ESC[36m[2023-11-14 11:58:38] INFO:ESC[0m 297 bins in 76 VAMB clusters left after filtering
ESC[36m[2023-11-14 11:58:38] INFO:ESC[0m Classifying genomes with GTDB-tk
ESC[36m[2023-11-14 12:42:58] INFO:ESC[0m 76 clusters could be classified
ESC[36m[2023-11-14 12:42:58] INFO:ESC[0m Clustering genes and parsing GTDB-tk results
ESC[36m[2023-11-14 12:47:59] INFO:ESC[0m 76 VAMB clusters merged into 76 metagenomic species
ESC[36m[2023-11-14 12:47:59] INFO:ESC[0m Filtering of the gene clusters and readmapping
ESC[36m[2023-11-14 13:23:24] INFO:ESC[0m Identifying signature genes
ESC[36m[2023-11-14 14:26:57] INFO:ESC[0m A total of 76 clusters are included in the analysis.

Russel88 · 2023-11-15T12:41:20Z

Can you post the log for the signature_gene workflow?

pabloati · 2023-11-15T13:47:39Z

I have not been able to find that log. Should it be in the logs directory created by maginator?

I have been looking at your code, and the process stops at the rule refinement. I get the output file from that step, and the logs are there, indicating that there was no error, but the next rule (gene_counts) is never executed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paralellization #10

Paralellization #10

pabloati commented Nov 9, 2023

Russel88 commented Nov 15, 2023

pabloati commented Nov 15, 2023 •

edited

Loading

Russel88 commented Nov 15, 2023

pabloati commented Nov 15, 2023

Paralellization #10

Paralellization #10

Comments

pabloati commented Nov 9, 2023

Russel88 commented Nov 15, 2023

pabloati commented Nov 15, 2023 • edited Loading

Russel88 commented Nov 15, 2023

pabloati commented Nov 15, 2023

pabloati commented Nov 15, 2023 •

edited

Loading