Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paralellization #10

Open
pabloati opened this issue Nov 9, 2023 · 4 comments
Open

Paralellization #10

pabloati opened this issue Nov 9, 2023 · 4 comments

Comments

@pabloati
Copy link
Contributor

pabloati commented Nov 9, 2023

Hi, I would like to run MAGinator on a pretty large data set. I have around 420 samples, with 60 bins per sample on average, and the preprocessed reads are around 6GB each sample.

I have been running a subset of the samples (5) as a trial run on a cluster (40ppn and 180GB), and it has been running for more than 24 hours already.

Is there any possibility to run MAGinator in parallel to speed up the process? I am running the following command:

maginator -v trial/maginator_clusters.tsv
-r trial/maginator_reads.csv
-c trial/maginator_contigs.fasta
-o trial/maginator
-g /home/people/pablop/workdir/databases/gtdb_release207_v2
bin/run_maginator.sh (END)

Thank you,
Pablo

@Russel88
Copy link
Owner

Hi Pablo

If you're on a compute cluster the best way to speed it up is to use multiple nodes. So if you use the qsub system, this could be added to the maginator command:
--cluster qsub --cluster_info "-l nodes=1:ppn={cores}:thinnode,mem={memory}gb,walltime={runtime}"

Can you see in the logs how far in the process maginator is? With 5 samples it shouldn't take that long.

@pabloati
Copy link
Contributor Author

pabloati commented Nov 15, 2023

Hi Russel,

It was my bad that I didn't include those optionsat the beginning. However, I did it now, and it seems like it got stucked after the refinement step. This is the output from MAGINATOR's log.

ESC[36m[2023-11-14 11:56:38] INFO:ESC[0m Running MAGinator version 0.1.18
ESC[36m[2023-11-14 11:56:40] INFO:ESC[0m Filtering bins
ESC[36m[2023-11-14 11:58:38] INFO:ESC[0m 297 bins in 76 VAMB clusters left after filtering
ESC[36m[2023-11-14 11:58:38] INFO:ESC[0m Classifying genomes with GTDB-tk
ESC[36m[2023-11-14 12:42:58] INFO:ESC[0m 76 clusters could be classified
ESC[36m[2023-11-14 12:42:58] INFO:ESC[0m Clustering genes and parsing GTDB-tk results
ESC[36m[2023-11-14 12:47:59] INFO:ESC[0m 76 VAMB clusters merged into 76 metagenomic species
ESC[36m[2023-11-14 12:47:59] INFO:ESC[0m Filtering of the gene clusters and readmapping
ESC[36m[2023-11-14 13:23:24] INFO:ESC[0m Identifying signature genes
ESC[36m[2023-11-14 14:26:57] INFO:ESC[0m A total of 76 clusters are included in the analysis.

@Russel88
Copy link
Owner

Can you post the log for the signature_gene workflow?

@pabloati
Copy link
Contributor Author

I have not been able to find that log. Should it be in the logs directory created by maginator?

I have been looking at your code, and the process stops at the rule refinement. I get the output file from that step, and the logs are there, indicating that there was no error, but the next rule (gene_counts) is never executed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants