- GCC with C++11, HTSlib.
- Python3 with numpy, scipy, networkx and Bio.
- Initial binning tools MaxBin2 and MetaBat2.
- Alignment tools bwa and samtools.
- HTSlib: After compiling, there would be
include
andlib
under your specified folder. You need append/path/to/include
to$CPLUS_INCLUDE_PATH
, and add/path/to/lib
to both$LIBRARY_PATH
and$LD_LIBRARY_PATH
. - MaxBin2 :
run_MaxBin.pl
should be avalible. - MetaBat2:
jgi_summarize_bam_contig_depths
andmetabat2
should be avalible.
Download and compile:
git clone https://github.com/ZhangZhenmiao/METAMVGL.git
cd METAMVGL && make && chmod +x *.py
Add components of METAMVGL to $PATH:
export PATH=/path/to/METAMVGL:$PATH
For metaSPAdes, the assembly graph (assembly_graph.fastg) is already in the output folder.
For MEGAHIT, the assembly graph is derived from final.contigs.fa:
megahit_toolkit contig2fastg <k_mer> final.contigs.fa > final.contigs.fastg
The simplest way to run METAMVGL is the wrapper, that runs initial binning (MetaBat2 or MaxBin2), paired-end graph extraction, and METAMVGL in one command:
usage: run_METAMVGL.py [-h] -a ASSEMBLER [-b INITIAL_BINNING] -r1 READS1 -r2
READS2 -c CONTIGS [-p PATHS] -g ASSEMBLY_GRAPH
[-m MAPPING_QUALITY] [-s IDENTITY] [-i INSERT_SIZE]
[-n PE] [-u MAX_ITER] [-d THRESHOLD] [-t THREADS] -o
OUTPUT
optional arguments:
-h, --help show this help message and exit
-a ASSEMBLER, --assembler ASSEMBLER
the assembler that was used to generate contigs
(choose from metaSPAdes and MEGAHIT)
-b INITIAL_BINNING, --initial_binning INITIAL_BINNING
the initial binning tool to create initial binning
results (choose from MetaBat2 and MaxBin2, default
MetaBat2)
-r1 READS1, --reads1 READS1
the path to reads1
-r2 READS2, --reads2 READS2
the path to reads2
-c CONTIGS, --contigs CONTIGS
the path to the contigs
-p PATHS, --paths PATHS
the path to the contigs.paths file under metaSPAdes
output folder, needed for metaSPAdes
-g ASSEMBLY_GRAPH, --assembly_graph ASSEMBLY_GRAPH
the path to the assembly_graph.fastg file under
metaSPAdes output folder (metaSPAdes), or fastg format
file ceretad by megahit_toolkit (MEGAHIT)
-m MAPPING_QUALITY, --mapping_quality MAPPING_QUALITY
the threshold of mapping quality for reads alignment
(default 10)
-s IDENTITY, --identity IDENTITY
the threshold of alignment identity for reads
alignment (default 0.95)
-i INSERT_SIZE, --insert_size INSERT_SIZE
the insert size of paired-end reads (default 270)
-n PE, --pe PE the minimum number of paired-end reads to support a PE
link (default 3)
-u MAX_ITER, --max_iter MAX_ITER
the maximun number of iteration of label propagation
(default 100)
-d THRESHOLD, --threshold THRESHOLD
the threshold to stop iteration of label propagation
(default 0.00000001)
-t THREADS, --threads THREADS
the number of threads for initial binning tools
(default 16)
-o OUTPUT, --output OUTPUT
output dir
Example 1: To run METAMVGL for metaSPAdes assembly, and use MetaBat2 to initial bin:
run_METAMVGL.py -a metaspades -r1 /path/to/reads1.fq.gz -r2 /path/to/reads2.fq.gz -c metaspades/contigs.fasta -p metaspades/contigs.paths -g metaspades/assembly_graph.fastg -o METAMVGL_output -t 100
Example2: To run METAMVGL for MEGAHIT assembly, and use MaxBin2 to initial bin:
run_METAMVGL.py -a megahit -b maxbin2 -r1 /path/to/reads1.fq.gz -r2 /path/to/reads2.fq.gz -c megahit/final.contigs.fa -g megahit/final.contigs.fastg -o METAMVGL_output -t 100
The results would be in METAMVGL_output/METAMVGL
.
We support metaSPAdes and MEGAHIT.
Initial binning can use any binning tools. To convert the initial binning results to the input format of METAMVGL, we modified prepResult.py, and the usage is:
prepResult.py --binned /path/to/initial_binning_result --assembler assembler_type_(SPAdes/MEGAIHT) --output /path/to/output_folder
It will create initial_contig_bins.csv
in /path/to/output_folder
. It is the input to METAMVGL.py (--binned).
We generate the assembly graph (.ag) and PE graph (.pe) by prep_graph
:
usage: prep_graph --assembler=string --assembly-graph=string --bam=string --output=string [options] ...
options:
-a, --assembler the assembler used to produce contigs, currently support metaSPAdes and MEGAHIT (string)
-c, --contigs the path to the contigs, only needed for MEGAHIT (string [=final.contigs.fa])
-p, --paths the path to the .paths file, only needed for metaSPAdes (string [=contigs.paths])
-g, --assembly-graph the path to the assembly graph in fastg (string)
-b, --bam the path to the alignment bam file (string)
-m, --mapping-quality the threshold of mapping quality (double [=10])
-i, --identity the threshold of identity (double [=0.95])
-s, --insert-size the insert size of paired-end reads (int [=270])
-n, --pe the minimum number of paired-end reads to support a link (int [=3])
-o, --output the prefix to output (string)
-?, --help print this message
We create the binning results by METAMVGL.py
:
usage: METAMVGL.py [-h] --contigs CONTIGS --assembler ASSEMBLER
--assembly_graph ASSEMBLY_GRAPH --PE_graph PE_GRAPH
--binned BINNED [--max_iter MAX_ITER] [--thresh THRESH]
--output OUTPUT
optional arguments:
-h, --help show this help message and exit
--contigs CONTIGS path to contigs file
--assembler ASSEMBLER
assembler used (metaSPAdes or MEGAHIT)
--assembly_graph ASSEMBLY_GRAPH
path to the .ag file
--PE_graph PE_GRAPH path to the .pe file
--binned BINNED path to the .csv file as initial binning
--max_iter MAX_ITER max iteration (default 100)
--thresh THRESH stop threshold (default 0.00000001)
--output OUTPUT output folder
In the output folder, we provide two types of binning results:
binning_result.csv
, each line is contig_name, cluster_idcluster.*.fasta
, the contigs in fasta format of each cluster
- The comparison results can be accessed here.
- The machine used for comparison is CentOS 8.2 (64-bit), with Dual 26-core Intel Xeon Gold 6230R 2.10GHz CPU and 768GB RAM.
- The measured time and memory include GraphBin/METAMVGL binning on MaxBin2/MetaBAT2 initial binning results from metaSPAdes/MEGAHIT assembly on BMock12, SYNTH64 and Sharon datasets.
- The
time_memory/README.md
has the commands for binning, the evaluation results are intime_memory/*/*/*.time
and generated bytime_memory/run_compare.sh
.