Skip to content

Latest commit

 

History

History
442 lines (307 loc) · 26.5 KB

Module 5_Metagenomics 2.md

File metadata and controls

442 lines (307 loc) · 26.5 KB

Module 5: Metagenomics 2

Assembly and Metagenome Assembled Genomes Tutorial

In this tutorial, you will work with a real-life set of metagenomic datasets taken from a bioreactor fed methane and ferric iron to enrich microorganisms involved in methane oxidation and iron reduction. The bioreactor was originally seeded with oxygen-depleted sediment from an iron- and methane-rich coastal site in the Bothnian Sea (1).

There are three time points and four samples associated with this dataset. There is also a published, full examination of the information from these sequences, including genome binning and metabolic examination – the paper is linked at the end of this document, but we recommend you not read it until you have worked through the tutorial. A note that for this tutorial, we have worked from the raw sequencing reads, and our pipeline is slightly different from the original study, so we may see differences in our conclusions compared to the published paper.

Many of the steps involved in metagenome assembly and binning are computationally demanding and can take a long time. Our focus today is on the analysis of metabolic potential and ecosystem roles of the organisms in this bioreactor. We have completed many of the processing steps to allow us to focus on the ecology questions in this dataset. See the appendices for a complete walk through on metagenome reads processing, assembly, and binning. We will start with assembled, binned Metagenome Assembled Genomes (MAGs).

The datasets:

Table 1: Dataset information, including timepoint, accession, and metagenome size.

Time point Enrichment length SRA accession Total reads Read length Total raw sequence (Gbp)
T0 0 months SRR12686322 6,155,126 (x2) 300 bp 3.69
T1 16 months SRR12686321 5,939,747 (x2) 300 bp 3.56
T2A 32 months SRR12686319 7,678,399 (x2) 300 bp 4.61
T2B 32 months SRR12686320 7,946,290 (x2) 300 bp 4.77

These datasets were downloaded using the SRA Toolkit from NCBI (2). Example command (do not run):

fasterq-dump --split-files SRR12686322

fasterq-dump with –split-files converts SRA to fastq files automatically, and splits the reads into F and R files.

Table 2: Number of bins generated by the different binning algorithms. Note the total number of bins from DAS Tool is lower than most of the other algorithms – DAS Tool implements an initial quality filter, so only bins passing this filter are retained (3).

Program T0 T1 T2A T2B
MetaBAT2 20 34 35 34
CONCOCT 40 46 45 49
MaxBin2 9 29 28 28
DAS Tool 9 20 25 26

We will move forward using the DAS Tool quality curated bins.

Phylogenetic, Abundance, and Metabolic Analyses

In this active part of the tutorial, we will work through some analyses of the bins generated by the workflow detailed in the appendices. The final bin set included 80 MAGs, for which we will quality filter, taxonomically identify, and generate gene annotations and metabolic predictions.

Initiate the conda environment:

. "/usr/local/conda/etc/profile.d/conda.sh"

conda activate /usr/local/conda/envs/DRAM

3.1 CheckM

CheckM requires nearly all the RAM available to you, and nearly the full length of this tutorial, so it was precomputed. We ran CheckM (4) on the 80 bins to identify high quality (>70% complete, <10% contaminated) MAGs for subsequent analyses. We first set the directory for the databases. Example (do not run):

checkm data setRoot ~/CourseData/Module5/Data/databases/CheckM_data

And then ran CheckM. We used 4 threads (-t 4) and -reduced_tree to allow for faster processing requiring lower memory.

Example Base command:

checkm lineage_wf -t 4 -x .fas --reduced_tree [bin folder] [output folder]  

we ran:

checkm lineage_wf -t 4 -x .fas --reduced_tree ~/CourseData/Module5/Data/All_bins ~/CourseData/Module5/Data/CheckM_output 

From this, 51 of the 80 bins passed the quality thresholds.

3.2 GTDB-tk

GTDB-tk is a supervised automated taxonomic classifier (5). It works from the GTDB database, and assigns MAGs taxonomy to their lowest possible classification. GTDB-tk requires more memory than is available in your AWS instance (~200 Gb RAM), but is also available on KBase if you need access and don't have sufficient computational capacity on your local options.

We ran GTDB-tk on the MAGs that passed the quality threshold as follows (do not run):

gtdbtk classify_wf --cpus 64 --genome_dir High_QC_bins/ --out_dir ./accessory_files

Take a look at the output:

head ~/CourseData/Module5/Data/High_QC_bins_with_taxonomy/accessory_files/GTDB_combined

From this output, we renamed the MAGs to their taxonomy. Compare original:

ls ~/CourseData/Module5/Data/High_QC_bins

To the renamed files:

ls ~/CourseData/Module5/Data/High_QC_bins_with_taxonomy

3.3 DRAM

DRAM (6) was precomputed, as installation of the full software package requires ~500 Gb of RAM. DRAM is also available on KBase if you are unable to install or host it on your local options.

We ran DRAM on the 51 bins passing quality filters. The DRAM annotate step took approximately 30 minutes per MAG, for a total run time of 26 hours.

DRAM.py annotate -i ~/CourseData/Module5/Data/High_QC_bins/*.fas' -o ~/CourseData/Module5/Data/High_QC_bins_annotation --threads 18

Examine the format of the annotations file:

head ~/CourseData/Module5/Data/High_QC_bins_annotation/annotation/annotations.tsv

You will see many columns, with one row per gene. Genes have been called using prodigal, and are numbered based on their location on their scaffold.

Q: What is the first gene's ID? What is it annotated as?

This annotations.tsv file is incredibly important for assigning metabolic function, and is a key strength of the DRAM program. Most other annotators do not maintain connection between predicted ORF and annotation – DRAM allows you to go into the annotations.tsv file to find the gene associated with a prediction, and then into the genes.faa file to conduct further analyses.

If you have a function of interest, you can search for it directly:

grep "pyruvate kinase" ~/CourseData/Module5/Data/High_QC_bins_annotation/annotation/annotations.tsv
grep "iron reductase" ~/CourseData/Module5/Data/High_QC_bins_annotation/annotation/annotations.tsv

Q: Compare the frequency of these two functions. Observe whether bins have one or more of these annotated proteins (remember, the bin name is at the front of the gene ID).

However, the annotations.tsv file is bulky and difficult to interpret without a deep knowledge of microbial metabolism or an a priori expectation of what you are looking for. The DRAM distilled files are more user-friendly.

We distilled the data:

DRAM.py distill -i ~/CourseData/Module5/Data/High_QC_bins_annotation/annotation/annotations.tsv -o genome_summaries --trna_path ~/CourseData/Module5/Data/High_QC_bins_annotation/annotation/trnas.tsv --rrna_path ~/CourseData/Module5/Data/High_QC_bins_annotation/annotation/rrnas.tsv

Open the product.html file found in

~/CourseData/Module5/Data/High_QC_bins_annotation/genome_summaries/product.html

in your web browser (drag and drop).

You will see two linked displays. On the left is a heat map showing central carbon metabolism (Glycolysis, Pentose Phosphate Pathway, TCA cycle, electron transport chain). On the right are presence/absence plots of specific functions involved in carbohydrate metabolism, Nitrogen and Sulfur cycling, and other key functions. Hover over a cell for more information.

Q: Do all MAGs have a complex IV (cytochrome C oxidase) in their genomes? This is a hallmark gene for aerobic metabolism – organisms missing this are often obligate anaerobes.

Q: Which carbohydrate active enzyme (CAZy) class is most well-represented in this dataset?

Q: How are the MAGs organized?

Q: Are there any functions that are represented more frequently in the T2 timepoints compared to T0 or T1?

Q: Do any lineages have a complete denitrification pathway [Nitrate -> Nitrite -> Nitric Acid -> Nitrous Acid -> Dinitrogen]? Which are most complete?

3.4 FeGenie

FeGenie is a program that specifically examines the presence of iron-cycling genes within a genome or genomes of interest (7). If DRAM incorporated it, it would be displayed as an additional blue/green presence/absence map, but as DRAM does not, and because our bioreactor was specifically fed an iron-rich enrichment medium, we will examine which organisms are involved in iron cycling using this more boutique functional analysis tool.

Run FeGenie:

FeGenie.py -bin_dir ~/CourseData/Module5/Data/High_QC_bins_with_taxonomy -bin_ext fas -t 4

When FeGenie has completed (~17 minutes):

Edit the /fegenie_output/FeGenie-heatmap-data.csv file to have no trailing commas (no last comma on each line):

nano fegenie_out/FeGenie-heatmap-data.csv

You can use "option" to position your cursor where you need it to be. Save your output when done.

Run the associated visualization scripts for FeGenie:

Rscript ~/CourseData/Module5/Data/FeGenie_scripts/dendro-heatmap.R fegenie_out/FeGenie-heatmap-data.csv fegenie_out
Rscript ~/CourseData/Module5/Data/FeGenie_scripts/DotPlot.R fegenie_out/FeGenie-heatmap-data.csv fegenie_out

View the .tiff files that were generated – these will be located in your fegenie_out directory.

Q: Which iron cycling functions are more strongly represented in the data: oxidation or reduction? Is there the potential for iron cycling or hand-offs within the community?

Q: Do the MAGs group according to taxonomy based on the iron cycling genes' presence/absence on the dendrogram?

Q: Who are the likely iron oxidizers?

Q: What role do the two Krumholzibacteriota play in iron cycling? Do they share similar iron cycling capacities? (hint: dotplot)

3.5 GToTree

GToTree is a rapid phylogenomic (whole genome phylogeny) program (8). It uses HMMs to identify genes of interest, aligns them, concatenates the single gene alignments, and generates a phylogeny from the concatenated alignment. It can work with fasta files, genbank files, or a combination, and will pull the NCBI reference genomes for you automatically.

You need to provide a list of fasta files, genbank files, or both.

For our MAGs:

ls ~/CourseData/Module5/Data/High_QC_bins_with_taxonomy/*.fas > GtoTree_fasta_files.txt

Run GToTree to make a tree containing only our MAGs. Leaf taxa will automatically be named the same as the fasta file:

GToTree -f GtoTree_fasta_files.txt -H Universal_Hug_et_al.hmm

This should take about 9 minutes.

We can view the trees using either iTOL (https://itol.embl.de/upload.cgi) or TreeView (http://etetoolkit.org/treeview/).

Upload the .tre file that GToTree has generated to one of these viewers and examine them.

Q: From the tree, what patterns do you see in terms of distribution of MAGs and their (a) taxonomy and (b) sample of origin? Is there evidence that MAGs are duplicated in this dataset?

Note: dRep is a commonly-used tool for dereplicating MAGs from across different samples, to leave one non-redundant set of MAGs for analysis (9).

Q: Based on branch lengths (or your own taxonomic knowledge), what do Thermoplasmatota and Halobacterota have in common? (hint: Halobacterota is named very poorly).

Appendix

The data for this tutorial was taken from "Enrichment of novel Verrucomicrobia, Bacteroidetes, and Krumholzibacteria in an oxygen-limited methane- and iron-fed bioreactor inoculated with Bothnian Sea sediments" by Martins et al., published in MicrobiologyOpen in February 2021 (1). https://onlinelibrary.wiley.com/doi/10.1002/mbo3.1175

Pre-computed pipeline

1. Assembly

We have pre-computed read quality processing and assembly steps as described below:

1.1 BBduk

BBduk: decontaminates raw sequence reads, removing common contaminants. We used it to screen out sequencing adapters, phiX phage, lambda phage, and the suite of common contaminants.

Example command for T0 dataset:

bbduk.sh in=SRR12686322_1.fastq in2=SRR12686322_2.fastq ref=adapters,artifacts,phix,lambda out=SRR12686322_1_trim.fastq out2=SRR12686322_2_trim.fastq

BBduk runs in ~7 seconds per sample. In this case, BBDuk removed ~0.38% of reads from each of the paired end read files.

1.2 Sickle

Sickle: trims or removes low quality sequence reads based on the quality scores from the sequencer (10). We started from the BBduk-trimmed files. We specifically asked for only matched pairs to be kept in the paired files, so a singletons file (where one read passed the quality filter but its pair did not) is also created. We did not work with this singletons file further.

Example command for T0 dataset:

sickle pe -f SRR12686322_1_trim.fastq -r SRR12686322_2_trim.fastq -t sanger -o SRR12686322_1_trim_sickle.fastq -p SRR12686322_2_trim_sickle.fastq -s SRR12686322_singles.fastq

Sickle runs in ~10 minutes per sample. In this case, Sickle removed ~1.6% of reads from the dataset (including the 120,714 that were "kept" as singletons).

1.3 Spades3

Spades3: assembles genomes and metagenomes (11). Has a -meta flag that allows for a wider range of kmers to be used, which helps assemble multiple genomes that are present in the DNA at different relative abundances.

Example command for T0 dataset:

spades.py -m 2000 --tmp-dir tmpCBW_SRR12686322 -o SRR12686322_spades3 --only-assembler -k 33,55,77,99,127 -t 32 -1 SRR12686322_1_trim_sickle.fastq -2 SRR12686322_2_trim_sickle.fastq

For a dataset this size (3-4 Gbp), with 32 threads, Spades3 ran in 1-3 minutes. A larger metagenome (20+ Gbp) will typically take several hours to days. Our 40 Gbp metagenomes usually take 3-7 days to complete. Assembly run time does not scale linearly with size, as the comparisons made are combinatorial.

Example output: A folder containing scaffolds [scaffolds.fasta], contigs [contigs.fasta], and output files delimiting the success of the assembly.

1.4 BBmap's stats.sh script

stats.sh: generates summary statistics for a file of nucleotide sequences including N50 and breakdown of scaffold lengths (12). Allows a quick check of assembly success. In general, more total length captured in longer scaffolds is better.

Example command for T0:

stats.sh in=SRR12686322_spades3/scaffolds.fasta

One important interpretation of the provided summary is that 106 million basepairs of sequence is assembled into 2,500 bp or longer scaffolds. These are the scaffolds we will work from moving forward.

1.5 pullseq

pullseq is a fast and handy tool for working with sequence files. You can extract sequences from a list of names or using a regex, or you can filter based on length requirements. It works on fastq and fasta files, and is written in C for speed.

We used pullseq to filter our assemblies down to 2,500 bp scaffolds and longer – this is a reasonable length for binning. If we use much shorter scaffolds, tetranucleotide frequencies become too poorly represented to appropriately match the overall genome composition.

Example command:

pullseq -i SRR12686322_spades3/scaffolds.fasta -m 2500 > T0_scaffolds_2.5K.fasta

2. Binning

We binned the assembled data using three binning algorithms: MetaBAT2 (13), CONCOCT (14), and MaxBin2 (15). We used tetranucleotide frequencies and series abundance as information for all three algorithms. We then merged, dereplicated, and quality-selected bins using DAS Tool (3). See table below for a summary of all bins generated for each sample, by each algorithm.

2.1 Bowtie2

To generate series abundance data, first the reads from each sample must be mapped to each assembled metagenome (2500 bp+ scaffolds only). We used Bowtie2 (16,17). This is an all-by-all mapping, as we want to know the depth of coverage for each scaffold across each sample.

First, we renamed scaffolds to include the sample name (time point) in their fasta header:

sed -i 's/>/>T0_/g' T0_scaffolds_2.5K.fasta > /CBW/2.5k_scaffolds/T0_scaffolds_2.5K.fasta

This sed command is finding all instances of > and replacing them with >T0_ to add the T0 information to each scaffold. In this case we are also moving the file to a new directory.

Then, we built a bowtie index for each metagenome's scaffolds file:

bowtie2-build T0_scaffolds_2.5K.fasta T0_scaffolds_2.5

For each pair of samples, we mapped the reads from one to the scaffolds from the other using bowtie2. We then converted the output .sam file to a .bam file, sorted it, and indexed it using samtools (18).

Example code for mapping T1 reads to T0 scaffolds:

bowtie2 --threads 64 -x T2A_scaffolds_2.5K -1 /CBW/reads/SRR12686321_1_trim_sickle.fastq -2 /CBW/reads/SRR12686321_2_trim_sickle.fastq --no-unal -S T1_to_T2A.sam 2> T12T0_bowtie_error.txt

x is the bowtie index, -1/-2 are the read files. We are not preserving information for the unaligned reads (saves a huge amount of space as otherwise file sizes are enormous), and we are saving the stdout to a file T12T0_bowtie_error.txt

Then we converted the .sam file into an easier, ordered and indexed format for subsequent use:

samtools view -bS -@ 64 -m 90G T1_to_T0.sam > T1_to_T0.bam

samtools sort -@ 70 T1_to_T0.bam -o T1_to_T0.sorted.bam

samtools index -@ 70 T1_to_T0.sorted.bam

The bowtie outputs give us a sense of how much read information was incorporated into the assembly. Use less to take a look at how the all-by-all mapping percentages differ in the following files:

less ~/CourseData/Module5/Data/Bowtie_files/T02T0_bowtie_error.txt

less ~/CourseData/Module5/Data/Bowtie_files/T02T2B_bowtie_error.txt

less ~/CourseData/Module5/Data/Bowtie_files/T12T2B_bowtie_error.txt

less ~/CourseData/Module5/Data/Bowtie_files/T2A2T2B_bowtie_error.txt

less ~/CourseData/Module5/Data/Bowtie_files/T2B2T2B_bowtie_error.txt

In this case, as enrichment has increased, more of the sequence data has been successfully assembled (compare T02T0 with T2B2T2B). The very low percent of reads mapped to T0 indicates that sample was not sequenced sufficiently deeply for assembly – ideally you want 40%+ assembling.

In addition, over time, more of the assembled data from the T2B enrichment timepoint is represented in the bioreactor samples, from a very low percentage in the T0 sample (T02T2B) up to quite high in the T2 samples (T2A2T2B, T2B2T2B).

2.2 MetaBAT2

We used MetaBAT as our first binning algorithm because it generates abundance depth files that can be used for MaxBin2.

First we generated the depth information from the Bowtie2 .bam files:

/METABAT/berkeleylab-metabat-d764aa1ece13/jgi_summarize_bam_contig_depths --outputDepth T0_depth.txt CBW/mapping/T0/*.bam

Then we used that information as input to the binning:

/METABAT/berkeleylab-metabat-d764aa1ece13/metabat2 -i 2.5k_scaffolds/T0_scaffolds_2.5K.fasta -a mapping/T0_depth.txt -t 64 -o bins/T0/

MetaBAT2 hides all output files as .bin1.fas – it's weird, and can be disconcerting, as your output folder will appear empty with a normal ls command. Make the files visible like so:

rename 's/\.//;' .*

2.3 CONCOCT

CONCOCT acts on slightly different data than MaxBin2 or MetaBAT2, in that it requires each scaffold be shredded into 10 Kbp-length pieces. Each of these pieces is binned independently, but the connections are used to help inform boundaries of the bins (as each piece of a scaffold should logically end up in the same bin).

First, we shredded the scaffolds:

cut_up_fasta.py T0_scaffolds_2.5K.fasta -c 10000 -o 0 --merge_last -b T0_contigs_10k.bed > T0_contigs_10K.fa

Then we generated a depth/coverage table, similar to MetaBAT2, but with different formatting:

concoct_coverage_table.py T0_contigs_10k.bed mapping/T0/*bam > T0_concoct_coverage_table.tsv

Then we binned the scaffold fragments:

concoct --composition_file T0_contigs_10K.fa --coverage_file T0_concoct_coverage_table.tsv -b T0 -t 64

Shredded pieces were merged together:

merge_cutup_clustering.py T0_clustering_gt1000.csv > T0_clustering_merged.csv

And finally, bin information was extracted:

extract_fasta_bins.py T0_scaffolds_2.5K.fasta T0_clustering_merged.csv --output_path concoct_T0/

2.4 MaxBin2

To save time, we converted the MetaBAT2 depth files to MaxBin2-formatted abundance files (otherwise MaxBin2 will re-run the read mapping we did using Bowtie2):

awk '{print $1"\t"$4}' T0_depth.txt | sed 's/T0_to_T0.sorted.bam/abundance/g' | sed 's/contigName/contig header/g' > T0_to_T0_abundance

Then we ran MaxBin2:

perl /Maxbin/MaxBin-2.2.6/run_MaxBin.pl -contig T0_scaffolds_2.5K.fasta -out T0_Maxbin2 -abund_list mapping/T0_abundance -thread 64

2.5 DAS Tool

Start the conda environment for DAS_Tool [NB: DAS Tool is not installed on your AWS instance]:

conda activate DAS_tool

Each dataset of bins needs to be processed for inclusion in DAS Tool.

MetaBAT2:

Fasta_to_Scaffolds2Bin.sh -i T0_Metabat/ -e fa > DAS_Tool_input/T0/T0_Maxbin_bins.tsv

CONCOCT:

Fasta_to_Scaffolds2Bin.sh -i T0_concoct/ -e fa > DAS_Tool_input/T0/T0_concoct_bins.tsv

MaxBin2:

Fasta_to_Scaffolds2Bin.sh -i T0_Maxbin/ -e fasta > DAS_Tool_input/T0/T0_Maxbin_bins.tsv

Once ready, we ran DAS Tool:

DAS_Tool -i DAS_Tool_input/T0/T0_concoct_bins.tsv, DAS_Tool_input/T0/T0_Maxbin_bins.tsv, DAS_Tool_input/T0/T0_Metabat_bins.tsv -l concoct,maxbin2,metabat2 -c T0_scaffolds_2.5K.fasta --threads 64 --write_bins 1 -o DAS_Tool_output/T0/

References

  1. Martins PD , Jong A de , Lenstra WK , Helmond NAGM van , Slomp CP , Jetten MSM , Welte CU , Rasigraf O. 2021. Enrichment of novel Verrucomicrobia, Bacteroidetes, and Krumholzibacteria in an oxygen-limited methane- and iron-fed bioreactor inoculated with Bothnian Sea sediments. Microbiologyopen 10 :e1175.

  2. SRA Toolkit Development Team. SRA Toolkit. http://ncbi.github.io/sra-tools/

  3. Sieber CMK , Probst AJ , Sharrar A , Thomas BC , Hess M , Tringe SG , Banfield JF. 2018. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat Microbiol 3 :836–843.

  4. Parks DH , Imelfort M , Skennerton CT , Hugenholtz P , Tyson GW. 2015. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25 :1043–55.

  5. Chaumeil P-A , Mussig AJ , Hugenholtz P , Parks DH. 2020. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36 :1925–1927.

  6. Shaffer M , Borton MA , McGivern BB , Zayed AA , La Rosa SL , Solden LM , Liu P , Narrowe AB , Rodríguez-Ramos J , Bolduc B , Gazitúa MC , Daly RA , Smith GJ , Vik DR , Pope PB , Sullivan MB , Roux S , Wrighton KC. 2020. DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic Acids Res 48 :8883–8900.

  7. Garber AI , Nealson KH , Okamoto A , McAllister SM , Chan CS , Barco RA , Merino N. 2020. FeGenie: A comprehensive tool for the identification of iron genes and iron gene neighborhoods in genome and metagenome assemblies. Front Microbiol 0 :37.

  8. Lee MD. 2019. GToTree: a user-friendly workflow for phylogenomics. Bioinformatics 35 :4162–4164.

  9. Olm MR , Brown CT , Brooks B , Banfield JF. 2017. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J 2017 1112 11 :2864–2868.

  10. Joshi N , Fass J. 2011. Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files (Version 1.33) [Software]. Available at https://github.com/najoshi/sickle.

  11. Bankevich A , Nurk S , Antipov D , Gurevich AA , Dvorkin M , Kulikov AS , Lesin VM , Nikolenko SI , Pham S , Prjibelski AD , Pyshkin A V. , Sirotkin A V. , Vyahhi N , Tesler G , Alekseyev MA , Pevzner PA. 2012. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19 :455–477.

  12. Bushnell B. 2014. BBMap.

  13. Kang DD , Froula J , Egan R , Wang Z. 2015. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3 :e1165.

  14. Alneberg J , Bjarnason BS , de Bruijn I , Schirmer M , Quick J , Ijaz UZ , Lahti L , Loman NJ , Andersson AF , Quince C. 2014. Binning metagenomic contigs by coverage and composition. Nat Methods 11 :1144–6.

  15. Wu Y-W , Simmons BA , Singer SW. 2016. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32 :605–607.

  16. Langmead B , Trapnell C , Pop M , Salzberg SL. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 :R25.

  17. Langmead B , Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nat Methods 9 :357–359.

  18. Li H , Handsaker B , Wysoker A , Fennell T , Ruan J , Homer N , Marth G , Abecasis G , Durbin R , Subgroup 1000 Genome Project Data Processing. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25 :2078–2079.

Written with StackEdit.