-
Notifications
You must be signed in to change notification settings - Fork 7
CZID mNGS Pipeline Stage #2: Alignment and Taxonomic Aggregation
After host filtration, a series of pipeline steps are performed to assign taxonomic IDs to non-host reads. This process involves three main steps.
- First, an initial short-read alignment is done against both the NCBI NT and NR databases. From these alignments, putative accessions are identified and used to construct a BLAST database.
- Then, to improve sensitivity, an assembly-based alignment procedure is performed. Specifically, short reads are assembled into longer contigs that are then BLASTed against the database of putative accessions.
- Finally, accessions are mapped to taxonomic IDs using the accession2taxid database. The taxonomic lineage is computed for each read to enable aggregation of taxonomic counts at both the species and genus level.
The short-read alignment procedure uses GSNAP to align to the NCBI NT database and Rapsearch2 to align to the NCBI NR database. These initial alignments are then used to identify putative accessions from which a BLAST database is generated. The constructed BLAST database is used downstream to refine taxonomic assignment through alignment of assembled contigs.
- Gsnap alignment to NCBI NT
- Rapsearch2 alignment to NCBI NR
- Generation of BLAST database
Input:
- gsnap_filter_1.fa
- gsnap_filter_2.fa
Process:
gsnapl
-A m8
--batch=0
--use-shared-memory=0
--gmap-mode=none
--npaths=100
--ordered
-t 36
--max-mismatches=40
-D {remote_index_dir}
-d nt_k16
{remote_input_files} > {multihit_remote_outfile}
GSNAP documentation is available here.
Output:
- gsnap.m8: the raw .m8 file containing all alignment results
- gsnap.deduped.m8: the .m8 file after selecting one alignment for each read based on the accession with the lowest e-value. If there are multiple accessions with equally good alignments, then one will be selected randomly.
- gsnap.hitsummary.tab: tab-delimited file containing information on taxonomic identity of each read. note: this short-read alignment taxonomic ID is not necessarily the taxonomic ID that will be assigned to the read after assembly-based alignment and refinement.
- gsnap_counts.json: this contains one .json entry per taxonomic ID identified by GSNAP.
Rapsearch2 Alignment to NCBI NR:
Input: gsnap_filter_merged.fa
Process:
rapsearch
-d {remote_index_dir}/nr_rapsearch
-e -6
-l 10
-a T
-b 0
-v 50
-z 24
-q {remote_input_files}
-o {multihit_remote_outfile}
Rapsearch2 documentation is available here.
Output:
- rapsearch2.m8: the raw .m8 file containing all alignment results
- rapsearch2.deduped.m8: the .m8 file after selecting one alignment for each read based on the accession with the lowest e-value. If there are multiple accessions with equally good alignments, then one will be selected randomly.
- rapsearch2.hitsummary.tab: tab-delimited file containing information on taxonomic identity of each read. note: this short-read alignment taxonomic ID is not necessarily the taxonomic ID that will be assigned to the read after assembly-based alignment and refinement.
- rapsearch2_counts.json: this contains one .json entry per taxonomic ID identified by rapsearch2.
From GSNAP, we know the best match from the NT db, for each read. All sequences that have accessions in the list of NT matches are downloaded. A BLAST database is generated from the resulting .fa file via the following command:
makeblastdb
-in assembly/nt.refseq.fasta
-dbtype nt
-out {blast_index_path}
Similarly, from Rapsearch2, we know the best match from the NR db, for each read. All sequences that have accessions in the list of NR matches are downloaded. A BLAST database is generated from the resulting .fa file via the following command:
makeblastdb
-in assembly/nr.refseq.fasta
-dbtype nr
-out {blast_index_path}
Due to redundancy and genetic similarity between related micro-organisms, identifying unique alignments and best matches for short reads is one outstanding challenge for mNGS. To improve specificity of taxon calls, reads are assembled into contigs using de novo assembly and then aligned via BLAST to a database of putative accessions identified in the short read alignment step.
Reasoning: To obtain longer contigs for improved sensitivity in mapping, short reads must be de novo assembled using SPADES. The SPADES output loses the information about which contig each individual read belongs to. Therefore, we use bowtie2 to map the original reads onto their assembled contigs.
Input:
- gsnap_filter_1.fa
- gsnap_filter_2.fa
- gsnap_filter_merged.fa
Process: First, the short reads are assembled into contigs using SPADES.
spades.py
-1 {input_fasta}
-2 {input_fasta2}
-o {assembled_dir}
-m {memory}
-t 32
—only-assembler
SPADES documentation can be found here
The single-read identity of reads merged into each contig are lost by SPADES. To recover this information and identify which contig each read belongs to, the contigs are then used to build a Bowtie2 database:
bowtie2-build {assembled_contig} {bowtie_index_path}
Finally, the original reads are mapped back to their assembled contigs:
bowtie2
-x {bowtie_index_path}
-f
-U {fasta_file}
--very-sensitive
-p 32 > {output_bowtie_sam}
Output:
- assembly/contigs.fasta: - the assembled contigs output by SPADES
- assembly/scaffolds.fasta: the resulting scaffolds output by SPADES
- assembly/read-contig.sam: the mapping of the original reads to the contigs
- assembly/contig_stats.json: .json file containing one entry per contig which lists the number of reads that align to that contig
The BLAST step is run independently for the contigs first against the NT-BLAST database constructed from putative taxa identified from short read alignments to NCBI NT using GSNAP. Then, against the NR-BLAST database constructed from putative taxa identified short read alignments to NCBI NR with Rapsearch2.
Input: The input for the BlastContigs module script is the .fa file, but this particular step just requires the pre-made BLAST db built from the .fa files.
Process:
blast_command
-query {assembled_contig}
-db {blast_index_path}
-out {blast_m8}
-outfmt 6
-num_alignments 5
-num_threads 32
Output:
- assembly/gsnap.blast.m8: the contigs mapping to accession IDs, one mapping per contig
- assembly/gsnap.reassigned.m8: the reads with accessions updated to match those in contig blast results. note: There is potential for this file to be confusing. In particular, the query sequence (first column) here is listed as a read ID, but all the other columns (length, query start, percent identity, etc.) refer to the results obtained from the contig blast alignment. These values were pulled out of gsnap.blast.m8, where the contig was the query, and appended to each read that mapped to that respective contig.
- assembly/gsnap.hitsummary2.tab: the summary of accessions from each step (single read alignment, contig alignment, and final alignment). See format specifics below, under Final Read Assignment.
- assembly/refined_gsnap_counts.json: contains one .json entry per taxonomic ID identified by GSNAP. Each .json entry contains the following fields:
- tax_id
- tax_level
- genus_taxid
- family_taxid
- count
- percent_identity (average over all reads that aligned to this tax_id)
- alignment_length (average over all reads that aligned to this tax_id)
- e_value (average over all reads that aligned to this tax_id)
- count_type - NT for all entries in gsnap_counts.json
- assembly/gsnap_contig_summary.json: contains one .json entry per taxonomic ID identified by GSNAP, after re-assignment of reads to their respective taxons based on assembly blast results. Within each taxonomic ID entry, there are entries for each contig that aligned to that taxonomic ID along with the read coverage of that contig.
[the same file formats are generated for contig alignment to NR]
- assembly/rapsearch2.blast.m8:
- assembly/rapsearch2.reassigned.m8:
- assembly/rapsearch2.hitsummary2.tab:
- assembly/refined_rapsearch2_counts.json:
- assembly/rapsearch2_contig_summary.json:
After alignment, CZ ID uses the NCBI accession2taxid database -is used- to map accessions to taxonomic IDs. The full database contains billions of entries, but only ~15% of those are found in either NR or NT databases. Therefore, the full NCBI accession2taxid database is subset to include only the relevant entries and then used to map accessions to taxonomic IDs. Finally, the taxonomic lineage for each read is computed using the ncbitax2lin (https://github.com/chanzuckerberg/ncbitax2lin) script. For each taxonomic ID this results in the following: taxid → [superkingdom, phylum, class, order ..., species].
Final Read Assignment For any given read, the tax ID is assigned by the following steps:
- If the read was assembled into a contig and the contig maps to an accession via blast, then the read is assigned the taxID of its respective contig.
- If the read was not assembled into a contig, then it is assigned to the taxID identified by short read alignment (NT with GSNAP and NR with Rapsearch2).
The progression of read assignment (from initial short read TaxID to refined assembly-based TaxID) can be identified in the hitsummary2 files (gsnap.hitsummary2.tab and rapsearch2.hitsummary2.tab)
NB501961:211:HGWKCBGX9:1:11205:5935:16320/1 1 155900 GQ881617.1 155900 -200 -300 NODE_1497_length_451_cov_1335.782828 GQ881617.1 155900.0 -200.0 -300.0
The .tab file contains 12 columns:
- Read ID
- Taxonomy level (all -1)
- Final TaxID assignment
- Initial (single short-read) GenBank alignment
- TaxID (species), from single alignment - obtained from the accession2taxid mapping after GSNAP or Rapsearch2. *note: *the pipeline outputs species-level counts as well as genus-level counts. For rows corresponding to genus-level counts (tax_level = 2), the species taxID is listed as “-100”.
- TaxID (genus), from single alignment - obtained by walking the phylogenetic tree backwards. If there is no genus-level classification, this value will be “-200”.
- TaxID (family), from single alignment - obtained by walking the phylogenetic tree backwards. If there is no family-level classification, this value will be “-300”.
- Assembled contig that this read maps to
- The GenBank ID that the contig mapped to
- TaxID (species), from assembled contig - obtained from the accession2taxid mapping after BLAST
- TaxID (genus), from assembled contig
- TaxID (family), from assembled contig
- “from_assembly” - this flag indicates whether this read was mapped ONLY through assembly. If this flag is present, then fields (4) - (7) are duplicates of the assembly-based TaxID call, and are not single short-read alignments.
note: columns 3 and 10 of the hitsummary2.xxx.tab files should always be identical.
Reference database generation
CZ ID uses the curated NCBI datasets to build indexes and lookup tables to enable metagenomic analysis as part of the czid pipeline. The following datasets are downloaded periodically from NCBI and a one-click index generation process is setup for periodical reference database updates.
- nt.gz (NCBI curated nucleotide database): we use gsnap (http://research-pub.gene.com/gmap/) to build/search the index from nt.
- nr.gz (NCBI curated protein database): we use rapsearch2 to build/search the index from nr.
- accession2taxid: accession(sequence identifier as shown in nt/nr) to taxon id mapping. We build the mapping into a berkeley db table through python shelve. (We are working on switching to sqlite3)
- taxdump.tar.gz: taxonomy information including the whole NCBI taxonomy information. We build the taxon to lineage mapping into a berkeley db table similar to above.
Generate NCBI NT Database:
gmap_build
-d nt_k16
-k 16
$${REF_DIR}/nt
Generate NCBI NR Database:
prerapsearch
-d $${REF_DIR}/nr
-n nr_rapsearch