The analysis flow logic is implemented in aligner-asa program. The program requires Bowtie2 and Samtools utilities as well as RDP II database. The software is designed to run under Linux operating system.
The pre-processed MBARC-26 dataset is available in the NCBI SRA archive with the accession number SRX1836716. It can be downloaded using the SRA toolkit
docker pull elolab/asa
Once download completes, installation can be checked:
docker run -it --rm elolab/asa /opt/asa/aligner-asa.py --help
The ASA container can be run with following command. In this example the computer has /data -folder, which is visible also to the container. The output is written to /data/out/ folder, which is required to exist.
docker run -it --rm -v /data:/data elolab/asa /opt/asa/aligner-asa.py \
-i /data/R1.fastq.gz,R2.fastq.gz \
-t /data/pro_rdp_11_5_T_I_L1200_taxonomy.txt \
--db_fasta /data/pro_rdp_11_5_T_I_L1200_seqs.fa \
--top_refs 2 \
--rounds 2 \
--best_match_only \
--candlist_bt2_score_min 0.99 \
--top_refs_file top_refs-example.txt \
--aligner_threads 6 \
--embed_in_n \
-o /data/out/consensus.txt
First, the tool aligns the reads to the RDP database in order to discover the best aligning references. The result can be found in top_refs-example.txt
Next, the tool chooses a number of the best matching references (specified by --top_refs
) and starts iteratively building each sequence. The maximum amount of iterations can be set by the --rounds
parameter
First, the RDP database was split into subsets on family level in order to decrease computational load.
Next, one tree per family was generated using clustalw and fasttree. Finally, the minimum distances between genera were calculated using the phylotool.py
script. It takes a directory of trees and creates a pickled object (min_family_dists.pickle
) containing the cross-genera cutoff values.
The consensus-reference tree is generated using the /opt/asa/create_con_ref_tree.R
script. Modify the paths in this file to point to the correct locations
/opt/asa/phylotool.py
processes the consensus-reference tree and reports the species mapping. Furthermore, a note is made if the mapping is within the cross-genera cutoff. If the cutoff is exceeded, then the mapping is not too confident, since it is possibly mapped to the wrong genera.
ASA can use any fasta file as reference file. However, if there is taxonomy associated with fasta sequences, there are two files: 1) fasta file and 2) related taxonomy file.
Fasta file example:
>0
AGAGTTTGATCATGGCTCAGGACGA
>1
TATTTCGACGGGTTCCG
>2
TAAAGAGCTCG
Corresponding taxonomy file example:
0[TAB]r__Root;d__Bacteria;p__Actinobacteria;c__Actinobacteria;b__Acidimicrobidae;o__Acidimicrobiales;u__Acidimicrobineae;f__Acidimicrobiaceae;g__Acidimicrobium;s__Acidimicrobium_ferrooxidans_(T)_ICP
1[TAB]r__Root;d__Bacteria;p__Actinobacteria;c__Actinobacteria;b__Acidimicrobidae;o__Acidimicrobiales;u__Acidimicrobineae;f__Acidimicrobiaceae;g__Ferrimicrobium;s__Ferrimicrobium_acidiphilum_(T)_T23
2[TAB]r__Root;d__Bacteria;p__Actinobacteria;c__Actinobacteria;b__Acidimicrobidae;o__Acidimicrobiales;u__Acidimicrobineae;f__Acidimicrobiaceae;g__Ferrithrix;s__Ferrithrix_thermotolerans_(T)_Y005
Where [TAB] is the tabulator character.
Download and unzip Bacterial 16S sequences (as Genbank and FASTA) from RDP. The files are: current_Bacteria_unaligned.fa.gz and current_Bacteria_unaligned.gb.gz. Note. RDP II project website seems to be available currently.
cd ..
wget \
https://rdp.cme.msu.edu/download/current_Bacteria_unaligned.gb.gz \
--no-check-certificate
wget \
https://rdp.cme.msu.edu/download/current_Bacteria_unaligned.fa.gz \
--no-check-certificate
Next, convert and filter the data to the format required by ASA:
gunzip current_Bacteria_unaligned.gb.gz current_Bacteria_unaligned.fa.gz
./filter_database.py \
--output_gb rdp_11_5_T_I_L1200.gb \
-f rdp_11_5_T_I_L1200.fasta \
--rdp_gb current_Bacteria_unaligned.gb \
--rdp_fasta current_Bacteria_unaligned.fa \
-t rdp_11_5_T_I_L1200_taxonomy.txt \
-s rdp_11_5_T_I_L1200_seqs.fa \
--filter_min_seqlen 1200 \
--filter_db_typestrain \
--filter_db_isolates
The repository includes a script (import_silva.py) to facilitate the loading of the SILVA (v. 138.1) database (SILVA_138.1_SSURef_NR99_tax_silva.fasta) and the creation of silva_138.1.fasta. This allows for the utilization of the SILVA database for conducting ASA and manually identifying sample taxonomy from phylogenetic trees.