-
Notifications
You must be signed in to change notification settings - Fork 10
Tutorial: classifying entire assemblies (MAGs or contigs) against 85,000 genomes in under 2 minutes
This is a continuation of the tutorial: Tutorial: setting up the GTDB genome database to search against. Please complete that first before doing this one.
Requirements
- skani is installed and in PATH (i.e. typing
skani -h
works). See skani's README for how to install skani. - Tutorial: setting up the GTDB genome database to search against is completed. That is, the GTDB database is indexed and ready to be searched.
What we will do
- see how skani can search contigs and MAGs against the GTDB database in minutes using the
search
command with ANI and AF information - discuss how skani's AF information can give biology beyond just classification, unveiling a conserved 20kb element in an assembly
We already saw an example of this in Tutorial: setting up the GTDB genome database to search against. If you want to classify multiple MAGs from a metagenomic assembly pipeline, run
skani search my_mag_folder/* -d gtdb_skani_database_ani -o results_file.txt -t 10
The search
command can take multiple genomes at once, or even a list of genomes. See skani search -h
.
This will search each MAG in my_mag_folder
against the GTDB database (using 10 threads). Using standard guidelines, if the best match to your MAG has > 95% ANI and > 50% aligned fraction, then your MAG can be assigned to this species.
This workflow is magnitudes faster than GTDB-tk, a classification tool associated with the GTDB. However, GTDB-tk is much more sensitive if your assembled genome has no direct species representative in the database. Furthermore, skani does not put the genome on a tree.
skani works great if you want to classify MAGs that are well-characterized but does not work as well for novel genomes (< 80% ANI). Try GTDB-tk instead for novel genomes and tree placement.
If you have contigs from metagenome assembly, you can search the contigs in the assembly as long as the contigs are long enough.
By default, > 10kb seems okay for species level (> 95% ANI) searching, but "long enough" depends on the exact parameters specified by skani. If you want to classify shorter contigs, see the advanced guide for how to choose parameters.
As an example, I ran metaFlye on a public human gut nanopore sample (accession ERR7625404). The assembly is available here. The following download command worked for me:
wget https://www.dropbox.com/s/j95a220kbqqctci/ERR7625404_assembly.fasta.gz?dl=0
mv 'ERR7625404_assembly.fasta.gz?dl=0' ERR7625404_assembly.fasta.gz
gzip -d ERR7625404_assembly.fasta.gz
To search your contigs, run
skani search --qi ERR7625404_assembly.fasta -d gtdb_skani_database_ani -o results_file.txt -t 10
Use the --qi
option to search individual contigs. The output looks like:
Ref_file Query_file ANI Align_fraction_ref Align_fraction_query Ref_name Query_name
gtdb_genomes_reps_r214/database/GCF/000/172/175/GCF_000172175.1_genomic.fna.gz assembly.fasta 98.20 0.04 94.54 NZ_ABJL02000008.1 Bacteroides intestinalis DSM 17393 B_intestinalis-2.0.1_Cont607, whole genome shotgun sequence contig_10
gtdb_genomes_reps_r214/database/GCF/000/382/445/GCF_000382445.1_genomic.fna.gz assembly.fasta 97.48 1.68 68.69 NZ_KB905472.1 Phocaeicola massiliensis B84634 = Timone 84634 = DSM 17679 = JCM 13223 strain DSM 17679 aczJl-supercont1.1, whole genome shotgun sequence contig_100
gtdb_genomes_reps_r214/database/GCF/013/618/865/GCF_013618865.1_genomic.fna.gz assembly.fasta 93.40 0.77 31.03 NZ_JABSOE010000001.1 Phocaeicola faecicola strain AGMB03916 AGMB03916-1, whole genome shotgun sequence contig_100
gtdb_genomes_reps_r214/database/GCF/021/730/445/GCF_021730445.1_genomic.fna.gz assembly.fasta 92.37 1.52 57.72 NZ_JADNRM010000001.1 Phocaeicola faecalis strain FXJYN30E22 Scaffold1, whole genome shotgun sequence contig_100
gtdb_genomes_reps_r214/database/GCA/902/363/135/GCA_902363135.1_genomic.fna.gz assembly.fasta 95.79 2.12 89.79 CABIZF010000001.1 Lachnospiraceae bacterium isolate MGYG-HGUT-00092 genome assembly, contig: .20287_6_63.1, whole genome shotgun sequence contig_1001
gtdb_genomes_reps_r214/database/GCA/009/929/715/GCA_009929715.1_genomic.fna.gz assembly.fasta 99.11 0.59 17.24 RZZX01000132.1 MAG: Bacteroidia bacterium isolate INTA.CYC.015 contig-100_10062, whole genome shotgun sequence contig_1002
gtdb_genomes_reps_r214/database/GCF/001/314/995/GCF_001314995.1_genomic.fna.gz assembly.fasta 98.43 1.47 81.59 NZ_CP012938.1 Bacteroides ovatus strain ATCC 8483 chromosome, complete genome contig_1002
gtdb_genomes_reps_r214/database/GCA/900/555/635/GCA_900555635.1_genomic.fna.gz assembly.fasta 98.10 0.36 18.28 USLU01000001.1 uncultured Bacteroides sp. isolate UMGS1845 genome assembly, contig: NODE_7_length_223807_cov_5.96235, whole genome shotgun sequence contig_1002
The query_file
column is always the same because we only have one input file. The contig being compared is shown in the last column under Query_name
(you have to scroll to the right in the box above).
Notice the contig_1002 in the above list (last 3 entries). It has
- 99.11 ANI to
RZZX01000132.1 Bacteroidia bacterium isolate
, but only 17.24% of the genome is mapped. - 98.43 ANI to
NZ_CP012938.1 Bacteroides ovatus strain ATCC 8483
, a different Bacteroides genome, but 81.59% of the genome is mapped. - 98.10 ANI to
SLU01000001.1 uncultured Bacteroides sp. isolate UMGS1845 genome assembly
, and only 15.85% of the genome is mapped.
Most people would believe 2. is the correct classification, even though it has lower ANI than 1. This means that you have to use both AF and ANI for contig classification.
I recommend ranking the classifications by a combination of AF and ANI and taking ANI > 95% for species-level classification. You should also consider the contig length: a contig of 20kb could map to many species with > 95% ANI and 100% AF if it is a mobile element.
Here is the full command:
Command being timed: "skani search ERR7625404_assembly.fasta --qi -t 10 -d gtdb_skani_database_ani -o test"
User time (seconds): 404.22
System time (seconds): 15.78
Percent of CPU this job got: 439%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:35.48
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 23910440
This took around about 1 minute and 30 seconds and 24 GB of RAM to search 3700 contigs. Pretty fast! But beware that the classification is limited by contig length and genomic distance, so searching small contigs below the species level is not possible.
In this example, of the ~3700 contigs, 3133 of them had a hit against the GTDB-database (version 214). This will be worse for short-read assemblies and for samples that are not as well characterized (this assembly was human gut, which is very well characterized).
Why did the previous contig_1002 map to at least 3 genomes quite well? This contig is ~ 120kb long, so this is a non-trivial amount of mapping to all 3 genomes. We can guess what's going on here:
- A long shared "conserved" element (mobile? contamination?)
- A misassembly by metaFlye caused a chimeric contig for two Bacteroides organisms
All three of these Bacteroides genomes have < 75% ANI to each other (as given by BLAST-based ANI). Without further investigation, 1. seems likely as 17.24% and 15.85% are suspiciously similar -- perhaps the same region is being aligned to both genomes?
To further investigate, I used minimap2 to map contig_1002
against the RZZX01000132.1 Bacteroidia bacterium isolate
and SLU01000001.1 uncultured Bacteroides sp. isolate UMGS1845 genome assembly
assemblies (you can try this out yourself as well) giving the following:
#RZZX01000132.1 Bacteroidia bacterium isolate mapping
RZZX01000036.1 23859 158 20210 + contig_1002 116935 69388 89426 18399 20070 60 tp:A:P cm:i:3146 s1:i:18394 s2:i:0 dv:f:0.0000 rl:i:0
...
#LU01000001.1 uncultured Bacteroides sp. isolate UMGS1845 genome assembly mapping
...
...
USLU01000094.1 17802 33 17778 - contig_1002 116935 72705 91792 15555 19184 60 tp:A:P cm:i:2568 s1:i:15364 s2:i:0 dv:f:0.0001 rl:i:0
USLU01000175.1 3861 10 3726 - contig_1002 116935 68978 72693 3160 3720 60 tp:A:P cm:i:504 s1:i:3160 s2:i:0 dv:f:0.0001 rl:i:0
Notice the 69kb - 90kb mappings in the 7th and 8th column for both mappings. The cm:i flag gives a indication of alignment length and goodness (i.e. how many minimizer k-mers are on minimap2's chain; don't worry about this if you're unfamiliar with minimap2).
As expected, bases 69kb - 90ish kb on the contig mapped to both genomes very well, but had little sequence homology outside these two bases. It appears that bases 69kb - 89kb on the contig is some sort of conserved element.
I blasted the 69kb - 90kb region of this contig, and it had 100% aligned fraction and 99% sequence identity to a 400 kb phage -- it looks like we found some sort of mobile viral element.
Key takeaway: Because skani is an approximate mapper, it can give some information about sequence homology and can unveil "conserved elements".
You can try looking at the results file from this assembly; it is littered with examples of contigs mapping to multiple different genera with > 95% ANI:
gtdb_genomes_reps_r214/database/GCA/900/760/455/GCA_900760455.1_genomic.fna.gz assembly.fasta 99.72 0.38 23.12 CAAEZC010000001.1 TPA_asm: uncultured Paraprevotella sp., isolate HGM04805, WGS project CAAEZC010000000 data, contig: SRS016095_8_k99_143, whole genome shotgun sequencecontig_103
gtdb_genomes_reps_r214/database/GCA/025/147/325/GCA_025147325.1_genomic.fna.gz assembly.fasta 99.72 0.29 17.43 CP102262.1 Bacteroides stercoris ATCC 43183 chromosome, complete genome contig_103
gtdb_genomes_reps_r214/database/GCA/025/151/215/GCA_025151215.1_genomic.fna.gz assembly.fasta 99.28 0.36 24.07 CP102286.1 Parabacteroides merdae ATCC 43184 chromosome, complete genome contig_103
gtdb_genomes_reps_r214/database/GCA/025/147/485/GCA_025147485.1_genomic.fna.gz assembly.fasta 99.15 0.28 19.27 CP102263.1 Bacteroides uniformis strain ATCC 8492 chromosome, complete genome contig_103
gtdb_genomes_reps_r214/database/GCF/000/382/445/GCF_000382445.1_genomic.fna.gz assembly.fasta 98.75 0.25 17.60 NZ_KB905472.1 Phocaeicola massiliensis B84634 = Timone 84634 = DSM 17679 = JCM 13223 strain DSM 17679 aczJl-supercont1.1, whole genome shotgun sequence contig_103
gtdb_genomes_reps_r214/database/GCF/000/012/845/GCF_000012845.1_genomic.fna.gz assembly.fasta 97.40 0.27 19.26 NC_009615.1 Parabacteroides distasonis ATCC 8503, complete sequence contig_103
gtdb_genomes_reps_r214/database/GCF/900/142/015/GCF_900142015.1_genomic.fna.gz assembly.fasta 97.08 0.18 16.68 NZ_FQZN01000089.1 Bacteroides stercorirosoris strain DSM 26884, whole genome shotgun sequence contig_103
gtdb_genomes_reps_r214/database/GCF/000/614/125/GCF_000614125.1_genomic.fna.gz assembly.fasta 97.08 0.23 16.68 NZ_BAKJ01000148.1 Bacteroides rodentium JCM 16496, whole genome shotgun sequence contig_103
gtdb_genomes_reps_r214/database/GCF/000/513/195/GCF_000513195.1_genomic.fna.gz assembly.fasta 95.70 0.18 18.93 NZ_CBVI010000001.1 Bacteroides timonensis isolate AP1, whole genome shotgun sequence contig_103
gtdb_genomes_reps_r214/database/GCF/000/614/185/GCF_000614185.1_genomic.fna.gz assembly.fasta 89.42 0.20 15.84 NZ_BAKM01000160.1 Phocaeicola sartorii JCM 17136 = DSM 21941 strain JCM 17136, whole genome shotgun sequence contig_103
gtdb_genomes_reps_r214/database/GCF/000/012/825/GCF_000012825.1_genomic.fna.gz assembly.fasta 89.18 0.43 33.54 NC_009614.1 Phocaeicola vulgatus ATCC 8482, complete sequence contig_103
All of the above results are for one contig: contig_103. skani thus also has an interesting ability to find highly conserved elements very efficiently.