Tutorial: classifying entire assemblies (MAGs or contigs) against 85,000 genomes in under 2 minutes

This is a continuation of the tutorial: Tutorial: setting up the GTDB genome database to search against. Please complete that first before doing this one.

Requirements

skani is installed and in PATH (i.e. typing skani -h works). See skani's README for how to install skani.
Tutorial: setting up the GTDB genome database to search against is completed. That is, the GTDB database is indexed and ready to be searched.

What we will do

see how skani can search contigs and MAGs against the GTDB database in minutes using the search command with ANI and AF information
discuss how skani's AF information can give biology beyond just classification, unveiling a conserved 20kb element in an assembly

Use case 1: MAG classification

We already saw an example of this in Tutorial: setting up the GTDB genome database to search against. If you want to classify multiple MAGs from a metagenomic assembly pipeline, run

skani search my_mag_folder/* -d gtdb_skani_database_ani -o results_file.txt -t 10

The search command can take multiple genomes at once, or even a list of genomes. See skani search -h.

This will search each MAG in my_mag_folder against the GTDB database (using 10 threads). Using standard guidelines, if the best match to your MAG has > 95% ANI and > 50% aligned fraction, then your MAG can be assigned to this species.

This workflow is magnitudes faster than GTDB-tk, a classification tool associated with the GTDB. However, GTDB-tk is much more sensitive if your assembled genome has no direct species representative in the database. Furthermore, skani does not put the genome on a tree.

Summary

skani works great if you want to classify MAGs that are well-characterized but does not work as well for novel genomes (< 80% ANI). Try GTDB-tk instead for novel genomes and tree placement.

Use case 2: metagenomic contig searching

If you have contigs from metagenome assembly, you can search the contigs in the assembly as long as the contigs are long enough.

By default, > 10kb seems okay for species level (> 95% ANI) searching, but "long enough" depends on the exact parameters specified by skani. If you want to classify shorter contigs, see the advanced guide for how to choose parameters.

Comparing contigs against a large database

As an example, I ran metaFlye on a public human gut nanopore sample (accession ERR7625404). The assembly is available here. The following download command worked for me:

wget https://www.dropbox.com/s/j95a220kbqqctci/ERR7625404_assembly.fasta.gz?dl=0
mv 'ERR7625404_assembly.fasta.gz?dl=0' ERR7625404_assembly.fasta.gz
gzip -d ERR7625404_assembly.fasta.gz

To search your contigs, run

skani search --qi ERR7625404_assembly.fasta -d gtdb_skani_database_ani -o results_file.txt -t 10

Use the --qi option to search individual contigs. The output looks like:

Ref_file       Query_file      ANI     Align_fraction_ref      Align_fraction_query    Ref_name        Query_name
gtdb_genomes_reps_r214/database/GCF/000/172/175/GCF_000172175.1_genomic.fna.gz  assembly.fasta     98.20   0.04    94.54   NZ_ABJL02000008.1 Bacteroides intestinalis DSM 17393 B_intestinalis-2.0.1_Cont607, whole genome shotgun sequence        contig_10
gtdb_genomes_reps_r214/database/GCF/000/382/445/GCF_000382445.1_genomic.fna.gz  assembly.fasta     97.48   1.68    68.69   NZ_KB905472.1 Phocaeicola massiliensis B84634 = Timone 84634 = DSM 17679 = JCM 13223 strain DSM 17679 aczJl-supercont1.1, whole genome shotgun sequence contig_100
gtdb_genomes_reps_r214/database/GCF/013/618/865/GCF_013618865.1_genomic.fna.gz  assembly.fasta     93.40   0.77    31.03   NZ_JABSOE010000001.1 Phocaeicola faecicola strain AGMB03916 AGMB03916-1, whole genome shotgun sequence  contig_100
gtdb_genomes_reps_r214/database/GCF/021/730/445/GCF_021730445.1_genomic.fna.gz  assembly.fasta     92.37   1.52    57.72   NZ_JADNRM010000001.1 Phocaeicola faecalis strain FXJYN30E22 Scaffold1, whole genome shotgun sequence    contig_100
gtdb_genomes_reps_r214/database/GCA/902/363/135/GCA_902363135.1_genomic.fna.gz  assembly.fasta     95.79   2.12    89.79   CABIZF010000001.1 Lachnospiraceae bacterium isolate MGYG-HGUT-00092 genome assembly, contig: .20287_6_63.1, whole genome shotgun sequence       contig_1001
gtdb_genomes_reps_r214/database/GCA/009/929/715/GCA_009929715.1_genomic.fna.gz  assembly.fasta     99.11   0.59    17.24   RZZX01000132.1 MAG: Bacteroidia bacterium isolate INTA.CYC.015 contig-100_10062, whole genome shotgun sequence  contig_1002
gtdb_genomes_reps_r214/database/GCF/001/314/995/GCF_001314995.1_genomic.fna.gz  assembly.fasta     98.43   1.47    81.59   NZ_CP012938.1 Bacteroides ovatus strain ATCC 8483 chromosome, complete genome   contig_1002
gtdb_genomes_reps_r214/database/GCA/900/555/635/GCA_900555635.1_genomic.fna.gz  assembly.fasta     98.10   0.36    18.28   USLU01000001.1 uncultured Bacteroides sp. isolate UMGS1845 genome assembly, contig: NODE_7_length_223807_cov_5.96235, whole genome shotgun sequence     contig_1002

The query_file column is always the same because we only have one input file. The contig being compared is shown in the last column under Query_name (you have to scroll to the right in the box above).

Notice the contig_1002 in the above list (last 3 entries). It has

99.11 ANI to RZZX01000132.1 Bacteroidia bacterium isolate, but only 17.24% of the genome is mapped.
98.43 ANI to NZ_CP012938.1 Bacteroides ovatus strain ATCC 8483, a different Bacteroides genome, but 81.59% of the genome is mapped.
98.10 ANI to SLU01000001.1 uncultured Bacteroides sp. isolate UMGS1845 genome assembly, and only 15.85% of the genome is mapped.

Most people would believe 2. is the correct classification, even though it has lower ANI than 1. This means that you have to use both AF and ANI for contig classification.

Summary

I recommend ranking the classifications by a combination of AF and ANI and taking ANI > 95% for species-level classification. You should also consider the contig length: a contig of 20kb could map to many species with > 95% ANI and 100% AF if it is a mobile element.

Runtime

Here is the full command:

Command being timed: "skani search ERR7625404_assembly.fasta --qi -t 10 -d gtdb_skani_database_ani -o test"
	User time (seconds): 404.22
	System time (seconds): 15.78
	Percent of CPU this job got: 439%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:35.48
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 23910440

This took around about 1 minute and 30 seconds and 24 GB of RAM to search 3700 contigs. Pretty fast! But beware that the classification is limited by contig length and genomic distance, so searching small contigs below the species level is not possible.

In this example, of the ~3700 contigs, 3133 of them had a hit against the GTDB-database (version 214). This will be worse for short-read assemblies and for samples that are not as well characterized (this assembly was human gut, which is very well characterized).

Aside: "Conserved" region given by aligned fraction information for contig_1002

Why did the previous contig_1002 map to at least 3 genomes quite well? This contig is ~ 120kb long, so this is a non-trivial amount of mapping to all 3 genomes. We can guess what's going on here:

A long shared "conserved" element (mobile? contamination?)
A misassembly by metaFlye caused a chimeric contig for two Bacteroides organisms

All three of these Bacteroides genomes have < 75% ANI to each other (as given by BLAST-based ANI). Without further investigation, 1. seems likely as 17.24% and 15.85% are suspiciously similar -- perhaps the same region is being aligned to both genomes?

To further investigate, I used minimap2 to map contig_1002 against the RZZX01000132.1 Bacteroidia bacterium isolate and SLU01000001.1 uncultured Bacteroides sp. isolate UMGS1845 genome assembly assemblies (you can try this out yourself as well) giving the following:

#RZZX01000132.1 Bacteroidia bacterium isolate mapping
RZZX01000036.1	23859	158	20210	+	contig_1002	116935	69388	89426	18399	20070	60	tp:A:P	cm:i:3146	s1:i:18394	s2:i:0	dv:f:0.0000	rl:i:0
...

#LU01000001.1 uncultured Bacteroides sp. isolate UMGS1845 genome assembly mapping
...
...
USLU01000094.1	17802	33	17778	-	contig_1002	116935	72705	91792	15555	19184	60	tp:A:P	cm:i:2568	s1:i:15364	s2:i:0	dv:f:0.0001	rl:i:0
USLU01000175.1	3861	10	3726	-	contig_1002	116935	68978	72693	3160	3720	60	tp:A:P	cm:i:504	s1:i:3160	s2:i:0	dv:f:0.0001	rl:i:0

Notice the 69kb - 90kb mappings in the 7th and 8th column for both mappings. The cm:i flag gives a indication of alignment length and goodness (i.e. how many minimizer k-mers are on minimap2's chain; don't worry about this if you're unfamiliar with minimap2).

As expected, bases 69kb - 90ish kb on the contig mapped to both genomes very well, but had little sequence homology outside these two bases. It appears that bases 69kb - 89kb on the contig is some sort of conserved element.

Spoiler: what was this element?

I blasted the 69kb - 90kb region of this contig, and it had 100% aligned fraction and 99% sequence identity to a 400 kb phage -- it looks like we found some sort of mobile viral element.

Key takeaway: Because skani is an approximate mapper, it can give some information about sequence homology and can unveil "conserved elements".

You can try looking at the results file from this assembly; it is littered with examples of contigs mapping to multiple different genera with > 95% ANI:

gtdb_genomes_reps_r214/database/GCA/900/760/455/GCA_900760455.1_genomic.fna.gz	assembly.fasta	99.72	0.38	23.12	CAAEZC010000001.1 TPA_asm: uncultured Paraprevotella sp., isolate HGM04805, WGS project CAAEZC010000000 data, contig: SRS016095_8_k99_143, whole genome shotgun sequencecontig_103
gtdb_genomes_reps_r214/database/GCA/025/147/325/GCA_025147325.1_genomic.fna.gz	assembly.fasta	99.72	0.29	17.43	CP102262.1 Bacteroides stercoris ATCC 43183 chromosome, complete genome	contig_103
gtdb_genomes_reps_r214/database/GCA/025/151/215/GCA_025151215.1_genomic.fna.gz	assembly.fasta	99.28	0.36	24.07	CP102286.1 Parabacteroides merdae ATCC 43184 chromosome, complete genome	contig_103
gtdb_genomes_reps_r214/database/GCA/025/147/485/GCA_025147485.1_genomic.fna.gz	assembly.fasta	99.15	0.28	19.27	CP102263.1 Bacteroides uniformis strain ATCC 8492 chromosome, complete genome	contig_103
gtdb_genomes_reps_r214/database/GCF/000/382/445/GCF_000382445.1_genomic.fna.gz	assembly.fasta	98.75	0.25	17.60	NZ_KB905472.1 Phocaeicola massiliensis B84634 = Timone 84634 = DSM 17679 = JCM 13223 strain DSM 17679 aczJl-supercont1.1, whole genome shotgun sequence	contig_103
gtdb_genomes_reps_r214/database/GCF/000/012/845/GCF_000012845.1_genomic.fna.gz	assembly.fasta	97.40	0.27	19.26	NC_009615.1 Parabacteroides distasonis ATCC 8503, complete sequence	contig_103
gtdb_genomes_reps_r214/database/GCF/900/142/015/GCF_900142015.1_genomic.fna.gz	assembly.fasta	97.08	0.18	16.68	NZ_FQZN01000089.1 Bacteroides stercorirosoris strain DSM 26884, whole genome shotgun sequence	contig_103
gtdb_genomes_reps_r214/database/GCF/000/614/125/GCF_000614125.1_genomic.fna.gz	assembly.fasta	97.08	0.23	16.68	NZ_BAKJ01000148.1 Bacteroides rodentium JCM 16496, whole genome shotgun sequence	contig_103
gtdb_genomes_reps_r214/database/GCF/000/513/195/GCF_000513195.1_genomic.fna.gz	assembly.fasta	95.70	0.18	18.93	NZ_CBVI010000001.1 Bacteroides timonensis isolate AP1, whole genome shotgun sequence	contig_103
gtdb_genomes_reps_r214/database/GCF/000/614/185/GCF_000614185.1_genomic.fna.gz	assembly.fasta	89.42	0.20	15.84	NZ_BAKM01000160.1 Phocaeicola sartorii JCM 17136 = DSM 21941 strain JCM 17136, whole genome shotgun sequence	contig_103
gtdb_genomes_reps_r214/database/GCF/000/012/825/GCF_000012825.1_genomic.fna.gz	assembly.fasta	89.18	0.43	33.54	NC_009614.1 Phocaeicola vulgatus ATCC 8482, complete sequence	contig_103

All of the above results are for one contig: contig_103. skani thus also has an interesting ability to find highly conserved elements very efficiently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial: classifying entire assemblies (MAGs or contigs) against 85,000 genomes in under 2 minutes

This is a continuation of the tutorial: Tutorial: setting up the GTDB genome database to search against. Please complete that first before doing this one.

Use case 1: MAG classification

Summary

Use case 2: metagenomic contig searching

Comparing contigs against a large database

Summary

Runtime

Aside: "Conserved" region given by aligned fraction information for contig_1002

Spoiler: what was this element?

Clone this wiki locally