-
Notifications
You must be signed in to change notification settings - Fork 10
Tutorial: strain and species level clustering of MAGs with skani triangle
In this tutorial we will use skani to visualize strain level community structure in MAGs. We will:
- show how to use skani to compute all-to-all ANIs
- use a script to cluster and visualize the ANI matrix
- (Optional) show that Mash and FastANI have issues on this data set
- skani is installed.
- Python3 is installed and seaborn, numpy, and scipy are installed. To install, simply run
pip install seaborn numpy scipy
. This is needed to visualize the clustering. - (Optional) Mash is installed.
- (Optional) FastANI is installed.
We will be using MAGs from Pasolli et al. (2019), which constructed hundreds of thousands of MAGs from short-read sequencing.
We will work with a specific set of ~200 genomes that have already been binned at the species level. To get these genomes, do
git clone https://github.com/bluenote-1577/skani-test
The genomes are available in skani-test/2328
. These genomes were classified as Alistipes ihumii (also the same set of genomes used in our skani paper; figure 1).
Now we will have to calculate ANI between all genomes. This is equivalent to constructing a pairwise similarity matrix. To do this, simply do
skani triangle skani-test/2328/* -t (threads) > skani_matrix.txt
#skani triangle skani-test/2328/* -t (threads) --robust > robust_skani_matrix.txt
#optional if you have Mash installed
mash triangle skani-test/2328/* -p (threads) > mash_matrix.txt
#optional if you have FastANI installed
ls skani-test/2328/* > genome_list.txt
fastANI --rl genome_list.txt --ql genome_list.txt --matrix -o fastani -t (threads)
with the (threads)
replaced by the number of threads you want to use for computing the ANI similarity matrix.
The output skani_matrix.txt
is a lower-triangular phyllip matrix that we can process. With 20 threads, this took my workstation about ~4 seconds. Mash took about ~1 second. FastANI takes about ~200 seconds for me, so I would skip this computation if you don't have enough computing resources.
The first line uses skani to construct an all-to-all ANI matrix. The second line uses the --robust
option, which calculates the ANI by trimming off the bottom 90th and upper 10th quantile nucleotide identity estimates. This option biases the ANI upwards (e.g. a 99% ANI may appear to be 99.5% for the --robust option), but may be a useful option to play around with. Likewise, the --median
option may be of interest.
We provide a helper script called clustermap_triangle.py for clustering lower-triangular similarity matrices that Mash and skani output. To obtain this script, either clone the skani repository or do
wget https://raw.githubusercontent.com/bluenote-1577/skani/main/scripts/clustermap_triangle.py
Make sure the python libraries outlined in the Requirements are installed, then specify one of the following commands to visualize the clustered ANI matrix
python clustermap_triangle.py skani_matrix.txt 3
#python clustermap_triangle.py robust_skani_matrix.txt 3
#if the Mash ANI matrix was calculated
python clustermap_triangle.py mash_matrix.txt 3
#if the FastANI matrix was calculated
python clustermap_triangle fastani.matrix 3
#
The 3
argument limits the heatmap's minimum ANI to 100 - 3 = 97%. The bar on the top-left of the plot displays 100 - ANI.
To actually obtain clusters by setting, say, a 99% ANI threshold, look into the fcluster function from scipy, which takes a dendrogram (generated by the script) and outputs clusters. The clustermap_triangle.py script can be easily modified to do so, and can be extended for your own analysis.
If you also used Mash or FastANI to generate ANI matrices, you'll see that the clustering for Mash, skani, and FastANI are very different on this data set. The exact distance matrices are also shown in Fig. 1 of our paper. Why are the clusterings different?
Mash's ANI estimation is contingent on genomes being complete. Mash's ANI estimation becomes more biased as the genomes become more incomplete. This fact has been noted by others before, and we also give a brief theoretical discussion in our paper. See the very good discussion by dRep on why Mash has issues with incomplete genomes.
Mash is still faster when doing many comparisons of similar genomes. If you're comparing 40000 e.coli genomes to each other, skani will take a while compared to Mash. Also consider sourmash's --max-containment option to estimate ANI.
It was noted FastANI's original paper that FastANI is most reliable when N50 > 10 kb. This turns out to be an important assumption.
In this data set, the median N50 is 20375 bp. There are plenty of MAGs with N50 < 10,000, and FastANI estimates that these MAGs have lower ANI comparisons than they actually do. On MAGs with higher N50, e.g. long-read assembled MAGs, FastANI seems to perform much more reasonably and is slightly more accurate than skani on reference-quality genomes.
We argue in our paper that for high-ANI computations, ANIm (a mummer based ANI method), available in the pyani package seems to perform the best. This is because mummer is a relatively robust alignment method (for high ANI genomes) and is length agnostic. However, ANIm takes anywhere from 500 - 1000 times longer than skani to run.
We also showed that skani's clustering correlates much better with ANIm's in the paper via cophenetic correlation, quantifying our qualitative observations.