-
Notifications
You must be signed in to change notification settings - Fork 10
skani advanced usage guide
NOTE: since v0.1.0, skani now has a debiasing step, which makes ANI estimates more accurate within the ranges of ~92-98% ANI for prokaryotic MAGs. See "ANI debiasing by trained regression" below.
We found that skani's default ANI estimation can be slightly biased upwards in certain ranges and for eukaryotic MAGs. See the extended figures in our paper for more insight on skani's ANI estimate as a function of true ANI.
-
Lowering
-c
sometimes gives a more accurate ANI estimate, but not always. In fact, there are cases where it gives a worse estimate. -
However, we found that lowering the -c parameter almost always makes the AF calculation more accurate, so consider this if AF is something you care about. For default c = 125, AF gets less accurate as ANI decreases; you should probably lower -c for ANI < 95% to get more accurate AFs.
We found that --slow
, which is an alias for -c 30
, works relatively well for AF for very fragmented and more distant genomes. It actually has slightly worse accuracy than with defaults and will make skani approximately 4 times slower, though.
It is important to remember that skani is still a noisier algorithm than an actual base-level alignment algorithm, so if you want the most precise and accurate measurements, consider using a mummer based method like ANIm.
If you want skani to run faster, the main parameter to adjust is the -c
parameter. skani's speed and memory efficiency are inversely proportional to c, so increasing c by 2x means 2x faster and less memory. As a default, c = 125.
For nice genomes of ANI > 95% (> 10kb N50, not too fragmented), c can be comfortably made higher up to 200 or even 300. ANI will get slightly less accurate (and biased upwards), but aligned fraction may be much less accurate.
In skani v0.1.2, see skani dist -h
for information on the pre-set values of --slow, --medium, --fast
corresponding to c = 30, 70, and 200.
Since v0.1.0, skani outputs a more accurate ANI by debiasing an initial ANI estimate using a trained regression model. This model is trained on bacterial MAGs, but it seems to work quite well on even complete genomes, and eukaryotes as well.
The debiasing step is turned on when there are > 150,000 bases mapped between the genomes and if c >= 70
. In particular, the default parameters enable the debiasing step. We've found that this step is helpful most of the time, but turning off ANI debiasing using the --no-learned-ani
command may be beneficial in edge cases.
Let us know if you find that the debiasing procedure is giving weird results, as this feature is still quite new.
IMPORTANT: skani has only been briefly tested with long-reads. If you try using skani for long-reads, I would be very interested in hearing about the issues you face or how your results are.
skani is not necessarily designed for comparing long-reads or very small contigs, but it seems to work relatively well for ANI when the reads/contigs are long enough (> 3kb at least).
- skani can not classify short-reads. Use a taxonomic classifier such as kraken for this.
- skani can not compare collections of short-reads. Use Mash or sourmash for this.
- skani is much less memory-efficient when querying many small sequences
For small contigs or long-reads, here are some suggestions:
- Make sure to use the
--qi
option forskani search
orskani dist
if your contigs/reads are all in one file. -
skani dist
will be much faster thanskani search
if you are using long-reads, since the bottleneck will be loading genomes into memory. - If memory is a bottleneck, try
skani search --keep-refs
for a memory-efficient compromise; this keeps loaded genomes that pass the filter in memory so you don't have to reload the same genomes over again.
For parameters:
- The default marker size
-m
is set to 1000, so we take one marker per every 1000 k-mers. A good rule of thumb is that you want at least 20 markers on average, so set-m
< avg_read_length / 20. - Consider setting
-c
to lower values, e.g. 60, for less accurate reads. The longer + higher identity the reads, the higher-c
can be. - skani currently loads each query file entirely into memory instead of processing one read at a time. Consider splitting large sets of reads.
- Setting
-s
to higher values, e.g.-s 92
is strongly recommended if using the--keep-refs option
, otherwise you will get many spurious matches.
The --marker-index
option is available for skani dist
and skani search
. This loads all marker k-mers into a hash table for constant time filtering. This is turned on automatically if:
- more than 100 query files are input
- the
--qi
option is enabled.
Otherwise, it is disabled.
Importantly, if --marker-index
is enabled, whether automatically or not, make sure your genomes (or contigs if using --qi
) have enough marker k-mers available, otherwise the genomes/contigs may get filtered out if no markers are shared between the genomes. By default, -m
is 1000, so there is 1 marker per 1000 bases. 20 markers per genome is a safe, conservative value for comparing at the species level, so consider decreasing -m
is you're using --qi
with small contigs.
Building the table can take up to a minute (for large databases), and the table itself is ~10 GB for gtdb-r207 (65k genomes) and ~20 GB for gtdb-r214. Consider changing the -m
option, which is inversely proportional to the memory of this table, if memory is an issue.
In skani v0.1.2, we changed the --marker-index
option to --no-marker-index
, which just reverses the behavior (i.e. it now controls whether or not you disable the index). The same rules for automatic inverted indexing still applies.
For skani search
, the --keep-refs
option keeps genomes that pass the first approximate ANI filter (see the "Comparing only high ANI genomes with -s" section below) in memory and does not discard these genomes from RAM. This option is useful if you're comparing many sequences that come from the same reference, so you can avoid reloading the reference for each query, but may spike memory usage.
skani triangle
should be used for all-to-all comparisons on reasonably sized data sets. However, it loads all genome indices into memory, so RAM may be an issue. If RAM is an issue, consider:
- Pre-sketch using
skani sketch -l list_of_genomes.txt -o sketched_genomes
and runskani search -d sketched_genomes -l list_of_genomes -o output
to do slower but low-memory all-to-all comparisons. - Raising the
-c
parameter can help, see the above section on the-c
parameter. - Consider raising the parameter
-m
for faster screens. It defaults to 1000 but 2000 is reasonable for most bacterial genomes, but may lose sensitivity on small genomes such as viruses.
skani's ANI calculations are the most accurate for genomes with > 85% ANI, although with default parameters results down to 82% ANI will usually be shown. We only output results where the aligned fraction for either the query or the reference is > 15% by default. This can be changed with the --min-af
option, but low aligned fraction results are not accurate.
To get more accurate results for low ANI values, one should use a lower value for c
and s
, and then possibly adjust the --min-af
option.
For example, the supplied genome refs/MN-03.fa
is a Klebsiella Pneumoniae genome, and running skani dist refs/MN-03.fa refs/e.coli-K12.fa
returns nothing because the two genomes do not have a good enough alignment. However, skani dist refs/MN-03.fa refs/e.coli-K12.fa -c 30 -s 75
returns an ANI estimate of ~79%.
For distant genomes, the aligned fraction output becomes more accurate as c
gets smaller. However, decreasing c
may not necessarily make high ANI calculations more accurate. Nevertheless, I would not recommend ANI comparisons for genomes with < 75% ANI using skani.
The option -s
controls for an approximate ANI cutoff. Computations proceed only if the putative ANI (obtained by k-mer max-containment index) is higher than -s
. By default, this is 80 (80%) for ANI.
A value of -s below 80% will usually not work, since k-mer based ANI estimation doesn't go below 80% very well. I would not recommend lowering -s.
You can use a higher value of -s
if you're only interested in comparing more similar strains. **You will need enough marker k-mers, i.e. -m to be set high enough. ** See ANI calculations for small genomes/reads on how to set -m.
This cutoff is only approximate. If the true predicted ANI is greater than -s
, but the putative is smaller than -s
, the calculation does not proceed. Therefore, too high -s
and you'll lose sensitivity. The reverse also holds: a putative ANI can be greater than -s
but the true predicted can be less than -s
, in which case calculation still proceeds.