-
Notifications
You must be signed in to change notification settings - Fork 10
skani cookbook
Jim Shaw edited this page Jul 18, 2024
·
9 revisions
This cookbook presents some examples of common use cases for skani and how to set parameters.
This is not a definitive guide but may be helpful for further investigation. See the basic or advanced guides for documentation.
- For bacterial/archaeal/eukaryotic genomes: when in doubt, use the default parameters.
- For smaller genomes (plasmids, viruses, etc): you may need some tuning -- see advanced guide and below
skani sketch -t 10 -l list_of_genome_names.txt -o database
skani search genomes_in_a_folder/* -d database > results.tsv
Important points
- skani's defaults usually work fine for bacterial/archaeal genomes
- the
-l
option takes each genome file as a line in a text file -
search
uses less memory and is fast for querying a few genomes. Usedist
for querying many genomes or contigs.
skani triangle -s 90 my_genome_folder/* -t (threads) -E > results.tsv
# OR
skani triangle -s 93 my_genome_folder/* -t (threads) -E --medium > results.tsv
Important points
-
triangle
sets better defaults thandist
for all-to-all comparison -
-s 93
or-s 90
means skani performs ANI computation only if the ANI is approximately > 93/90%, speeding up computation. This ensure genomes with close to 95% ANI get compared. If you set -s to 95, you may screen out genomes with over 95 percent ANI. -
-E
outputs results in a tsv format instead of a matrix format. -
--medium
may give slightly more accurate results for very fragmented genomes or lower ANI genomes (~90%) at the cost of speed, but usually not a huge deal.
Tip
Since v0.2.2, skani has the --small-genomes
option equivalent to -c 30 -m 200 --faster-small
.
skani triangle viruses.fna -i -m 200 --slow (OR --medium) -t (threads) -E --faster-small -s 90 > results.tsv
Important points
-
-i
uses contigs within the fasta file for comparison -
-m 200
sets marker k-mers to appear 1/200 bases. Genome length /-m
should ideally be > 20. Larger contigs -> set this higher. Smaller contigs -> set this smaller. - small genomes may benefit from the
--slow
or--medium
options. This sets-c
to be smaller and gives better AFs, and sometimes (but not always!) better ANIs. -
--faster-small
makes skani faster by using more aggressive ANI filtering for very small genomes. This increases speed for large data sets (> 10k sequences) but loses a bit of sensitivity. -
-s 90
sets skani to screen comparisons for only approximately > 90% ANI. Feel free to set this higher or lower. Do not expect filtering to be accurate for small genomes and < 85% ANI.
skani sketch database_genomes/* -o database
skani dist --qi -q my_contigs_or_reads.fasta -r database_genomes/* -t (THREADS) --faster-small -m 300 --medium (OR default) > results.tsv
Important points
- contigs/reads < 500bp are ignored. short-reads do not work.
- when searching many small contigs or reads,
dist
is faster thansearch
, but this depends on how large your database is and how many contigs you have. -
--qi
makes your query files-q
use individual sequences/contigs instead -
--faster-small
makes screening more aggressive but loses sensitivity on very small reads/contigs. -
-m 300
gives better screening for small contigs/reads. Contig length /-m
should ideally be > 20. - consider
--medium
or even--slow
if your reads/contigs are small. Default may also be ok.