-
Notifications
You must be signed in to change notification settings - Fork 181
3. Command line options
Commands are issued as the first parameter on the command line and set the task to be run by the program.
-
makedb
Create a DIAMOND formatted reference database from a FASTA input file.
-
prepdb
Prepare BLAST database for use with Diamond. This call requires the path to the BLAST database (option
-d
) and will write a number of small auxiliary files into the database directory. -
blastp
Align protein query sequences against a protein reference database.
-
blastx
Align translated DNA query sequences against a protein reference database.
-
view
Generate formatted output from DAA files.
-
version
Print version information.
-
dbinfo
Print information about a database file.
-
help
Print help message.
-
test
Run a series of test cases and verify the output against reference hashes. This command will exit with code
0
if all tests have passed and1
otherwise. Running this command requires write access to the current working directory.
-
--in <file>
Path to the input protein reference database file in FASTA format (may be gzip compressed). If this parameter is omitted, the input will be read from
stdin
. -
--db/-d <file>
Path to the output DIAMOND database file.
-
--taxonmap <file>
Path to mapping file that maps NCBI protein accession numbers to taxon ids (gzip compressed). This parameter is optional and needs to be supplied in order to provide taxonomy features. The file can be downloaded from NCBI: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz
Versions older than v2.0.7 only support the reduced mapping file: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
A custom file following the same format may be supplied here. Note that the first line of this file is assumed to contain headings and will be ignored.
-
--taxonnodes <file>
Path to the
nodes.dmp
file from the NCBI taxonomy. This parameter is optional and needs to be supplied in order to provide taxonomy features. The file is contained within this archive downloadable at NCBI: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip. -
--taxonnames <file>
Path to the
names.dmp
file from the NCBI taxonomy. This parameter is optional and needs to be supplied in order to provide taxonomy features. The file is contained within this archive downloadable at NCBI: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip. -
--no-parse-seqids
For the purpose of matching sequence accessions in the database file to sequence accessions in the taxonomy mapping file, they are normally subjected to parsing rules, such as deleting prefixes and suffixes separated by pipe characters or version numbers after dots. This option can be used to disable this behaviour and use accessions as they appear in the input files. (Option supported since v2.1.7)
-
--threads/-p #
Number of CPU threads. By default, the program will auto-detect and use all available virtual cores on the machine.
-
--quiet
Disable all terminal output.
-
--verbose/-v
Enable more verbose terminal output.
-
--log
Enable even more verbose terminal output, which is also written to a file named
diamond.log
is the current working directory.
-
--db/-d <file>
Path to the DIAMOND database file. Since v2.0.8, a BLAST database can also be used here. Specify the base path of the database without file extensions. Since v2.0.10, BLAST databases have to be prepared using the
prepdb
command. Note that for self-made BLAST databases,makeblastdb
should be used with the-parse_seqids
option. -
--query/-q <file>
Path to the query input file in FASTA or FASTQ format (may be gzip compressed, or zstd compressed if compiled with zstd support). If this parameter is omitted, the input will be read from
stdin
.Two files that contain the same number of sequences may be supplied when running in blastx mode. Supported since v2.0.7.
-
--taxonlist <list>
Comma-separated list of NCBI taxonomic IDs to filter the database by. Any taxonomic rank can be used, and only reference sequences matching one of the specified taxon ids will be searched against. Using this option requires setting the
--taxonmap
and--taxonnodes
parameters formakedb
. -
--taxon-exclude <list>
Comma-separated list of NCBI taxonomic IDs to exclude from the database. Using this option requires setting the
--taxonmap
and--taxonnodes
parameters formakedb
. -
--seqidlist <filename>
Filter the database by a list of accessions provided as a text file. Only supported when using a BLAST database.
-
--query-gencode #
Genetic code used for translation of query in BLASTX mode. A list of possible values can be found at the NCBI website. By default, the Standard Code is used. Note: changing the genetic code is currently not fully supported for the DAA format.
-
--strand {both, plus, minus}
Set strand of query to align for translated searches. By default both strands are searched.
-
--min-orf/-l #
Ignore translated sequences that do not contain an open reading frame of at least this length. By default this feature is disabled for sequences of length below 30, set to 20 for sequences of length below 100, and set to 40 otherwise. Setting this option to
1
will disable this feature.
Without using any sensitivity option, the default mode will run which
is designed for finding hits of >60% identity and short read alignment.
Its sensitivity is between --fast
and --mid-sensitive
.
-
--fast
Enable the fast sensitivity mode, which runs faster than default and is designed for finding hits of >90% identity. (Option supported since v2.0.10)
-
--mid-sensitive
Enable the mid-sensitive mode which is between the default mode and the sensitive mode in sensitivity. (Option supported since v2.0.3)
-
--sensitive
Enable the sensitive mode designed for full sensitivity for hits of >40% identity.
-
--more-sensitive
This mode is equivalent to the
--sensitive
mode except for soft-masking of certain motifs being disabled (same as setting--motif-masking 0
). -
--very-sensitive
Enable the very-sensitive mode designed for best sensitivity including the twilight zone range of <40% identity. (Option supported since v2.0.0)
-
--ultra-sensitive
Enable the ultra-sensitive mode which is yet more sensitive than the
--very-sensitive
mode. (Option supported since v2.0.0)
-
--iterate
Run multiple rounds of searches with increasing sensitivity. The query dataset will first be searched at a lower sensitivity setting, only searching those query sequences at the target sensitivity that fail to produce a significant alignment at a lower sensitivity. The target sensitivity is set using the options listed above. (Option supported since v2.0.10)
This option may improve performance by a lot if only a single best alignment is required for each query and is generally recommended for this use case.
The sensitivity modes of the earlier rounds prior to the target sensitivity are set automatically, but can also be specified using a space-separated list following
--iterate
(supported since v2.0.12). The keywords correspond to the flags listed above without the leading dashes anddefault
for the default mode. -
--global-ranking/-g #
Set a hard limit on the number of Smith Waterman extensions that will be computed for each query. Target sequences will be ranked according to their ungapped extension scores at seed hits, and gapped extensions will only be computed for the best N targets for each query. Note that this option increases memory use. (Option supported since v2.0.10)
-
--frameshift/-F #
Penalty for frameshifts in DNA-vs-protein alignments. Values around 15 are reasonable for this parameter. Enabling this feature will have the aligner tolerate missing bases in DNA sequences and is most recommended for long, error-prone sequences like MinION reads.
In the pairwise output format, frameshifts will be indicated by\
and/
for a shift by +1 and -1 nucleotide in the direction of translation respectively. Note that this feature is disabled by default. -
--gapopen #
Gap open penalty.
-
--gapextend #
Gap extension penalty.
-
--matrix <matrix name>
Scoring matrix. The following matrices are supported, with the default being BLOSUM62.
Matrix Supported values for (gap open)/(gap extend) Default gap penalties BLOSUM45 (10-13)/3; (12-16)/2; (16-19)/1 14/2 BLOSUM50 (9-13)/3; (12-16)/2; (15-19)/1 13/2 BLOSUM62 (6-11)/2; (9-13)/1 11/1 BLOSUM80 (6-9)/2; 13/2; 25/2; (9-11)/1 10/1 BLOSUM90 (6-9)/2; (9-11)/1 10/1 PAM250 (11-15)/3; (13-17)/2; (17-21)/1 14/2 PAM70 (6-8)/2; (9-11)/1 10/1 PAM30 (5-7)/2; (8-10)/1 9/1 -
--custom-matrix <file>
Use a custom scoring matrix (example of the file format).
-
--masking (0,1,seg)
DIAMOND by default applies the tantan repeat masking algorithm to the query and target sequences as described in (Frith, 2011). This masking procedure increases the specificity of alignments and serves to filter out spurious hits. If this is not desired, repeat masking can be disabled using
--masking 0
, or the default BLASTP SEG masking can be used instead by setting--masking seg
(supported since v2.0.12). Note that when using--comp-based-stats (2,3,4)
, tantan masking is disabled by default. -
--comp-based-stats (0,1,2,3,4)
Enable composition based statistics. These algorithms adjust alignment scores based on sequence composition in order to improve search specificity. The following modes are supported:
-
0
Disable composition based statistics. -
1
Compositional correction as described in (Hauser, 2016). This mode is the default. -
2
Compositional matrix adjust as described in (Yu, 2005), conditioned on sequence properties. An adjusted matrix is used if the compositional angle (Altschul, 2005) between the sequence pairs is less than 50 degrees, otherwise falling back on the (Hauser, 2016) method. This mode also uses a simplified version of the algorithm that runs faster, but produces slightly less accurate scores. Supported since v2.0.6. -
3
Compositional matrix adjust as described in (Yu, 2005), conditioned on sequence properties. An adjusted matrix is used if the compositional angle (Altschul, 2005) between the sequence pairs is less than 50 degrees, otherwise falling back on the (Hauser, 2016) method. Supported since v2.0.6. -
4
Compositional matrix adjust as described in (Yu, 2005), unconditionally. An adjusted matrix is computed for all alignments, which substantially reduces performance, but provides the highest accuracy. Supported since v2.0.6.
The (Yu, 2005) method is the same algorithm also used by NCBI BLAST.
Modes
0
and1
use tantan repeat masking by default, while modes2
,3
and4
do not use tantan repeat masking by default. These modes will instead apply a more conservative SEG masking to only the target sequences (parameters: window=10, locut=1.8, hicut=2.1). This is the same masking that is also used by BLASTP by default, and these modes will therefore produce scores and alignments that are more similar to those of BLAST.Modes 2-4 are not yet supported for translated searches (blastx mode).
-
-
--algo (0,1,ctg)
Algorithm for seed search.
0
means double-indexed and is the main algorithm of the program, designed for large input files but less efficient for small query files .1
means query-indexed and improves performance for small query files. This mode will be automatically triggered based on the input.ctg
means contiguous-seed mode and further improves performance for small query files. This mode needs to be manually set by the user (available since v2.0.10).The modes differ slightly in their sensitivity, so results are not guaranteed to be 100% identical for different settings of this option.
-
--out/-o <file>
Path to the output file. If this parameter is omitted, the results will be written to the standard output and all other program output will be suppressed.
-
--outfmt/-f #
Format of the output file. The following values are accepted:
-
0
BLAST pairwise format. -
5
BLAST XML format. -
6
BLAST tabular format (default). This format can be customized, the6
may be followed by a space-separated list of the following keywords, each specifying a field of the output. N.B.: these additional arguments should not be quoted as is often required for other tools, e.g. usediamond --outfmt 6 qseqid sseqid
, notdiamond --outfmt '6 qseqid sseqid'
-
qseqid
Query Seq - id -
qlen
Query sequence length -
sseqid
Subject Seq - id -
sallseqid
All subject Seq - id(s), separated by a ’;’ -
slen
Subject sequence length -
qstart
Start of alignment in query* -
qend
End of alignment in query* -
sstart
Start of alignment in subject* -
send
End of alignment in subject* -
qseq
Aligned part of query sequence* -
qseq_translated
Aligned part of query sequence (translated)* Supported since v2.0.7. -
full_qseq
Full query sequence -
full_qseq_mate
Query sequence of the mate (requires two files for--query
) Supported since v2.0.7. -
sseq
Aligned part of subject sequence* -
full_sseq
Full subject sequence -
evalue
Expect value -
bitscore
Bit score -
score
Raw score -
length
Alignment length* -
pident
Percentage of identical matches* -
nident
Number of identical matches* -
mismatch
Number of mismatches* -
positive
Number of positive - scoring matches* -
gapopen
Number of gap openings* -
gaps
Total number of gaps* -
ppos
Percentage of positive - scoring matches* -
qframe
Query frame -
btop
Blast traceback operations(BTOP)* -
cigar
CIGAR string* -
staxids
Unique Subject Taxonomy ID(s), separated by a ’;’ (in numerical order). This field requires setting the--taxonmap
parameter formakedb
. -
sscinames
Unique Subject Scientific Name(s), separated by a ';'. This field requires setting the--taxonmap
and--taxonnames
parameters formakedb
. -
sskingdoms
Unique Subject Super Kingdom(s), separated by a ';'. This field requires setting the--taxonmap
,--taxonnodes
and--taxonnames
parameters formakedb
. -
skingdoms
Unique Subject Kingdom(s), separated by a ';'. This field requires setting the--taxonmap
,--taxonnodes
and--taxonnames
parameters formakedb
. -
sphylums
Unique Subject Phylums(s), separated by a ';'. This field requires setting the--taxonmap
,--taxonnodes
and--taxonnames
parameters formakedb
. -
stitle
Subject Title -
salltitles
All Subject Title(s), separated by a ’<>’ -
qcovhsp
Query Coverage Per HSP* -
scovhsp
Subject Coverage Per HSP* -
qtitle
Query title -
qqual
Query quality values for the aligned part of the query* -
full_qqual
Query quality values -
qstrand
Query strand
By default, there are 12 preconfigured fields:
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
.*These fields require alignment traceback. If none of these fields are selected, traceback computation will be disabled, which improves performance and reduces use of temporary disk space.
-
-
100
DIAMOND alignment archive (DAA). The DAA format is a proprietary binary format that can subsequently be used to generate other output formats using theview
command. It is also supported by MEGAN and allows a quick import of results. Note that this format does not support streaming output or taxonomy features. It is considered a legacy format at this time. -
101
SAM format. -
102
Taxonomic classification. This format will not print alignments but only a taxonomic classification for each query using the LCA algorithm. The output lines consist of 3 tab-delimited fields:-
Query ID
-
NCBI taxonomy ID (0 if unclassified)
-
E-value of the best alignment with a known taxonomic ID found for the query (0 if unclassified)
The score range for the LCA algorithm is set by the
--top
parameter. The default value is 10 which means that all alignments whose score is at most 10% lower than the best score are considered for the LCA computation.Using this format requires setting the
--taxonmap
and--taxonnodes
parameters formakedb
. -
-
103
PAF format. The custom fields in the format are AS (bit score), ZR (raw score) and ZE (e-value).
Additionally, for creating output in Apache Parquet or DuckDB format, see: File formats
-
-
--salltitles
Include full length subject titles into the DAA format. By default, DAA files contain only the shortened sequence id (up to the first blank character).
-
--sallseqid
Include all subject ids into the DAA file. By default only the first id of each subject is included. As the subject ids are much shorter than the full titles this option will save space compared to the
--salltitles
option. -
--compress (0,1,zstd)
Enable compression of the output file.
0
(default) means no compression,1
means gzip compression,zstd
means zstd compression (executable is required to have been compiled with zstd support). -
--max-target-seqs/-k #
The maximum number of target sequences per query to report alignments for (default=25). Setting this to
-k0
will report all targets for which alignments were found.Note that this parameter does not only affect the reporting, but also the algorithm as it is taken into account for heuristics that eliminate hits prior to full gapped extension.
-
--top #
Report alignments within the given percentage range of the top alignment score for a query (overrides
--max-target-seqs
option). For example, setting--top 10
will report all alignments whose score is at most 10% lower than the best alignment score for a query. Using this option will cause targets to be sorted by bit score instead of e-evalue in the output.Note that this parameter does not only affect the reporting, but also the algorithm as it is taken into account for heuristics that eliminate hits prior to full gapped extension.
-
--max-hsps #
The maximum number of HSPs (High-Scoring Segment Pairs) per target sequence to report for each query. The default policy is to report only the highest-scoring HSP for each target, while disregarding alternative, lower-scoring HSPs that are contained in the same target. This is not to be confused with the
--max-target-seqs
option.This parameter can be increased to report an alternative HSP if its query and subject ranges are not enveloped by a higher scoring HSP and if it meets the e-value threshold. Setting this option to
--max-hsps 0
will report all alternative HSPs.A non-default setting will always recompute a full-matrix Smith Waterman alignment with the range of the best HSP masked in the target (since v2.0.12). Therefore using this setting will reduce performance substantially.
-
--range-culling
Enable hit culling with respect to the query range. This feature is designed for long query DNA sequences that may span several genes. In these cases, reporting the overall top N hits can cause hits to a lower-scoring gene to be superseded by a higher-scoring gene. Using this option, hit culling will be performed locally with respect to a hit's query range, thus reporting the locally top N hits while allowing more hits that span a different region of the query.
Using this feature along with
-k 25
(default), a hit will only be deleted if at least 50% of its query range is spanned by at least 25 higher or equal scoring hits.Using this feature along with
--top 10
, a hit will only be deleted if its score is more than 10% lower than that of a higher scoring hit over at least 50% of its query range.The overlap percentage is configurable using
--range-cover
. Note that this feature is currently only available in frameshift alignment mode. -
--evalue/-e #
Maximum expected value to report an alignment (default=0.001).
-
--min-score #
Minimum bit score to report an alignment. Setting this option will override the
--evalue
parameter. -
--id #
Report only alignments above the given percentage of sequence identity.
Note that using this option reduces performance.
-
--query-cover #
Report only alignments above the given percentage of query cover.
Note that using this option reduces performance.
-
--subject-cover #
Report only alignments above the given percentage of subject cover.
Note that using this option reduces performance.
-
--unal (0,1)
Report unaligned queries (0=no, 1=yes). By default, unaligned queries are reported for the BLAST pairwise, BLAST XML and SAM format.
-
--no-self-hits
Suppress reporting of identical self-hits between sequences. The FASTA sequence identifiers as well as the sequences of query and target need to be identical for a hit to be deleted.
-
--block-size/-b #
Block size in billions of sequence letters to be processed at a time. This is the main parameter for controlling the program’s memory and disk space usage. Bigger numbers will increase the use of memory and temporary disk space, but also improve performance. The program can be expected to use roughly six times this number of memory (in GB).
The default value is
-b2.0
. The parameter can be decreased for reducing memory use, as well as increased for better performance (values of >20 are not recommended).The very-sensitive and ultra-sensitive modes use
-b0.4
as a default. Note that these two modes benefit only slightly from increasing this parameter.Note that this parameter affects the algorithm and results will not be completely identical for different values of the block size.
-
--tmpdir/-t <directory>
Directory to be used for temporary storage. This is set to the output directory by default. The amount of disk space that will be used depends on the program’s settings and the input data. As a general rule it should be ensured that 100 GB of disk space are available here.
If the program is being run in a cluster environment, and disk space is mounted over a network based file system, it is recommended to set this parameter to a fast local disk or to
/dev/shm
to avoid any I/O bottlenecks. -
--index-chunks/-c #
The number of chunks for processing the seed index. This option can be additionally used to tune the performance. The default value is
-c4
, while setting this parameter to-c1
instead will improve the performance at the cost of increased memory use. Note that the very-sensitive and ultra-sensitive modes use-c1
by default.
-
--daa/-a <file>
Path to input file in DAA format.
-
--out/-o <file>
Path to output file. If this parameter is omitted, the results will be written to the standard output and all other program output will be suppressed.
These aligner parameters apply to the view command as well and work in
the same way: --outfmt
, --compress
, --max-target-seqs
, --top
. Note
that taxonomy features are currently not available for the DAA format.
Altschul SF, Wootton JC, Gertz EM, Agarwala R, Morgulis A, Schäffer AA, Yu YK. Protein database searches using compositionally adjusted substitution matrices. FEBS J. 2005 Oct;272(20):5101-9. doi: 10.1111/j.1742-4658.2005.04945.x. PMID: 16218944; PMCID: PMC1343503.
Martin C. Frith; A new repeat-masking method enables specific detection of homologous sequences, Nucleic Acids Research, Volume 39, Issue 4, 1 March 2011, Page e23, https://doi.org/10.1093/nar/gkq1212
Maria Hauser, Martin Steinegger, Johannes Söding; MMseqs software suite for fast and deep clustering and searching of large protein sequence sets, Bioinformatics, Volume 32, Issue 9, 1 May 2016, Pages 1323–1330
Yi-Kuo Yu, Stephen F. Altschul; The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions, Bioinformatics, Volume 21, Issue 7, 1 April 2005, Pages 902–911.