v-build.pl
example usagev-build.pl
command-line options- basic options
- options for controlling what feature types are stored in the output model info file
- options for controlling what qualifiers are stored in the output model info file
- options for including additional model attributes (
group
andsubgroup
) - options for controlling the CDS translation step (translation tables)
- options for controlling the
cmbuild
step - options for skipping stages
- options for additional output files
- additional expert options
- Building a VADR model library
- Advanced tutorial: building an RSV model library
- How the VADR 1.0 model library was constructed
v-build.pl
creates the VADR model files for a specified reference
GenBank (typically RefSeq) sequence necessary for running
v-annotate.pl
to validate and annotate sequences similar to that
reference sequence. It is recommended to run v-build.pl
only on sequences
of length 25Kb (25,000 nucleotides) or less due to the prohibitively
large memory requirements of v-annotate.pl
for larger models. To
determine the command-line usage of v-build.pl
(or any VADR script),
use the -h
option, like this:
v-build.pl -h
You'll see something like the following output:
# v-build.pl :: build homology model of a single sequence for feature annotation
# VADR 1.6 (Nov 2023)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date: Tue Nov 7 10:07:21 2023
#
Usage: v-build.pl [-options] <accession> <path to output directory to create>
The first few lines are the banner which show the name of the VADR
script being run along with the version and release date. This is
followed by the time and date the command was executed. The Usage:
line details the expected command line arguments. v-build.pl
takes
as input two command line arguments, the name of the reference
accession (<accession>
) and the name of the output directory you
want it to create (<path to output directory to create>
) and
populate with output files, including model files for the accession to
use with v-annotate.pl
.
After that comes a list of all available command-line options. These are explained in more detail below.
Here is an example v-build.pl
command using the RefSeq
accession NC_039897
a Norovirus GI complete genome sequence, and creating
an output directory with the same name as the accession:
v-build.pl NC_039897 NC_039897
If, when you the above command, it fails and you see a (potentially
very long) error that starts with Can't locate LWP/Protocol/https.pm in @INC
or something similar. You will need
to install one or more perl modules before v-build.pl
will run as
explained more here.
The standard output of v-build.pl
that is printed to the screen
(which is also output to the .log
output file) begins with the
banner and date again followed by a list of relevant environment
variables, the command line arguments used and any command line
options used:
# $VADRBLASTDIR: /home/nawrocki/vadr-install-dir/ncbi-blast
# $VADREASELDIR: /home/nawrocki/vadr-install-dir/infernal/binaries
# $VADRINFERNALDIR: /home/nawrocki/vadr-install-dir/infernal/binaries
# $VADRSCRIPTSDIR: /home/nawrocki/vadr-install-dir/vadr
#
# accession/model name: NC_039897
# output directory: NC_039897
No command line options were used in our example output, but if they
were information on them would have appeared after the output directory
line.
Next, information is output about each step the script is proceeding
through. When each step is completed, the elapsed time for that step
is output. The first few steps for v-build.pl
are to fetch and parse
a FASTA sequence file and feature table file for the input accession
NC_039897
.
# Fetching FASTA file ... done. [ 3.4 seconds]
# Parsing FASTA file ... done. [ 0.0 seconds]
# Fetching feature table file ... done. [ 3.4 seconds]
# Parsing feature table file ... done. [ 0.0 seconds]
Next, any relevant feature tables for proteins referenced in the
NC_039897
feature table are fetched and parsed. The fetched
data from GenBank is then pruned to remove information that will
not be stored in the VADR model files:
# Fetching and parsing protein feature table file(s) ... done. [ 10.3 seconds]
# Pruning data read from GenBank ... done. [ 0.0 seconds]
The fetched FASTA file is then reformatted to Stockholm for input to
cmbuild
to create the model CM file, the BLAST nucleotide database
is construced, feature information read from the feature tables is
finalized internally, and CDS features are translated and used to
create the BLAST database with makeblastdb
, and the HMMER protein
HMM database with hmmbuild
.
# Reformatting FASTA file to Stockholm file ... done. [ 0.0 seconds]
# Building BLAST nucleotide database ... done. [ 0.7 seconds]
# Finalizing feature information ... done. [ 0.0 seconds]
# Translating CDS ... done. [ 0.2 seconds]
# Building BLAST protein database ... done. [ 0.3 seconds]
# Building HMMER protein database ... done. [ 1.1 seconds]
# Checking intron splice sites, if any ... done. [ 0.0 seconds]
Next, the covariance model is built using cmbuild
. This is by far
the slowest step of v-build.pl
and a rough estimate of how long it
will take is output. The longer the sequence being modeled the longer
this step will take. When that completes the cm file is pressed with
cmpress
to prepare it for use with v-annotate.pl
, and a BLAST
nucleotide database is created from the CM consensus sequence. The final
step is to create the model info file:
# Building CM (should take roughly 10-30 minutes) ... done. [ 477.0 seconds]
# Pressing CM file ... done. [ 0.4 seconds]
# Creating model info file ... done. [ 0.0 seconds]
When all steps are complete, v-build.pl
ends by outputting a list of
relevant output files with brief descriptions:
# Output printed to screen saved in: NC_039897.vadr.log
# List of executed commands saved in: NC_039897.vadr.cmd
# List and description of all output files saved in: NC_039897.vadr.filelist
# fasta file for NC_039897 saved in: NC_039897.vadr.fa
# feature table format file for NC_039897 saved in: NC_039897.vadr.tbl
# feature table format file for YP_009538340.1 saved in: NC_039897.vadr.YP_009538340.1.tbl
# feature table format file for YP_009538341.1 saved in: NC_039897.vadr.YP_009538341.1.tbl
# feature table format file for YP_009538342.1 saved in: NC_039897.vadr.YP_009538342.1.tbl
# Stockholm alignment file for NC_039897 saved in: NC_039897.vadr.stk
# nucleotide blastn db fasta sequence file for NC_039897 saved in: NC_039897.vadr.fa
# BLAST db .nhr file for NC_039897 saved in: NC_039897.vadr.fa.nhr
# BLAST db .nin file for NC_039897 saved in: NC_039897.vadr.fa.nin
# BLAST db .nsq file for NC_039897 saved in: NC_039897.vadr.fa.nsq
# BLAST db .ndb file for NC_039897 saved in: NC_039897.vadr.fa.ndb
# BLAST db .not file for NC_039897 saved in: NC_039897.vadr.fa.not
# BLAST db .ntf file for NC_039897 saved in: NC_039897.vadr.fa.ntf
# BLAST db .nto file for NC_039897 saved in: NC_039897.vadr.fa.nto
# fasta sequence file for CDS from NC_039897 saved in: NC_039897.vadr.cds.fa
# fasta sequence file for translated CDS from NC_039897 saved in: NC_039897.vadr.protein.fa
# BLAST db .phr file for NC_039897 saved in: NC_039897.vadr.protein.fa.phr
# BLAST db .pin file for NC_039897 saved in: NC_039897.vadr.protein.fa.pin
# BLAST db .psq file for NC_039897 saved in: NC_039897.vadr.protein.fa.psq
# BLAST db .pdb file for NC_039897 saved in: NC_039897.vadr.protein.fa.pdb
# BLAST db .pot file for NC_039897 saved in: NC_039897.vadr.protein.fa.pot
# BLAST db .ptf file for NC_039897 saved in: NC_039897.vadr.protein.fa.ptf
# BLAST db .pto file for NC_039897 saved in: NC_039897.vadr.protein.fa.pto
# HMMER model db file for NC_039897 saved in: NC_039897.vadr.protein.hmm
# hmmbuild build output (concatenated) saved in: NC_039897.vadr.protein.hmmbuild
# binary HMM and p7 HMM filter file saved in: NC_039897.vadr.protein.hmm.h3m
# SSI index for binary HMM file saved in: NC_039897.vadr.protein.hmm.h3i
# optimized p7 HMM filters (MSV part) saved in: NC_039897.vadr.protein.hmm.h3f
# optimized p7 HMM filters (remainder) saved in: NC_039897.vadr.protein.hmm.h3p
# hmmpress output file saved in: NC_039897.vadr.hmmpress
# CM file saved in: NC_039897.vadr.cm
# cmbuild output file saved in: NC_039897.vadr.cmbuild
# binary CM and p7 HMM filter file saved in: NC_039897.vadr.cm.i1m
# SSI index for binary CM file saved in: NC_039897.vadr.cm.i1i
# optimized p7 HMM filters (MSV part) saved in: NC_039897.vadr.cm.i1f
# optimized p7 HMM filters (remainder) saved in: NC_039897.vadr.cm.i1p
# cmpress output file saved in: NC_039897.vadr.cmpress
# VADR 'model info' format file for NC_039897 saved in: NC_039897.vadr.minfo
#
# All output files created in directory ./NC_039897/
These files include the FASTA and and their formats are described more here.
Only some of these files will be used by v-annotate.pl
. These are
the files with the following suffixes:
file suffix | description | reference |
---|---|---|
.protein.fa.p{hr,in,sq,db,ot,tf,to} |
BLAST protein database index files, created by makeblastdb |
binary files, not meant to be human-readable |
.fa.n{hr,in,sq,db,ot,tf,to} |
BLAST nucleotide database index files, created by makeblastdb |
binary files, not meant to be human-readable |
.cm |
Infernal 1.1x covariance model file | http://eddylab.org/infernal/Userguide.pdf (section 9: "File and output formats") |
.cm.i1{m,i,f,p} |
Infernal 1.1x covariance model index files, created by cmpress |
binary files, not meant to be human-readable |
.hmm |
HMMER 3.x HMM file | http://eddylab.org/software/hmmer/Userguide.pdf ("HMMER profile HMM files" section) |
.hmm.h3{m,i,f,p} |
HMMER 3.x covariance model index files, created by hmmpress |
binary files, not meant to be human-readable |
.minfo |
VADR model info file | description of format |
You can use only this model to annotate sequences
with v-annotate.pl
that are similar to NC_039897
using the
--mdir
and --mkey
options with a command like v-annotate.pl --mdir NC_039897 --mkey NC_039897.vadr <fasta-seq-file> <output-directory>
,
explained more here.
Alternatively, you can combine these files together with analogous
files from additional v-build.pl
runs for other accessions to create
a VADR model library and then use v-annotate.pl
--mdir
and --mkey
options to specify that the library be used. This is explained in more
detail below. The VADR 1.0 library was created in this
manner, as explained in another section.
To get a list of command-line options, execute:
v-build.pl -h
This will output the usage and available command-line options.
Each option has a short description, but additional information on some
of these options can be found below.
For v-build.pl
the available options are split into eight different categories,
each explained in their own subsection below.
In the tables describing options below, <s>
represents a string
and <n>
represents an integer.
The first category of options are the basic options:
........option........ | explanation |
---|---|
-f |
if <output directory> already exists, then using this option will cause it to be overwritten, otherwise the progam exits in error |
-v |
verbose mode: all commands will be output to standard output as they are run |
--stk <s> |
single sequence stockholm alignment will be read from <s> , possibly with secondary structure annotation |
--infa <s> |
instead of fetching the FASTA for this sequence from GenBank, use the sequence in file |
--inft <s> |
instead of fetching the feature table for this sequence from GenBank, use the feature table in file |
--ftfetch1 |
use efetch program (must be in your PATH ) to fetch feature table with efetch -format ft instead of default method of fetching from an eutils URL |
--ftfetch2 |
use efetch program (must be in your PATH ) to fetch feature table with `efetch -format gbc |
--gb |
fetch and parse a GenBank-format file from GenBank instead of a feature table |
--ingb <s> |
read the GenBank-format file in <s> instead of a feature table file (requires --gb ) |
--addminfo <s> |
add arbitrary feature info in file <s> to output .minfo file, see an example here |
--forcelong |
use at your own risk; allow long models > 25Kb in length; by default v-build.pl will fail for any model more than 25Kb (25,000 nucleotides) because v-build.pl will be very slow and the memory requirements of v-annotate.pl will be prohibitively large |
--keep |
keep additional output files that are normally removed |
By default, only CDS
, gene
and mat_peptide
feature types read from the GenBank feature table file
will be stored in the output .minfo
file. This default set can be changed using the following three
command line options. For an example of using the --fadd
option, see the construction of the dengue virus
RefSeq models for the VADR 1.0 model library here.
......option...... | explanation |
---|---|
--fall |
specify that all feature types (except those in <s> from --fskip <s> ) be added to the .minfo output file |
--fadd <s> |
add feature types listed in <s> to the default set, where <s> is a comma-separated string with each feature type separated by a comma with no whitespace |
--fskip <s> |
do not store information for feature types listed in <s> , where <s> is a comma-separated string with each feature type separated by a comma with no whitespace; <s> may contain feature types from the default set, or from other features (if --fall also used) |
By default, only product
, gene
and exception
qualifiers read
from the GenBank feature table file will be stored in the output
.minfo
file, and then only for the feature types that will be
stored. This default set can be changed using the following five
command line options.
For an example of using the --qadd
and --qftradd
options, see
the construction of the dengue virus RefSeq models for the VADR 1.0
model library here.
.......option....... | explanation |
---|---|
--qall |
specify that all qualifiers (except those in <s> from --qskip <s> ) be added to the .minfo output file |
--qadd <s> |
add qualifiers listed in <s> to the default set, where <s> is a comma-separated string with each qualifier separated by a comma with no whitespace |
--qftradd <s> |
specify that the qualifiers listed in <s2> from qadd <s2> only apply for feature types in the string <s> , where <s> is a comma-separated string with each qualifier separated by a comma with no whitespace |
--qskip <s> |
do not store information for qualifiers listed in <s> , where <s> is a comma-separated string with each qualifier separated by a comma with no whitespace; <s> may contain qualifiers from the default set, or from other qualifiers (if --qall also used) |
--noaddgene |
do not automatically add gene qualifiers from gene features to any overlapping non-gene features |
--nosplice |
do not automatically check for GT/AG intron splice sites and add canon_splice_sites qualifiers for introns with valid splice sites, an intron is defined as a gap between CDS segments >= <n> nucleotides from --intlen option, by default <n> is 40 |
--ssplice |
exit if any introns in CDS are found that do not have valid GT/AG splice sites, introns defined as explained in --nosplice description above |
Besides qualifiers read from GenBank and information included in the input
.minfo
file with the --addminfo
option, two additional attributes
can be added using the command line options --group
and
--subgroup
. For an example of using these options see
construction of the Norovirus VADR 1.0 library model files here.
.....option..... | explanation |
---|---|
--group <s> |
specify that the model group attribute is <s> , e.g. Norovirus |
--subgroup <s> |
specify that the model subgroup attribute is <s> , e.g. GI , requires --group |
By default, the NCBI translation table 1 is used to translate CDS
sequences into proteins. This can be changed to use translation table
<s>
with the --ttbl
option. A non-1 translation table will be
stored in the output .minfo
file so that v-annotate.pl
is aware of
the translation table to use when analyzing CDS predictions.
......option...... | explanation |
---|---|
--ttbl <n> |
specify that NCBI translation table <n> be used instead of 1 |
Reference on NCBI translation tables: https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes
Several options exist for controlling the command-line options that will be passed
to Infernal's cmbuild
program. For more information on these options and how
they control cmbuild
, see the Infernal
User's Guide manual page for cmbuild
(section 8 of http://eddylab.org/infernal/Userguide.pdf) .
......option...... | explanation |
---|---|
--cmn <n> |
set the number of seqs for glocal forward profile HMM filter calibration to <n> (sets the cmbuild --EgfN option), default is to use default cmbuild value |
--cmp7ml |
set CM's filter profile HMM as the maximum likelihood profile HMM (sets the cmbuild --p7ml option) |
--cmere |
set CM relative entropy target bits to position to <x> (sets the cmbuild --ere <x> option), default is to use default cmbuild value |
--cmeset |
set CM effective sequence number to '(sets the cmbuild --eset option), default is to use default cmbuild` value |
--cmemaxseq |
set CM maximum allowed effective sequence # for CM to <x> (sets the cmbuild --emaxseq <x> option) |
--cmnoh3pri |
do not use --noh3pri option with cmbuild , to allow CM to use prior from HMMER3 for zero basepair models |
--cminfile |
read cmbuild options from an input file <s> , the contents of the file (after removing newlines) will be supplied directly to cmbuild as an options string (possibly with more than one option separated by whitespace) |
......option...... | explanation |
---|---|
--skipbuild |
skip the cmbuild step; this is mostly useful for debugging purposes, but also possibly for creating a different .minfo file using options like --fadd and/or --qadd for an already created model without the need to wait for the slow cmbuild step |
--onlyurl |
output the url for the GenBank feature table file (or GenBank-format file if --gb also used, and exit; possibly helpful if v-build.pl is having trouble fetching from GenBank, you can manually download the feature table file from a browswer with the output url and use the downloaded file with --inft |
.......option....... | explanation |
---|---|
--ftrinfo <s> |
output information on the internal data structure used to keep track of features to <s> , mainly useful for debugging |
--sgminfo <s> |
output information on the internal data structure used to keep track of segments of features to <s> , mainly useful for debugging |
.......option....... | explanation |
---|---|
--execname <s> |
in banner and usage output, replace v-annotate.pl with <s> |
--nosig2mat |
do not treat sig_peptide features as mat_peptide in model info file |
--intlen <n> |
define intron as any gap >= <n> nucleotides between segments in a CDS, only relevant for checking for canonical splice sites, the default value for <n> is 40 |
Follow these steps to build a VADR model library:
-
Run
v-build.pl
multiple (N>1
) times for different accessions. If your model is larger than 30Kb, or you know that you will use the-s
and--glsearch
options withv-annotate.pl
as is recommended for coronavirus annotation (https://github.com/ncbi/vadr/wiki/Coronavirus-annotation), you should also use the--skipbuild
option withv-build.pl
. If you do, no CM files will be created so you can skip steps 4 and 8 below and ignore any other steps involving a.cm
file. -
Create a new directory call it
my-vadr-model-dir
, for example. -
Concatenate all resulting
N
.vadr.minfo
files into a single file, call itmy.vadr.minfo
, and move it to themy-vadr-model-dir
directory. -
Concatenate all resulting
N
.vadr.cm
files into a single file, call itmy.vadr.cm
, for example, and move it to themy-vadr-model-dir
directory. -
Concatenate all resulting
N
.vadr.hmm
files into a single file, call itmy.vadr.hmm
, for example, and move it to themy-vadr-model-dir
directory. -
Concatenate all resulting
N
.vadr.fa
files into a single file, call itmy.vadr.fa
, for example, and move it to themy-vadr-model-dir
directory, and then create a BLAST nucleotide database from it with the command:
$VADRBLASTDIR/makeblastdb -dbtype nucl -in my-vadr-model-dir/my.vadr.fa
- Create an Easel index
.ssi
file for the.vadr.fa
file you created and moved in step 6 with the command:
$VADREASELDIR/esl-sfetch --index my-vadr-model-dir/my.vadr.fa
-
Move all resulting BLAST protein DB files (
.vadr.protein.fa
,.vadr.protein.fa.p{hr,in,sq,db,ot,tf,to}
) from allN
runs into themy-vadr-model-dir
directory. -
Run
cmpress
on themy.vadr.cm
file created in step 4 like this:
$VADRINFERNALDIR/cmpress my-vadr-model-dir/my.vadr.cm
- Run
hmmpress
on themy.vadr.hmm
file created in step 5 like this:
$VADRHMMERDIR/hmmpress my-vadr-model-dir/my.vadr.hmm
You can then use this library with v-annotate.pl
as follows:
v-annotate.pl --mdir my-vadr-model-dir --mkey my.vadr <fasta-file>
<output directory>
Or substitute the full paths to my-vadr-model-dir
if it is not a
subdirectory of the current directory. Optionally, you can supply the
paths to each of the relevant files or directories, possibly if they
are in different directories or are not consistently named, using the -m
,
-a
, -i
, -n
, and -x
options as listed
here, with a command like:
v-annotate.pl -m my-vadr-model-dir/my.vadr.cm -a my-vadr-model-dir/my.vadr.hmm -i my-vadr-model-dir/my.vadr.minfo -n
my-vadr-model-dir/my.vadr.fa -x my-vadr-model-dir <fasta-file> <output directory>
If you used --skipbuild
with v-build.pl
, you will also have
to use the -s
and --glsearch
options with v-annotate.pl
.
If you ever move .cm
, .hmm
, or BLAST .fa
files into new
directories, make sure you also move the corresponding index files
(.cm.*, .hmm.*
, .fa.*
) along with them.
The VADR 1.0 library was built with version 1.0 of VADR. It has not changed since version 1.0 and is still the default model library used in this version. It is included here as an example of how to build a VADR library, but also so it can be reproduced, because it is the model library used in the paper on VADR 1.0 (https://doi.org/10.1186/s12859-020-3537-3). If you want to reproduce it exactly, you'll want to install version 1.0. The install script for v1.0 is here:
https://raw.githubusercontent.com/ncbi/vadr/vadr-1.0/vadr-install.sh
Additionally, if the RefSeq annotation for any of these 197 VADR
models has changed since October 2019, then it may not be able to
identically reproduce the VADR 1.0 model library using the steps
outlined above. This is because v-build.pl
fetches the current
RefSeq annotation data from GenBank when it is run. If necessary,
email [email protected] for additional files
needed to reproduce the library exactly.
Also note that the library has changed since version 1.0. For example,
the default set of models included with version 1.1 has 205 total
models, not 197. To see a list of changes, see the RELEASE-NOTES.txt
file in the directory pointed to by your $VADRMODELDIR
environment
variable after installing VADR. To reproduce the construction of the
1.1 library, you would run similar steps to those below but also
adding the additional models listed in the RELEASE-NOTES.txt
file.
Additionally, as of version 1.2, the Caliciviridae models and
Flaviviridae models have been split up into two different model
sets, but both are installed by the VADR install script
vadr-install.sh
. The Caliciviridae models are used by default with
v-annotate.pl
. To use the Flaviviridae models, use the options --mkey flavi --mdir $VADRMODELDIR/vadr-models-flavi
.
The VADR 1.0 model library consists of 197 VADR models. Nine of these are Norovirus RefSeq models, listed in vadr/documentation/build-files/1p0-models/norovirus.9.list. Four of these are Dengue virus RefSeq models, listed in vadr/documentation/build-files/1p0-models/dengue.4.list. Eight of these are Hepatitis C virus Refseq models, listed in vadr/documentation/build-files/1p0-models/hcv.8.list. The remaining 173 are additional Caliciviridae and Flaviviridae RefSeq models, listed in vadr/documentation/build-files/1p0-models/non-noro-dengue-hcv.173.list.
To build models for each of the nine norovirus RefSeqs listed in
vadr/documentation/build-files/1p0-models/norovirus.9.list,
run v-build.pl
nine separate times as follows:
v-build.pl --group Norovirus --subgroup GI NC_001959 NC_001959
v-build.pl --group Norovirus --subgroup GV NC_008311 NC_008311
v-build.pl --group Norovirus --subgroup GIII NC_029645 NC_029645
v-build.pl --group Norovirus --subgroup GII --addminfo NC_029646.addminfo NC_029646 NC_029646
v-build.pl --group Norovirus --subgroup GIV NC_029647 NC_029647
v-build.pl --group Norovirus --subgroup GI NC_031324 NC_031324
v-build.pl --group Norovirus --subgroup GII --addminfo NC_039475.addminfo NC_039475 NC_039475
v-build.pl --group Norovirus --subgroup GII NC_039476 NC_039476
v-build.pl --group Norovirus --subgroup GII NC_039477 NC_039477
(The shell script vadr/documentation/build-files/1p0-models/norovirus.9.build.sh contains these commands.)
The --group
and --subgroup
options specify the group and subgroup
values that will be added to the output .minfo
file. These will
enable v-annotate.pl
to label sequences classified to these models
with the corresponding group and subgroup, as well as to fail
sequences users expect are norovirus sequences that actually are
classified best to a non-norovirus model.
The --addminfo
options specify an input file that contains information
on additional feature attributes not from GenBank that are desired in the output
.minfo
file. Those input files are located here:
vadr/documentation/build-files/1p0-models/NC_029646.addminfo
and
vadr/documentation/build-files/1p0-models/NC_039475.addminfo.
The NC_029646.addminfo
file is in the .minfo
format and looks like this:
MODEL NC_029646
FEATURE NC_029646 type:"CDS" coords:"5085..6692:+" xmaxins_exc:"297:36"
This file specifies that the additional <key>:<value>
pair of
xmaxins_exc:"297:36"
be added to the CDS feature with coordinates
5085..6692:+
. (The VADR coordinate string format is described
here). The NC_039475.addminfo
file is
similar except with the value "295:36"
. These two additions allow
the corresponding CDS features to have an exception to the default
maximum allowed insert length by v-annotate.pl
without causing an
insertnp alert, setting it as 36 after position 297 in NC_029646
and after position 295 in NC_039475
. This change to the default was
allowed after GenBank indexers observed a common biologically valid
insertion at these positions of length 36 nucleotides (nt), which
exceeds the default maximum of 27 nt.
The four Dengue RefSeq models are built using the --stk
option
to specify the secondary structure of structured regions of the
genome at the 5' and 3' ends, as well as some additional options.
To build models for each of these RefSeqs, run
run v-build.pl
from VADR v1.0 four separate times as follows:
v-build.pl --stk NC_001477.v1.stk --qftradd stem_loop,ncRNA --qadd note,ncRNA_class --fadd stem_loop,ncRNA --group Dengue --subgroup 1 NC_001477 NC_001477
v-build.pl --stk NC_001474.v1.stk --qftradd stem_loop,ncRNA --qadd note,ncRNA_class --fadd stem_loop,ncRNA --group Dengue --subgroup 2 NC_001474 NC_001474
v-build.pl --stk NC_001475.v1.stk --qftradd stem_loop,ncRNA --qadd note,ncRNA_class --fadd stem_loop,ncRNA --group Dengue --subgroup 3 NC_001475 NC_001475
v-build.pl --stk NC_002640.v1.stk --qftradd stem_loop,ncRNA --qadd note,ncRNA_class --fadd stem_loop,ncRNA --group Dengue --subgroup 4 NC_002640 NC_002640
(The shell script vadr/documentation/build-files/1p0-models/dengue.4.build.sh contains these commands.)
The --qftradd
, --qadd
, and --fadd
options all take comma-separated strings
as arguments. For example, --qftradd
takes the argument stem_loop,ncRNA
.
These options specify that v-build.pl
should include GenBank feature information for
stem_loop
and ncRNA
features in addition to its default set of CDS
, gene
, and
mat_peptidefeatures. Further, the
noteand
ncRNA_classGenBank qualifiers should be included in addition to the default set of
product,
geneand
exceptionqualifiers for these
stem_loopand
ncRNA` features.
The --group
and --subgroup
options are used in a similar way to how they were used
to build the norovirus models.
Not all the annotation in the Dengue RefSeqs as of October 2019 is in
the VADR 1.0 library. To make your .minfo
files match those in the
library exactly do the following:
-
Remove the
ncRNA
features with product namessfRNA2
,sfRNA3
, andsfRNA4
from the.minfo
files ofNC_001477
,NC_001474
, andNC_001475
. -
Remove the
ncRNA
features with product namessfRNA3
andsfRNA4
from the.minfo
file ofNC_002640
. -
Remove all occurrences of the substring
gene:"-"
(only) fromFEATURE
lines in the.minfo
files ofNC_001477
,NC_001474
,NC_001475
andNC_002640
.
To build models for each of the eight Hepatitis C RefSeqs listed in
vadr/documentation/build-files/1p0-models/hcv.8.list,
run v-build.pl
eight separate times as follows:
v-build.pl --group HCV --subgroup 1 NC_004102 NC_004102
v-build.pl --group HCV --subgroup 1 NC_038882 NC_038882
v-build.pl --group HCV --subgroup 2 NC_009823 NC_009823
v-build.pl --group HCV --subgroup 3 NC_009824 NC_009824
v-build.pl --group HCV --subgroup 4 NC_009825 NC_009825
v-build.pl --group HCV --subgroup 5 NC_009826 NC_009826
v-build.pl --group HCV --subgroup 6 NC_009827 NC_009827
v-build.pl --group HCV --subgroup 7 NC_030791 NC_030791
(The shell script vadr/documentation/build-files/1p0-models/hcv.8.build.sh contains these commands.)
The --group
and --subgroup
options are used in a similar way to how they were used
to build the norovirus models.
To build models for the other 173 Caliciviridae and Flaviviridae models listed in vadr/documentation/build-files/1p0-models/non-noro-dengue-hcv.173.list
Simply run v-build.pl
from VADR v1.0 using default parameters for each accession.
For example:
v-build.pl NC_034444 NC_034444
(The shell script vadr/documentation/build-files/1p0-models/non-noro-dengue.hcv.173.build.sh will execute these 173 commands.)
Each of these commands takes roughtly between 10 minutes and an hour.
After completing the steps above to make the 197 models, you can make the VADR 1.0 library by following the instructions for creating a VADR library starting at step 2 here.