-
Notifications
You must be signed in to change notification settings - Fork 6
Home
The recommended method is to use a pre-built Docker or Singularity container.
Both the Docker and Singularity container have the main script
viridian
installed.
Get a Docker image of the latest release:
docker pull ghcr.io/iqbal-lab-org/viridian_workflow:latest
All Docker images are listed in the packages page.
To build a docker container, clone this repository and then from its root run:
docker build --network=host .
(without --network=host
you will likely get pip install
timing out and
the build failing).
Releases
include a Singularity image to download.
Each release has a singularity image file called
viridian_workflow_vX.Y.Z.img
, where X.Y.Z
is the release version,
which is built for x86 architecture. From release 1.2.0, there is also
an ARM container, called viridian_workflow_vX.Y.Z.arm.img
.
To build a singularity container, clone this repository and then from its root run:
singularity build viridian_workflow.img Singularity.def
The examples below will run the default pipeline, using the built-in SARS-CoV-2 amplicon schemes (at the time of writing) ampliseq, ARTIC V3-5, and Midnight-1200. The pipeline automatically detects the scheme that best matches the input reads. To use your own amplicon scheme and/or force the choice of scheme, please read the amplicon schemes page.
Viridian needs one of the following input sequencing data:
- Illumina reads, can be paired or unpaired
- Oxford Nanopore reads
- Ion Torrent reads, can be paired on unpaired
To run on paired Illumina reads:
viridian run_one_sample \
--tech illumina \
--reads1 reads_1.fastq.gz \
--reads2 reads_2.fastq.gz \
--outdir OUT
To run on unpaired nanopore reads:
viridian run_one_sample \
--tech ont \
--reads reads.fastq.gz \
--outdir OUT
The allowed values of --tech
are illumina
, iontorrent
, ont
.
Nanopore reads must be unpaired (ie use the --reads
option).
Illumina and iontorrent reads can be paired or unpaired.
The option --outdir
is required - Viridian will make a new
output directory with the give value. In the above examples, this
is called OUT
. Some combination of reads options is
required, as explained in detail below.
To run on paired or unpaired Ion Torrent reads, use either of the
above commands, but with the option --tech iontorrent
.
Instead of providing your own reads files, Viridian can
download reads from the ENA and then run Viridian on them.
In this mode, the
only required options are --run_accession
and
--outdir
. The sequencing tech and unpaired/paired
is determined from the metadata in the ENA.
Download reads with INSDC run accession SRR12345678 and run:
viridian run_one_sample --run_accession SRR12345678 --outdir OUT
Alternatively, download and run on a batch of sequencing runs,
using a file runs.txt
of INSDC run accessions:
viridian dl_and_run --acc_file runs.txt --outdir OUT
The default files in the output directory are:
-
consensus.fa.gz
: a gzipped FASTA file of the final masked consensus sequence. -
variants.vcf
: a VCF file of the identified variants between the consensus sequence and the reference genome. -
log.json.gz
: a gzipped JSON file of logging information for the viridian workflow run. This is described in detail in the JSON output file page. -
qc.tsv.gz
: a gzipped tab-delimited file containing per-base QC information about the consensus sequence. This is described in detail in the QC TSV file page. -
scheme_id.depth_across_genome.pdf
: a plot of the read depth across the genome, with amplicons coloured in the background. See the Plots page for an example. -
scheme_id.score_plot.pdf
: a plot of the scoring for amplicon scheme identification. See the Plots page for an example.
Other output file options:
-
--keep_bam
is used, then a sorted BAM file of the reads mapped to the reference will also be present, calledreference_mapped.bam
(and its index filereference_mapped.bam.bai
). -
--write_msa MSA_TYPE
: this is useful for building trees (for example, input to usher). All MSA variations are always in the filelog.json
, but this option lets you write some of them to FASTA files. It can be used more than once. Each sequence is written to separate FASTA file. Choose from:indel_as_N
and/orindel_as_ref
. Both options output the consensus sequence, but any insertions with respect to the reference are removed so that the consensus sequence length is the same as the reference length. Deletions (ie gaps) in the consensus can be replaced withN
usingindel_as_N
, or with the reference sequence usingindel_as_ref
.
The reads can be in FASTA, FASTQ or BAM format. FASTA/Q files can optionally be gzipped. Use the following options depending on the format:
- Unpaired FASTA/FASTQ:
--reads my_reads.fastq
(or--reads my_reads.fastq.gz
) - Paired FASTA/FASTQ:
--reads1 reads_1.fastq --reads2 reads_.fastq
(or--reads1 reads_1.fastq.gz --reads2 reads_.fastq.gz
) - Paired or unpaired in a sorted by coordinate
and indexed BAM file:
--reads_bam mapped.bam
. "Indexed" means that the filemapped.bam.bai
must exist (made withsamtools index mapped.bam
).
The sequencing technology of the reads must be specified
with the required option --tech
, with allowed values
illumina
, iontorrent
, or ont
. Nanopore (ont
) reads
must be unpaired. Illumuna and Ion Torrent reads can be
paired or unpaired.
Alternatively you can provide a run accession, of the form
ERR12345678, using the option --run_accession ERR12345678
.
The reads will be downloaded and assembled,
as long as they are from a supported sequencing technology.
If you use this option, then the --reads*
and --tech
options
are not required. There is also the option
--keep_downloaded_reads
, which will keep reads downloaded from the ENA.
By default they are deleted at the end of the Viridian run.
-
--sample_name MY_NAME
: use this to change the sample name (default is "sample") that is put in the final FASTA file, BAM file, and VCF file. -
--keep_bam
: use this option to keep the BAM file of original input reads mapped to the reference genome. -
--force
: use with caution - it will overwrite the output directory if it already exists. -
--tmp_dir DIRNAME
: directory in which to put a temporary directory for intermediate files that are not kept. The dir given by --tmp_dir must already exist. Default is platform dependent, and uses python's tempdir default locations (eg for linux, looks for /tmp first). If you use the --debug option, then the --tmp_dir option is ignored and a directory calledProcessing
is used in the output directory. -
--no_gzip
: use this to not gzip the final output FASTA file and JSON log file -
--debug
: produces very verbose logging output, and does not clean up intermediate files. Also affects temporary files directory (see--tmp_dir
).
The reads can be decontaminated using readItAndKeep
using
the flag --decontam COVID
, which uses the SARS-CoV-2 reference
genome with the polyA tail removed. This option is
incompatible with --reads_bam
. The reads are decontaminated up front,
before the usual start of the pipeline that maps reads to the
reference genome.
The basic options related to amplicon schemes are listed below. If you want to use your own scheme, please see the detailed amplicon schemes page
-
--detect_scheme_only
: map the reads and detect the amplicon scheme, then stop the pipeline instead of making a consensus etc -
--min_scheme_score INT
: minimum score required for a matching scheme. If all schemes are less than this cutoff, the pipeline is stopped unless--force_amp_scheme
is used. Default is 250. -
--max_scheme_ratio FLOAT
: maximum allowed value of (second best scheme score) / (best scheme score). Default is 0.5.
We do not recommend using these options unless you know what you are doing.
Initial read depth QC options:
-
--coverage_min_x X
,coverage_min_pc Y
: after initial read mapping to the reference, if less thanY
percent of the genome has at leastX
read coverage, then the pipeline is stopped. Default is to require at least 50 percent of the genome with at least 20X read depth
Assembly option:
-
--assemble_depth INT
: Target read depth for each amplicon when assembling. Default is 100. This means reads are sampled to 100X within each amplicon. If an amplicon has mean depth less than 100, then all reads are kept for that amplicon.
After assembly, reads are mapped to the consensus sequence to perform more QC and mask low quality positions. The relevant options are:
-
--qc_depth INT
: target coverage for each amplicon, when running QC and masking. Default is 1000. Similarly to the--assemble_depth
option, the default of 1000 means reads for each amplicon are randomly sampled to 1000X, or if the mean depth is less than 1000 then all reads are kept. -
--masking_min_depth INT
: consensus positions with total reads supporting the consensus call less than this number are masked. Default is 20. -
--masking_min_frs FLOAT
. Must be a value between 0 and 1. Consensus positions with a fraction of read support (FRS) less than the given number are masked. Default is 0.5, meaning if less than half of the mapped reads agree with the consensus, then the position is masked with anN
. -
--het_min_pc FLOAT
: cutoff for when to use ambiguous IUPAC codes. With--het_min_pc X
, if two or more alleles each have support of more thanX
percent of reads, then an ambiguous IUPAC code is used and the position is given the 'HET' filter. Default is 20.0.
Other QC options:
-
--max_percent_amps_fail FLOAT
: maximum percent of amplicons allowed to fail during read sampling or making consensus for each amplicon. The pipeline is stopped as soon as too many failed amplicons are detected. Default is 50.0. -
--max_cons_n_percent FLOAT
: maximum allowed percentage of Ns in the consensus sequence. Pipeline is stopped as soon as too many Ns in current consensus. Default is 50.0.