-
Notifications
You must be signed in to change notification settings - Fork 20
Home
SVIM implements a pipeline of four consecutive components:
- COLLECT collects SV signatures from each individual read in the input BAM file
- CLUSTER clusters SV signatures coming from the same SV using an hierarchical clustering approach and a novel distance metric
- COMBINE combines clusters from different genomic regions and classifies them into six SV classes
- GENOTYPE uses read alignments in the genomic neighborhood to determine the genotype of SV candidates
![SVIM workflow](https://github.com/eldariont/svim/raw/a9b0985ef847f81a837ee851e75d1e84c744349f/docs/SVIM_Workflow.png)
SVIM can process two types of input. Firstly, it can detect SVs from existing read alignments in SAM/BAM format. Alternatively, it can detect SVs from raw reads by aligning them to a given reference genome first.
SVIM accepts read alignments in indexed BAM format. The read alignments need to be sorted by coordinate, e.g. with samtools sort
. For genotyping, a BAM index (*.bam.bai) is required which can be generated using samtools index
.
To start with a BAM file:
svim alignment my_sample my_alignments.bam my_genome.fa
svim alignment -h #display all parameters
SVIM also accepts uncompressed and gzipped FASTA or FASTQ files or a file list as input. The input file has to have one of the following supported file endings:
- FASTA: .fa, .fasta, .FA, .fa.gz, .fa.gzip, .fasta.gz, .fasta.gzip
- FASTQ: .fq, .fastq, .FQ, .fq.gz, .fq.gzip, .fastq.gz, .fastq.gzip
- file list: .fa.fn, fq.fn
To start with reads in a FASTA file:
svim reads my_sample my_reads.fa my_genome.fa # align PacBio reads with ngmlr, then call SVs
svim reads --nanopore my_sample my_reads.fa my_genome.fa # align Nanopore reads with ngmlr, then call SVs
svim reads --aligner minimap2 my_sample my_reads.fa my_genome.fa # align PacBio reads with minimap2, then call SVs
svim reads --min_mapq 30 my_sample my_reads.fa my_genome.fa # Use only high-quality alignments for SV calling (minimum mapping quality of 30)
svim reads -h #display all parameters
To align the reads to the reference genome, SVIM needs the following tools to be installed and in the PATH:
- gunzip (for gzipped files)
- ngmlr
- minimap2
- samtools (for sorting and indexing the BAM output)
SVIM checks whether these tools are accessible before starting the alignment. If you don't have the tools installed, you can still pass a BAM file as input to SVIM (see above).
A complete list of all command line parameters including a description can be obtained by running svim alignment --help
(SAM/BAM input) or svim reads --help
(Read input). An (hopefully) up-to-date list can also be found here.
SVIM outputs all its results into the given working directory (my_sample
in our examples). The two main output files are:
Final SV calls in VCF format where each line represents one variant. The variant coordinates are contained in the CHROM, POS and INFO:END fields. The sequence changes are contained in the REF and ALT fields. The variant type is contained in the INFO:SVTYPE field and the six possible types are DEL
, INS
, INV
, DUP:TANDEM
, DUP:INT
and BND
(translocations). The number of supporting reads is contained in the INFO:SUPPORT field.
Important: SVIM produces an unfiltered output of all calls! See Section 4 for details.
The QUAL fields contains the score of each call on a range between 0 and 100 (higher meaning better). Note that the scores are not phred-scaled (in contrast to what is defined in the VCF specification).
Log file with runtime information. Contains information on the SVIM version and parameters as well as the command line output with precise time stamps.
Besides the two output files described above there are additional output files for debugging purposes:
- The SV calls in BED format (for debugging):
candidates/candidates_*.bed
- Intermediate signature clusters in BED format (for debugging):
signatures/*.bed
Final SV calls in BED format to view in any text editor or IGV. For deletions (
candidates_deletions.bed
), inversions (candidates_inversions.bed
) and novel element insertions (candidates_novel_insertions.bed
), one BED file is written. For tandem and interspersed duplications, two files are written - one for the genomic origin (candidate_tan_duplication_source.bed
,candidate_int_duplication_source.bed
) and one for the genomic destination region (candidate_tan_duplication_dest.bed
,candidate_int_duplication_dest.bed
). Each file has the same 7 columns:
- Chromosome
- Start coordinate
- End coordinate
- SV type, standard deviation of position across the cluster, standard deviation of span across the cluster
- Score
- [Interspersed duplications only] evidence for deleted region of origin found
- List of signatures in the cluster making up this candidate call (each consisting of chr, start, end, sv_type; signature_type, read)
Signature clusters (output of the CLUSTER component) in BED format to view in any text editor or IGV.
Unlike many other SV callers, SVIM does not filter its output but writes out all SV calls and their respective scores. This means that even low-scoring calls supported by only a single read are contained in the output. Actually, often the majority of calls output by SVIM are calls supported by one or two reads. These calls with very low score are almost always caused by sequencing or alignment errors. It is therefore highly recommended to filter the variant calls that SVIM produces based on these scores.
Why does SVIM output all calls and even low-quality ones? We are convinced that it is more user-friendly to output all variant calls and leave the filtering to the user as a post-processing step. If the filtering would be done during runtime (e.g. controlled by a score cutoff parameter to SVIM), a modified filtering cutoff would require SVIM to be run again which takes considerably more time and effort than filtering the output file with all calls.
How is the score defined that each variant call receives? SVIM outputs a score between 0 and 100 for each variant call (a higher score meaning that the call is more trustworthy). You can find the scores in the QUAL column of the VCF output. Since the publication of our manuscript, we changed the score formula so that it is now mainly based on the number of supporting reads. Additionally, the agreement between the supporting reads concerning the span and position of the SV is taken into account.
The score is computed based on three features:
-
The number
n
of signatures in the cluster (i.e. the number of supporting reads). -
A score
s_p
based on the standard deviations_pos
of the genomic positions of the signatures in the cluster normalized by their average span.s_p = 1 - min(1, (s_pos / span))
-
A score
s_s
based on the standard deviations_span
of the genomic spans of the signatures in the cluster normalized by their average span.s_s = 1 - min(1, (s_span / span))
The three features are combined into a total score S
with the following formula:
S = min(80, n) * (1 + (s_p / 8) + (s_s / 8))
The score formula puts the main emphasis on the number n
of signatures in the cluster but takes at most 80 signatures into account. Clusters with very consistent genomic positions and spans and consequently high standard deviation scores can earn a score bonus of up to 25% ((1/8) + (1/8) = (1/4)
) relative to the number n
of signatures in the cluster. The maximum score that can be reached for clusters with n >= 80
and s_p = s_s = 1
is 100. Thus, we obtain a score to discern trustworthy signature clusters from artifacts, such as sequencing or alignment artifacts. Trustworthy events are characterized by many supporting signatures that exhibit a high concordance of their genomic positions and spans.
How do I filter the SV calls after running SVIM? The filtering of variant calls is always a trade-off between precision and recall. Filtering more strictly usually increases precision and removes false positives but also decreases recall/sensitivity. Applying more lenient filtering has the opposite effect.
Generally, filtering the variant calls produced by SVIM is as easy as removing all calls with a score below a certain threshold, e.g. only including calls with score >= 10: bcftools view -i 'QUAL >= 10' variants.vcf'
.
Because the score depends on the number of supporting reads, its distribution varies with the sequencing coverage in the input. Therefore, we cannot make a general statement about suitable score cutoffs. For high-coverage datasets (>40x), we would recommend a threshold of 10-15. For low-coverage datasets, the threshold should be lower.
A good approach could also be to select a cutoff that returns the expected or desired number of calls. A recent study by the Human Genome Structural Variation Consortium (HGSVC), for instance, estimated the average number of deletions and insertions per individual to be 9,488 and 15,337, respectively (Chaisson et al., 2019, https://www.nature.com/articles/s41467-018-08148-z).