Home

Welcome to the SVIM documentation!

1. The workflow

SVIM implements a pipeline of four consecutive components:

COLLECT collects SV signatures from each individual read in the input BAM file
CLUSTER clusters SV signatures coming from the same SV using an hierarchical clustering approach and a novel distance metric
COMBINE combines clusters from different genomic regions and classifies them into six SV classes
GENOTYPE uses read alignments in the genomic neighborhood to determine the genotype of SV candidates

2. The input data

SVIM can process two types of input. Firstly, it can detect SVs from existing read alignments in SAM/BAM format. Alternatively, it can detect SVs from raw reads by aligning them to a given reference genome first.

SAM/BAM input

SVIM accepts read alignments in indexed BAM format. The read alignments need to be sorted by coordinate, e.g. with samtools sort. For genotyping, a BAM index (*.bam.bai) is required which can be generated using samtools index.

To start with a BAM file:

svim alignment my_sample my_alignments.bam my_genome.fa
svim alignment -h #display all parameters

Read input

SVIM also accepts uncompressed and gzipped FASTA or FASTQ files or a file list as input. The input file has to have one of the following supported file endings:

FASTA: .fa, .fasta, .FA, .fa.gz, .fa.gzip, .fasta.gz, .fasta.gzip
FASTQ: .fq, .fastq, .FQ, .fq.gz, .fq.gzip, .fastq.gz, .fastq.gzip
file list: .fa.fn, fq.fn

To start with reads in a FASTA file:

svim reads my_sample my_reads.fa my_genome.fa                        # align PacBio reads with ngmlr, then call SVs
svim reads --nanopore my_sample my_reads.fa my_genome.fa             # align Nanopore reads with ngmlr, then call SVs
svim reads --aligner minimap2 my_sample my_reads.fa my_genome.fa     # align PacBio reads with minimap2, then call SVs
svim reads --min_mapq 30 my_sample my_reads.fa my_genome.fa          # Use only high-quality alignments for SV calling (minimum mapping quality of 30)
svim reads -h #display all parameters

To align the reads to the reference genome, SVIM needs the following tools to be installed and in the PATH:

gunzip (for gzipped files)
ngmlr
minimap2
samtools (for sorting and indexing the BAM output)

SVIM checks whether these tools are accessible before starting the alignment. If you don't have the tools installed, you can still pass a BAM file as input to SVIM (see above).

3. The parameters

A complete list of all command line parameters including a description can be obtained by running svim alignment --help (SAM/BAM input) or svim reads --help (Read input). An (hopefully) up-to-date list can also be found here.

4. The results

SVIM outputs all its results into the given working directory (my_sample in our examples). The two main output files are:

The final SV calls (`variants.vcf`)

Final SV calls in VCF format where each line represents one variant. The variant coordinates are contained in the CHROM, POS and INFO:END fields. The sequence changes are contained in the REF and ALT fields. The variant type is contained in the INFO:SVTYPE field and the six possible types are DEL, INS, INV, DUP:TANDEM, DUP:INT and BND (translocations). The number of supporting reads is contained in the INFO:SUPPORT field.

Important: SVIM produces an unfiltered output of all calls! See Section 4 for details.

The QUAL fields contains the score of each call on a range between 0 and 100 (higher meaning better). Note that the scores are not phred-scaled (in contrast to what is defined in the VCF specification).

The log file (`SVIM_{date}_{time}.log`)

Log file with runtime information. Contains information on the SVIM version and parameters as well as the command line output with precise time stamps.

Besides the two output files described above there are additional output files for debugging purposes:

The SV calls in BED format (for debugging): candidates/candidates_*.bed

Intermediate signature clusters in BED format (for debugging): signatures/*.bed

candidates/candidates_*.bed

Final SV calls in BED format to view in any text editor or IGV. For deletions (candidates_deletions.bed), inversions (candidates_inversions.bed) and novel element insertions (candidates_novel_insertions.bed), one BED file is written. For tandem and interspersed duplications, two files are written - one for the genomic origin (candidate_tan_duplication_source.bed, candidate_int_duplication_source.bed) and one for the genomic destination region (candidate_tan_duplication_dest.bed, candidate_int_duplication_dest.bed). Each file has the same 7 columns:

Chromosome

Start coordinate

End coordinate

SV type, standard deviation of position across the cluster, standard deviation of span across the cluster

Score

[Interspersed duplications only] evidence for deleted region of origin found

List of signatures in the cluster making up this candidate call (each consisting of chr, start, end, sv_type; signature_type, read)

signatures/*.bed

Signature clusters (output of the CLUSTER component) in BED format to view in any text editor or IGV.

5. Analysis and filtering

Unlike many other SV callers, SVIM does not filter its output but writes out all SV calls and their respective scores. This means that even low-scoring calls supported by only a single read are contained in the output. Actually, often the majority of calls output by SVIM are calls supported by one or two reads. These calls with very low score are almost always caused by sequencing or alignment errors. It is therefore highly recommended to filter the variant calls that SVIM produces based on these scores.

Why does SVIM output all calls and even low-quality ones? We are convinced that it is more user-friendly to output all variant calls and leave the filtering to the user as a post-processing step. If the filtering would be done during runtime (e.g. controlled by a score cutoff parameter to SVIM), a modified filtering cutoff would require SVIM to be run again which takes considerably more time and effort than filtering the output file with all calls.

How is the score defined that each variant call receives? SVIM outputs a score between 0 and 100 for each variant call (a higher score meaning that the call is more trustworthy). You can find the scores in the QUAL column of the VCF output. Since the publication of our manuscript, we changed the score formula so that it is now mainly based on the number of supporting reads. Additionally, the agreement between the supporting reads concerning the span and position of the SV is taken into account.

The score is computed based on three features:

The number n of signatures in the cluster (i.e. the number of supporting reads).
A score s_p based on the standard deviation s_pos of the genomic positions of the signatures in the cluster normalized by their average span.

s_p = 1 - min(1, (s_pos / span))
A score s_s based on the standard deviation s_span of the genomic spans of the signatures in the cluster normalized by their average span.

s_s = 1 - min(1, (s_span / span))

The three features are combined into a total score S with the following formula:

S = min(80, n) * (1 + (s_p / 8) + (s_s / 8))

The score formula puts the main emphasis on the number n of signatures in the cluster but takes at most 80 signatures into account. Clusters with very consistent genomic positions and spans and consequently high standard deviation scores can earn a score bonus of up to 25% ((1/8) + (1/8) = (1/4)) relative to the number n of signatures in the cluster. The maximum score that can be reached for clusters with n >= 80 and s_p = s_s = 1 is 100. Thus, we obtain a score to discern trustworthy signature clusters from artifacts, such as sequencing or alignment artifacts. Trustworthy events are characterized by many supporting signatures that exhibit a high concordance of their genomic positions and spans.

How do I filter the SV calls after running SVIM? The filtering of variant calls is always a trade-off between precision and recall. Filtering more strictly usually increases precision and removes false positives but also decreases recall/sensitivity. Applying more lenient filtering has the opposite effect.

Generally, filtering the variant calls produced by SVIM is as easy as removing all calls with a score below a certain threshold, e.g. only including calls with score >= 10: bcftools view -i 'QUAL >= 10' variants.vcf'.

Because the score depends on the number of supporting reads, its distribution varies with the sequencing coverage in the input. Therefore, we cannot make a general statement about suitable score cutoffs. For high-coverage datasets (>40x), we would recommend a threshold of 10-15. For low-coverage datasets, the threshold should be lower.

A good approach could also be to select a cutoff that returns the expected or desired number of calls. A recent study by the Human Genome Structural Variation Consortium (HGSVC), for instance, estimated the average number of deletions and insertions per individual to be 9,488 and 15,337, respectively (Chaisson et al., 2019, https://www.nature.com/articles/s41467-018-08148-z).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Welcome to the SVIM documentation!

1. The workflow

2. The input data

SAM/BAM input

Read input

3. The parameters

4. The results

The final SV calls (`variants.vcf`)

The log file (`SVIM_{date}_{time}.log`)

`candidates/candidates_*.bed`

`signatures/*.bed`

5. Analysis and filtering

Table of contents

Clone this wiki locally

Home

Welcome to the SVIM documentation!

1. The workflow

2. The input data

SAM/BAM input

Read input

3. The parameters

4. The results

The final SV calls (variants.vcf)

The log file (SVIM_{date}_{time}.log)

candidates/candidates_*.bed

signatures/*.bed

5. Analysis and filtering

Table of contents

Clone this wiki locally

The final SV calls (`variants.vcf`)

The log file (`SVIM_{date}_{time}.log`)

`candidates/candidates_*.bed`

`signatures/*.bed`