-
Notifications
You must be signed in to change notification settings - Fork 46
Home
SURVIVOR incorporates a set of tools to:
- Simulate SVs and evaluate existing callers
- Merge and compare SVs within a sample and among populations/samples.
- Convert different formats to vcf files
- Summarize the results within vcf files or results from SURVIVOR.
The methods are independent of the sequencing technology or variant caller.
wget https://github.com/fritzsedlazeck/SURVIVOR/archive/master.tar.gz -O SURVIVOR.tar.gz tar xzvf SURVIVOR.tar.gz cd SURVIVOR-master/Debug/ make ./SURVIVOR
Here we just list the three top methods implemented in SURVIVOR. See Methods & Parmater for a list of all methods and parameter.
The simulation of SVs are explained in more detailed here: Methods & Parmater
Frist, we generate a parameter file:
./SURVIVOR simSV parameter_file
You can alter the parameter file like any other text file. Note the placement of each line is critical so don't add any lines. Just alter the values to what you need.
Second, we simulate the SVs on a given reference sequence (e.g. ref.fa), which has to be a fasta file.
./SURVIVOR simSV ref.fa parameter_file 0.1 0 simulated
Note you can choose if we want to simulate the reads based on the generated sequence and map these to the original reference genome (option 0) or if we want to use e.g. real reads and map these to the so simulated genome (option 1). The modified fasta file will be named: simulated.fasta. Furthermore, we report the simulated events (simulated.bed) and the insertion sequences (simulated.insertions.fa).
We have been using e.g. Mason as a read simulator but you can use any simulator that you want.
To evaluate the call set:
./SURVIVOR eval caller.vcf simulated.bed 10 eval_res
This will evaluate the results of caller.vcf with our previous simulated data set allowing a max. 10bp wobble of the predicted breakpoints. The results are stored in eval_res*.
In the end of the program a line is printed:
Overall: 20 11/0/0/0/9 0/0/0/0/0 0/0/0/0/0 1 0
This shows that 20 SVs were simulated. Followed by the true positive (simulated and found) SVs, the false negative (simulated not found) and the false positive (not simulated but found). Each of which is reported by type (Del/DUP/INV/TRA/INS). The second last number is the sensitivity and the last number is the FDR rate. In this particular case, we simulated 20 SVs from which 11 deletions and 9 insertions were found again and no SV were missed or additionally reported.
Note SURVIVOR also incorporates modlues to simulate reads (see option 2).
To reduce the false positive rate from some caller while remaining a high sensitivity we suggest to use multiple caller for one sample. For example, we often run Manta, Delly and Lumpy on our short read data. All these methods provide a vcf file as a result.
SURVIVOR can be used to compare and obtain a consensus call set from these vcf files. Note in general SURVIVOR can incorporate any technology or other callers as long as the SVs are represented as a vcf file.
First, we collect all files that we want to merge in a file. You might want to consider sorting the vcf files prior to merging.
ls *vcf > sample_files
Next, we use SURVIVOR to obtain a call set.
./SURVIVOR merge sample_files 1000 2 1 1 0 30 sample_merged.vcf
This will merge all the vcf files specified in sample_files together using a maximum allowed distance of 1kb, as measured pairwise between breakpoints (begin1 vs begin2, end1 vs end2). Furthermore we ask SURVIVOR only to report calls supported by 2 callers and they have to agree on the type (1) and on the strand (1) of the SV. Note you can change this behavior by altering the numbers from 1 to e.g. 0. In addition, we told SURVIVOR to only compare SV that are at least 30bp long and print the output in sample_merged.vcf.
For simply merging different SV set one can set the minimum required support by caller to 1 instead of 2 and retrieve a union set across all samples.
After the merge, you can use the merged vcf file to obtain a graphical representation of the overlaps.
If you have many VCF files that you want to compare: The first step is to obtain a pairwise comparison matrix like this:
./SURVIVOR genComp sample_merged.vcf sample_merged.mat.txt
Subsequently you can read this file into R and plot it with a heatmap function. For example:
t=read.table("sample_merged.mat.txt",header=F)
dst <- data.matrix(t(t))
image(1:nrow(dst), 1:ncol(dst),log(dst), col=rev(heat.colors(10)),axes=F, xlab="", ylab="")
grid(col='black',nx=nrow(dst),ny=ncol(dst),lty=1)
You will need to modify the code with labels etc.
If you just have a less than 5 VCF files:
Extract the overlap information like this:
perl -ne 'print "$1\n" if /SUPP_VEC=([^,;]+)/' sample_merged.vcf | sed -e 's/\(.\)/\1 /g' > sample_merged_overlapp.txt
This will extract the string from the support vector representing 0 or 1 depending on if a sample/ input VCF file supports an SV or not. The sed command will add a space between every character, which is needed for R.
Next, I am using the R package called VennDiagram. You need to install/download it for your R instance.
library(VennDiagram)
venn.diagram(list(Sample1=which(t[,1]==1), His=which(t[,2]==1), AA=which(t[,3]==1)) , fill = c("gray", "orange" ,"blue") , alpha = c(0.5, 0.5, 0.5), cex = 2, lty =2, filename = "my_sample_overlapp.tiff");
You need to adjust this R code if you have more than 3 samples by extending the list, fill and alpha variable.
In our experiments, we often want to filter for tricky to align regions. This is often done by requiring a higher MQ (e.g 20) to filter reads. This will filter out less reliable reads, but there are reads mapping to the same region with a mapping quality higher than e.g. 20. This can result in artificial SV call. We developed this following approach to be more conservative.
First, we extract the low mapping quality reads from our e.g. sorted bam file:
samtools view -H our.sort.bam > lowMQ.sam samtools view our.sort.bam | awk '$5<5 {print $0}' >> lowMQ.sam samtools view -S -b -h lowMQ.sam > lowMQ.bam
Now we should have a sorted bam file of all the reads that have an MQ <5 (second line awk statement). Next we compute the bp coverage:
samtools depth lowMQ.bam > lowMQ.cov
Next we use SURVIVOR to cluster the coverage track into a bed file for filtering:
./SURVIVOR bincov lowMQ.cov 10 2 > lowMQ.bed
Here we allow to cluster regions together that are maximal 10bp away and only consider regions if their coverage is larger then 2. The resulting bed file can be used to filter SV or read file using e.g. bedtools.