-
Notifications
You must be signed in to change notification settings - Fork 7
SMuRF v1.6.4 vignette
Disclaimer: This readme version is for SMuRF v1.6.4
by Huang Weitai
20th Jan 2020
SMuRF R package predicts a consensus set of somatic mutation calls using RandomForest machine learning. SMuRF generates a set of point mutations and insertions/deletions (indels) trained on the latest community-curated tumor whole genome sequencing data (Alioto et. al., 2015, Nat. Comms.). Our method is fast and accurate and analyses both whole-genome and whole-exome sequencing data from different cancer types.
For more information see our Bioinformatics paper: https://doi.org/10.1093/bioinformatics/btz018
Citation Huang, W., et al., SMuRF: Portable and accurate ensemble prediction of somatic mutations. Bioinformatics, 2019: p. 3157-3159
Input from bcbio-nextgen pipeline Input directly from VCF Callers (optional) Test Dataset Requirements Installation Functions -Get SNV and indel predictions -Annotate genes -Subset region-of-interest Troubleshooting cutoffs Output format Running on multiple samples
Before running SMuRF, you require output data from the bcbio-nextgen pipeline that generates the VCF output for the variant callers: MuTect2, FreeBayes, VarDict and VarScan. Note that your vcf.gz files need to be tab-indexed (.tbi files required) for retrieving gene annotations in SMuRF. We would recommend the bcbio-nextgen pipeline for a better user experience.
SMuRF requires the VCF output from each caller (.vcf.gz) to be placed in the same directory and files tagged with the caller (eg. sample1-mutect.vcf.gz, sample1-freebayes.vcf.gz, sample1-vardict.vcf.gz, sample1-varscan.vcf.gz)
For Users not running bcbio-nextgen pipeline: Alternatively, install and execute the individual callers.
Refer to the installation and instructions for each caller:
- VarDict
- VarScan
- MuTect2
- FreeBayes
In this vignette, we utilise a partial output dataset derived from the chronic lymphocytic leukemia (CLL) data downloaded from the European Genome-phenome Archive (EGA) under the accession number EGAS00001001539. This test dataset is provided in the SMuRF package.
R 3.3 & 3.4 : bioconductor::VariantAnnotation
R >=3.5 : BiocManager::VariantAnnotation
h2o package : If h2o package takes a long time to install, try manually installing from their AWS page.
1. The latest version of the package is updated on Github https://github.com/skandlab/SMuRF
2. You can install the current SMuRF directly from Github via the following R commands:
#devtools is required
install.packages("devtools")
library(devtools)
install_github("skandlab/SMuRF", subdir="smurf")
(Alternative option) SMuRF installation via downloading of the package from Github:
#Clone or download package from Github https://github.com/skandlab/SMuRF/tree/master/smurf and save to your local directory
install.packages("my/current/directory/smurf", repos = NULL, type = "source")
SMuRF concurrently predicts single somatic nucleotide variants (SNV) as well as small insertions and deletions (indels) and saves time by parsing the VCF files once.
Missing packages will be installed the first time you run SMuRF.
library("smurf") #load SMuRF package
smurf() #check version and parameters
# "SMuRFv1.6.4 (20th Jan 2020)"
smurf(directory=NULL, mode=NULL, nthreads = -1,
annotation=F, output.dir=NULL, parse.dir=NULL, whitelist.file=NULL,
snv.cutoff = 'default', indel.cutoff = 'default',
build=NULL, change.build=F, t.label=NULL,
check.packages=T)
myresults = smurf(mydir, "combined") #save output into 'myresults' variable
Arguments | Description |
---|---|
directory | Choose directory where the Variant Caller Format(VCF) files are located |
output.dir | Path to output directory (if saving files as .txt) |
parse.dir | Specify if changing SMuRF default cutoffs. Path to the location of existing snv-parse.txt and indel-parse.txt files generated by SMuRF |
mode | Choose "snv", "indel" or "combined" (snv+indel). "combined" provides a separate list of SNVs and indels. |
annotation | TRUE or FALSE (default). Provide gene annotations for each variant call. |
nthreads=-1 | Number of cores used for RandomForest prediction. Default (-1) for maximum number of cores. For 32-bit Windows, only 1 core is allowed (nthreads=1). |
t.label | (Optional) Provide the sample name for your tumour sample to ease the identification of the normal and tumour sample names in your vcf |
build | Genome build='hg19' or "hg38". If unknown, SMuRF will try to detect it |
change.build | For conversion of your genomic coordinates |
snv.cutoff | Default SNV model cutoff, unless a number between 0 to 1 is stated |
indel.cutoff | Default indel model cutoff, unless a number between 0 to 1 is stated |
whitelist.file | Path to .BED or .txt file with region to be parsed into the SMuRF format. All SNVs and INDELS will be retrieved. SMuRF prediction will not be performed. |
check.packages=T | Developer mode |
For more information on the parameters see R documentation:
help(smurf)
library("smurf") #load SMuRF package
myresults = smurf(directory="/path/to/directory..",
mode="snv",
output.dir="/path/to/output", build='hg19')
#Alter snv or indel model score cutoff (see below)
myresults = smurf(directory="/path/to/directory..",
mode="combined",
nthreads = 1, #for Windows OS
snv.cutoff = 0.2, indel.cutoff = 'default',
parse.dir="/path/to/parse/files/", build='hg19')
#Include gene annotations for coding regions in output
myresults = smurf(directory="/path/to/directory..",
mode="combined",
annotation=T, build='hg19')
#Whitelist option to extract full list of genomic positions
myresults = smurf(directory="/path/to/directory..",
whitelist.dir="/path/to/dir/roi.bed")
#Change hg38 to hg19 coordinates in gene annotation output
myresults = smurf(directory="/path/to/directory..",
mode="combined",
annotation=T,
build='hg38', change.build=T)
#Specify tumor sample name eg."SampleA-001-T"
myresults = smurf(directory="/path/to/directory..",
mode="combined",
annotation=T,
build='hg38', t.label = '-T')
#Specify directories manually
dir.list = list(mutect='/path/to/mutect.vcf.gz',
freebayes='/path/to/freebayes.vcf.gz',
vardict='/path/to/vardict.vcf.gz',
varscan='/path/to/varscan.vcf.gz')
myresults = smurf(directory=dir.list,
mode="combined")
SMuRF is fine-tuned to achieve higher sensitivity. Re-adjust the stringency of the prediction with a specific cut-off value. Use parameters 'snv.cutoff' or 'indel.cutoff' to adjust the thresholds. To re-adjust the cut-off value of an existing SMuRF output, simply provide the parse.dir to the snv-parse and indel-parse files for immediate re-processing.
#start with a specific cutoff
myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
mode="combined", #change cutoffs
snv.cutoff=0.2, indel.cutoff=0.1, #specify new cutoffs
output.dir = 'C:/Users/admin/myresults')
#modify cutoff from existing SMuRF parse files
myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
mode="combined",
snv.cutoff=0.2, indel.cutoff=0.1,
parse.dir = 'C:/Users/admin/myresults', #SMuRF output path to existing parse.txt files
output.dir = 'C:/Users/admin/myresults2')
#Plot histogram
hist(as.numeric(myresults$smurf_indel$predicted_indel[,'SMuRF_score']), main = 'Re-adjusted predicted indels', xlab = 'SMuRF_score', col = 'grey50')
SMuRF can extract variant calls from a BED or (.txt) file containing the genomic coordinates of your region of interest. Use genomic coordinates referencing hg19/GRCh37. Specify the BED file path using 'whitelist.file'
# roi.df = read.delim(roi.dir, header = F)
# roi.df
# chrom start end
# 1 1 1 25000
# 2 1 1000 100000
# 3 1 77000000 78000000
myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
mode="combined", #retrieve SNVs + indels from regions-of-interest
whitelist.file = paste0(find.package("smurf"), "/data/roi.bed"), #BED file containing ROIs
save.files = T,
output.dir = 'C:/Users/admin/myresults3')
Output files saved include:
-
Variant statistics (stats)
-
Predicted reads (predicted)
-
Parsed-raw file (parse)
-
Predicted reads with annotations (annotated)* #for smurf's "cdsannotation" function only
-
Time taken (time)
#Main predicted file (SNV & indel)
myresults$smurf_snv$predicted_snv
myresults$smurf_indel$predicted_indel
Column | Description |
---|---|
Chr | Chromosome number |
START_POS_REF/END_POS_REF | Start and End nucleotide position of the somatic mutation |
REF/ALT | Consensus Ref and Alt nucleotide changes of the highest likelihood |
REF_MFVdVs/ALT_MFVdVs | Reference and Alternative nucleotide changes from each caller; Mutect2 (M), Freebayes (F), Vardict (Vd), Varscan (Vs) |
FILTER | Passed (TRUE) or Reject (FALSE) [boolean] mutation calls from the individual callers |
Sample_Name | Sample name is extracted based on your labeled samples in the vcf files |
Alt_Allele_Freq | Mean Variant allele frequency calculated from the tumor reads of the callers |
Depth ref/alt N/T | Mean read depth from the N/T sample for ref/alt alleles |
SMuRF_score | SMuRF confidence score of the predicted mutation |
myresults$smurf_indel$stats_indel
# Passed_Calls
# Mutect2 1546
# FreeBayes 339
# VarDict 515
# VarScan 2228
# Atleast1 4343
# Atleast2 244
# Atleast3 37
# All4 4
# SMuRF_INDEL 6
myresults$smurf_snv$stats_snv
# Passed_Calls
# Mutect2 4906
# FreeBayes 247
# VarDict 315
# VarScan 5101
# Atleast1 10302
# Atleast2 170
# Atleast3 58
# All4 39
# SMuRF_SNV 1345
We added gene annotations using SnpEff (from bcbio) and SMuRF extracts the coding annotations from the canonical transcripts with the highest impact. Take note that your vcf.gz files should be tab-indexed (.tbi files required).
myresults = smurf(mydir, "cdsannotation") #runs SMuRF for SNV and indels + generate annotations
myresults$smurf_snv_annotation$annotated
# Chr START_POS_REF END_POS_REF REF ALT REF_MFVdVs ALT_MFVdVs FILTER_Mutect2 FILTER_Freebayes FILTER_Vardict FILTER_Varscan Sample_Name Alt_Allele_Freq
# 1 1 77712621 77712621 G A G/G/G/G A/A/A/A TRUE TRUE TRUE TRUE icgc_cll 0.296
# 2 1 77806132 77806132 G A G/G/G/G A/A/A/A TRUE TRUE TRUE TRUE icgc_cll 0.483
# T_altDepth T_refDepth N_refDepth N_altDepth Allele Annotation Impact Gene_name Gene_ID Feature_Type Feature_ID Transcript_BioType
# 1 8 19 40 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 2 15 16 22 0 A missense_variant MODERATE AK5 ENSG00000154027 transcript ENST00000354567 protein_coding
# Rank HGVS.c HGVS.p cDNA.pos CDS.pos AA.pos Distance REGION SMuRF_score
# 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> Non-coding 0.8148148
# 2 6/14 c.770G>A p.Arg257His 1033/3248 770/1689 257/562 . CDS 0.7777778
Time taken for your run:
myresults$time.taken
<!-- Time difference of 20.52405 secs -->
The raw parsed output:
myresults$smurf_indel$parse_indel
myresults$smurf_snv$parse_snv
Proceed to save the output in your desired formats State "output.dir" to generate .txt files in your desired directory Example:
myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
mode ="combined", nthreads = 1,
annotation=T, build='hg19',
output.dir = 'C:/Users/admin/myresults')
Iterate over multiple samples by providing the list of directories of where your sample files are located.
#Example
project.dir = 'path/to/my/dir'
samples = c('sample_A', 'sample_B', 'sample_C') #sample dir where vcf files are located
for(i in 1:length(samples)) {
smurf(directory=paste0(project.dir, '/', samples[i]),
mode="combined", build='hg19',
output.dir = paste0('C:/Users/admin/myresults-',sample[i]))
}
For errors and bugs, please report on our Github page.