-
Notifications
You must be signed in to change notification settings - Fork 16
VQR 5.2.7 Design Document
The Variant Quality Recalibration tool (VQR) is a command line tool used to post-process gVCF files. VQR recalibrates the variant quality scores (Q scores) given to variants within a sample, simply based on if the particular variants are over represented in the given sample. This tool was specifically developed to facilitate the filtering of FFPE artifacts on highly degraded samples, but is not limited to these types of signature events. VQR self-discovers which types of variants are over represented, and may be used to filter out a range of system artifacts or upstream sample issues. VQR requires a (g)VCF as input, and outputs an adjusted (g)VCF, where variant Q scores have been downgraded accordingly.
Pisces VQR works for vcf and genome.vcf input files. It is not currently recommended for crushed/diploid input, because this is not an identified use case for VQR.
Anecdotally, Pisces VQR seems to work on Strelka vcfs.
VQR supports configuration of parameters so that its behavior can be fine tuned depending on the application context.
Format: dotnet VariantQualityRecalibration.dll [-options]
Example: dotnet VariantQualityRecalibration.dll –vcf C:\test.vcf –o C:\OutFolder
SDS ID | Specification |
---|---|
SDS-1 | VQR shall accept command line arguments as a whitespace-separated list of name and value pairs. |
SDS-2 | If an invalid command is given, VQR shall exit with an error message describing the failed argument, the reason for failure, and the list of valid commands. |
SDS-3 | VQR command line shall be capitalization invariant. |
SDS ID | Specification |
---|---|
SDS-4 | VQR shall require the command line arguments listed below: |
Argument Name | Type | Default value | Description |
---|---|---|---|
vcf | string | none | File path for input vcf |
SDS ID | Specification |
---|---|
SDS-5 | VQR shall optionally support the command line arguments listed below: |
Argument Name | Type | Default value | Description |
---|---|---|---|
-locicount | integer | none (-1) | If a vcf is given instead of a gvcf, VQR needs the approximate number of loci to asses the error rates.(When given a gvcf, VQR can figure this out by itself, by counting the lines in the gvcf.) |
o | string | none. By default the output destination will be the original bam folder | destination for output bam |
log | integer | 20 | in case of a stitching conflict, bases with qscore less than this value will automatically be disregarded in favor of the mate's bases. |
b | integer | 1 | reads with map quality less than this value shall be filtered |
z | double | true | reads marked as duplicate reads shall be filtered |
f | integer | false | reads marked as not proper pairs shall be filtered |
q | integer | false | reads pairs with incompatible cigar strings shall be filtered |
VQR requires as input one gVCF file. The gVCF file should be formatted such that each variant allele has its own line in the gVCF. file. Pisces output has this format by default.
SDS ID | Specification |
---|---|
SDS-6 | Scylla shall require one gVCF file as input. |
VQR outputs one gVCF file, with the same convention and structure as the input file.
SDS ID | Specification |
---|---|
SDS-7 | VQR shall produce output files in the same directory as input gVCF file. |
SDS-8 | VQR shall output a gVCF as described in the https://git.illumina.com/Bioinformatics/Pisces5/wiki/Pisces-VCF-Specifications document. |
SDS-9 | VQR the output file name shall be the input file name with ".recal" appended to the file name. |
VQR reads in the gVCF file and generates a "counts" file, where it has calculated how many variants have been called in each mutation category. There are 12 point mutation categories, as shown below. The counter also tracks insertions, deletions, reference, and other categories of variant, but these are not used int he recalibration step.
Mutation Category | A | C | G | T |
---|---|---|---|---|
A | X | A>C | A>G | A>T |
C | C>A | X | C>G | C>T |
G | G>A | G>C | X | G>T |
T | T>T | T>C | T>G | X |
Once the counts are known, the recalibration step begins. The average mutation rate is calculated for each category, and the variance between each category is also calculated. Each category that exceeds the mean plus Z times the typical standard deviation is considered over represented. The value of Z is configurable. Young samples typically have a very white profile. However, older samples with FFPE artifacts, oxidative damage, or characteristic sequencing artifacts might have a characteristic colored profile, where certain mutations are highly over represented in the sample. These distributions generally look the same if we constrain the observations to be purely false positives (which are typically not known apriori) or all called variants.
For samples with a balanced profile, no recalibration is performed. For samples with a highly colored noise profile, the variant Q scores are recalibrated int he following manner: The 1% noise model used by Pisces, which assumes the same noise-rate for all categories of mutations, is replaced with a noise model derived from the sample-specific noise profile. Specifically, the 1% noise assumption is raised to the observed mutation rate for the over represented categories of mutations. In this way, for an over represented mutation to get a passing Q score, it has to distinguish itself from the baseline over-represented state of the sample. This allows for better resolution in variant/noise discrimination.
This technique has shown the improvements in FP count for a range of FFPE samples, for 2 to 15 years old. For some samples, the FP rate goes from several hundred calls to less than 10. However, not all samples see improved FP, and this might be because other error modes are the source of the false positives.
This technique only reduces the FPs that follow the particular pattern the algorithm is looking for, and is currently restricted to point mutations. This technique is adaptable and easily extensible for future work.
- Pisces 5.2.10 Design Document
- Pisces 5.2.10 Supported Options
- Scylla 5.2.10 Design Document
- Stitcher 5.2.10 Design Document
- VQR 5.2.10 Design Document
- VennVcf 5.2.10 Design Document
- Gemini 5.2.10 Design Document
- AdaptiveGenotyper 5.2.10 Design Document
- Pisces Tools 5.2.10
- Suggested Pipeline Configuration 5.2.10
- Pisces 5.2.9 Quick Start
- Pisces 5.2.9 Design Document
- Pisces 5.2.9 Supported Options
- Scylla 5.2.9 Design Document
- Stitcher 5.2.9 Design Document
- VQR 5.2.9 Design Document
- VennVcf 5.2.9 Design Document
- Pisces Tools 5.2.9
- Suggested Pipeline Configuration 5.2.9
- Pisces 5.2.7 Quick Start
- Pisces 5.2.7 Design Document
- Pisces 5.2.7 Supported Options
- Scylla 5.2.7 Design Document
- Stitcher 5.2.7 Design Document
- VQR 5.2.7 Design Document
- VennVcf 5.2.7 Design Document
- Pisces Tools 5.2.7
- Suggested Pipeline Configuration 5.2.7
- Pisces 5.2.5 Design Document
- Pisces 5.2.5 Supported Options
- Scylla 5.2.5 Design Document
- Stitcher 5.2.5 Design Document
- VQR 5.2.5 Design Document
- Suggested Pipeline Configuration 5.2.5
- Pisces 5.2.0 Design Document
- Pisces 5.2.0 Supported Options
- Scylla 5.2.0 Design Document
- Stitcher 5.2.0 Design Document
- VQR 5.2.0 Design Document
- Suggested Pipeline Configuration 5.2.0
- Pisces Suite 5.2.0 Known Issues and Limitations