-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for FORMAT/FT VQSLod Filtering and cohort-wide LowQual filter #7248
Conversation
@@ -19,33 +19,30 @@ | |||
|
|||
parts = line.split("\t") | |||
|
|||
if ("ExcessHet" in parts[6]): | |||
# strip out hard filtered sites, so vcfeval can use "all-records" to plot the ROC curves |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably just because I am too new here, but seeing an example in the comments would be really helpful, esp for the transformation that starts on line 37 where we seem to be putting "new" data into parts[6]?
@@ -19,33 +19,30 @@ | |||
|
|||
parts = line.split("\t") | |||
|
|||
if ("ExcessHet" in parts[6]): | |||
# strip out hard filtered sites, so vcfeval can use "all-records" to plot the ROC curves |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be sure to remember to reindex any files you run this on...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep -- this is only used from a benchmarking pipeline (see scripts/variantstore/tieout/GIAB_TIEOUT.md) and it runs tabix to generate the index
if (parts[6] == "PASS" or parts[6] == "."): | ||
parts[6] = ft | ||
else: | ||
parts[6] = parts[6] + "," + ft |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will always be true:
if ft != "PASS" or ft != ".":
if think you want if ft != "PASS" and ft != ".":
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reworked this, and re-ran to make it didn't change the tieout results (which it did not)
// the genotype is passed, nothing to do here as non-filtered is the default | ||
} else { | ||
// get the minimum (worst) vqslod for all SNP non-Yay sites, and apply the filter | ||
Optional<Double> snpMin = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are actually getting the max. if that is what we want, update the comment and variable name for this and indels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch -- updating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just 2 comments that need some fixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good some comments mostly for explanation
@@ -1,7 +1,6 @@ | |||
package org.broadinstitute.hellbender.tools.variantdb.nextgen; | |||
|
|||
import htsjdk.variant.vcf.VCFHeader; | |||
import htsjdk.variant.vcf.VCFHeaderLine; | |||
import htsjdk.variant.vcf.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this standard for java? I was taught always be explicit with imports
@@ -89,6 +89,13 @@ | |||
) | |||
private boolean disableGnarlyGenotyper = true; | |||
|
|||
@Argument( | |||
fullName = "vqslod-filter-genotypes", | |||
doc = "Should VQSLOD filtering be applied at the genotype level", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add what the alternative is? i.e., if this is false, filtering will be applied at.... site level?
|
||
// get the minimum (worst) vqslod for all INDEL non-Yay sites | ||
Optional<Double> indelMin = | ||
nonRefAlleles.stream().filter(a -> a.length() != ref.length()).map(a -> remappedVqsLodMap.get(a)).filter(Objects::nonNull).max(Double::compareTo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this calls .max
but we're getting the minimum? is that a mistake or is the logic reversed because we're filtering? if this is a straightforward java thing, ignore me, but a comment about this logic might be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outdated variables names/comment. fixed
} else { | ||
// get the minimum (worst) vqslod for all SNP non-Yay sites, and apply the filter | ||
Optional<Double> snpMin = | ||
nonRefAlleles.stream().filter(a -> a.length() == ref.length()).map(a -> remappedVqsLodMap.get(a)).filter(Objects::nonNull).max(Double::compareTo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same question of min vs max as below for the indelMin)
but also - is len(a) == len(ref) a reliable SNP identifier? what if e.g. ref is ACGC and allele is AGGG ? (if this is How It's Done and beyond the scope of this PR, fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this is somewhat "how it's done"... what you describe is an MNP but would actually be represented as two SNPs (one at position 2, and one at 4). It's possible it could be even uglier, but in that case we still would use the SNP model to threshold VQSR
The overarching goal of this PR is to reduce or eliminate the effect of cohort size on the filtering of variants for a specific sample. As an example this means the filtering for the genotypes for a GIAB sample should be the same whether you make a VCF of the full cohort and then subset to the GIAB sample (expensive) or you just make a callset with just the GIAB sample. This is good for users since their results won't get "better" with more samples that they don't care about in their VCF.
Before/After results showing minimal impact are at:
https://docs.google.com/spreadsheets/d/1LUrssKHBCwIzbA_9M3b01Ul0urMbOOmv4Z703dHwiyg/edit#gid=398306713