-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve MQ calculation accuracy #4969
Conversation
@jamesemery Can you take a look? (This doesn't have to go into the next release.) |
Codecov Report
@@ Coverage Diff @@
## master #4969 +/- ##
============================================
+ Coverage 86.75% 86.77% +0.01%
- Complexity 29834 29883 +49
============================================
Files 1828 1831 +3
Lines 138115 138452 +337
Branches 15227 15249 +22
============================================
+ Hits 119827 120138 +311
- Misses 12741 12765 +24
- Partials 5547 5549 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly there are some places that could use clarifying comments. Otherwise the code itself looks reasonable and contained. Are there plans to update AS_RMSMappingQuality? They are now out of synch with each-other with regards to what they do.
|
||
@Override | ||
public String getRawKeyName() { return GATKVCFConstants.RAW_RMS_MAPPING_QUALITY_KEY;} | ||
public String getRawKeyName() { return GATKVCFConstants.RAW_MAPPING_QUALITY_WITH_DEPTH_KEY;} //new key for the two-value MQ data to prevent version mismatch catastrophes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to bump this into a new class? This would mean that we are no longer able to handle the old RMS mapping quality code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that matter to us? I guess since the variant context wont have a key matching one of the discovered reducible annotations' keys then nothing will really happen and the key will be dropped...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could add a check and emit a warning to the user if there is a mismatch between the versions of the rmsMappingQualityKey in their files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check in the finalize method in case GenotypeGVCFs gets run on old data. GenomicsDB behavior is largely outside of our control here. I don't have a great idea for CombineGVCFs though. I thought about adding a check against ReducibleAnnotation.getDeprecatedKeyName(), but I feel like having getDeprecatedKeyName() return null for the other annotations and querying the annotationMap for null is an invitation for trouble.
@@ -53,12 +56,12 @@ | |||
|
|||
@Override | |||
public List<VCFInfoHeaderLine> getDescriptions() { | |||
return Arrays.asList(VCFStandardHeaderLines.getInfoLine(getKeyNames().get(0)), GATKVCFHeaderLines.getInfoLine(getRawKeyName())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh, yeah this should probably have been changed when I ported that AS_stuff. I remember this was here purely because it broke tests. It looks like you've updated them which is good.
/** | ||
* Created by gauthier on 5/29/18. | ||
*/ | ||
public class VariantDepth { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may have included this accidentally
@@ -223,6 +226,13 @@ private static double parseRawDataString(String rawDataString) { | |||
@VisibleForTesting | |||
static int getNumOfReads(final VariantContext vc, | |||
final ReadLikelihoods<Allele> likelihoods) { | |||
if(vc.hasAttribute(GATKVCFConstants.RAW_MAPPING_QUALITY_WITH_DEPTH_KEY)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the comment to reflect that it searches for the count data first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the GATK Javadoc for the annotation -- is that what you meant?
} | ||
|
||
|
||
public void combineAttributeMap(ReducibleAnnotationData<Double> toAdd, ReducibleAnnotationData<Double> combined) { | ||
public void combineAttributeMap(ReducibleAnnotationData<List<Integer>> toAdd, ReducibleAnnotationData<List<Integer>> combined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method could use a comment, as its a little harder to read now (Allele.NO_CALL)
.mapToDouble(mq -> mq * mq).sum(); | ||
|
||
rawAnnotations.putAttribute(Allele.NO_CALL, squareSum); | ||
//GATK3.5 had a double, but change this to an int for the tuple representation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just a quick comment explaining that this is counting both the squareSum AND the count.
} | ||
|
||
@Test (dataProvider = "VCFdata") | ||
public void assertMatchingGenotypesFromGenomicsDB_vidmapHack(File[] inputs, File expected, String interval, List<String> additionalArguments, String reference) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment explaining this test? It doesn't seem related to the rest of the branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I goofed and copied the genotypes comparison when I meant do compare annotations. I fixed the name and made sure MQ is actually being compared. (FYI the vidmap hack is because I need to change the GDB json since I changed the MQ annotation name. I'll update this once the protobuf update goes in.)
@@ -42,9 +42,12 @@ | |||
@DocumentedFeature(groupName=HelpConstants.DOC_CAT_ANNOTATORS, groupSummary=HelpConstants.DOC_CAT_ANNOTATORS_SUMMARY, summary="Root mean square of the mapping quality of reads across all samples (MQ)") | |||
public final class RMSMappingQuality extends InfoFieldAnnotation implements StandardAnnotation, ReducibleAnnotation { | |||
private static final RMSMappingQuality instance = new RMSMappingQuality(); | |||
public static final int NUM_LIST_ENTRIES = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the header comments to reflect the new annotation? Just add a line specifying what the count means
@@ -97,6 +100,8 @@ | |||
{getTestFile( "testAlleleSpecificAnnotations.CombineGVCF.output.g.vcf"), getTestFile( "testAlleleSpecificAnnotations.CombineGVCF.expected.g.vcf"), Arrays.asList( "-A", "ClippingRankSumTest", "-G", "AS_StandardAnnotation", "-G", "StandardAnnotation"), b37_reference_20_21}, | |||
//all sites not supported yet see https://github.com/broadinstitute/gatk-protected/issues/580 and https://github.com/broadinstitute/gatk/issues/2429 | |||
//{getTestFile(basePairGVCF), getTestFile( "gvcf.basepairResolution.includeNonVariantSites.gatk3.7_30_ga4f720357.expected.vcf"), Collections.singletonList("--"+GenotypeGVCFs.ALL_SITES_LONG_NAME) //allsites not supported yet | |||
//Test for new RAW_MQandDP annotation format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test doesn't actually test RAW_MWandDP output as they are added to attributes to ignore. Maybe spin this off into its own test to make sure that the assertions made on the annotations are run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, I still have the question of whether this should be ported to AS_MQ to include an AS_RAW_MQandDP tag? Perhaps we should open an issue to implement it at some point in the future?
@@ -118,7 +118,8 @@ private static void addFilterLine(final VCFFilterHeaderLine line) { | |||
addInfoLine(new VCFInfoHeaderLine(LIKELIHOOD_RANK_SUM_KEY, 1, VCFHeaderLineType.Float, "Z-score from Wilcoxon rank sum test of Alt Vs. Ref haplotype likelihoods")); | |||
addInfoLine(new VCFInfoHeaderLine(MAP_QUAL_RANK_SUM_KEY, 1, VCFHeaderLineType.Float, "Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities")); | |||
addInfoLine(new VCFInfoHeaderLine(AS_MAP_QUAL_RANK_SUM_KEY, VCFHeaderLineCount.A, VCFHeaderLineType.Float, "allele specific Z-score From Wilcoxon rank sum test of each Alt vs. Ref read mapping qualities")); | |||
addInfoLine(new VCFInfoHeaderLine(RAW_RMS_MAPPING_QUALITY_KEY, 1, VCFHeaderLineType.Float, "Raw data for RMS Mapping Quality")); | |||
addInfoLine(new VCFInfoHeaderLine(RAW_RMS_MAPPING_QUALITY_KEY, 2, VCFHeaderLineType.Integer, "Raw data for RMS Mapping Quality")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the vcf description change for the new key to reflect that it is functionally different and incompatible with the old format? Having two files with the same header lines except for a count seems potentially confusing.
The AS_MQ never suffered from this issue because it uses AD for (allele-specific) depth instead of the INFO DP. The sum of the squared MQs there was allocated based on informative reads and the AD represents informative reads, so the data there was always in lock-step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now. Unfortunately there are some merge conflicts, could you rebase onto master then I think it is good to merge 👍
c3c0ed5
to
e95075e
Compare
I put this into a different branch because I upgraded GDB to fix the weird error. I don't want this feature to go into the 4.0.9.0 release so I'll do a PR of the new branch after. |
317b4fe
to
475668c
Compare
…curacy where there are lots of uninformative reads
79d9e59
to
840bb25
Compare
Provided Travis tests pass, does this still have your 👍 @jamesemery ? The ReblockGVCF tool had a less elegant MQ solution so now I'm having it output both versions so that we don't need to re-reprocess the gnomAD v3 GVCFs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good to me, this still has my 👍
* https://developers.google.com/protocol-buffers/docs/javatutorial#the-protocol-buffer-api | ||
* https://developers.google.com/protocol-buffers/docs/reference/java-generated | ||
*/ | ||
public class GenomicsDBUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve of this refactor
Change raw MQ to a tuple of (sumSquaredMQs, totalDepth) for better accuracy where there are lots of uninformative reads or called single-sample variants with homRef genotypes. Note that incorporating this change into a pipeline will require a concomitant update to this version for GenomicsDBImport and GenotypeGVCFs.
Just as a note. This change in line 251 of
results in the following type of GVCF/VCF parsing error:
when the |
After the switch to rawMQ for the reducible implementation we no longer take the median of MQ values across all samples (yay!) but there were some accuracy problems at sites with a lot of uninformative reads (boo!)
I tried to calculate the depth over variant samples previously, but the approximations weren't perfect. Now we keep track of the depth used to find the RMS of MQ at the end of joint calling.