-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNVQ recalibration for flow based reads #8697
SNVQ recalibration for flow based reads #8697
Conversation
xgboost added, initial xgboost version update --svn-qual-model added, placeholder code too FlowFeatureMapper refactor towards AppleSVNQR ApplySVNQR Integration Test (placeholder) typos fixed model now loading conf file reading added initial matrix build move snvqr feature creation to processor xgboost matrix logging feature implementation (partial) non-called-base support optimize for non-called-base independent features apply snvqr test file update all features initial coding matrix build seems ok BQ and histograms lots of changes ... quality interpolation Interpolator edge condition handled additional aliasing --add-apply-snvqr-features safeguard against non category boolean features BQ length (and polarity?) fixed "quality_interpolation_data" made optional ApplySNVQR --model optional FlowFeatureMapper optimization - reduced by ~70% conf optional when no model minor performance increase enforce score 0 in SNVQ BQ zero warning added zeroCount report
Also updated the printout of flow matrix to have more digits
… GATK and updated t0 parsing
…g probability is not given
@meganshand - this should be ready for your review, there is a parallel PR in picard that adds tools that can collect statistics from the output of this tool, so would be good to merge this one first |
|
||
@Advanced | ||
@Hidden | ||
@Argument(fullName="debug-collect-stats-into", doc = "", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this debug argument here while the debug read name argument is in AddFlowSNVQualityArgumnetCollection
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved.
private SAMFileGATKReadWriter outputWriter; | ||
|
||
@Advanced | ||
@Argument(fullName=INCLUDE_QC_FAILED_READS_FULL_NAME, doc = "include reads with QC failed flag", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have this here and the keepSupplementaryAlignments
arg in the collection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved.
} | ||
|
||
// collect input stats | ||
if ( debugCollectStatsInto != null ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add braces to this one so it matches the others
} | ||
|
||
private void collectOutputStats(GATKRead read) { | ||
if ( read.hasAttribute("BQ") ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use BASE_QUALITY_ATTRIBUTE_NAME
here instead of "BQ"
(and below as well)
@Argument(fullName="debug-collect-stats-into", doc = "", optional = true) | ||
public String debugCollectStatsInto = null; | ||
|
||
public static final String BASE_QUALITY_ATTRIBUTE_NAME = "BQ"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the BQ
tag is defined in the spec as Offset to base alignment quality (BAQ)
. How much of a pain would it be to change this tag name (not sure how baked into your pipelines this is at this point).
Another option could be to put your updated base qualities in the output as the main base qualities, and move the old base qualities to the OQ tag (original qualities) which is how BQSR has worked in the past. This might make downstream tooling easier without deleting any data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilyasoifer could you.please suggest? This tool seems to be collecting stats on BQ - i.e. it is not the one producing the field. I'm not that familiar with the upstream and downstream tooling,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dror27 - agree with Megan that BQ is a confusing name (somehow I missed this point). We should fix this also in the picard tool in a different PR
So the desired interface is:
- By default the tool overrides base qualities
- There is a parameter that tells the tool to set a different tag rather than base qualities, but it is unset by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed. Tool now override base qualities by default. If --output-quality-attribute is provided, then base qualities are preserved and an attribute by the value of the parameter is set.
} | ||
} | ||
|
||
public void aux(int v, int auxValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these aux
functions get used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
@@ -250,6 +254,33 @@ public FlowBasedRead(final SAMRecord samRecord, final String flowOrder, final in | |||
|
|||
} | |||
|
|||
public FlowBasedRead(final FlowBasedRead other, final boolean deepCopy) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere? If not is it still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
} | ||
|
||
// access qual, convert to flow representation | ||
final byte[] quals = samRecord.getBaseQualities(); | ||
final byte[] tp = samRecord.getSignedByteArrayAttribute(FLOW_MATRIX_TAG_NAME); | ||
// initialize matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spurious comment?
@@ -820,9 +869,9 @@ private void applyFilteringFlowMatrix(){ | |||
private void clipProbs() { | |||
for ( int i = 0 ; i < getMaxHmer(); i++ ) { | |||
for ( int j =0; j < getNFlows(); j++) { | |||
if ((flowMatrix[i][j] <= fbargs.probabilityRatioThreshold) && | |||
if ((flowMatrix[i][j] <= perHmerMinErrorProb*3) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where did this *3
come from? Why is it needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean to add this file?
@meganshand - thank you for your comments. I have responded and updated. We have just a few left between @ilyasoifer and myself - then you can have it back for another round. |
@dror27 - I addressed the comments that were assigned to me. There is some documentation task left and that failing test and then we are ready, I think. |
I'm finding out that the failing test is due to different floating point accuracy between the platform on which the expected file is generated and the one running the test. I will change the way the expected and generated files are changed against to accommodate for that < -5.6956674728026115 < -5.6956674728026115 < -5.6956674728026115 --- > -5.695667472802612 > -5.695667472802612 > -5.695667472802612 |
@meganshand - I think that we addressed all your comments, could you take another look please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good. I just had a few small suggestions for documentation.
/** | ||
* alternate quality attribute to set instead of the usual quality string | ||
*/ | ||
@Argument(fullName = OUTPUT_QUALITY_ATTRIBUTE_FULL_NAME, doc = "alternate quality attribute to set instead of the usual quality string.", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Argument(fullName = OUTPUT_QUALITY_ATTRIBUTE_FULL_NAME, doc = "alternate quality attribute to set instead of the usual quality string.", optional = true) | |
@Argument(fullName = OUTPUT_QUALITY_ATTRIBUTE_FULL_NAME, doc = "alternate SAM tag to put original quality scores instead of overwriting the QUAL field. If not used, QUAL will be overwritten.", optional = true) |
public SnvqModeEnum snvMode = SnvqModeEnum.Geometric; | ||
|
||
/** | ||
* alternate quality attribute to set instead of the usual quality string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* alternate quality attribute to set instead of the usual quality string | |
* By default this tool overwrites the QUAL field with the new qualities. Setting this argument saves the original qualities in the specified SAM tag. |
} | ||
|
||
@VisibleForTesting | ||
protected double[] generateHmerBaseErrorProbabilities(final int[] key, final double[][] errorProbBands, final int flow, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking, this comment was added to AddFlowBaseQuality
rather than here, right?
import java.util.List; | ||
|
||
/** | ||
* Set of arguments for the {@link FlowFeatureMapper} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Set of arguments for the {@link FlowFeatureMapper} | |
* Set of arguments for {@link AddFlowSNVQuality} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'll merge when tests pass
The pull request addresses two issues:
We believe that this tool is going to be important for users of the Ultima Genomics data that care about calling SNVs, especially in somatic setting, so the goal was to make documentation more accessible.
Happy to receive feedback about it