-
-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify an output sort order in FilterConsensusReads #782
Conversation
def close(): Unit = { | ||
this._sorter.foreach { rec => writer += regenerateNmUqMdTags(rec) } | ||
writer.close() | ||
refCloseMethod() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refCloseMethod() | |
refCloseMethod() | |
progress.logLast() |
Also ReferenceSequenceIterator closes the underlying reference file
0103932
to
a5a150a
Compare
src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## main #782 +/- ##
==========================================
+ Coverage 95.47% 95.88% +0.40%
==========================================
Files 121 122 +1
Lines 6855 7774 +919
Branches 463 553 +90
==========================================
+ Hits 6545 7454 +909
- Misses 310 320 +10
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Top level usage on FilterConsensusReads needs updating. The second to last paragraph talks about sorting, which is now incorrect. And it should mention the need to load the whole genome into memory in addition to potentially needing memory for sorting.
src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala
Outdated
Show resolved
Hide resolved
|
||
val progress = ProgressLogger(logger, verb="Filtered and masked") | ||
val in = SamSource(input) | ||
val out = buildOutputWriter(in.header, refMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment on the line below, but I wonder if in Bams.templateIterator (or rather Bams.queryGroupedIterator that is calls) we should think about trying to auto-detect query-grouping by looking at say the first 100 records if the sort order and group order are both null? Could be a separate PR, but would be nice to avoid sorting more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this tool though it might be nice to test for query grouped or sorted and emit a warning that it is preferable to supply this tool query-grouped data and defer sorting until after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving it as is. See #799
src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala
Outdated
Show resolved
Hide resolved
val orderAfterFiltering = SamOrder(inHeader) match { | ||
case Some(order) if order == SamOrder.Queryname || order.groupOrder == GroupOrder.query => order | ||
case _ => SamOrder.Queryname // input will we resorted to queryname | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will work as intended in all cases. Because we don't have SamOrder
objects that have e.g. sort=unknown, group=query
, it won't find a SamOrder
if the header has that combination, and so this will fall to the default branch and assume sorting to queryname was done.
I think it's safer and simpler to just check if (inHeader.getSortOrder == SortOrder.queryname || inHeader.getGroupOrder == GroupOrder.queryname)
. And possibly we should extract this check into Bams
or elsewhere so it's the same check that is used in Bams.templateIterator
?
Or for extra credit, we could make a new iterator type in Bams that has a flag that just tells you whether the data got sorted or not to generate your iterator.
outputOrder.applyTo(outHeader) // remember to apply it | ||
|
||
// Build the writer | ||
val sort = if (orderAfterFiltering == outputOrder) None else Some(outputOrder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will have to become a slightly more complicated check
src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala
Outdated
Show resolved
Hide resolved
private case class ReadNames(in: Seq[String], out: Seq[String]) | ||
|
||
private def sortOrderTest(name1: String, start1R1: Int, start1R2: Int, name2: String, start2R1: Int, start2R2: Int, | ||
inOrder: SamOrder, outOrder: Option[SamOrder] = None): ReadNames = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in order to test more combinations it would be better to take inSort
and inGroup
separately as SamOrder
has a limited set of combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One simplification suggested, and usage still needs updating. Merge when addressed.
logger.info(f"Output $keptReads%,d of $totalReads%,d primary consensus reads.") | ||
logger.info(f"Masked $maskedBases%,d of $totalBases%,d bases in retained primary consensus reads.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do think that Intellij is wrong on this. Since braces become necessary the moment the expression becomes more than a simple variable reference (e.g. ${masedBases / totalBases}
) and the fact that braces make it much clearer where the expression ends, I try to use them 100% of the time, and would certainly prefer they not get removed.
} else { // output order and order after filtering do not match, we need to re-sort the output | ||
SamOrder.values.find { order => | ||
order.sortOrder == outputSortOrder && order.groupOrder == outputGroupOrder && order.subSort == outputSubSort | ||
}.orElse { | ||
// this can only happen if the input order is unrecognized | ||
throw new IllegalArgumentException( | ||
s"The input BAM had an unrecognized sort order (SO:$inSortOrder GO:$inGroupOrder SS: $inSubSort)" + | ||
s"\nTry re-running with --sort-order for a supported output order." in $input | ||
) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this whole else
block now just be this.sortOrder
? I think your intent is:
- If
this.sortOrder
is None, and input is query sorted or grouped, no sorting occurs anywhere and output == input - Else if
this.sortOrder
is None, and input is not query sorted or grouped, sorting to queryname occurs, and output is in SamOrder.Queryname - Else output will be in
this.sortOrder
and sorting will occur only if theafterFiltering
ordering is different thanthis.sortOrder
.
In this else
branch we've confirmed that afterFiltering and output orderings are different, and that can only occur if this.sortOrder
is defined, so we can just use it?
@@ -114,7 +118,9 @@ class FilterConsensusReads | |||
@arg(flag='q', doc="The minimum mean base quality across the consensus read.") | |||
val minMeanBaseQuality: Option[PhredScore] = None, | |||
@arg(flag='s', doc="Mask (make `N`) consensus bases where the AB and BA consensus reads disagree (for duplex-sequencing only).") | |||
val requireSingleStrandAgreement: Boolean = false | |||
val requireSingleStrandAgreement: Boolean = false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment on the tool usage, but you seemed to miss my top-level review comment last time, so I'm replicating it here:
Top level usage on FilterConsensusReads needs updating. The second to last paragraph talks about sorting, which is now incorrect. And it should mention the need to load the whole genome into memory in addition to potentially needing memory for sorting.
Also ReferenceSequenceIterator closes the underlying reference file.
@tfenne not sure if all this work is worth it. But if the input order is queryname and the output order is queryname, then this will skip sorting when
--load-full-reference=true
. This could be nice if we want to rely on multi-threaded sort tools or don't care about the order after filtering.