Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify an output sort order in FilterConsensusReads #782

Merged
merged 7 commits into from
Mar 3, 2022

Conversation

nh13
Copy link
Member

@nh13 nh13 commented Feb 21, 2022

Also ReferenceSequenceIterator closes the underlying reference file.

@tfenne not sure if all this work is worth it. But if the input order is queryname and the output order is queryname, then this will skip sorting when --load-full-reference=true. This could be nice if we want to rely on multi-threaded sort tools or don't care about the order after filtering.

@nh13 nh13 requested a review from tfenne February 21, 2022 05:30
def close(): Unit = {
this._sorter.foreach { rec => writer += regenerateNmUqMdTags(rec) }
writer.close()
refCloseMethod()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
refCloseMethod()
refCloseMethod()
progress.logLast()

nh13 added 2 commits February 26, 2022 08:53
Also ReferenceSequenceIterator closes the underlying reference file
@nh13 nh13 force-pushed the feature/filter-consensus-reads-load-ref branch from 0103932 to a5a150a Compare February 26, 2022 15:54
@nh13 nh13 changed the base branch from feature/group-reads-by-umi-threading to master February 26, 2022 15:54
@nh13 nh13 marked this pull request as ready for review February 26, 2022 15:55
@codecov-commenter
Copy link

codecov-commenter commented Feb 26, 2022

Codecov Report

Merging #782 (0cd6d67) into main (7512807) will increase coverage by 0.40%.
The diff coverage is 97.36%.

❗ Current head 0cd6d67 differs from pull request most recent head d495fd4. Consider uploading reports for the commit d495fd4 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #782      +/-   ##
==========================================
+ Coverage   95.47%   95.88%   +0.40%     
==========================================
  Files         121      122       +1     
  Lines        6855     7774     +919     
  Branches      463      553      +90     
==========================================
+ Hits         6545     7454     +909     
- Misses        310      320      +10     
Flag Coverage Δ
unittests 95.88% <97.36%> (+0.40%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...com/fulcrumgenomics/umi/FilterConsensusReads.scala 97.77% <96.42%> (-0.48%) ⬇️
src/main/scala/com/fulcrumgenomics/bam/Bams.scala 95.74% <100.00%> (+1.30%) ⬆️
...c/main/scala/com/fulcrumgenomics/bam/ClipBam.scala 99.24% <100.00%> (+0.28%) ⬆️
...crumgenomics/fasta/ReferenceSequenceIterator.scala 90.90% <100.00%> (+7.57%) ⬆️
...cala/com/fulcrumgenomics/umi/ConsensusCaller.scala 93.97% <0.00%> (-0.47%) ⬇️
...cala/com/fulcrumgenomics/alignment/Alignment.scala 100.00% <0.00%> (ø)
...ala/com/fulcrumgenomics/bam/api/HeaderHelper.scala 100.00% <0.00%> (ø)
...om/fulcrumgenomics/umi/SimpleConsensusCaller.scala 100.00% <0.00%> (ø)
...fulcrumgenomics/umi/ConsensusCallingIterator.scala 100.00% <0.00%> (ø)
...crumgenomics/umi/CallMolecularConsensusReads.scala 100.00% <0.00%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7512807...d495fd4. Read the comment docs.

@tfenne tfenne self-assigned this Feb 28, 2022
@nh13 nh13 added the 2.0 label Feb 28, 2022
Copy link
Member

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top level usage on FilterConsensusReads needs updating. The second to last paragraph talks about sorting, which is now incorrect. And it should mention the need to load the whole genome into memory in addition to potentially needing memory for sorting.

src/main/scala/com/fulcrumgenomics/bam/Bams.scala Outdated Show resolved Hide resolved

val progress = ProgressLogger(logger, verb="Filtered and masked")
val in = SamSource(input)
val out = buildOutputWriter(in.header, refMap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on the line below, but I wonder if in Bams.templateIterator (or rather Bams.queryGroupedIterator that is calls) we should think about trying to auto-detect query-grouping by looking at say the first 100 records if the sort order and group order are both null? Could be a separate PR, but would be nice to avoid sorting more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this tool though it might be nice to test for query grouped or sorted and emit a warning that it is preferable to supply this tool query-grouped data and defer sorting until after?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving it as is. See #799

Comment on lines 248 to 251
val orderAfterFiltering = SamOrder(inHeader) match {
case Some(order) if order == SamOrder.Queryname || order.groupOrder == GroupOrder.query => order
case _ => SamOrder.Queryname // input will we resorted to queryname
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work as intended in all cases. Because we don't have SamOrder objects that have e.g. sort=unknown, group=query, it won't find a SamOrder if the header has that combination, and so this will fall to the default branch and assume sorting to queryname was done.

I think it's safer and simpler to just check if (inHeader.getSortOrder == SortOrder.queryname || inHeader.getGroupOrder == GroupOrder.queryname). And possibly we should extract this check into Bams or elsewhere so it's the same check that is used in Bams.templateIterator?

Or for extra credit, we could make a new iterator type in Bams that has a flag that just tells you whether the data got sorted or not to generate your iterator.

outputOrder.applyTo(outHeader) // remember to apply it

// Build the writer
val sort = if (orderAfterFiltering == outputOrder) None else Some(outputOrder)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will have to become a slightly more complicated check

private case class ReadNames(in: Seq[String], out: Seq[String])

private def sortOrderTest(name1: String, start1R1: Int, start1R2: Int, name2: String, start2R1: Int, start2R2: Int,
inOrder: SamOrder, outOrder: Option[SamOrder] = None): ReadNames = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in order to test more combinations it would be better to take inSort and inGroup separately as SamOrder has a limited set of combinations.

@tfenne tfenne assigned nh13 and unassigned tfenne Mar 1, 2022
@nh13 nh13 assigned tfenne and unassigned nh13 Mar 2, 2022
Copy link
Member

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One simplification suggested, and usage still needs updating. Merge when addressed.

Comment on lines +229 to +230
logger.info(f"Output $keptReads%,d of $totalReads%,d primary consensus reads.")
logger.info(f"Masked $maskedBases%,d of $totalBases%,d bases in retained primary consensus reads.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do think that Intellij is wrong on this. Since braces become necessary the moment the expression becomes more than a simple variable reference (e.g. ${masedBases / totalBases}) and the fact that braces make it much clearer where the expression ends, I try to use them 100% of the time, and would certainly prefer they not get removed.

Comment on lines 268 to 278
} else { // output order and order after filtering do not match, we need to re-sort the output
SamOrder.values.find { order =>
order.sortOrder == outputSortOrder && order.groupOrder == outputGroupOrder && order.subSort == outputSubSort
}.orElse {
// this can only happen if the input order is unrecognized
throw new IllegalArgumentException(
s"The input BAM had an unrecognized sort order (SO:$inSortOrder GO:$inGroupOrder SS: $inSubSort)" +
s"\nTry re-running with --sort-order for a supported output order." in $input
)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this whole else block now just be this.sortOrder? I think your intent is:

  1. If this.sortOrder is None, and input is query sorted or grouped, no sorting occurs anywhere and output == input
  2. Else if this.sortOrder is None, and input is not query sorted or grouped, sorting to queryname occurs, and output is in SamOrder.Queryname
  3. Else output will be in this.sortOrder and sorting will occur only if the afterFiltering ordering is different than this.sortOrder.

In this else branch we've confirmed that afterFiltering and output orderings are different, and that can only occur if this.sortOrder is defined, so we can just use it?

@@ -114,7 +118,9 @@ class FilterConsensusReads
@arg(flag='q', doc="The minimum mean base quality across the consensus read.")
val minMeanBaseQuality: Option[PhredScore] = None,
@arg(flag='s', doc="Mask (make `N`) consensus bases where the AB and BA consensus reads disagree (for duplex-sequencing only).")
val requireSingleStrandAgreement: Boolean = false
val requireSingleStrandAgreement: Boolean = false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on the tool usage, but you seemed to miss my top-level review comment last time, so I'm replicating it here:

Top level usage on FilterConsensusReads needs updating. The second to last paragraph talks about sorting, which is now incorrect. And it should mention the need to load the whole genome into memory in addition to potentially needing memory for sorting.

@tfenne tfenne assigned nh13 and unassigned tfenne Mar 2, 2022
@nh13 nh13 merged commit 6e0a53d into main Mar 3, 2022
@nh13 nh13 deleted the feature/filter-consensus-reads-load-ref branch March 3, 2022 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants