Specify an output sort order in FilterConsensusReads #782

nh13 · 2022-02-21T05:30:24Z

Also ReferenceSequenceIterator closes the underlying reference file.

@tfenne not sure if all this work is worth it. But if the input order is queryname and the output order is queryname, then this will skip sorting when --load-full-reference=true. This could be nice if we want to rely on multi-threaded sort tools or don't care about the order after filtering.

nh13 · 2022-02-21T05:31:32Z

src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala

+        def close(): Unit = {
+          this._sorter.foreach { rec => writer += regenerateNmUqMdTags(rec) }
+          writer.close()
+          refCloseMethod()


Suggested change

refCloseMethod()

refCloseMethod()

progress.logLast()

Also ReferenceSequenceIterator closes the underlying reference file

src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala

codecov-commenter · 2022-02-26T16:01:25Z

Codecov Report

Merging #782 (0cd6d67) into main (7512807) will increase coverage by 0.40%.
The diff coverage is 97.36%.

❗ Current head 0cd6d67 differs from pull request most recent head d495fd4. Consider uploading reports for the commit d495fd4 to get more accurate results

@@            Coverage Diff             @@
##             main     #782      +/-   ##
==========================================
+ Coverage   95.47%   95.88%   +0.40%     
==========================================
  Files         121      122       +1     
  Lines        6855     7774     +919     
  Branches      463      553      +90     
==========================================
+ Hits         6545     7454     +909     
- Misses        310      320      +10

Flag	Coverage Δ
unittests	`95.88% <97.36%> (+0.40%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...com/fulcrumgenomics/umi/FilterConsensusReads.scala	`97.77% <96.42%> (-0.48%)`	⬇️
src/main/scala/com/fulcrumgenomics/bam/Bams.scala	`95.74% <100.00%> (+1.30%)`	⬆️
...c/main/scala/com/fulcrumgenomics/bam/ClipBam.scala	`99.24% <100.00%> (+0.28%)`	⬆️
...crumgenomics/fasta/ReferenceSequenceIterator.scala	`90.90% <100.00%> (+7.57%)`	⬆️
...cala/com/fulcrumgenomics/umi/ConsensusCaller.scala	`93.97% <0.00%> (-0.47%)`	⬇️
...cala/com/fulcrumgenomics/alignment/Alignment.scala	`100.00% <0.00%> (ø)`
...ala/com/fulcrumgenomics/bam/api/HeaderHelper.scala	`100.00% <0.00%> (ø)`
...om/fulcrumgenomics/umi/SimpleConsensusCaller.scala	`100.00% <0.00%> (ø)`
...fulcrumgenomics/umi/ConsensusCallingIterator.scala	`100.00% <0.00%> (ø)`
...crumgenomics/umi/CallMolecularConsensusReads.scala	`100.00% <0.00%> (ø)`
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7512807...d495fd4. Read the comment docs.

tfenne

Top level usage on FilterConsensusReads needs updating. The second to last paragraph talks about sorting, which is now incorrect. And it should mention the need to load the whole genome into memory in addition to potentially needing memory for sorting.

src/main/scala/com/fulcrumgenomics/bam/Bams.scala

src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala

tfenne · 2022-03-01T21:35:56Z

src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala

+
+    val progress = ProgressLogger(logger, verb="Filtered and masked")
+    val in       = SamSource(input)
+    val out      = buildOutputWriter(in.header, refMap)


I can't comment on the line below, but I wonder if in Bams.templateIterator (or rather Bams.queryGroupedIterator that is calls) we should think about trying to auto-detect query-grouping by looking at say the first 100 records if the sort order and group order are both null? Could be a separate PR, but would be nice to avoid sorting more.

In this tool though it might be nice to test for query grouped or sorted and emit a warning that it is preferable to supply this tool query-grouped data and defer sorting until after?

Leaving it as is. See #799

src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala

tfenne · 2022-03-01T21:55:56Z

src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala

+    val orderAfterFiltering = SamOrder(inHeader) match {
+      case Some(order) if order == SamOrder.Queryname || order.groupOrder == GroupOrder.query => order
+      case _ => SamOrder.Queryname // input will we resorted to queryname
+    }


I don't think this will work as intended in all cases. Because we don't have SamOrder objects that have e.g. sort=unknown, group=query, it won't find a SamOrder if the header has that combination, and so this will fall to the default branch and assume sorting to queryname was done.

I think it's safer and simpler to just check if (inHeader.getSortOrder == SortOrder.queryname || inHeader.getGroupOrder == GroupOrder.queryname). And possibly we should extract this check into Bams or elsewhere so it's the same check that is used in Bams.templateIterator?

Or for extra credit, we could make a new iterator type in Bams that has a flag that just tells you whether the data got sorted or not to generate your iterator.

tfenne · 2022-03-01T21:57:50Z

src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala

+    outputOrder.applyTo(outHeader) // remember to apply it
+
+    // Build the writer
+    val sort   = if (orderAfterFiltering == outputOrder) None else Some(outputOrder)


I think this will have to become a slightly more complicated check

src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala

tfenne · 2022-03-01T22:01:18Z

src/test/scala/com/fulcrumgenomics/umi/FilterConsensusReadsTest.scala

+  private case class ReadNames(in: Seq[String], out: Seq[String])
+
+  private def sortOrderTest(name1: String, start1R1: Int, start1R2: Int, name2: String, start2R1: Int, start2R2: Int,
+                            inOrder: SamOrder, outOrder: Option[SamOrder] = None): ReadNames = {


I think that in order to test more combinations it would be better to take inSort and inGroup separately as SamOrder has a limited set of combinations.

tfenne

One simplification suggested, and usage still needs updating. Merge when addressed.

tfenne · 2022-03-02T12:01:39Z

src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala

+    logger.info(f"Output $keptReads%,d of $totalReads%,d primary consensus reads.")
+    logger.info(f"Masked $maskedBases%,d of $totalBases%,d bases in retained primary consensus reads.")


I really do think that Intellij is wrong on this. Since braces become necessary the moment the expression becomes more than a simple variable reference (e.g. ${masedBases / totalBases}) and the fact that braces make it much clearer where the expression ends, I try to use them 100% of the time, and would certainly prefer they not get removed.

tfenne · 2022-03-02T12:09:37Z

src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala

+      } else { // output order and order after filtering do not match, we need to re-sort the output
+        SamOrder.values.find { order =>
+          order.sortOrder == outputSortOrder && order.groupOrder == outputGroupOrder && order.subSort == outputSubSort
+        }.orElse {
+          // this can only happen if the input order is unrecognized
+          throw new IllegalArgumentException(
+            s"The input BAM had an unrecognized sort order (SO:$inSortOrder GO:$inGroupOrder SS: $inSubSort)" +
+            s"\nTry re-running with --sort-order for a supported output order." in $input
+          )
+        }
+      }


Can't this whole else block now just be this.sortOrder? I think your intent is:

If this.sortOrder is None, and input is query sorted or grouped, no sorting occurs anywhere and output == input

Else if this.sortOrder is None, and input is not query sorted or grouped, sorting to queryname occurs, and output is in SamOrder.Queryname

Else output will be in this.sortOrder and sorting will occur only if the afterFiltering ordering is different than this.sortOrder.

In this else branch we've confirmed that afterFiltering and output orderings are different, and that can only occur if this.sortOrder is defined, so we can just use it?

tfenne · 2022-03-02T12:11:23Z

src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala

@@ -114,7 +118,9 @@ class FilterConsensusReads
  @arg(flag='q', doc="The minimum mean base quality across the consensus read.")
  val minMeanBaseQuality: Option[PhredScore] = None,
  @arg(flag='s', doc="Mask (make `N`) consensus bases where the AB and BA consensus reads disagree (for duplex-sequencing only).")
-  val requireSingleStrandAgreement: Boolean = false
+  val requireSingleStrandAgreement: Boolean = false,


I can't comment on the tool usage, but you seemed to miss my top-level review comment last time, so I'm replicating it here:

Top level usage on FilterConsensusReads needs updating. The second to last paragraph talks about sorting, which is now incorrect. And it should mention the need to load the whole genome into memory in addition to potentially needing memory for sorting.

nh13 requested a review from tfenne February 21, 2022 05:30

nh13 commented Feb 21, 2022

View reviewed changes

nh13 added 2 commits February 26, 2022 08:53

Specify an output sort order in FilterConsensusReads

610b100

Also ReferenceSequenceIterator closes the underlying reference file

simplify

a5a150a

nh13 force-pushed the feature/filter-consensus-reads-load-ref branch from 0103932 to a5a150a Compare February 26, 2022 15:54

nh13 changed the base branch from feature/group-reads-by-umi-threading to master February 26, 2022 15:54

nh13 marked this pull request as ready for review February 26, 2022 15:55

nh13 commented Feb 26, 2022

View reviewed changes

src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala Outdated Show resolved Hide resolved

nh13 commented Feb 26, 2022

View reviewed changes

src/main/scala/com/fulcrumgenomics/umi/FilterConsensusReads.scala Outdated Show resolved Hide resolved

Apply suggestions from code review

1129b9a

tfenne self-assigned this Feb 28, 2022

nh13 added the 2.0 label Feb 28, 2022

tfenne requested changes Mar 1, 2022

View reviewed changes

tfenne assigned nh13 and unassigned tfenne Mar 1, 2022

nh13 added 2 commits March 1, 2022 17:15

fixups

0cd6d67

error message fixup

d495fd4

nh13 assigned tfenne and unassigned nh13 Mar 2, 2022

tfenne approved these changes Mar 2, 2022

View reviewed changes

tfenne assigned nh13 and unassigned tfenne Mar 2, 2022

nh13 added 2 commits March 2, 2022 15:48

fixup

7ccc2b7

usage

0c8b5fb

tfenne approved these changes Mar 2, 2022

View reviewed changes

nh13 merged commit 6e0a53d into main Mar 3, 2022

nh13 deleted the feature/filter-consensus-reads-load-ref branch March 3, 2022 03:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify an output sort order in FilterConsensusReads #782

Specify an output sort order in FilterConsensusReads #782

nh13 commented Feb 21, 2022

nh13 Feb 21, 2022

codecov-commenter commented Feb 26, 2022 •

edited

Loading

tfenne left a comment

tfenne Mar 1, 2022

tfenne Mar 1, 2022

nh13 Mar 1, 2022

tfenne Mar 1, 2022

tfenne Mar 1, 2022

tfenne Mar 1, 2022

tfenne left a comment

tfenne Mar 2, 2022

tfenne Mar 2, 2022

tfenne Mar 2, 2022

		logger.info(f"Output $keptReads%,d of $totalReads%,d primary consensus reads.")
		logger.info(f"Masked $maskedBases%,d of $totalBases%,d bases in retained primary consensus reads.")

Specify an output sort order in FilterConsensusReads #782

Specify an output sort order in FilterConsensusReads #782

Conversation

nh13 commented Feb 21, 2022

Choose a reason for hiding this comment

codecov-commenter commented Feb 26, 2022 • edited Loading

Codecov Report

tfenne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tfenne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 26, 2022 •

edited

Loading