Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up the duplex consensus caller #493

Merged
merged 3 commits into from
Jul 24, 2019
Merged

Speeding up the duplex consensus caller #493

merged 3 commits into from
Jul 24, 2019

Conversation

nh13
Copy link
Member

@nh13 nh13 commented Jun 21, 2019

This can speed up the consensus caller by 1.5-2.15x for 2 and 4 threads respectively. Speeds up by 1.15x for a single thread. If there is very high per-molecule coverage, this can speed up by 20x or more (single threaded).

@codecov-io
Copy link

codecov-io commented Jul 5, 2019

Codecov Report

Merging #493 into master will increase coverage by 0.01%.
The diff coverage is 91.24%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #493      +/-   ##
=========================================
+ Coverage   95.59%   95.6%   +0.01%     
=========================================
  Files          92      91       -1     
  Lines        5448    5512      +64     
  Branches      702     691      -11     
=========================================
+ Hits         5208    5270      +62     
- Misses        240     242       +2
Impacted Files Coverage Δ
.../scala/com/fulcrumgenomics/util/NumericTypes.scala 95.94% <ø> (ø) ⬆️
...cala/com/fulcrumgenomics/umi/ConsensusCaller.scala 94.44% <100%> (+0.1%) ⬆️
...crumgenomics/umi/CallMolecularConsensusReads.scala 100% <100%> (ø) ⬆️
.../scala/com/fulcrumgenomics/bam/api/SamRecord.scala 86.55% <100%> (+1.08%) ⬆️
...om/fulcrumgenomics/umi/DuplexConsensusCaller.scala 96% <100%> (+0.72%) ⬆️
...fulcrumgenomics/umi/CallDuplexConsensusReads.scala 100% <100%> (ø) ⬆️
...fulcrumgenomics/umi/ConsensusCallingIterator.scala 100% <100%> (+5.88%) ⬆️
...ulcrumgenomics/umi/VanillaUmiConsensusCaller.scala 89.77% <12.5%> (-5.53%) ⬇️
...a/com/fulcrumgenomics/umi/UmiConsensusCaller.scala 93.93% <88.23%> (-0.47%) ⬇️
...om/fulcrumgenomics/umi/SimpleConsensusCaller.scala 96% <88.88%> (+0.76%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4f2d06...da69e3c. Read the comment docs.

@nh13 nh13 requested a review from tfenne July 8, 2019 21:29
@nh13 nh13 marked this pull request as ready for review July 8, 2019 21:30
Copy link
Member

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of comments. In general I'm on board, but there are a couple of places where the optimized implementation is much less readable than the original. In those places I'd really like to either a) revert the implementation, b) find a more readable optimized version, or c) be convinced that the trade-off is worth it.

@@ -313,36 +320,33 @@ trait UmiConsensusCaller[C <: SimpleRead] {
* NOTE: filtered out reads are sent to the [[rejectRecords]] method and do not need further handling
*/
protected[umi] def filterToMostCommonAlignment(recs: Seq[SourceRead]): Seq[SourceRead] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this new implementation impossible to follow. I've literally been staring at it side-by-side in IntelliJ with the old implementation for ~10 minutes and I still struggle to follow it. Two thoughts:

  1. Unless this makes a very significant different I would just revert to the old method that I think was easier to follow.
  2. If it does make a significant difference, then I think we need to find a clearer implementation. I wonder if something like the following would work:
case class AlignmentGroup(cigar: Cigar, reads: mutable.Buffer[SourceRead])

protected[umi] def filterToMostCommonAlignment(recs: Seq[SourceRead]): Seq[SourceRead] = {
  val groups = new ArrayBuffer[AlignmentGroup]
  recs.sortBy(r => -r.length).foreach { rec =>
    val simpleCigar = simplifyCigar(rec.cigar)
    var found = false
    groups.foreach { g => if (simpleCigar.isPrefixOf(g.cigar) { g.reads += rec; found = true } }
    if (!found) {
      val newGroup = AlignmentGroup(simpleCigar, new ArrayBuffer[SourceRead](recs.size))
      newGroup += rec
      groups += newGroup
    }
  }

  if (groups.isEmpty) {
    Seq.empty
  }
  else {
    val sorted  = groups.sortBy(g => - g.size)
    val keepers = sorted.head
    val rejects = recs.filter(r => !keepers.contains(r))
    rejectRecords(rejects.flatMap(_.sam), FilterMinorityAlignment)
   keepers
  }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the previous and your implementation is the keepers.contains method can be really really slow if we have many raw reads with the same cigar (think 20 cigar groups, each with 1000 reads). This is the point of optimizing, and it really does make a difference. I'll try to clean it up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'm curious if you tried either a) replacing the recs.filter with a recs.diff(keepers) or be creating a val keepSet = keepers.toSet and then calling contains on that? You'd still pay the cost of the conversion to a set, but then the lookup time would be constant.

Copy link
Member Author

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfenne I fixed up everything that I agreed with, and left some of your comments so we can discuss offline as I'd like more input, with all but one a good idea.

}
else {
val pool = new ForkJoinPool(threads, ForkJoinPool.defaultForkJoinWorkerThreadFactory, null, true)
val bufferedIter = groupingIterator.bufferBetter
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried this, but it slowed things down for some reason! You can see evidence of that via the unused import on line 29. I'd certainly welcome you helping figure out why.

@@ -313,36 +320,33 @@ trait UmiConsensusCaller[C <: SimpleRead] {
* NOTE: filtered out reads are sent to the [[rejectRecords]] method and do not need further handling
*/
protected[umi] def filterToMostCommonAlignment(recs: Seq[SourceRead]): Seq[SourceRead] = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the previous and your implementation is the keepers.contains method can be really really slow if we have many raw reads with the same cigar (think 20 cigar groups, each with 1000 reads). This is the point of optimizing, and it really does make a difference. I'll try to clean it up.

@nh13 nh13 force-pushed the nh_par_cc branch 2 times, most recently from 4f3d545 to 2cf9d3f Compare July 23, 2019 19:38
Changes to CallDuplexConsensusReads:
- added the --threads option to support multi-threading; 4-8 threads
  seems like a decent trade-off.
- added the --max-reads-per-strand option, for when the per-molecule
  coverage is very high, thus causing the tool to run slowly.

Consensus calling API
Implemented many performance optmizations found during profiling for
consensus calling.  Notable examples include:
- multi-threaded support in the consensus calling iterator; non-duplex
  consensus callers could support this in the future.
- faster grouping of raw reads based on simplified cigars
- caching of the expensive to retrieve per-read molecular identifier
- caching some expensive to compute log-probabilities in the core
  consensus caller

Both @nh13 and @tfenne contributed to this commit.
@nh13
Copy link
Member Author

nh13 commented Jul 23, 2019

A TODO includes a reference to fulcrumgenomics/commons#51

@nh13 nh13 merged commit 426a258 into master Jul 24, 2019
@nh13 nh13 deleted the nh_par_cc branch July 24, 2019 17:59
nh13 added a commit to fulcrumgenomics/dagr that referenced this pull request Jul 24, 2019
nh13 added a commit to fulcrumgenomics/dagr that referenced this pull request Jul 24, 2019
* Add new options to CallDuplexConsensusReads

See: fulcrumgenomics/fgbio#493
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants