-
-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speeding up the duplex consensus caller #493
Conversation
Codecov Report
@@ Coverage Diff @@
## master #493 +/- ##
=========================================
+ Coverage 95.59% 95.6% +0.01%
=========================================
Files 92 91 -1
Lines 5448 5512 +64
Branches 702 691 -11
=========================================
+ Hits 5208 5270 +62
- Misses 240 242 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of comments. In general I'm on board, but there are a couple of places where the optimized implementation is much less readable than the original. In those places I'd really like to either a) revert the implementation, b) find a more readable optimized version, or c) be convinced that the trade-off is worth it.
src/main/scala/com/fulcrumgenomics/umi/CallDuplexConsensusReads.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/fulcrumgenomics/umi/CallDuplexConsensusReads.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/fulcrumgenomics/umi/ConsensusCallingIterator.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/fulcrumgenomics/umi/UmiConsensusCaller.scala
Outdated
Show resolved
Hide resolved
@@ -313,36 +320,33 @@ trait UmiConsensusCaller[C <: SimpleRead] { | |||
* NOTE: filtered out reads are sent to the [[rejectRecords]] method and do not need further handling | |||
*/ | |||
protected[umi] def filterToMostCommonAlignment(recs: Seq[SourceRead]): Seq[SourceRead] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this new implementation impossible to follow. I've literally been staring at it side-by-side in IntelliJ with the old implementation for ~10 minutes and I still struggle to follow it. Two thoughts:
- Unless this makes a very significant different I would just revert to the old method that I think was easier to follow.
- If it does make a significant difference, then I think we need to find a clearer implementation. I wonder if something like the following would work:
case class AlignmentGroup(cigar: Cigar, reads: mutable.Buffer[SourceRead])
protected[umi] def filterToMostCommonAlignment(recs: Seq[SourceRead]): Seq[SourceRead] = {
val groups = new ArrayBuffer[AlignmentGroup]
recs.sortBy(r => -r.length).foreach { rec =>
val simpleCigar = simplifyCigar(rec.cigar)
var found = false
groups.foreach { g => if (simpleCigar.isPrefixOf(g.cigar) { g.reads += rec; found = true } }
if (!found) {
val newGroup = AlignmentGroup(simpleCigar, new ArrayBuffer[SourceRead](recs.size))
newGroup += rec
groups += newGroup
}
}
if (groups.isEmpty) {
Seq.empty
}
else {
val sorted = groups.sortBy(g => - g.size)
val keepers = sorted.head
val rejects = recs.filter(r => !keepers.contains(r))
rejectRecords(rejects.flatMap(_.sam), FilterMinorityAlignment)
keepers
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the previous and your implementation is the keepers.contains
method can be really really slow if we have many raw reads with the same cigar (think 20 cigar groups, each with 1000 reads). This is the point of optimizing, and it really does make a difference. I'll try to clean it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I'm curious if you tried either a) replacing the recs.filter
with a recs.diff(keepers)
or be creating a val keepSet = keepers.toSet
and then calling contains on that? You'd still pay the cost of the conversion to a set, but then the lookup time would be constant.
src/main/scala/com/fulcrumgenomics/umi/VanillaUmiConsensusCaller.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/fulcrumgenomics/umi/VanillaUmiConsensusCaller.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/fulcrumgenomics/umi/VanillaUmiConsensusCaller.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfenne I fixed up everything that I agreed with, and left some of your comments so we can discuss offline as I'd like more input, with all but one a good idea.
src/main/scala/com/fulcrumgenomics/umi/ConsensusCallingIterator.scala
Outdated
Show resolved
Hide resolved
} | ||
else { | ||
val pool = new ForkJoinPool(threads, ForkJoinPool.defaultForkJoinWorkerThreadFactory, null, true) | ||
val bufferedIter = groupingIterator.bufferBetter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually tried this, but it slowed things down for some reason! You can see evidence of that via the unused import on line 29. I'd certainly welcome you helping figure out why.
src/main/scala/com/fulcrumgenomics/umi/DuplexConsensusCaller.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/fulcrumgenomics/umi/DuplexConsensusCaller.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/fulcrumgenomics/umi/VanillaUmiConsensusCaller.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/fulcrumgenomics/umi/VanillaUmiConsensusCaller.scala
Outdated
Show resolved
Hide resolved
@@ -313,36 +320,33 @@ trait UmiConsensusCaller[C <: SimpleRead] { | |||
* NOTE: filtered out reads are sent to the [[rejectRecords]] method and do not need further handling | |||
*/ | |||
protected[umi] def filterToMostCommonAlignment(recs: Seq[SourceRead]): Seq[SourceRead] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the previous and your implementation is the keepers.contains
method can be really really slow if we have many raw reads with the same cigar (think 20 cigar groups, each with 1000 reads). This is the point of optimizing, and it really does make a difference. I'll try to clean it up.
Not faster in my hands and we already depend on apache math
4f3d545
to
2cf9d3f
Compare
Changes to CallDuplexConsensusReads: - added the --threads option to support multi-threading; 4-8 threads seems like a decent trade-off. - added the --max-reads-per-strand option, for when the per-molecule coverage is very high, thus causing the tool to run slowly. Consensus calling API Implemented many performance optmizations found during profiling for consensus calling. Notable examples include: - multi-threaded support in the consensus calling iterator; non-duplex consensus callers could support this in the future. - faster grouping of raw reads based on simplified cigars - caching of the expensive to retrieve per-read molecular identifier - caching some expensive to compute log-probabilities in the core consensus caller Both @nh13 and @tfenne contributed to this commit.
A TODO includes a reference to fulcrumgenomics/commons#51 |
* Add new options to CallDuplexConsensusReads See: fulcrumgenomics/fgbio#493
This can speed up the consensus caller by 1.5-2.15x for 2 and 4 threads respectively. Speeds up by 1.15x for a single thread. If there is very high per-molecule coverage, this can speed up by 20x or more (single threaded).