Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplified KBestHaplotypeFinder by replacing recursion with Dijkstra's algorithm #5462

Merged
merged 1 commit into from
Dec 27, 2018

Conversation

davidbenjamin
Copy link
Contributor

Closes #3561.

@vruano could you review this? Note that I ended up implementing Dijkstra's algorithm instead of using a library, but it's only a few lines of code. This PR does not affect the outputs of HC or M2 at all.

Also, @vruano, I recall your misgivings about the current haplotype enumeration (which this preserves):

However, the current algorithm and the k-dijkstra still would show the same problems in terms of doing a suboptimal selection of haplotypes in terms of their coverage of plausible variation. I had implemented an alternative that fixed that issue . . . simulate haplotypes based on those same furcation likelihoods and wstop when we have not discovered anything new for a while... the problem of such an approach is to make it deterministic.

Although this PR doesn't do that, it could easily be extended to do so just by running Dijkstra's algorithm until you have the amount of variation you want. That is, instead of terminating when the Dijkstra priority queue is empty or when we have discovered the maximum number of haplotypes, we could terminate based on some Predicate<List<KBestHaplotype>> on the list of haplotypes found so far. And it's deterministic since Dijkstra's algorithm is greedy.

So basically, it's a nice refactoring for now but it also sets up some worthwhile extensions if we want.

final List<KBestSubHaplotypeFinder> sourceFinders = new ArrayList<>(sources.size());
for (final SeqVertex source : sources) {
sourceFinders.add(createVertexFinder(source));
public List<KBestHaplotype> findBestHaplotypes(final int maxNumberOfHaplotypes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok... perhaps you can mention that this is described in https://en.wikipedia.org/wiki/K_shortest_path_routing. I now wonder what is the advantages in O() vs the previous solution... though it would be roughly the same but perhaps I was mistaken... that article says that this alg is O(m + n log n + k)... The dinamamic programming one seems to be O(m + n x log x + k) where x is = the number of outgoing edges from a vertex.... ~ (m / n). I would argument hat x is normally quite low like 1-3 and so O(x log x) is effective O(1)... that said perhaps the constants are much larger in the previous solution and in practice this solution is faster also is clearly much simpler. I bet the contribution of this part to the overall HC cost is so low that its not worth the difference anyway so i think that regardless of the O(.) your solution stands just based on the merits of simplicity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. Done.


final Path<SeqVertex,BaseEdge> refPath = bestPathFinder.get(0).path();
final Path<SeqVertex,BaseEdge> altPath = bestPathFinder.get(1).path();
final List<KBestHaplotype>bestPaths = new KBestHaplotypeFinder(graph,top,bot).findBestHaplotypes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing blank space just before bestPahts =?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@vruano
Copy link
Contributor

vruano commented Dec 24, 2018

No need to change anything really, go ahead and merge once conflicts are addressed and tests pass.

@vruano vruano assigned davidbenjamin and unassigned vruano Dec 24, 2018
@davidbenjamin davidbenjamin force-pushed the db_dijkstra branch 2 times, most recently from afd69ab to 5e74f30 Compare December 27, 2018 01:45
@codecov-io
Copy link

Codecov Report

Merging #5462 into master will increase coverage by 0.012%.
The diff coverage is 86.42%.

@@               Coverage Diff               @@
##              master     #5462       +/-   ##
===============================================
+ Coverage     87.075%   87.087%   +0.012%     
+ Complexity     31334     31225      -109     
===============================================
  Files           1921      1915        -6     
  Lines         144602    144079      -523     
  Branches       15951     15891       -60     
===============================================
- Hits          125912    125474      -438     
+ Misses         12896     12834       -62     
+ Partials        5794      5771       -23
Impacted Files Coverage Δ Complexity Δ
...s/walkers/haplotypecaller/graphs/PathUnitTest.java 93.258% <ø> (-0.22%) 7 <0> (ø)
...rs/haplotypecaller/graphs/AdaptiveChainPruner.java 95.349% <100%> (ø) 16 <0> (ø) ⬇️
...ller/readthreading/ReadThreadingGraphUnitTest.java 95.238% <100%> (+0.018%) 55 <0> (ø) ⬇️
...rs/haplotypecaller/graphs/ChainPrunerUnitTest.java 99.194% <100%> (-0.006%) 40 <0> (ø)
...der/tools/walkers/haplotypecaller/graphs/Path.java 96.491% <100%> (+1.33%) 24 <1> (-2) ⬇️
...walkers/haplotypecaller/graphs/KBestHaplotype.java 100% <100%> (+13.514%) 5 <5> (-8) ⬇️
...pecaller/readthreading/ReadThreadingAssembler.java 67.578% <33.333%> (-0.961%) 51 <0> (ø)
...r/graphs/SharedVertexSequenceSplitterUnitTest.java 94.545% <66.667%> (-4.35%) 33 <3> (-5)
.../readthreading/ReadThreadingAssemblerUnitTest.java 98.712% <66.667%> (ø) 38 <0> (ø) ⬇️
...otypecaller/graphs/CommonSuffixMergerUnitTest.java 94.286% <85.714%> (+1.603%) 20 <3> (-1) ⬇️
... and 9 more

@davidbenjamin davidbenjamin merged commit fef36e3 into master Dec 27, 2018
@davidbenjamin davidbenjamin deleted the db_dijkstra branch December 27, 2018 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Dijkstra's algorithm for finding best haplotypes from assembly graph
3 participants