-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignment with ksw2 and WFA2-lib #242
Conversation
This is great! I will also run a benchmark on the variant calling I did in the paper to see how the new extension perform once this branch is ready to be tested. Piecewise alignment is definitely the way to go. Just to think a bit ahead of a possible limitation with respect to my naive semi-global alignment: for the longer reads (say 250 and above) I think the benefit of my naive semi-global alignment was that it could easily span a insertion (in the query) of 50-100 nucleotides without breaking the alignment (our indel calling results were better than any other aligner on simulated data with ground truth). I am wondering how the piecewise approach will behave here. Starting to extend with ksw2 at the 20nt long NAM start/end sites won't make the score increase enough before the alignment hits the long indel, and would therefore be dropped by ksw2. I predict we will se a lot more long softclipps of reads not making it over the indels with our piecewise solution (which may also affect indel calling). This is at least what I saw with my (bug-prone) attempt at piecewise alignment. If this turns out to be a problem. I am wondering if this could be solved (i.e. mimicking semi global alignment) with increasing start score of ksw2 obtained from the alignment score of the middle part? Alternatively, I am thinking if a solution would be simply to look for NAMs on 'each side' of the long indel? (And then we are getting dangerously close to a solution for split(supplementary) alignments. |
Finding long indels in the extensions may be a matter of tuning the scoring function. ksw2 supports a dual gap cost function that is supposed to help retain long gaps, maybe we could switch that on. Actually, now that I think about it, if we use the same scoring function as before, then we should actually be able to span the same indels as before. But there’re probably cases I haven’t thought about. I just did some preliminary benchmarks for this PR and noticed that accuracy is slightly lower than before (~0.02 percentage points). This may be due to soft clipping, which happens more often now. Many alignments change from something like this: |
Right, the dual gap cost should offer an improvement over our old alignment scoring (or at least make up for not performing semi-global). I am not too confident in my opinions on this matter though. Probably because I have not build up enough intuition of the extension step yet. I would have to resort to test cases for this matter.
Okay, it will probably affect SNP calling as well. However, I think this issue is automatically taken care of if we first try hamming_align (which will put 4=1X95=). But perhaps you intended to fix this in itself first without 'masking' the issue with hamming distance? |
I added back the |
Here’s a summary of my findings:
Accuracy
SpeedThis was run on Drosophila only, reporting minimum runtime of three runs.
Other notes:
I’d like to look into what happens to some of the reads that now get aligned incorrectly, maybe we can avoid the drop in accuracy. |
To me this looks like great news! Although I would have to rerun the SNP/indel calling pipeline to know how/if it deteriorates around indels. Two notes (pulling in different directions):
Also, it could be worth a short discussion (at office) on how this partitioned alignment work. Perhaps after I get result from the indel calling benchmark. Would you like me to start the SNP/indel benchmarks for this commit? (or tell me otherwise when you believe you have something ready to test). |
Here’s one read ( The CIGAR changes from There are six repeats of Alignment before:
Alignment after:
When we use SSW, we can recover from the incorrect NAM start position because the entire alignment is re-done, but for extension alignment, we assume that the NAM start and end positions are correct, so it just stays that way. One remedy could be to detect this situation and then use SSW like before. |
I’m still thinking about the accuracy issue, but since it affects only 100 out of 1 million reads, it won’t make a difference even if we fix this, so go ahead. |
I looked at a second problematic read and the misalignment was also caused by repetitive sequence in the beginning ( There’s probably bigger potential for improved accuracy elsewhere, so instead of focusing on this rare corner case, I’ll leave this be for the moment and think about other things. |
Ok, I will try to start the benchmarks tomorrow.
I see, so the deletion is coming from the internal 'global' WFA alignment (i.e., not from the KSW ends)?
Ok, sounds reasonable! |
Makes sense that the accuracy decrease seems slightly larger for CHM which probably has more STRs due to the teleomer regions. |
I have updated the PR to not update the vendored SSW version. More recent SSW versions contain a fix for a segmentation fault that sometimes happens. Instead of segfaulting, a flag is set that signals that the alignment fails. The flag is not yet exposed in the C++ wrapper, so if the flag is set, the C++ This should fix #245. |
We need the aln_info available in a separate header file. It makes most sense to arrange things so that everything that produces aln_info objects is in aligner.cpp/.hpp
Instead, use aln_info.query_start and aln_info.query_end in the caller to get the same information (this is also more consistent with how the various other data structures represent local alignments).
This was the edit distance plus total number of soft-clipped bases. We can instead use the query_start and query_end attributes to compute how many bases are soft clipped.
Delete length and replace it with a ref_span() method.
At least not the way I expected it, so we just reverse the CIGAR ourselves.
This allows us to avoid parsing the CIGAR string in Aligner::align.
Yes, exactly. |
Ok to close this in favor of #251? |
yes agree to close |
(I just noticed this is not quite in a working state due to some linking problems, but I’ll add this here anyway.)
This replaces the local alignment with SSW with a hybrid approach where
There are also a lot of restructuring commits in here to make the way the various alignment functions are called more consistent with each other.
This needs some optimizations. In particular, I temporarily removed the optimization of trying a gapfree alignment first (with
hamming_align
). However, the speed is roughly the same as before, in particular when setting match scores to 0 (which makes WFA2 faster).To Do
libwfa2.so
not found – switch from dynamic to static linking)hamming_align
optimization