Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment with ksw2 and WFA2-lib #242

Closed
wants to merge 23 commits into from
Closed

Alignment with ksw2 and WFA2-lib #242

wants to merge 23 commits into from

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Feb 17, 2023

(I just noticed this is not quite in a working state due to some linking problems, but I’ll add this here anyway.)

This replaces the local alignment with SSW with a hybrid approach where

  • WFA2-lib is used to globally align the NAM
  • ksw2 is used twice to extend the NAM towards the 5' and 3' end of the read

There are also a lot of restructuring commits in here to make the way the various alignment functions are called more consistent with each other.

This needs some optimizations. In particular, I temporarily removed the optimization of trying a gapfree alignment first (with hamming_align). However, the speed is roughly the same as before, in particular when setting match scores to 0 (which makes WFA2 faster).

To Do

  • Fix the linking problem (libwfa2.so not found – switch from dynamic to static linking)
  • Put back the hamming_align optimization
  • Run benchmarks
  • Fix the macOS build error
  • Replace SSW in rescue alignment with ksw2

@ksahlin
Copy link
Owner

ksahlin commented Feb 17, 2023

This is great! I will also run a benchmark on the variant calling I did in the paper to see how the new extension perform once this branch is ready to be tested.

Piecewise alignment is definitely the way to go. Just to think a bit ahead of a possible limitation with respect to my naive semi-global alignment: for the longer reads (say 250 and above) I think the benefit of my naive semi-global alignment was that it could easily span a insertion (in the query) of 50-100 nucleotides without breaking the alignment (our indel calling results were better than any other aligner on simulated data with ground truth).

I am wondering how the piecewise approach will behave here. Starting to extend with ksw2 at the 20nt long NAM start/end sites won't make the score increase enough before the alignment hits the long indel, and would therefore be dropped by ksw2. I predict we will se a lot more long softclipps of reads not making it over the indels with our piecewise solution (which may also affect indel calling). This is at least what I saw with my (bug-prone) attempt at piecewise alignment.

If this turns out to be a problem. I am wondering if this could be solved (i.e. mimicking semi global alignment) with increasing start score of ksw2 obtained from the alignment score of the middle part?

Alternatively, I am thinking if a solution would be simply to look for NAMs on 'each side' of the long indel? (And then we are getting dangerously close to a solution for split(supplementary) alignments.

@marcelm
Copy link
Collaborator Author

marcelm commented Feb 17, 2023

Finding long indels in the extensions may be a matter of tuning the scoring function. ksw2 supports a dual gap cost function that is supposed to help retain long gaps, maybe we could switch that on. Actually, now that I think about it, if we use the same scoring function as before, then we should actually be able to span the same indels as before. But there’re probably cases I haven’t thought about.

I just did some preliminary benchmarks for this PR and noticed that accuracy is slightly lower than before (~0.02 percentage points). This may be due to soft clipping, which happens more often now. Many alignments change from something like this: 4=1X95= to this: 5S95=. Using the current scores, any mismatch 4 bases away from the end of the read is going to be soft clipped. BWA-MEM has a penalty for soft clipping to avoid this, and ksw2 appears to have a way to give a bonus to alignments that reach the end of the query, which would have the same effect. This is related to #54.

@ksahlin
Copy link
Owner

ksahlin commented Feb 17, 2023

Finding long indels in the extensions may be a matter of tuning the scoring function. ksw2 supports a dual gap cost function that is supposed to help retain long gaps, maybe we could switch that on. Actually, now that I think about it, if we use the same scoring function as before, then we should actually be able to span the same indels as before. But there’re probably cases I haven’t thought about.

Right, the dual gap cost should offer an improvement over our old alignment scoring (or at least make up for not performing semi-global). I am not too confident in my opinions on this matter though. Probably because I have not build up enough intuition of the extension step yet. I would have to resort to test cases for this matter.

I just did some preliminary benchmarks for this PR and noticed that accuracy is slightly lower than before (~0.02 percentage points). This may be due to soft clipping, which happens more often now. Many alignments change from something like this: 4=1X95= to this: 5S95=. Using the current scores, any mismatch 4 bases away from the end of the read is going to be soft clipped. BWA-MEM has a penalty for soft clipping to avoid this, and ksw2 appears to have a way to give a bonus to alignments that reach the end of the query, which would have the same effect. This is related to #54.

Okay, it will probably affect SNP calling as well. However, I think this issue is automatically taken care of if we first try hamming_align (which will put 4=1X95=). But perhaps you intended to fix this in itself first without 'masking' the issue with hamming distance?

@marcelm
Copy link
Collaborator Author

marcelm commented Feb 21, 2023

I added back the hamming_align optimization and with this, mapping appears to be a little bit faster than before. I still need to run more thorough benchmarks.

@marcelm
Copy link
Collaborator Author

marcelm commented Feb 23, 2023

Here’s a summary of my findings:

  • Accuracy drops by 0.01 percentage points across all datasets (it was 0.02 before I added back the hamming_align optimization)
  • The speedup ranges from no speedup at all for short reads to 1.5 times for longer reads

Accuracy

dataset main this PR accuracy reduction
drosophila/2x50-1M 90.0643 90.0591 0.0052
drosophila/2x100-1M 92.38105 92.37705 0.004
drosophila/2x150-1M 93.2202 93.21225 0.00795
drosophila/2x200-1M 93.506 93.4974 0.0086
drosophila/2x300-1M 95.3651 95.35735 0.00775
CHM13/2x50-1M 90.46475 90.4603 0.00445
CHM13/2x100-1M 93.22665 93.2179 0.00875
CHM13/2x150-1M 94.1315 94.11335 0.01815
CHM13/2x200-1M 94.41465 94.4028 0.01185
CHM13/2x300-1M 95.62665 95.616 0.01065
maize/2x50-1M 71.1893 71.1803 0.009
maize/2x100-1M 87.1095 87.10545 0.00405
maize/2x200-1M 92.91535 92.9083 0.00705
maize/2x300-1M 96.71805 96.7116 0.00645

Speed

This was run on Drosophila only, reporting minimum runtime of three runs.

dataset runtime before (s) runtime after (s) speedup
2x50bp 49.660 49.298 1.01
2x100 bp 64.250 60.817 1.06
2x150 bp 78.750 67.922 1.16
2x200 bp 100.810 78.441 1.29
2x300 bp 199.235 131.731 1.51

Other notes:

  • Alignment is not a bottleneck for short reads. The percentage of total runtime spent on it in main was ca. 19% (2x100 bp reads), so this limits how much total speedup we can have by making alignment faster. (This PR reduces the percentage to 14%).
  • For 2x200 bp reads, time spent on alignment drops from ~32% to ~9%

I’d like to look into what happens to some of the reads that now get aligned incorrectly, maybe we can avoid the drop in accuracy.

@ksahlin
Copy link
Owner

ksahlin commented Feb 23, 2023

To me this looks like great news! Although I would have to rerun the SNP/indel calling pipeline to know how/if it deteriorates around indels. Two notes (pulling in different directions):

  1. I think extension time takes up a larger fraction of the time on smaller genomes, so the relative speedup will not be as high on larger genomes (for simulated data).
  2. However, alignment is invoked less often in simulated reads compared to biological datasets (for some reason) as seen in figure 2C, so I suspect the speedup will be more prevalent in biological data.

Also, it could be worth a short discussion (at office) on how this partitioned alignment work. Perhaps after I get result from the indel calling benchmark.

Would you like me to start the SNP/indel benchmarks for this commit? (or tell me otherwise when you believe you have something ready to test).

@marcelm
Copy link
Collaborator Author

marcelm commented Feb 23, 2023

Here’s one read (simulated.1612 in the D. melanogastor 2x100 bp dataset) that gets incorrect using the new method. We can perhaps discuss this in person, but I’ll write it down here anyway so I remember.

The CIGAR changes from 100= to 33=8D67= and the read maps to a position 8 bases earlier than would be correct.

There are six repeats of GAGCTGGT in the beginning of the read, except that one the occurrences has two bases changed (marked with ^). On the reference, there is another occurrence preceding the others. The NAM starts there in that additional occurrence, so it is one repeat "too early" (marked with NAM below).

Alignment before:

  NAM
GAGCTGGT GAGCTGGT GAGCTGGT GAGCTGGT GAGCTGGT GGGCTAGT GAGCTGGT CAAAGCCAAAAGCCAGCCTTCACTTCACTTGACCGCCAGCCAGTGATGTTGC ref
         ======== ======== ======== ======== ======== ======== ====================================================
         GAGCTGGT GAGCTGGT GAGCTGGT GAGCTGGT GGGCTAGT GAGCTGGT CAAAGCCAAAAGCCAGCCTTCACTTCACTTGACCGCCAGCCAGTGATGTTGC query
           NAM                                ^   ^

Alignment after:

  NAM
GAGCTGGT GAGCTGGT GAGCTGGT GAGCTGGT GAGCTGGT GGGCTAGT GAGCTGGT CAAAGCCAAAAGCCAGCCTTCACTTCACTTGACCGCCAGCCAGTGATGTTGC ref
======== ======== ======== ======== =DDDDDDD D======= ======== ====================================================
GAGCTGGT GAGCTGGT GAGCTGGT GAGCTGGT G------- -GGCTAGT GAGCTGGT CAAAGCCAAAAGCCAGCCTTCACTTCACTTGACCGCCAGCCAGTGATGTTGC query
  NAM

When we use SSW, we can recover from the incorrect NAM start position because the entire alignment is re-done, but for extension alignment, we assume that the NAM start and end positions are correct, so it just stays that way.

One remedy could be to detect this situation and then use SSW like before.

@marcelm
Copy link
Collaborator Author

marcelm commented Feb 23, 2023

Would you like me to start the SNP/indel benchmarks for this commit? (or tell me otherwise when you believe you have something ready to test).

I’m still thinking about the accuracy issue, but since it affects only 100 out of 1 million reads, it won’t make a difference even if we fix this, so go ahead.

@marcelm
Copy link
Collaborator Author

marcelm commented Feb 23, 2023

I looked at a second problematic read and the misalignment was also caused by repetitive sequence in the beginning (GTGTGT...). I think in both cases, this is some kind of "edge effect" from the strobemers being generated differently in the read and the reference. I think it looks like this: In the reference, there is a sequence of syncmers k1, k2, k3 with k1=k2, and the randstrobe is built from k1 and k3. The read starts after k1 and the randstrobe therefore pairs up k2 and k3.

There’s probably bigger potential for improved accuracy elsewhere, so instead of focusing on this rare corner case, I’ll leave this be for the moment and think about other things.

@ksahlin
Copy link
Owner

ksahlin commented Feb 23, 2023

, it won’t make a difference even if we fix this, so go ahead.

Ok, I will try to start the benchmarks tomorrow.

When we use SSW, we can recover from the incorrect NAM start position because the entire alignment is re-done, but for extension alignment, we assume that the NAM start and end positions are correct, so it just stays that way.

I see, so the deletion is coming from the internal 'global' WFA alignment (i.e., not from the KSW ends)?

There’s probably bigger potential for improved accuracy elsewhere, so instead of focusing on this rare corner case, I’ll leave this be for the moment and think about other things.

Ok, sounds reasonable!

@ksahlin
Copy link
Owner

ksahlin commented Feb 23, 2023

Makes sense that the accuracy decrease seems slightly larger for CHM which probably has more STRs due to the teleomer regions.

@marcelm
Copy link
Collaborator Author

marcelm commented Feb 24, 2023

I have updated the PR to not update the vendored SSW version. More recent SSW versions contain a fix for a segmentation fault that sometimes happens. Instead of segfaulting, a flag is set that signals that the alignment fails. The flag is not yet exposed in the C++ wrapper, so if the flag is set, the C++ Align() method just returns garbage as the CIGAR string. For some reason, the segfault does not get triggered with your custom modification where you set score_size to 1.

This should fix #245.

We need the aln_info available in a separate header file. It makes most
sense to arrange things so that everything that produces aln_info objects is
in aligner.cpp/.hpp
Instead, use aln_info.query_start and aln_info.query_end in the caller to
get the same information (this is also more consistent with how the various
other data structures represent local alignments).
This was the edit distance plus total number of soft-clipped bases. We can
instead use the query_start and query_end attributes to compute how many
bases are soft clipped.
Delete length and replace it with a ref_span() method.
@marcelm
Copy link
Collaborator Author

marcelm commented Feb 24, 2023

I see, so the deletion is coming from the internal 'global' WFA alignment (i.e., not from the KSW ends)?

Yes, exactly.

@marcelm
Copy link
Collaborator Author

marcelm commented Apr 4, 2023

Ok to close this in favor of #251?

@ksahlin
Copy link
Owner

ksahlin commented Apr 4, 2023

yes agree to close

@marcelm marcelm closed this Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants