Alignment with ksw2 and WFA2-lib #242

marcelm · 2023-02-17T15:16:41Z

(I just noticed this is not quite in a working state due to some linking problems, but I’ll add this here anyway.)

This replaces the local alignment with SSW with a hybrid approach where

WFA2-lib is used to globally align the NAM
ksw2 is used twice to extend the NAM towards the 5' and 3' end of the read

There are also a lot of restructuring commits in here to make the way the various alignment functions are called more consistent with each other.

This needs some optimizations. In particular, I temporarily removed the optimization of trying a gapfree alignment first (with hamming_align). However, the speed is roughly the same as before, in particular when setting match scores to 0 (which makes WFA2 faster).

To Do

Fix the linking problem (libwfa2.so not found – switch from dynamic to static linking)
Put back the hamming_align optimization
Run benchmarks
Fix the macOS build error
Replace SSW in rescue alignment with ksw2

ksahlin · 2023-02-17T15:38:19Z

This is great! I will also run a benchmark on the variant calling I did in the paper to see how the new extension perform once this branch is ready to be tested.

Piecewise alignment is definitely the way to go. Just to think a bit ahead of a possible limitation with respect to my naive semi-global alignment: for the longer reads (say 250 and above) I think the benefit of my naive semi-global alignment was that it could easily span a insertion (in the query) of 50-100 nucleotides without breaking the alignment (our indel calling results were better than any other aligner on simulated data with ground truth).

I am wondering how the piecewise approach will behave here. Starting to extend with ksw2 at the 20nt long NAM start/end sites won't make the score increase enough before the alignment hits the long indel, and would therefore be dropped by ksw2. I predict we will se a lot more long softclipps of reads not making it over the indels with our piecewise solution (which may also affect indel calling). This is at least what I saw with my (bug-prone) attempt at piecewise alignment.

If this turns out to be a problem. I am wondering if this could be solved (i.e. mimicking semi global alignment) with increasing start score of ksw2 obtained from the alignment score of the middle part?

Alternatively, I am thinking if a solution would be simply to look for NAMs on 'each side' of the long indel? (And then we are getting dangerously close to a solution for split(supplementary) alignments.

marcelm · 2023-02-17T21:35:11Z

Finding long indels in the extensions may be a matter of tuning the scoring function. ksw2 supports a dual gap cost function that is supposed to help retain long gaps, maybe we could switch that on. Actually, now that I think about it, if we use the same scoring function as before, then we should actually be able to span the same indels as before. But there’re probably cases I haven’t thought about.

I just did some preliminary benchmarks for this PR and noticed that accuracy is slightly lower than before (~0.02 percentage points). This may be due to soft clipping, which happens more often now. Many alignments change from something like this: 4=1X95= to this: 5S95=. Using the current scores, any mismatch 4 bases away from the end of the read is going to be soft clipped. BWA-MEM has a penalty for soft clipping to avoid this, and ksw2 appears to have a way to give a bonus to alignments that reach the end of the query, which would have the same effect. This is related to #54.

ksahlin · 2023-02-17T22:08:58Z

Finding long indels in the extensions may be a matter of tuning the scoring function. ksw2 supports a dual gap cost function that is supposed to help retain long gaps, maybe we could switch that on. Actually, now that I think about it, if we use the same scoring function as before, then we should actually be able to span the same indels as before. But there’re probably cases I haven’t thought about.

Right, the dual gap cost should offer an improvement over our old alignment scoring (or at least make up for not performing semi-global). I am not too confident in my opinions on this matter though. Probably because I have not build up enough intuition of the extension step yet. I would have to resort to test cases for this matter.

I just did some preliminary benchmarks for this PR and noticed that accuracy is slightly lower than before (~0.02 percentage points). This may be due to soft clipping, which happens more often now. Many alignments change from something like this: 4=1X95= to this: 5S95=. Using the current scores, any mismatch 4 bases away from the end of the read is going to be soft clipped. BWA-MEM has a penalty for soft clipping to avoid this, and ksw2 appears to have a way to give a bonus to alignments that reach the end of the query, which would have the same effect. This is related to #54.

Okay, it will probably affect SNP calling as well. However, I think this issue is automatically taken care of if we first try hamming_align (which will put 4=1X95=). But perhaps you intended to fix this in itself first without 'masking' the issue with hamming distance?

marcelm · 2023-02-21T17:01:22Z

I added back the hamming_align optimization and with this, mapping appears to be a little bit faster than before. I still need to run more thorough benchmarks.

marcelm · 2023-02-23T10:20:44Z

Here’s a summary of my findings:

Accuracy drops by 0.01 percentage points across all datasets (it was 0.02 before I added back the hamming_align optimization)
The speedup ranges from no speedup at all for short reads to 1.5 times for longer reads

Accuracy

dataset	main	this PR	accuracy reduction
drosophila/2x50-1M	90.0643	90.0591	0.0052
drosophila/2x100-1M	92.38105	92.37705	0.004
drosophila/2x150-1M	93.2202	93.21225	0.00795
drosophila/2x200-1M	93.506	93.4974	0.0086
drosophila/2x300-1M	95.3651	95.35735	0.00775
CHM13/2x50-1M	90.46475	90.4603	0.00445
CHM13/2x100-1M	93.22665	93.2179	0.00875
CHM13/2x150-1M	94.1315	94.11335	0.01815
CHM13/2x200-1M	94.41465	94.4028	0.01185
CHM13/2x300-1M	95.62665	95.616	0.01065
maize/2x50-1M	71.1893	71.1803	0.009
maize/2x100-1M	87.1095	87.10545	0.00405
maize/2x200-1M	92.91535	92.9083	0.00705
maize/2x300-1M	96.71805	96.7116	0.00645

Speed

This was run on Drosophila only, reporting minimum runtime of three runs.

dataset	runtime before (s)	runtime after (s)	speedup
2x50bp	49.660	49.298	1.01
2x100 bp	64.250	60.817	1.06
2x150 bp	78.750	67.922	1.16
2x200 bp	100.810	78.441	1.29
2x300 bp	199.235	131.731	1.51

Other notes:

Alignment is not a bottleneck for short reads. The percentage of total runtime spent on it in main was ca. 19% (2x100 bp reads), so this limits how much total speedup we can have by making alignment faster. (This PR reduces the percentage to 14%).
For 2x200 bp reads, time spent on alignment drops from ~32% to ~9%

I’d like to look into what happens to some of the reads that now get aligned incorrectly, maybe we can avoid the drop in accuracy.

ksahlin · 2023-02-23T10:48:24Z

To me this looks like great news! Although I would have to rerun the SNP/indel calling pipeline to know how/if it deteriorates around indels. Two notes (pulling in different directions):

I think extension time takes up a larger fraction of the time on smaller genomes, so the relative speedup will not be as high on larger genomes (for simulated data).
However, alignment is invoked less often in simulated reads compared to biological datasets (for some reason) as seen in figure 2C, so I suspect the speedup will be more prevalent in biological data.

Also, it could be worth a short discussion (at office) on how this partitioned alignment work. Perhaps after I get result from the indel calling benchmark.

Would you like me to start the SNP/indel benchmarks for this commit? (or tell me otherwise when you believe you have something ready to test).

marcelm · 2023-02-23T12:44:48Z

Here’s one read (simulated.1612 in the D. melanogastor 2x100 bp dataset) that gets incorrect using the new method. We can perhaps discuss this in person, but I’ll write it down here anyway so I remember.

The CIGAR changes from 100= to 33=8D67= and the read maps to a position 8 bases earlier than would be correct.

There are six repeats of GAGCTGGT in the beginning of the read, except that one the occurrences has two bases changed (marked with ^). On the reference, there is another occurrence preceding the others. The NAM starts there in that additional occurrence, so it is one repeat "too early" (marked with NAM below).

Alignment before:

  NAM
GAGCTGGT GAGCTGGT GAGCTGGT GAGCTGGT GAGCTGGT GGGCTAGT GAGCTGGT CAAAGCCAAAAGCCAGCCTTCACTTCACTTGACCGCCAGCCAGTGATGTTGC ref
         ======== ======== ======== ======== ======== ======== ====================================================
         GAGCTGGT GAGCTGGT GAGCTGGT GAGCTGGT GGGCTAGT GAGCTGGT CAAAGCCAAAAGCCAGCCTTCACTTCACTTGACCGCCAGCCAGTGATGTTGC query
           NAM                                ^   ^

Alignment after:

  NAM
GAGCTGGT GAGCTGGT GAGCTGGT GAGCTGGT GAGCTGGT GGGCTAGT GAGCTGGT CAAAGCCAAAAGCCAGCCTTCACTTCACTTGACCGCCAGCCAGTGATGTTGC ref
======== ======== ======== ======== =DDDDDDD D======= ======== ====================================================
GAGCTGGT GAGCTGGT GAGCTGGT GAGCTGGT G------- -GGCTAGT GAGCTGGT CAAAGCCAAAAGCCAGCCTTCACTTCACTTGACCGCCAGCCAGTGATGTTGC query
  NAM

When we use SSW, we can recover from the incorrect NAM start position because the entire alignment is re-done, but for extension alignment, we assume that the NAM start and end positions are correct, so it just stays that way.

One remedy could be to detect this situation and then use SSW like before.

marcelm · 2023-02-23T12:49:41Z

Would you like me to start the SNP/indel benchmarks for this commit? (or tell me otherwise when you believe you have something ready to test).

I’m still thinking about the accuracy issue, but since it affects only 100 out of 1 million reads, it won’t make a difference even if we fix this, so go ahead.

marcelm · 2023-02-23T13:21:01Z

I looked at a second problematic read and the misalignment was also caused by repetitive sequence in the beginning (GTGTGT...). I think in both cases, this is some kind of "edge effect" from the strobemers being generated differently in the read and the reference. I think it looks like this: In the reference, there is a sequence of syncmers k1, k2, k3 with k1=k2, and the randstrobe is built from k1 and k3. The read starts after k1 and the randstrobe therefore pairs up k2 and k3.

There’s probably bigger potential for improved accuracy elsewhere, so instead of focusing on this rare corner case, I’ll leave this be for the moment and think about other things.

ksahlin · 2023-02-23T20:17:25Z

, it won’t make a difference even if we fix this, so go ahead.

Ok, I will try to start the benchmarks tomorrow.

When we use SSW, we can recover from the incorrect NAM start position because the entire alignment is re-done, but for extension alignment, we assume that the NAM start and end positions are correct, so it just stays that way.

I see, so the deletion is coming from the internal 'global' WFA alignment (i.e., not from the KSW ends)?

There’s probably bigger potential for improved accuracy elsewhere, so instead of focusing on this rare corner case, I’ll leave this be for the moment and think about other things.

Ok, sounds reasonable!

ksahlin · 2023-02-23T20:22:54Z

Makes sense that the accuracy decrease seems slightly larger for CHM which probably has more STRs due to the teleomer regions.

marcelm · 2023-02-24T13:58:30Z

I have updated the PR to not update the vendored SSW version. More recent SSW versions contain a fix for a segmentation fault that sometimes happens. Instead of segfaulting, a flag is set that signals that the alignment fails. The flag is not yet exposed in the C++ wrapper, so if the flag is set, the C++ Align() method just returns garbage as the CIGAR string. For some reason, the segfault does not get triggered with your custom modification where you set score_size to 1.

This should fix #245.

We need the aln_info available in a separate header file. It makes most sense to arrange things so that everything that produces aln_info objects is in aligner.cpp/.hpp

Instead, use aln_info.query_start and aln_info.query_end in the caller to get the same information (this is also more consistent with how the various other data structures represent local alignments).

This was the edit distance plus total number of soft-clipped bases. We can instead use the query_start and query_end attributes to compute how many bases are soft clipped.

Delete length and replace it with a ref_span() method.

At least not the way I expected it, so we just reverse the CIGAR ourselves.

This allows us to avoid parsing the CIGAR string in Aligner::align.

marcelm · 2023-02-24T22:26:23Z

I see, so the deletion is coming from the internal 'global' WFA alignment (i.e., not from the KSW ends)?

Yes, exactly.

marcelm · 2023-04-04T07:27:49Z

Ok to close this in favor of #251?

ksahlin · 2023-04-04T10:40:45Z

yes agree to close

marcelm force-pushed the wfa branch from 5240083 to f3c05b3 Compare February 17, 2023 21:37

marcelm force-pushed the wfa branch from 99b8d52 to eb7ef72 Compare February 22, 2023 21:50

marcelm mentioned this pull request Feb 24, 2023

[E::sam_parse1] CIGAR and query sequence are of different length #245

Closed

marcelm force-pushed the wfa branch from c248724 to 1964092 Compare February 24, 2023 13:48

marcelm added 14 commits February 24, 2023 16:31

Move Aligner into a separate file

7d63216

We need the aln_info available in a separate header file. It makes most sense to arrange things so that everything that produces aln_info objects is in aligner.cpp/.hpp

Remove some unused code

c4ec4eb

Rename ref_offset to ref_start

cee7afc

Add query_start and query_end attributes to aln_info

27e5eb0

Remove hamming_align soft_left/soft_right parameters

8d0cfe1

Instead, use aln_info.query_start and aln_info.query_end in the caller to get the same information (this is also more consistent with how the various other data structures represent local alignments).

Remove aln_info.global_ed

d25977b

This was the edit distance plus total number of soft-clipped bases. We can instead use the query_start and query_end attributes to compute how many bases are soft clipped.

Move SSW files to an ssw/ subdirectory

be79f39

Add aln_info.ref_end

d763f8e

Delete length and replace it with a ref_span() method.

Rename aln_info.ed to edit_distance

326cfcf

Consistently use query, ref as parameter order

11b8898

Make WFA2-lib available to the build

ef33ec3

Temporary static linking fix

b630699

Update ksw2 to the most recent upstream version

e65f6b8

Move ksw2 files to a subdirectory and add license

5e739c6

marcelm added 9 commits February 24, 2023 16:31

Add a Cigar class

c381dc5

Start using WFA2 (for global alignment of the NAM)

8c82912

Start ksw2 wrapper

d28d833

Implement right extension with ksw2

d1e09ce

Implement left extension with ksw2

02ca2a3

Implement Cigar::edit_distance() to count edits correctly in ksw_extend

6a40eec

ksw’s built-in CIGAR-reversing capability does not seem to work

616acf8

At least not the way I expected it, so we just reverse the CIGAR ourselves.

Re-add the hamming_align optimization

3bc1d2f

Let aln_info.cigar be a Cigar object

edeb390

This allows us to avoid parsing the CIGAR string in Aligner::align.

marcelm force-pushed the wfa branch from d5897f3 to edeb390 Compare February 24, 2023 15:31

marcelm force-pushed the wfa branch from 48b7987 to edeb390 Compare February 27, 2023 15:08

This was referenced Feb 28, 2023

Some harmless refactoring #249

Merged

Introduce an end bonus to penalize soft-clipping #250

Merged

Alignment with WFA2 #229

Closed

Alignment with ksw2/WFA2 - Take 2 #251

Open

marcelm closed this Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment with ksw2 and WFA2-lib #242

Alignment with ksw2 and WFA2-lib #242

marcelm commented Feb 17, 2023 •

edited

Loading

ksahlin commented Feb 17, 2023 •

edited

Loading

marcelm commented Feb 17, 2023

ksahlin commented Feb 17, 2023

marcelm commented Feb 21, 2023

marcelm commented Feb 23, 2023

ksahlin commented Feb 23, 2023

marcelm commented Feb 23, 2023

marcelm commented Feb 23, 2023

marcelm commented Feb 23, 2023 •

edited

Loading

ksahlin commented Feb 23, 2023

ksahlin commented Feb 23, 2023

marcelm commented Feb 24, 2023

marcelm commented Feb 24, 2023

marcelm commented Apr 4, 2023

ksahlin commented Apr 4, 2023

Alignment with ksw2 and WFA2-lib #242

Alignment with ksw2 and WFA2-lib #242

Conversation

marcelm commented Feb 17, 2023 • edited Loading

To Do

ksahlin commented Feb 17, 2023 • edited Loading

marcelm commented Feb 17, 2023

ksahlin commented Feb 17, 2023

marcelm commented Feb 21, 2023

marcelm commented Feb 23, 2023

Accuracy

Speed

ksahlin commented Feb 23, 2023

marcelm commented Feb 23, 2023

marcelm commented Feb 23, 2023

marcelm commented Feb 23, 2023 • edited Loading

ksahlin commented Feb 23, 2023

ksahlin commented Feb 23, 2023

marcelm commented Feb 24, 2023

marcelm commented Feb 24, 2023

marcelm commented Apr 4, 2023

ksahlin commented Apr 4, 2023

marcelm commented Feb 17, 2023 •

edited

Loading

ksahlin commented Feb 17, 2023 •

edited

Loading

marcelm commented Feb 23, 2023 •

edited

Loading