-
Notifications
You must be signed in to change notification settings - Fork 10
Local realignment
Polypolish does not perform local realignment. However, the concept is relevant and I thought it deserved a wiki page. So if you're interested, read on...
Most read alignment strategies align each read independently. I.e. a read's alignment doesn't depend on other reads in the read set. However, there can often be multiple ways of aligning reads around indels, and some alignments will make the 'wrong' choice. This is especially common when an indel is near the end of a read.
For example, consider these alignments, where there is a 4-bp deletion in the reads relative to the reference:
read 5: GTGCGTGTGCGAGGAACCTT
read 4: TACG----TGCGTGTGCGAGGAAC
read 3: GCGACTACG----TGCGTGTGCGA
read 2: TACTGTGCGACTACG----TGCGT
read 1: CTTTTACTGTGCGACTACGT
——————————————————————————————————————————————————————————————
reference: AATTCGTGGACTTTTACTGTGCGACTACGACCATGCGTGTGCGAGGAACCTTAGACTGTCAG
Reads 1 and 5 didn't get the alignment quite right, because due to the scoring scheme, a 1-bp mismatch was preferable to a 4-bp deletion. When looking at just read 1 or just read 5, their alignment looks good. It's only when looking at all read alignments together that it becomes clear that there's a 4-bp deletion and reads 1 and 5 could be aligned better.
Local realignment adjusts read alignments by taking each other into account. In our example, this should yield the following, where reads 1 and 5 are in agreement with the rest:
read 5: G----TGCGTGTGCGAGGAACCTT
read 4: TACG----TGCGTGTGCGAGGAAC
read 3: GCGACTACG----TGCGTGTGCGA
read 2: TACTGTGCGACTACG----TGCGT
read 1: CTTTTACTGTGCGACTACG----T
——————————————————————————————————————————————————————————————
reference: AATTCGTGGACTTTTACTGTGCGACTACGACCATGCGTGTGCGAGGAACCTTAGACTGTCAG
Note that the words 'local' and 'realignment' can have different meanings in different contexts, so the term 'local realignment' might be a bit confusing:
- Usually in the context of alignment, 'local' means not 'global' (e.g. Smith-Waterman vs Needleman-Wunsch). But in 'local realignment', 'local' refers to the fact that the realignment only aims to make small changes to each read's alignment in the same vicinity of the reference (i.e. locally).
- 'Realignment' can also refer to the act of taking reads aligned to one reference genome and adjusting the alignments to a different reference genome. E.g. human short reads might be aligned to GRCh37 but you need them to be aligned to GRCh38. This is a different kind of 'realignment' and not what we're talking about here.
You can read more about local realignment in these papers:
- Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA
- A framework for variation discovery and genotyping using next-generation DNA sequencing data
Since Polypolish uses the read pileup to call variants, including indels, local realignment could be beneficial. So if you can locally realign your alignments prior to giving those alignments to Polypolish, you should do so!
However, I have found it to be very difficult in practice. GATK v3 could do local realignment, but this was removed from GATK in v4. Even if you're willing to use an old version of GATK, there are a lot of roadblocks along the way - it's not an easy process! SRMA doesn't support the all-per-read alignments that Polypolish needs (it outputs only one alignment per read). CLC Genomics Workbench seems to have a tool for local realignment, but it isn't open source so I haven't been able to try it. And some other tools like glia only realign around specific variants in a VCF - not the entire set of alignments. Samtools/Bcftools avoid local realignment, instead relying on their BAQ system to address the problem in a different way.
Since I can't find a good way to do it, I don't use local realignment before running Polypolish. But if there is a method that I've missed, I would love to hear about it! So if you know of one, please send me an email 😄