Corner case: cannot encode a SNP next to an indel in standard VCF encoding #893

hyanwong · 2024-01-23T14:16:45Z

In the VCF format, indels are encoded by using adding the letter before the variant to the allele list, and reporting the position of the variant as the indel position minus one (see 5.1.1 in http://samtools.github.io/hts-specs/VCFv4.4.pdf). E.g.:


ATGC---CC
ATGCGGGCC
123456789 (position)

Would be reported as a variant at position 4, with alleles "C" and "CGGG", rather than a variant at position 5 with alleles "" and "GGG". The initial C in the allele string is taken from the reference genome sequence. However, if there is also a SNP at position 4, we would register two sites, both at position 4. Tskit does not allow sites with duplicate positions, so this situation can't be easily encoded in a tree sequence, even though it's a valid representation of genetic variation.

Other than mask such sites out (if there are any), we could re-encode such sites as at position 5 with one blank allele (""). However, this could cause problems for interpreting the SGkit format, as we currently calculate allele numbers etc on the assumption that trailing "" values in the alleles list can be removed.

This is probably not much of an issue for human data. But in some other datasets, such as Anopheles, every other base is a SNP, so I imagine this situation occurs quite often.

Note that there is a corner case of this corner case: when the indel is at the start of the sequence, the reference letter after the site is appended.

The text was updated successfully, but these errors were encountered:

hyanwong · 2024-02-02T08:57:18Z

On reflection, I think that the SGkit format should use the VCF style (i.e. use the position one base pair before the actual indel, e.g. pos=9, REF="ATCT", ALT="A"), whereas tsinfer should translate that to a tree sequence which has a position at the indel site (pos=10, REF="TCT", ALT=""). That way we can reasonably encode indels with an adjacent SNP.

The TreeSequence.to_vcf() method should be able to deal with tree sequences of both forms, however (outputting the VCF-compatible form in both cases). I'm unclear what we would use as the reference sequence letter to append to the allele string in this case, however.

hyanwong mentioned this issue Sep 4, 2024

Check for duplicate alleles #960

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corner case: cannot encode a SNP next to an indel in standard VCF encoding #893

Corner case: cannot encode a SNP next to an indel in standard VCF encoding #893

hyanwong commented Jan 23, 2024 •

edited

Loading

hyanwong commented Feb 2, 2024 •

edited

Loading

Corner case: cannot encode a SNP next to an indel in standard VCF encoding #893

Corner case: cannot encode a SNP next to an indel in standard VCF encoding #893

Comments

hyanwong commented Jan 23, 2024 • edited Loading

hyanwong commented Feb 2, 2024 • edited Loading

hyanwong commented Jan 23, 2024 •

edited

Loading

hyanwong commented Feb 2, 2024 •

edited

Loading