You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the VCF format, indels are encoded by using adding the letter before the variant to the allele list, and reporting the position of the variant as the indel position minus one (see 5.1.1 in http://samtools.github.io/hts-specs/VCFv4.4.pdf). E.g.:
ATGC---CC
ATGCGGGCC
123456789 (position)
Would be reported as a variant at position 4, with alleles "C" and "CGGG", rather than a variant at position 5 with alleles "" and "GGG". The initial C in the allele string is taken from the reference genome sequence. However, if there is also a SNP at position 4, we would register two sites, both at position 4. Tskit does not allow sites with duplicate positions, so this situation can't be easily encoded in a tree sequence, even though it's a valid representation of genetic variation.
Other than mask such sites out (if there are any), we could re-encode such sites as at position 5 with one blank allele (""). However, this could cause problems for interpreting the SGkit format, as we currently calculate allele numbers etc on the assumption that trailing "" values in the alleles list can be removed.
This is probably not much of an issue for human data. But in some other datasets, such as Anopheles, every other base is a SNP, so I imagine this situation occurs quite often.
Note that there is a corner case of this corner case: when the indel is at the start of the sequence, the reference letter after the site is appended.
The text was updated successfully, but these errors were encountered:
On reflection, I think that the SGkit format should use the VCF style (i.e. use the position one base pair before the actual indel, e.g. pos=9, REF="ATCT", ALT="A"), whereas tsinfer should translate that to a tree sequence which has a position at the indel site (pos=10, REF="TCT", ALT=""). That way we can reasonably encode indels with an adjacent SNP.
The TreeSequence.to_vcf() method should be able to deal with tree sequences of both forms, however (outputting the VCF-compatible form in both cases). I'm unclear what we would use as the reference sequence letter to append to the allele string in this case, however.
In the VCF format, indels are encoded by using adding the letter before the variant to the allele list, and reporting the position of the variant as the indel position minus one (see 5.1.1 in http://samtools.github.io/hts-specs/VCFv4.4.pdf). E.g.:
Would be reported as a variant at position 4, with alleles
"C"
and"CGGG"
, rather than a variant at position 5 with alleles""
and"GGG"
. The initialC
in the allele string is taken from the reference genome sequence. However, if there is also a SNP at position 4, we would register two sites, both at position 4. Tskit does not allow sites with duplicate positions, so this situation can't be easily encoded in a tree sequence, even though it's a valid representation of genetic variation.Other than mask such sites out (if there are any), we could re-encode such sites as at position 5 with one blank allele (
""
). However, this could cause problems for interpreting the SGkit format, as we currently calculate allele numbers etc on the assumption that trailing""
values in the alleles list can be removed.This is probably not much of an issue for human data. But in some other datasets, such as Anopheles, every other base is a SNP, so I imagine this situation occurs quite often.
Note that there is a corner case of this corner case: when the indel is at the start of the sequence, the reference letter after the site is appended.
The text was updated successfully, but these errors were encountered: