Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect INFO and genotypes in HPRCv1.1 decomposed VCFs #29

Open
Han-Cao opened this issue Jul 5, 2024 · 0 comments
Open

Incorrect INFO and genotypes in HPRCv1.1 decomposed VCFs #29

Han-Cao opened this issue Jul 5, 2024 · 0 comments

Comments

@Han-Cao
Copy link

Han-Cao commented Jul 5, 2024

Hi @glennhickey ,

There are some errors in the decomposed VCFs, which is likely caused by vcfwave:

  • Some of the decomposed variants have incorrect genotypes and INFO
  • The decomposed INFO/AT do not match the alleles. It looks like vcfwave treat this field as Number=A instead of Number=R

To quickly replicate this issue:

bcftools view -H -r chr22:11021446-11021447 -s HG01109 hprc-v1.1-mc-chm13.raw.vcf.gz
chr22   11021446        >55070592>55070595      TT      T,TTT   60      .       AC=1,0;AF=0.0857143,0.0571429;AN=2;AT=>55070592>55070593>55070595,>55070592>55070595,>55070592>55070593>55070594>55070595;NS=44;LV=0    GT      0|1

bcftools view -H -r chr22:11021446-11021447 -s HG01109 hprc-v1.1-mc-chm13.vcfbub.a100k.wave.vcf.gz
[W::bcf_calc_ac] Incorrect number of AC fields at chr22:11021447. (This message is printed only once.)

chr22   11021446        >55070592>55070595_1    TT      T       60      .       AC=1;AF=0.085714;AN=2;AT=>55070592>55070593>55070595;NS=44;LV=0;ORIGIN=chr22:11021446;LEN=1;TYPE=del    GT      0|1
chr22   11021447        >55070592>55070595_2    T       TT      60      .       AC=1;AF=0.0857143,0.0571429;AN=2;AT=>55070592>55070593>55070595,>55070592>55070595,>55070592>55070593>55070594>55070595;NS=44;LV=0;ORIGIN=chr22:11021446;LEN=1;TYPE=ins  GT      0|1

Genotype error:
The >55070592>55070595 record has 2 alternative alleles, the first one is deletion (TT -> T) and the second one is insertion (TT -> TTT). In raw VCF, HG01109 has one copy of the deletion on haplotype2. While in the decomposed VCF, HG01109 has both the deletion and insertion on haplotype2.

INFO error
The AT in decomposed VCF don't match the input VCF. Moreover, it seems all INFO for >55070592>55070595_2 are still the same as the raw VCF.

I also report this issue to vcfwave, you may refer to vcflib/vcflib#403 for details .

Besides, this issue was initially identified when I tried to run this Pangenie pipeline.

bcftools view -s ^GRCh38 -r chr22 -i 'F_MISSING < 0.2' hprc-v1.1-mc-chm13.vcfbub.a100k.wave.vcf.gz | \
python3 scripts/add-ids.py | \
bcftools norm -m - > callset-biallelic.vcf
python3 scripts/merge_vcfs.py merge -vcf callset-biallelic.vcf -r CHM13v11Y.fa -ploidy 2 1> graph.vcf 2> graph.vcf.log

The error records are reported in graph.vcf.log like below. In chr22, there are 3111 errors.

Two overlapping variants at same haplotype at chr22:11021446, set allele to missing.
Two overlapping variants at same haplotype at chr22:11023348, set allele to missing.
Two overlapping variants at same haplotype at chr22:11028129, set allele to missing.
...
@Han-Cao Han-Cao changed the title Incorrect INFO and genotypes in the decomposed VCFs Incorrect INFO and genotypes in HPRCv1.1 decomposed VCFs Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant