You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
bcftools annotate can output an INFO field value with unquoted semicolons (;). This causes the part after the semicolon to be interpreted as another INFO field when parsed. If the part after the semicolon contains the comma character, the resulting file cannot be viewed using bcftools view, instead producing an error.
Steps to reproduce:
A minimal VCF file to annotate:
$ cat repro.vcf
##fileformat=VCFv4.3
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr20>
##reference=hg38
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT repro
chr20 33791101 . GC G . . . GT 0/1
Annotations file, containing an annotation value that contains a semicolon:
$ cat annots.txt
chr20 33791101 GC G ENST00000342427.6:c.2129delC,ENST00000342427.6:p.K711Rfs*47;ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47
A header line to use for the new annotation:
$ cat header.txt
##INFO=<ID=FOO,Number=1,Type=String,Description="Yet another header line">
We can see that it produced a VCF file where INFO field separator ; appears unquoted:
$ cat out.vcf
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr20>
##reference=hg38
##INFO=<ID=FOO,Number=1,Type=String,Description="Yet another header line">
##bcftools_annotateVersion=1.20+htslib-1.20
##bcftools_annotateCommand=annotate -a annots.txt.gz -h header.txt -c CHROM,POS,REF,ALT,FOO repro.vcf; Date=Fri May 31 10:13:26 2024
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT repro
chr20 33791101 . GC G . . FOO=ENST00000342427.6:c.2129delC,ENST00000342427.6:p.K711Rfs*47;ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47 GT 0/1
This is not accepted by bcftools view, because it parses the part after the semicolon to be another info field, and tries to create a dummy header line for it, which fails due to the comma embedded in it:
$ bcftools view out.vcf
[W::vcf_parse_info] INFO 'ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##INFO=<ID=ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47,Number=1,Type=String,Description=\"Dummy\">"
[E::vcf_parse_info] Could not add dummy header for INFO 'ENST00000375200.6:c.2150delC,ENST00000375200.6:p.K718Rfs*47' at chr20:33791101
Error: VCF parse error
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr20>
##reference=hg38
##INFO=<ID=FOO,Number=1,Type=String,Description="Yet another header line">
##bcftools_annotateVersion=1.20+htslib-1.20
##bcftools_annotateCommand=annotate -a annots.txt.gz -h header.txt -c CHROM,POS,REF,ALT,FOO repro.vcf; Date=Fri May 31 10:13:26 2024
##bcftools_viewVersion=1.20+htslib-1.20
##bcftools_viewCommand=view out.vcf; Date=Fri May 31 10:15:56 2024
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT repro
Some characters have a special meaning when they appear (such as field delimiters ‘;’ in INFO or ‘:’ FORMAT fields), and for any other meaning they must be represented with the capitalized percent encoding; [...]
bcftools version
$ bcftools version
bcftools 1.20
Using htslib 1.20
Copyright (C) 2024 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
bcftools annotate
can output an INFO field value with unquoted semicolons (;
). This causes the part after the semicolon to be interpreted as another INFO field when parsed. If the part after the semicolon contains the comma character, the resulting file cannot be viewed usingbcftools view
, instead producing an error.Steps to reproduce:
A minimal VCF file to annotate:
Annotations file, containing an annotation value that contains a semicolon:
A header line to use for the new annotation:
Annotating the VCF file:
We can see that it produced a VCF file where INFO field separator
;
appears unquoted:This is not accepted by
bcftools view
, because it parses the part after the semicolon to be another info field, and tries to create a dummy header line for it, which fails due to the comma embedded in it:Additional information
VCF v4.3 spec, Section 1.2 says:
bcftools version
Files used in the steps to reproduce
repro.zip
The text was updated successfully, but these errors were encountered: