Unparseable VEP output #1351

pd3 · 2023-02-09T14:36:59Z

Describe the issue

When VEP adds the LoF_info annotation, it does not sanitize its output and allows commas in the LoF_info subfield. For example:

[...]|PERCENTILE:0.7,GERP_DIST:-366.3[...]MAX_ORF:-698.745,GA|frameshift_variant[...]

This makes it impossible to split the consequences by transcript and variant, programs that are designed to extract and query VEP annotations fail (such as bcftools +split-vep).

Additional information

Example of such file and site is chr21:5032064 in https://gnomad-public-us-east-1.s3.amazonaws.com/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr21.vcf.bgz

System

VEP version: 101 (possibly more recent versions as well)

An example of such VEP annotation

header:
##INFO=<ID=vep,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|ALLELE_NUM|DISTANCE|STRAND|VARIANT_CLASS|MINIMISED|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info">
##vep_version=v101

data line:
vep=GA|frameshift_variant|HIGH|FP565260.3|ENSG00000277117|Transcript|ENST00000612610|protein_coding|5/7||ENST00000612610.4:c.718_719dup|ENSP00000483732.1:p.Asp240GlufsTer35|896-897|709-710|237|G/GX|gga/gGAga|1||1|insertion||Clone_based_ensembl_gene|||1|A2||ENSP00000483732|||||||PANTHER:PTHR24100&PANTHER:PTHR24100|10|||||HC||PHYLOCSF_WEAK|PERCENTILE:0.773118279569892,GERP_DIST:-366.377766615897,BP_DIST:218,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,ANN_ORF:-698.745,MAX_ORF:-698.745,GA|frameshift_variant|HIGH|FP565260.3|ENSG00000277117|Transcript|ENST00000620481|protein_coding|4/6||ENST00000620481.4:c.367_368dup|ENSP00000484302.1:p.Asp123GlufsTer35|545-546|358-359|120|G/GX|gga/gGAga|1||1|insertion||Clone_based_ensembl_gene|||5|||ENSP00000484302|||||||PANTHER:PTHR24100&PANTHER:PTHR24100|10|||||HC||PHYLOCSF_WEAK|PERCENTILE:0.635578583765112,GERP_DIST:-366.377766615897,BP_DIST:218,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,ANN_ORF:-698.745,MAX_ORF:-698.745,GA|frameshift_variant|HIGH|FP565260.3|ENSG00000277117|Transcript|ENST00000623795|protein_coding|4/6||ENST00000623795.1:c.367_368dup|ENSP00000485649.1:p.Asp123GlufsTer35|505-506|358-359|120|G/GX|gga/gGAga|1||1|insertion||Clone_based_ensembl_gene|||2|||ENSP00000485649|||||||PANTHER:PTHR24100&PANTHER:PTHR24100|10|||||HC||PHYLOCSF_WEAK|PERCENTILE:0.659498207885305,GERP_DIST:-372.525567065179,BP_DIST:197,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,ANN_ORF:-698.745,MAX_ORF:-698.745,GA|3_prime_UTR_variant&NMD_transcript_variant|MODIFIER|FP565260.3|ENSG00000277117|Transcript|ENST00000623903|nonsense_mediated_decay|5/7||ENST00000623903.3:c.*332_*333dup||706-707|||||1||1|insertion||Clone_based_ensembl_gene|||2|||ENSP00000485557||||||||||||||||,GA|frameshift_variant|HIGH|FP565260.3|ENSG00000277117|Transcript|ENST00000623960|protein_coding|5/7||ENST00000623960.4:c.718_719dup|ENSP00000485129.1:p.Asp240GlufsTer35|858-859|709-710|237|G/GX|gga/gGAga|1||1|insertion||Clone_based_ensembl_gene||YES|1|P2|CCDS86973.1|ENSP00000485129|||||||PANTHER:PTHR24100&PANTHER:PTHR24100|10|||||HC||PHYLOCSF_WEAK|PERCENTILE:0.790979097909791,GERP_DIST:-372.525567065179,BP_DIST:197,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,ANN_ORF:-698.745,MAX_ORF:-698.745,GA|frameshift_variant|HIGH|LOC102723996|102723996|Transcript|NM_001363770.2|protein_coding|5/7||NM_001363770.2:c.718_719dup|NP_001350699.1:p.Asp240GlufsTer35|858-859|709-710|237|G/GX|gga/gGAga|1||1|insertion||EntrezGene||YES||||NP_001350699.1||||||||10|||||HC|||PERCENTILE:0.790979097909791,GERP_DIST:-372.525567065179,BP_DIST:197,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,PHYLOCSF_TOO_SHORT,GA|frameshift_variant|HIGH|LOC102723996|102723996|Transcript|XM_006723899.2|protein_coding|5/6||XM_006723899.2:c.718_719dup|XP_006723962.1:p.Asp240GlufsTer35|1345-1346|709-710|237|G/GX|gga/gGAga|1||1|insertion||EntrezGene||||||XP_006723962.1||||||||10|||||HC|||PERCENTILE:0.463571889103804,GERP_DIST:-1141.14512844086,BP_DIST:840,DIST_FROM_LAST_EXON:152,50_BP_RULE:PASS,PHYLOCSF_TOO_SHORT,GA|frameshift_variant|HIGH|LOC102723996|102723996|Transcript|XM_011546078.2|protein_coding|5/7||XM_011546078.2:c.718_719dup|XP_011544380.1:p.Asp240GlufsTer35|1345-1346|709-710|237|G/GX|gga/gGAga|1||1|insertion||EntrezGene||||||XP_011544380.1||||||||10|||||HC|||PERCENTILE:0.662062615101289,GERP_DIST:-354.294564935565,BP_DIST:374,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,PHYLOCSF_TOO_SHORT,GA|frameshift_variant|HIGH|LOC102723996|102723996|Transcript|XM_011546079.1|protein_coding|5/7||XM_011546079.1:c.718_719dup|XP_011544381.1:p.Asp240GlufsTer35|1345-1346|709-710|237|G/GX|gga/gGAga|1||1|insertion||EntrezGene||||||XP_011544381.1||||||||10|||||HC|||PERCENTILE:0.785792349726776,GERP_DIST:-391.862466733158,BP_DIST:203,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,PHYLOCSF_TOO_SHORT

Proposed solution

Replace commas and other special characters in plugins' outputs with the corresponding percent encoded characters, as recommended by the VCF specification v4.3 in section 1.2 (http://samtools.github.io/hts-specs/VCFv4.3.pdf).

The text was updated successfully, but these errors were encountered:

dglemos · 2023-02-09T15:01:20Z

Hi @pd3,
You are correct, the output field has to be refactored to not include the commas.
Can you please confirm you are getting this results from the loftee plugin? If so the issue should be raised in the loftee repo as it is developed there.

Best wishes,
Diana

pd3 · 2023-02-09T15:04:04Z

I would assume so, but the file was not produced by me, it's gnomAD VCFs.

As for the responsibility, I would argue that both Loftee and VEP should do something about it: Loftee should not produce such output in the first place and VEP should sanitize outputs from all its plugins to prevent problems like this.

It's unfortunate that this went unnoticed and a major resource got affected.

The LoF_info subfield contains commas which, in general, makes it impossible to parse the VEP subfields in automated way. The +split-vep plugin can now work with such files, replacing the offending commas with slash (/) characters. Note that this makes two assumptions: 1) the number of subfields delimited by the pipe characters (|) are consistent with the header definition 2) the first subfield never contains a comma, otherwise it woud be impossible to distinguish between A|A,A,B,B|B and A|A,A,A,B|B See also Ensembl/ensembl-vep#1351

dglemos · 2023-02-15T14:50:19Z

At the moment VEP does not sanitize the data returned by any of the plugins. We ask anyone developing a plugin to test how the output is displayed for each format and ensure the data is parsable.
We understand this is not the ideal solution however an extra parsing of the output can have a significant impact on VEP performance for some of the plugins.

pd3 · 2023-02-15T14:59:52Z

We understand this is not the ideal solution however an extra parsing of the output can have a significant impact on VEP performance for some of the plugins.

I understand why it's a nuisance. But I don't believe that sanitizing plugins output can have a noticeable performance effect, if done well. Have you done any benchmarking to support that claim?

dglemos · 2024-10-16T10:50:27Z

Hi @pd3,
Thank you for your feedback, and I completely understand your perspective.
While we haven’t conducted specific benchmarking for this particular scenario, our experience with similar cases has shown that any additional parsing, even when optimized, tends to increase runtime.
We acknowledge the importance of ensuring output data quality and are committed to being more vigilant with plugin outputs in future integrations.

Best wishes,
Diana

dglemos self-assigned this Feb 9, 2023

pd3 mentioned this issue Feb 9, 2023

Loftee VEP plugin creates incorrectly formatted VCFs konradjk/loftee#93

Open

dglemos closed this as completed Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unparseable VEP output #1351

Unparseable VEP output #1351

pd3 commented Feb 9, 2023 •

edited

Loading

dglemos commented Feb 9, 2023

pd3 commented Feb 9, 2023 •

edited

Loading

dglemos commented Feb 15, 2023

pd3 commented Feb 15, 2023

dglemos commented Oct 16, 2024

Unparseable VEP output #1351

Unparseable VEP output #1351

Comments

pd3 commented Feb 9, 2023 • edited Loading

Describe the issue

Additional information

System

An example of such VEP annotation

Proposed solution

dglemos commented Feb 9, 2023

pd3 commented Feb 9, 2023 • edited Loading

dglemos commented Feb 15, 2023

pd3 commented Feb 15, 2023

dglemos commented Oct 16, 2024

pd3 commented Feb 9, 2023 •

edited

Loading

pd3 commented Feb 9, 2023 •

edited

Loading