-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unparseable VEP output #1351
Comments
Hi @pd3, Best wishes, |
I would assume so, but the file was not produced by me, it's gnomAD VCFs. As for the responsibility, I would argue that both Loftee and VEP should do something about it: Loftee should not produce such output in the first place and VEP should sanitize outputs from all its plugins to prevent problems like this. It's unfortunate that this went unnoticed and a major resource got affected. |
The LoF_info subfield contains commas which, in general, makes it impossible to parse the VEP subfields in automated way. The +split-vep plugin can now work with such files, replacing the offending commas with slash (/) characters. Note that this makes two assumptions: 1) the number of subfields delimited by the pipe characters (|) are consistent with the header definition 2) the first subfield never contains a comma, otherwise it woud be impossible to distinguish between A|A,A,B,B|B and A|A,A,A,B|B See also Ensembl/ensembl-vep#1351
At the moment VEP does not sanitize the data returned by any of the plugins. We ask anyone developing a plugin to test how the output is displayed for each format and ensure the data is parsable. |
I understand why it's a nuisance. But I don't believe that sanitizing plugins output can have a noticeable performance effect, if done well. Have you done any benchmarking to support that claim? |
Hi @pd3, Best wishes, |
Describe the issue
When VEP adds the
LoF_info
annotation, it does not sanitize its output and allows commas in the LoF_info subfield. For example:This makes it impossible to split the consequences by transcript and variant, programs that are designed to extract and query VEP annotations fail (such as
bcftools +split-vep
).Additional information
Example of such file and site is chr21:5032064 in https://gnomad-public-us-east-1.s3.amazonaws.com/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr21.vcf.bgz
System
VEP version: 101 (possibly more recent versions as well)
An example of such VEP annotation
Proposed solution
Replace commas and other special characters in plugins' outputs with the corresponding percent encoded characters, as recommended by the VCF specification v4.3 in section 1.2 (http://samtools.github.io/hts-specs/VCFv4.3.pdf).
The text was updated successfully, but these errors were encountered: