Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Funcotator - Problems with variant alleles that span exon, coding sequence start/end boundaries #4307

Open
jonn-smith opened this issue Jan 30, 2018 · 8 comments
Assignees
Milestone

Comments

@jonn-smith
Copy link
Collaborator

jonn-smith commented Jan 30, 2018

The way start positions are calculated causes issues when alleles span the boundaries of exons and the coding sequence itself.

For exon boundaries:

The error stems from FuncotatorUtils::getStartPositionInCodingSequence and how the results of that method call are used in the GencodeFuncotationFactory.

In addition:
Funcotator must be able to handle indels that span exon start boundaries.

For example, in hg19 the following variant is not handled properly:

chr5:71622537-71622538 CA->C

The current workaround is to throw a FuncotatorUtils.TranscriptCodingSequenceException for the transcript causing this problem in Gencode.

For Coding sequence boundaries, the following variant in hg38 causes a problem:

chr17 80090386  rs71163918  CAGCACGTGCATGAACAACACAGGACACACACAGCACGTGCATGAACAACACAGGACACACACA  C . PASS  .
@jonn-smith jonn-smith added this to the Engine-4.1 milestone Jan 30, 2018
@jonn-smith jonn-smith self-assigned this Jan 30, 2018
@jonn-smith
Copy link
Collaborator Author

A temporary fix is in that will only annotate variants as in a CDS if they are contained within that CDS.

@jonn-smith
Copy link
Collaborator Author

jonn-smith commented May 31, 2018

This is related to #4804

@jonn-smith
Copy link
Collaborator Author

Can oncotator handle this?

@droazen droazen removed this from the Engine-2Q2018 milestone Oct 4, 2018
@jonn-smith jonn-smith changed the title Funcotator - Problems with variant alleles that span exon start/end boundaries Funcotator - Problems with variant alleles that span exon, coding sequence start/end boundaries Oct 11, 2018
@jonn-smith jonn-smith added this to the Engine-Q42018 milestone Oct 15, 2018
@jonn-smith
Copy link
Collaborator Author

Funcotator and Oncotator both produce the same results for chr5:71622537-71622538 CA->C. They both seem to ignore the transcript that runs off the end of the boundary.

@jonn-smith
Copy link
Collaborator Author

GencodeFuncotationFactory::getContainingGtfSubfeature contains a "fix" to get around this issue.

@jonn-smith
Copy link
Collaborator Author

This is related to #3749

@jonn-smith
Copy link
Collaborator Author

jonn-smith commented Oct 30, 2018

For the following variants:

chr1 819955 . TTCACGAATT T . PASS .
chr1 819979 . CATGAGCAT C . PASS .

VEP seems to produce incorrect data:

##fileformat=VCFv4.1
##VEP="v94" time="2018-10-30 19:13:50" cache="/nfs/public/release/ensweb-data/latest/tools/grch37/e94/vep/cache/homo_sapiens/94_GRCh37" db="homo_sapiens_core_94_37@hh-mysql-ens-grch37-web" 1000genomes="phase3" COSMIC="81" ClinVar="201706" ESP="20141103" HGMD-PUBLIC="20164" assembly="GRCh37.p13" dbSNP="150" gencode="GENCODE 19" genebuild="2011-04" gnomAD="170228" polyphen="2.2.2" regbuild="1.0" sift="sift5.2.2"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|TSL|APPRIS|ENSP|SIFT|PolyPhen|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	819955	.	TTCACGAATT	T	.	PASS	CSQ=-|splice_acceptor_variant&coding_sequence_variant&intron_variant|HIGH|AL645608.2|ENSG00000269308|Transcript|ENST00000594233|protein_coding|3/3|2/2|||?-38|?-38|?-13|||||1||Clone_based_ensembl_gene||||ENSP00000470877|||||||||||||||||||||||||||
chr1	819979	.	CATGAGCAT	C	.	PASS	CSQ=-|coding_sequence_variant&3_prime_UTR_variant|MODIFIER|AL645608.2|ENSG00000269308|Transcript|ENST00000594233|protein_coding|3/3||||54-?|54-?|18-?|||||1||Clone_based_ensembl_gene||||ENSP00000470877|||||||||||||||||||||||||||

There is no cDNA string / amino acid change and the positions have ? characters in them (not 100% sure the question marks are wrong - there is no real spec for these fields).

Oncotator seems to produce incorrect data:

##fileformat=VCFv4.1
##fileDate=Thu Sept 6 15:53:20 UTC 2018
##reference=/cromwell_root/broad-references/hg19/v0/Homo_sapiens_assembly19.fasta
##oncotator_version=Oncotator_v1.10.0.0dev_|_Flat_File_Reference_hg19_|_GENCODE_v19_CANONICAL_
##INFO=<ID=annotation_transcript,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gencode_transcript_status,Number=.,Type=String,Description="Unknown">
##INFO=<ID=HGVS_protein_change,Number=.,Type=String,Description="Unknown">
##INFO=<ID=genome_change,Number=.,Type=String,Description="Unknown">
##INFO=<ID=ccds_id,Number=.,Type=String,Description="Unknown">
##INFO=<ID=codon_change,Number=.,Type=String,Description="Unknown">
##INFO=<ID=HGVS_coding_DNA_change,Number=.,Type=String,Description="Unknown">
##INFO=<ID=transcript_position,Number=.,Type=String,Description="Unknown">
##INFO=<ID=transcript_change,Number=.,Type=String,Description="Unknown">
##INFO=<ID=other_transcripts,Number=.,Type=String,Description="Unknown">
##INFO=<ID=HGVS_genomic_change,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gencode_transcript_tags,Number=.,Type=String,Description="Unknown">
##INFO=<ID=variant_classification,Number=.,Type=String,Description="Unknown">
##INFO=<ID=secondary_variant_classification,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gene_id,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gencode_transcript_name,Number=.,Type=String,Description="Unknown">
##INFO=<ID=transcript_exon,Number=.,Type=String,Description="Unknown">
##INFO=<ID=havana_transcript,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gene_type,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gencode_transcript_type,Number=.,Type=String,Description="Unknown">
##INFO=<ID=protein_change,Number=.,Type=String,Description="Unknown">
##INFO=<ID=ref_context,Number=.,Type=String,Description="Unknown">
##INFO=<ID=variant_type,Number=.,Type=String,Description="Unknown">
##INFO=<ID=transcript_id,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gc_content,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gene,Number=.,Type=String,Description="Unknown">
##INFO=<ID=transcript_strand,Number=.,Type=String,Description="Unknown">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
##contig=<ID=GL000207.1,length=4262>
##contig=<ID=GL000226.1,length=15008>
##contig=<ID=GL000229.1,length=19913>
##contig=<ID=GL000231.1,length=27386>
##contig=<ID=GL000210.1,length=27682>
##contig=<ID=GL000239.1,length=33824>
##contig=<ID=GL000235.1,length=34474>
##contig=<ID=GL000201.1,length=36148>
##contig=<ID=GL000247.1,length=36422>
##contig=<ID=GL000245.1,length=36651>
##contig=<ID=GL000197.1,length=37175>
##contig=<ID=GL000203.1,length=37498>
##contig=<ID=GL000246.1,length=38154>
##contig=<ID=GL000249.1,length=38502>
##contig=<ID=GL000196.1,length=38914>
##contig=<ID=GL000248.1,length=39786>
##contig=<ID=GL000244.1,length=39929>
##contig=<ID=GL000238.1,length=39939>
##contig=<ID=GL000202.1,length=40103>
##contig=<ID=GL000234.1,length=40531>
##contig=<ID=GL000232.1,length=40652>
##contig=<ID=GL000206.1,length=41001>
##contig=<ID=GL000240.1,length=41933>
##contig=<ID=GL000236.1,length=41934>
##contig=<ID=GL000241.1,length=42152>
##contig=<ID=GL000243.1,length=43341>
##contig=<ID=GL000242.1,length=43523>
##contig=<ID=GL000230.1,length=43691>
##contig=<ID=GL000237.1,length=45867>
##contig=<ID=GL000233.1,length=45941>
##contig=<ID=GL000204.1,length=81310>
##contig=<ID=GL000198.1,length=90085>
##contig=<ID=GL000208.1,length=92689>
##contig=<ID=GL000191.1,length=106433>
##contig=<ID=GL000227.1,length=128374>
##contig=<ID=GL000228.1,length=129120>
##contig=<ID=GL000214.1,length=137718>
##contig=<ID=GL000221.1,length=155397>
##contig=<ID=GL000209.1,length=159169>
##contig=<ID=GL000218.1,length=161147>
##contig=<ID=GL000220.1,length=161802>
##contig=<ID=GL000213.1,length=164239>
##contig=<ID=GL000211.1,length=166566>
##contig=<ID=GL000199.1,length=169874>
##contig=<ID=GL000217.1,length=172149>
##contig=<ID=GL000216.1,length=172294>
##contig=<ID=GL000215.1,length=172545>
##contig=<ID=GL000205.1,length=174588>
##contig=<ID=GL000219.1,length=179198>
##contig=<ID=GL000224.1,length=179693>
##contig=<ID=GL000223.1,length=180455>
##contig=<ID=GL000195.1,length=182896>
##contig=<ID=GL000212.1,length=186858>
##contig=<ID=GL000222.1,length=186861>
##contig=<ID=GL000200.1,length=187035>
##contig=<ID=GL000193.1,length=189789>
##contig=<ID=GL000194.1,length=191469>
##contig=<ID=GL000225.1,length=211173>
##contig=<ID=GL000192.1,length=547496>
##contig=<ID=NC_007605,length=171823>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	819955	.	TTCACGAATT	T	.	PASS	annotation_transcript=ENST00000594233.1;gencode_transcript_status=NOVEL;genome_change=g.chr1:819956_819964delTCACGAATT;codon_change=c.(34-39)tcacga>a;HGVS_coding_DNA_change=ENST00000594233.1:c.35-1TCACGAATT>-;transcript_position=34_38;transcript_change=c.34_38delTCACGAATT;HGVS_genomic_change=chr1.hg19:g.819956_819964delTCACGAATT;gencode_transcript_tags=basic|appris_principal;variant_classification=Splice_Site;secondary_variant_classification=In_Frame_Del;gencode_transcript_name=AL645608.2-201;transcript_exon=3;gene_type=protein_coding;gencode_transcript_type=protein_coding;protein_change=p.SR12del;ref_context=atggctattttcacgaattaattcttccg;variant_type=DEL;transcript_id=ENST00000594233.1;gc_content=0.349;gene=AL645608.2;transcript_strand=+
1	819979	.	CATGAGCAT	C	.	PASS	annotation_transcript=ENST00000594233.1;gencode_transcript_status=NOVEL;HGVS_protein_change=Exception_encountered;genome_change=g.chr1:819980_819987delATGAGCAT;HGVS_coding_DNA_change=ENST00000594233.1:c.54_*4delATGAGCAT;transcript_position=54_57;HGVS_genomic_change=chr1.hg19:g.819980_819987delATGAGCAT;gencode_transcript_tags=basic|appris_principal;variant_classification=Stop_Codon_Del;gencode_transcript_name=AL645608.2-201;transcript_exon=0;gene_type=protein_coding;gencode_transcript_type=protein_coding;ref_context=ttccgtatccatgagcatggaatgcttc;variant_type=DEL;transcript_id=ENST00000594233.1;gc_content=0.370;gene=AL645608.2;transcript_strand=+

The protein change / cdna string / coding string fields are all incorrect. The positions in the transcript may also be incorrect.

@droazen
Copy link
Contributor

droazen commented Oct 31, 2018

Note that this ticket also applies to variants that span between the UTR and flanking regions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants