Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HaplotypeCaller outputs incorrect genotype at spanning deletion/SNP site #2960

Closed
droazen opened this issue Jun 5, 2017 · 3 comments
Closed

Comments

@droazen
Copy link
Contributor

droazen commented Jun 5, 2017

@vdauwera commented on Wed Mar 22 2017

@chandrans commented on Mon Oct 24 2016

Bug Report

Affected tool(s)

HaplotypeCaller

Affected version(s)

3.6

Description

At this position:
screen shot 2016-10-24 at 3 55 16 pm
HaplotypeCaller outputs this to the VCF:
9 418269 . TTTTG T 366.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.359;ClippingRankSum=0.000;DP=36;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.90;MQRankSum=2.404;QD=11.11;ReadPosRankSum=0.041;SOR=0.914 GT:AD:DP:GQ:PL 0/1:21,12:33:99:404,0,760

9 418272 . T C 509.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-3.075;ClippingRankSum=0.000;DP=36;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.90;MQRankSum=-2.152;QD=14.56;ReadPosRankSum=-0.236;SOR=0.784 GT:AD:DP:GQ:PL 0/1:14,21:35:99:538,0,505

The SNP/spanning deletion site is at position 418272. Notice HaplotypeCaller assigns a genotype of 0/1 (T/C). The genotype should really be 1/2 (C/*) with the * allele as the other alternate allele.

Steps to reproduce

Commands and files below.

This Issue was generated from your forums


@chandrans commented on Mon Oct 24 2016

Test Files here:
/humgen/gsa-scr1/schandra/bgrenier_MixingAndMatchingGVCFAndBPRES

Command for HaplotypeCaller:
java -jar /humgen/gsa-hpprojects/GATK/bin/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /humgen/gsa-hpprojects/GATK/bundle/2.8/b37/human_g1k_v37.fasta -I /humgen/gsa-scr1/schandra/bgrenier_MixingAndMatchingGVCFAndBPRES/ind2.bam -o Sheila.HaplotypeCaller.vcf -L 9 -bamout Sheila.HaplotypeCaller.bam

HaplotypeCaller in GVCF mode:
java -jar /humgen/gsa-hpprojects/GATK/bin/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /humgen/gsa-hpprojects/GATK/bundle/2.8/b37/human_g1k_v37.fasta -I /humgen/gsa-scr1/schandra/bgrenier_MixingAndMatchingGVCFAndBPRES/ind2.bam -L 9 -o Sheila.HaplotypeCallerGVCF.g.vcf -ERC GVCF

GenotypeGVCFs:
java -jar /humgen/gsa-hpprojects/GATK/bin/current/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /humgen/gsa-hpprojects/GATK/bundle/2.8/b37/human_g1k_v37.fasta -V Sheila.HaplotypeCallerGVCF.g.vcf -o Sheila.GenotypeGVCFsGVCF.vcf


@chandrans commented on Sat Dec 03 2016

User asked about this. Probably won't get to it anytime soon, but I told him/her I would check in.


@vdauwera commented on Mon Dec 05 2016

I responded that we're waiting on the tie-outs.


@ldgauthier commented on Tue Dec 06 2016

Tie-outs?

I started on this, but I definitely won't finish it before the release.


@vdauwera commented on Tue Dec 06 2016

Functional equivalence of GATK4 ports. That's my party line for now. Didn't realize you had actually started on this... but yeah, 3.7 is going out today.


@ronlevine commented on Mon Jan 23 2017

@davidbenjamin Any thoughts on this? Laura thought she implemented a fix but it have any effect. I am becoming familiar with the HC code but it's slow going. While stepping though the code, I noticed it correctly identifies the events for the deletion and SNP.


@davidbenjamin commented on Mon Jan 23 2017

@ronlevine I would put in a few breakpoints to see where the spanning deletion allele gets lost from the SNP site VariantContext. I just ran things with the new qual out of curiosity and the bug persists, so it's probably not an issue of previous bugs with multiallelic sites.


@ronlevine commented on Mon Jan 23 2017

@davidbenjamin That's what I have been doing. I could not find where the spanning deletion alleles was ever added. I agree with Laura's code that the spanning deletion alleles should be added to the events in HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods.


@ronlevine commented on Tue Jan 24 2017

@davidbenjamin It looks like the issue is with ReferenceConfidenceModel.getOverlappingVariantContext(final GenomeLoc curPos, final Collection<VariantContext> maybeOverlapping), with the stack trace:

ReferenceConfidenceModel.calculateRefConfidence
ReferenceConfidenceModel.getOverlappingVariantContext
HaplotypeCaller.map

For curPos=9:418272, there are 2 variants, 9:418269-418273 TTTTG*,<NON_REF>,T and 9:418272 T*,<NON_REF>,C. This method returns the variant with the right-most start , so the variant with the deletion is ignored. This logic should be changed so that the variants that overlap curPos are merged and returned. A utility similar to GATKVariantContextUtils.simpleMerge might work.


@ldgauthier commented on Wed Mar 01 2017

I thought the state I left that branch was that it would output a merged
representation of the two events starting at the deletion position- so the
reference is some string of bases, allele 1 is the deletion, and allele 2
matches the reference for the length of the deletion with the exception of
the SNP. (Allele 2 is not the minimal representation yet.) That's the first
step to get the genotype right. After that we need to break up the events,
clean up the representation, and assign the genotype from the combined
event to both of them.

Hopefully that helps. (And hopefully I actually committed the version of
the branch that does what I said.)

On Jan 24, 2017 12:06 PM, "Ron Levine" [email protected] wrote:

@davidbenjamin https://github.com/davidbenjamin It looks like the issue
is with ReferenceConfidenceModel.getOverlappingVariantContext(final
GenomeLoc curPos, final Collection maybeOverlapping),
with the stack trace:

ReferenceConfidenceModel.calculateRefConfidence
ReferenceConfidenceModel.getOverlappingVariantContext
HaplotypeCaller.map

For curPos=9:418272, there are 2 variants, 9:418269-418273
TTTTG*,<NON_REF>,T and 9:418272 T*,<NON_REF>,C. This method returns the
variant with the right-most start , so the variant with the deletion is
ignored. This logic should be changed so that the variants that overlap
curPos are merged and returned. A utility such as GATKVariantContextUtils.
simpleMerge might work.


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/broadinstitute/gsa-unstable/issues/1499#issuecomment-274868889,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGRhdCAl7601U3RRbph6rOi0E7dqNawRks5rVi-fgaJpZM4KfOWm
.

@droazen
Copy link
Contributor Author

droazen commented Apr 13, 2018

@cwhelan mentioned that he plans to take a stab at fixing this, so I've added him as a tentative co-assignee. He'll send it back to us if the fix proves too hairy.

@cwhelan
Copy link
Member

cwhelan commented Apr 13, 2018

It would be nice if the fix also worked for genotype-given-alleles mode in the case where you are given a spanning deletion and a SNP to genotype in the input alleles. I will try to make sure that works.

HC GGA mode will, I guess, not really return the right result if the input alleles include only the SNP but not its spanning deletion.

@cwhelan
Copy link
Member

cwhelan commented Sep 6, 2018

Fixed via #4963

@cwhelan cwhelan closed this as completed Sep 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants