HaplotypeCaller makes different variant calls depending on input padding #3697

chandrans · 2017-10-16T16:48:24Z

Bug Report

Affected tool(s)

HaplotypeCaller

Affected version(s)

GATK4.beta5

Description

HaplotypeCaller does not make some calls depending on the padding size around the interval of interest. The variant calls should not be dependent on the interval size. For example,with -ip 50, I get 7 variant calls. But with -ip 150, I get only 2 variant calls. It seems to be an issue with the graph assembly (perhaps due to repeat regions), but adding --allowNonUniqueKmersInRef does not help. In the IGV screenshots below, the top is the original BAM file; the second is the bamout with -ip 50; the third is with -ip 100; the fourth is with -ip 150; the fifth is with -ip 200.

Notice the difference in calls between -ip 50 and -ip 150. The call should be made regardless of -ip.

Steps to reproduce

Files are here:
/humgen/gsa-scr1/schandra/SkyWarrior_HCMissingCalls/GATK_Bugsubmit_10448_haplotypecaller-missing-snp-calls

Commands:
gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.vcf

gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.50.vcf -ip 50

gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.100.vcf -ip 100

gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.150.vcf -ip 150

gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.200.vcf -ip 200

This Issue was generated from your [forums]
[forums]: https://gatkforums.broadinstitute.org/gatk/discussion/10448/haplotypecaller-missing-snp-calls/p1

The text was updated successfully, but these errors were encountered:

droazen · 2018-02-05T15:58:03Z

@chandrans Can you try again with the latest 4.x release and report whether this is still an issue (I suspect it is, as it was a problem in GATK3 as well)

chandrans · 2018-02-09T16:47:57Z

Yes, I can confirm it is still an issue.

gokalpcelik · 2018-04-09T10:42:33Z

Hi this is SkyWarrior from GATK forum.
Are there any updates on this issue? (I wish to drop using locus based callers to overcome this problem)

sooheelee · 2018-04-09T14:45:38Z

Hello, @kelepiradam (SkyWarrior).

First, thanks for all your contributions to the GATK forum. The Communications team appreciates you.

When you say:

I wish to drop using locus based callers to overcome this problem

in regards to HaplotypeCaller/Mutect2, do you mean that the issue above is a reason to stop using these tools?

gokalpcelik · 2018-04-09T15:55:13Z

Sorry that sentence was not clear at all. I meant I am using an additional locus based caller in order not to miss those variants missed by haplotypecaller padding issue. I want to drop that duct tape solution from my workflows completely.

sooheelee · 2018-04-09T18:38:26Z

Ah, so locus based caller = pileup caller that calls SNVs. Thanks for the clarification. What I am hearing is that you want a more holistic solution from GATK assembly callers so that you do not need to additionally call with a SNP/SNV pileup caller. Those of us on the Communications Team are keen to ensure our developers hear about your needs, given the developers make the decisions on tool features.

If I am understanding your issue correctly, then one solution is not to use an intervals list at all and to subset variants within desired intervals to after calling. Omission of an exome intervals list is a solution I use in the GATK4 Mutect2 tutorial, because I want to ensure reproducibility for the tutorial calls. Does this solution work for you? Or, is there still some additional value-add that locus-based calling provides? Can you please be more specific?

gokalpcelik · 2018-04-10T05:23:56Z

There is really not much of a value other than a few potential variants being caught pileup callers and this was probably due to my pickiness. Regardless omitting intervals from the HC step to remove the restrain on entrophy calculations may be the ultimate solution as you indicated (Why didn't I try that before? - Probably I was too focussed on getting it done one way). I will benchmark this and let you know. Actually this may be a really interesting finding to share with others in GATK forums regarding best practices workflows.

Thanks @sooheelee

sooheelee · 2018-04-10T14:55:21Z

If you will benchmark this @kelepiradam, that would be awesome.

munrosa · 2018-06-22T19:44:14Z

Has there been any progress on this bug? I have a similar issue that is not identical, but may be related. I'm running two different tissue samples from the same individual and a snp is missed in one, but found in the other. This screen grab shows the original bams that went into HaplotypeCaller and the bamout results for each sample from HaplotypeCaller. I tried dropping the -L from my call and that didn't pick up the variant. I also tried -ip 100 and the SNP still wasn't picked up. I'm using GATK3.8.

droazen · 2018-06-22T19:48:43Z

@ldgauthier / @davidbenjamin Would either of you like to comment on this one? This is a long-standing issue with our assembly-based callers, and it's not clear to me that there's an obvious solution.

davidbenjamin · 2018-06-24T03:40:12Z

In @meganshand's work on mitochondria we found that adjusting -min-pruning to be commensurate with the expected error rate per locus (eg minPruning = depth / 500 or something like that) rescued some false negatives and fixed the issue of interval- and padding-dependent calls. In her case, the sheer complexity of the assembly graph caused two undesirable things: 1) the true variant was missing from the best haplotypes and 2) the evidence for the true variant was diluted beyond recognition among many spurious haplotypes.

This is not always the answer but for high depths or high error rates it is a strong possibility.

I would be curious to see what turning on kmer error correction does, and in the long run I hope that #4868 will help.

munrosa · 2018-06-25T17:03:44Z

I tried several things, the only thing that has made the call was -forceActive.
For -minPruning HaplotypeCaller would only accept integer values, I tested 0,1,2,3, and none of these worked.
I also tried setting -dontTrimActiveRegions and --allowNonUniqueKmersInRef (separately, not simultaneously) neither helped with grabbing the het call.

Based on the fact that forceActive allows the call to be made, it seems like the problem is in the definition of active regions...

munrosa · 2018-06-29T19:36:18Z

I'm going to add to my previous comment, when I used -forceActive alone I picked up that het call I was missing BUT I lost another obvious het call that was getting called when I wasn't using -forceActive. If I use -forceActive and -dontTrimActiveRegions I can pick up both calls. It's still very troubling that these obvious calls are getting missed or not depending on these options...

davidbenjamin · 2018-07-01T01:50:54Z

@munrosa I wouldn't mind taking a look. Could you provide your bam restricted to 1000 bases on each side of the variant?

munrosa · 2018-07-03T18:48:45Z

@davidbenjamin I've got a munrosa_bams_bugreport.tar.gz (2.1 MB) ready for you -- I'm trying to upload to the ftp side via the instructions here, but I haven't been able to get access this morning due to the 20 user limit. Is there any other way I can send it over to you? I'd prefer not to post here.

munrosa · 2018-07-03T19:23:46Z

Nevermind, I was just able to get through and upload the tar.gz.

davidbenjamin · 2018-07-03T19:38:03Z

@munrosa I logged into that ftp server and couldn't see anything. Those instructions are really old and who knows if they still work. It's only 2 MB, so could you just email to [email protected]?

davidbenjamin · 2018-07-03T20:20:52Z

Thanks for the data @munrosa! The first thing that jumps out is the enormous amount of soft clipping in all of your bams. Do you have any idea what it's from? By default, HaplotypeCaller keeps soft-clipped bases because they could be evidence for large indels, but here it seems they are just noise. You could try the --dont-use-soft-clipped-bases argument, which excludes soft clips from assembly and I believe also realignment.

In some cases there is a pattern to the soft clips and perhaps there is something subtler one could do.

munrosa · 2018-07-05T11:20:28Z

Thanks for looking into this @davidbenjamin. I followed the best practices using bwa mem, mark duplicates etc., to create these input bams for HaplotypeCaller. This is Novaseq 2 x 150 data, I ran Fastqc on the reads and everything looks really good, the only thing I can find that might explain the soft-clipping is that there's some Nextera adapter read through on a small percentage of the reads. I haven't been using -Y with bwa (I see it's used in GATK 4 wdls), so it seems like there should be less soft-clipping than normal.

I'll admit these are definitely messy regions we're dealing with, but we really need to make the F5 calls for our clinical pipeline. I just tried --dont-use-soft-clipped-bases and I wasn't able to pick the SNP up in the 55-55003_F5_region.bam, but using forceActive/dontTrimActiveRegions does work on this call.

NandiniBN · 2018-10-08T07:32:10Z

Hello,
I have the same issue reported on the GATK forum last week
https://gatkforums.broadinstitute.org/gatk/discussion/13171/variant-not-being-called-by-hc-gatk-v3-7-0-gcfedb67

Though the above issue is with GATK 3.7, I have also run the same pipeline with GATK4 and the variant is still not called.

is there a solution for this ?

munrosa · 2018-10-22T21:03:14Z

This is an update of where I am at with this issue. It turns out that using --forceActive and --dontTrimActiveRegions only worked for picking up some of the het SNP calls with HaplotypeCaller. A fun side effect was that some calls that were made with the 'vanilla' best practices HC options were now being missed with the forceActive/dontTrim options. So our clinical team decided to use samtools/bcftools for a pileup approach in combination with HC. We call variants with samtools/bcftools then filter the 'samtools' vcf for VAF > 0.15 and pass that vcf to HC with the -L flag to force HC to make these calls. This is working, all of the calls we are trying to pick up are now being found with our combined method. We also run the vanilla best practices HC on our data and merge the vanilla and samtools vcfs after they go through HC for downstream hard filtering and annotation. Part of this hybrid vanilla/samtools method is for continuity, we're been running 'vanilla' HC for awhile now and didn't want to completely drop it for our new samtools/HC calling approach, so we are combining both to be extra conservative. We decided to keep HC around for 2 reasons, 1) it's not going to give us as many false positives as a 'pileup' method and 2) our downstream annotation software has been set up for dealing with HC vcf files and switching to another vcf INFO format would be painful. But it certainly has causes some alarm about the 'unknown unknowns' that we could be missing in a clinical context. All of these troublesome variants checked out with Sanger sequencing, so this is definitely a real issue and the problem is occurring in clinically-relevant genes, such as F5. I'm happy to provide additional info to help the GATK development team figure out why these variants are missed with HC in the 'vanilla' best practices mode.

ldgauthier · 2018-10-23T12:38:55Z

@davidbenjamin did you ever get data to reproduce this issue?

davidbenjamin · 2018-10-30T02:22:44Z

I have the data, just need to find time to steal from Mutect!

davidbenjamin · 2018-12-12T05:22:16Z

@munrosa @ldgauthier Possible breakthrough.

First, what's definitely true about the het at 169510380 in 55_55003_F5region.bam when I reproduce the bug with -L chr1:169510380 -ip 100:

The variant is considered active and triggers assembly, as it should.
For every kmer size there are non-unique kmers in the reference, so it increases up to k = 85, the last attempt at which the engine relaxes the unique kmers requirement. (See ReadThreadingAssembler line 425).
Once it reaches this kmer size, there are cycles in the graph and so no assembly is returned. (See ReadThreadingAssembler line 464). Thus no alt haplotype is discovered and the variant is missed.

I believe there are two possible solutions.

The assembly engine looks for cycles before pruning, but this order could be switched with no ill effects. In the case of this het there are no cycles after pruning because the apparent cycle was a poorly-supported path due to sequencing error. Here regular pruning works but the new --adaptive-pruning option would give a bit more security against false cycles.
We don't actually have to check for cycles, especially in the last, desperate kmer attempt. Well, we do with the current recursive implementation of KBestHaplotypeFinder, but we don't in the Dijkstra's algorithm implementation currently under review: Simplified KBestHaplotypeFinder by replacing recursion with Dijkstra's algorithm #5462. (Technical note: @ldgauthier I know I promised that this PR gives entirely equivalent results to the existing implementation, but technically this is only true if the existing implementation finishes in finite time. Due to the greedy -- but optimal -- nature of Dijkstra's algorithm, cycles do not cause issues).

Personally, I am in favor of both solutions -- looking for cycles after pruning, and waiving the no-cycle requirement on the last attempt. They are complementary.

davidbenjamin · 2018-12-12T18:24:53Z

@kelepiradam @sooheelee @droazen I have diagnosed the problem in the original issue at the top of this page. When we increase interval padding we introduce additional downstream reference sequence that is homologous to kmers with the variants that get dropped. Thus the kmers with the variant never actually get threaded into the graph because we only start threading at unique kmers. When you change the code to start threading from the beginning of the read, you get the variants back.

There is no way to fix this on the command line, although there is a ticket (#4942) to consider doing something about this as @vruano has proposed. At the very least we should add an argument to allow threading to start at non-unique kmers. After some investigation we might want to make it the default behavior. @ldgauthier would you support creating a command-line argument to start threading from the beginning of each sequence?

gokalpcelik · 2018-12-13T05:33:11Z

Awesome news. Looking forward to test it.

ldgauthier · 2018-12-13T15:36:51Z

These all sound like positive improvements. Provided they don't affect performance by dramatically increasing the number of discovered haplotypes, I'm on board. Hopefully this will go a long way towards removing the dependence of calling on the active region boundaries.

droazen · 2018-12-13T15:57:22Z

That's great news @davidbenjamin -- I think this issue has been around as long as the HaplotypeCaller has been in existence, so I imagine that the GATK community will be very appreciative of a solution after all these years! @jamesemery @jonn-smith take note (#3697 (comment))

davidbenjamin · 2018-12-17T16:24:05Z

Looping in @kachulis.

davidbenjamin · 2019-01-06T03:33:42Z

@munrosa Is there any chance we could use part of the data you shared as an integration test within the gatk repository? The repo is public, but we would only need a few hundred bases of your data and could anonymize the sample name and anything else in the header.

We fully understand if this is not possible, and are very grateful for the help you have given already.

munrosa · 2019-01-13T04:07:31Z

@davidbenjamin I checked with our diagnostic lab director about which data can be put on the public repo (anonymized of course). The only file that cannot be used is the one labeled "Exome_NBPF16_SNP.bam", the other bam files I shared with you are from control samples and can be used in the integration test.

davidbenjamin · 2019-01-13T05:43:33Z

@munrosa Wonderful, thank you!!

davidbenjamin · 2019-02-27T21:00:26Z

We just merged PR #5562, which addresses one of @munrosa's missed calls.

I am investigating the harder fix of threading in both directions from the first unique kmer. It seems that there is nothing fundamentally wrong with this change, but it exposes mapping artifacts that we have never had to handle before. I think I know how to address these but it will take a while. Maybe two months, though it's hard to guess.

droazen · 2019-05-06T19:44:33Z

Re-assigning to @jamesemery, as he is working on this problem this quarter.

chandrans added the Vanilla label Oct 16, 2017

chandrans changed the title ~~HaplotypeCaller missing SNP calls with different -ip values~~ HaplotypeCaller makes different variant calls depending on padding Oct 16, 2017

chandrans changed the title ~~HaplotypeCaller makes different variant calls depending on padding~~ HaplotypeCaller makes different variant calls depending on input padding Oct 16, 2017

lbergelson assigned droazen Oct 25, 2017

lbergelson added bug HaplotypeCaller labels Oct 25, 2017

droazen assigned ldgauthier Jun 22, 2018

TomJKono mentioned this issue Jul 2, 2018

Options to Help With GATK HaplotypeCaller Making False Negative Errors MorrellLAB/sequence_handling#37

Closed

ldgauthier assigned davidbenjamin and unassigned ldgauthier Oct 5, 2018

droazen assigned ldgauthier Oct 23, 2018

davidbenjamin mentioned this issue Dec 2, 2018

Adaptive pruning option for local assembly #5473

Merged

HagenC mentioned this issue Jan 3, 2019

Deepvariant and GATK – GT inconsistency. google/deepvariant#135

Closed

droazen added this to the Engine-Q12019 milestone Feb 4, 2019

droazen assigned jamesemery and unassigned droazen, ldgauthier and davidbenjamin May 6, 2019

droazen modified the milestones: Engine-Q12019, Engine-Q2-2019 May 6, 2019

cgpu mentioned this issue Dec 30, 2019

HapCall: Consider adding -forceActive and -dontTrimActiveRegions simultaneously to reduce false negatives cgpu/sarek#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HaplotypeCaller makes different variant calls depending on input padding #3697

HaplotypeCaller makes different variant calls depending on input padding #3697

chandrans commented Oct 16, 2017 •

edited

Loading

droazen commented Feb 5, 2018

chandrans commented Feb 9, 2018

gokalpcelik commented Apr 9, 2018 •

edited

Loading

sooheelee commented Apr 9, 2018

gokalpcelik commented Apr 9, 2018

sooheelee commented Apr 9, 2018

gokalpcelik commented Apr 10, 2018

sooheelee commented Apr 10, 2018

munrosa commented Jun 22, 2018

droazen commented Jun 22, 2018 •

edited

Loading

davidbenjamin commented Jun 24, 2018

munrosa commented Jun 25, 2018

munrosa commented Jun 29, 2018

davidbenjamin commented Jul 1, 2018

munrosa commented Jul 3, 2018

munrosa commented Jul 3, 2018

davidbenjamin commented Jul 3, 2018

davidbenjamin commented Jul 3, 2018

munrosa commented Jul 5, 2018

NandiniBN commented Oct 8, 2018 •

edited

Loading

munrosa commented Oct 22, 2018

ldgauthier commented Oct 23, 2018

davidbenjamin commented Oct 30, 2018

davidbenjamin commented Dec 12, 2018

davidbenjamin commented Dec 12, 2018

gokalpcelik commented Dec 13, 2018

ldgauthier commented Dec 13, 2018

droazen commented Dec 13, 2018

davidbenjamin commented Dec 17, 2018

davidbenjamin commented Jan 6, 2019

munrosa commented Jan 13, 2019

davidbenjamin commented Jan 13, 2019

davidbenjamin commented Feb 27, 2019

droazen commented May 6, 2019

HaplotypeCaller makes different variant calls depending on input padding #3697

HaplotypeCaller makes different variant calls depending on input padding #3697

Comments

chandrans commented Oct 16, 2017 • edited Loading

Bug Report

Affected tool(s)

Affected version(s)

Description

Steps to reproduce

droazen commented Feb 5, 2018

chandrans commented Feb 9, 2018

gokalpcelik commented Apr 9, 2018 • edited Loading

sooheelee commented Apr 9, 2018

gokalpcelik commented Apr 9, 2018

sooheelee commented Apr 9, 2018

gokalpcelik commented Apr 10, 2018

sooheelee commented Apr 10, 2018

munrosa commented Jun 22, 2018

droazen commented Jun 22, 2018 • edited Loading

davidbenjamin commented Jun 24, 2018

munrosa commented Jun 25, 2018

munrosa commented Jun 29, 2018

davidbenjamin commented Jul 1, 2018

munrosa commented Jul 3, 2018

munrosa commented Jul 3, 2018

davidbenjamin commented Jul 3, 2018

davidbenjamin commented Jul 3, 2018

munrosa commented Jul 5, 2018

NandiniBN commented Oct 8, 2018 • edited Loading

munrosa commented Oct 22, 2018

ldgauthier commented Oct 23, 2018

davidbenjamin commented Oct 30, 2018

davidbenjamin commented Dec 12, 2018

davidbenjamin commented Dec 12, 2018

gokalpcelik commented Dec 13, 2018

ldgauthier commented Dec 13, 2018

droazen commented Dec 13, 2018

davidbenjamin commented Dec 17, 2018

davidbenjamin commented Jan 6, 2019

munrosa commented Jan 13, 2019

davidbenjamin commented Jan 13, 2019

davidbenjamin commented Feb 27, 2019

droazen commented May 6, 2019

chandrans commented Oct 16, 2017 •

edited

Loading

gokalpcelik commented Apr 9, 2018 •

edited

Loading

droazen commented Jun 22, 2018 •

edited

Loading

NandiniBN commented Oct 8, 2018 •

edited

Loading