Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HaplotypeCaller makes different variant calls depending on input padding #3697

Open
chandrans opened this issue Oct 16, 2017 · 34 comments
Open

Comments

@chandrans
Copy link
Contributor

chandrans commented Oct 16, 2017

Bug Report

Affected tool(s)

HaplotypeCaller

Affected version(s)

GATK4.beta5

Description

HaplotypeCaller does not make some calls depending on the padding size around the interval of interest. The variant calls should not be dependent on the interval size. For example,with -ip 50, I get 7 variant calls. But with -ip 150, I get only 2 variant calls. It seems to be an issue with the graph assembly (perhaps due to repeat regions), but adding --allowNonUniqueKmersInRef does not help. In the IGV screenshots below, the top is the original BAM file; the second is the bamout with -ip 50; the third is with -ip 100; the fourth is with -ip 150; the fifth is with -ip 200.

screen shot 2017-10-16 at 12 36 32 pm

Notice the difference in calls between -ip 50 and -ip 150. The call should be made regardless of -ip.

Steps to reproduce

Files are here:
/humgen/gsa-scr1/schandra/SkyWarrior_HCMissingCalls/GATK_Bugsubmit_10448_haplotypecaller-missing-snp-calls

Commands:
gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.vcf

gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.50.vcf -ip 50

gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.100.vcf -ip 100

gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.150.vcf -ip 150

gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.200.vcf -ip 200

This Issue was generated from your [forums]
[forums]: https://gatkforums.broadinstitute.org/gatk/discussion/10448/haplotypecaller-missing-snp-calls/p1

@chandrans chandrans changed the title HaplotypeCaller missing SNP calls with different -ip values HaplotypeCaller makes different variant calls depending on padding Oct 16, 2017
@chandrans chandrans changed the title HaplotypeCaller makes different variant calls depending on padding HaplotypeCaller makes different variant calls depending on input padding Oct 16, 2017
@droazen
Copy link
Contributor

droazen commented Feb 5, 2018

@chandrans Can you try again with the latest 4.x release and report whether this is still an issue (I suspect it is, as it was a problem in GATK3 as well)

@chandrans
Copy link
Contributor Author

Yes, I can confirm it is still an issue.

@gokalpcelik
Copy link
Contributor

gokalpcelik commented Apr 9, 2018

Hi this is SkyWarrior from GATK forum.
Are there any updates on this issue? (I wish to drop using locus based callers to overcome this problem)

@sooheelee
Copy link
Contributor

Hello, @kelepiradam (SkyWarrior).

First, thanks for all your contributions to the GATK forum. The Communications team appreciates you.

When you say:

I wish to drop using locus based callers to overcome this problem

in regards to HaplotypeCaller/Mutect2, do you mean that the issue above is a reason to stop using these tools?

@gokalpcelik
Copy link
Contributor

Sorry that sentence was not clear at all. I meant I am using an additional locus based caller in order not to miss those variants missed by haplotypecaller padding issue. I want to drop that duct tape solution from my workflows completely.

@sooheelee
Copy link
Contributor

Ah, so locus based caller = pileup caller that calls SNVs. Thanks for the clarification. What I am hearing is that you want a more holistic solution from GATK assembly callers so that you do not need to additionally call with a SNP/SNV pileup caller. Those of us on the Communications Team are keen to ensure our developers hear about your needs, given the developers make the decisions on tool features.

If I am understanding your issue correctly, then one solution is not to use an intervals list at all and to subset variants within desired intervals to after calling. Omission of an exome intervals list is a solution I use in the GATK4 Mutect2 tutorial, because I want to ensure reproducibility for the tutorial calls. Does this solution work for you? Or, is there still some additional value-add that locus-based calling provides? Can you please be more specific?

@gokalpcelik
Copy link
Contributor

There is really not much of a value other than a few potential variants being caught pileup callers and this was probably due to my pickiness. Regardless omitting intervals from the HC step to remove the restrain on entrophy calculations may be the ultimate solution as you indicated (Why didn't I try that before? - Probably I was too focussed on getting it done one way). I will benchmark this and let you know. Actually this may be a really interesting finding to share with others in GATK forums regarding best practices workflows.

Thanks @sooheelee

@sooheelee
Copy link
Contributor

If you will benchmark this @kelepiradam, that would be awesome.

@munrosa
Copy link

munrosa commented Jun 22, 2018

Has there been any progress on this bug? I have a similar issue that is not identical, but may be related. I'm running two different tissue samples from the same individual and a snp is missed in one, but found in the other. This screen grab shows the original bams that went into HaplotypeCaller and the bamout results for each sample from HaplotypeCaller. I tried dropping the -L from my call and that didn't pick up the variant. I also tried -ip 100 and the SNP still wasn't picked up. I'm using GATK3.8.
screen shot 2018-06-21 at 3 50 03 pm

@droazen
Copy link
Contributor

droazen commented Jun 22, 2018

@ldgauthier / @davidbenjamin Would either of you like to comment on this one? This is a long-standing issue with our assembly-based callers, and it's not clear to me that there's an obvious solution.

@davidbenjamin
Copy link
Contributor

In @meganshand's work on mitochondria we found that adjusting -min-pruning to be commensurate with the expected error rate per locus (eg minPruning = depth / 500 or something like that) rescued some false negatives and fixed the issue of interval- and padding-dependent calls. In her case, the sheer complexity of the assembly graph caused two undesirable things: 1) the true variant was missing from the best haplotypes and 2) the evidence for the true variant was diluted beyond recognition among many spurious haplotypes.

This is not always the answer but for high depths or high error rates it is a strong possibility.

I would be curious to see what turning on kmer error correction does, and in the long run I hope that #4868 will help.

@munrosa
Copy link

munrosa commented Jun 25, 2018

I tried several things, the only thing that has made the call was -forceActive.
For -minPruning HaplotypeCaller would only accept integer values, I tested 0,1,2,3, and none of these worked.
I also tried setting -dontTrimActiveRegions and --allowNonUniqueKmersInRef (separately, not simultaneously) neither helped with grabbing the het call.

Based on the fact that forceActive allows the call to be made, it seems like the problem is in the definition of active regions...

@munrosa
Copy link

munrosa commented Jun 29, 2018

I'm going to add to my previous comment, when I used -forceActive alone I picked up that het call I was missing BUT I lost another obvious het call that was getting called when I wasn't using -forceActive. If I use -forceActive and -dontTrimActiveRegions I can pick up both calls. It's still very troubling that these obvious calls are getting missed or not depending on these options...

@davidbenjamin
Copy link
Contributor

@munrosa I wouldn't mind taking a look. Could you provide your bam restricted to 1000 bases on each side of the variant?

@munrosa
Copy link

munrosa commented Jul 3, 2018

@davidbenjamin I've got a munrosa_bams_bugreport.tar.gz (2.1 MB) ready for you -- I'm trying to upload to the ftp side via the instructions here, but I haven't been able to get access this morning due to the 20 user limit. Is there any other way I can send it over to you? I'd prefer not to post here.

@munrosa
Copy link

munrosa commented Jul 3, 2018

Nevermind, I was just able to get through and upload the tar.gz.

@davidbenjamin
Copy link
Contributor

@munrosa I logged into that ftp server and couldn't see anything. Those instructions are really old and who knows if they still work. It's only 2 MB, so could you just email to [email protected]?

@davidbenjamin
Copy link
Contributor

Thanks for the data @munrosa! The first thing that jumps out is the enormous amount of soft clipping in all of your bams. Do you have any idea what it's from? By default, HaplotypeCaller keeps soft-clipped bases because they could be evidence for large indels, but here it seems they are just noise. You could try the --dont-use-soft-clipped-bases argument, which excludes soft clips from assembly and I believe also realignment.

In some cases there is a pattern to the soft clips and perhaps there is something subtler one could do.

@munrosa
Copy link

munrosa commented Jul 5, 2018

Thanks for looking into this @davidbenjamin. I followed the best practices using bwa mem, mark duplicates etc., to create these input bams for HaplotypeCaller. This is Novaseq 2 x 150 data, I ran Fastqc on the reads and everything looks really good, the only thing I can find that might explain the soft-clipping is that there's some Nextera adapter read through on a small percentage of the reads. I haven't been using -Y with bwa (I see it's used in GATK 4 wdls), so it seems like there should be less soft-clipping than normal.

I'll admit these are definitely messy regions we're dealing with, but we really need to make the F5 calls for our clinical pipeline. I just tried --dont-use-soft-clipped-bases and I wasn't able to pick the SNP up in the 55-55003_F5_region.bam, but using forceActive/dontTrimActiveRegions does work on this call.

@NandiniBN
Copy link

NandiniBN commented Oct 8, 2018

Hello,
I have the same issue reported on the GATK forum last week
https://gatkforums.broadinstitute.org/gatk/discussion/13171/variant-not-being-called-by-hc-gatk-v3-7-0-gcfedb67

Though the above issue is with GATK 3.7, I have also run the same pipeline with GATK4 and the variant is still not called.

is there a solution for this ?

@munrosa
Copy link

munrosa commented Oct 22, 2018

This is an update of where I am at with this issue. It turns out that using --forceActive and --dontTrimActiveRegions only worked for picking up some of the het SNP calls with HaplotypeCaller. A fun side effect was that some calls that were made with the 'vanilla' best practices HC options were now being missed with the forceActive/dontTrim options. So our clinical team decided to use samtools/bcftools for a pileup approach in combination with HC. We call variants with samtools/bcftools then filter the 'samtools' vcf for VAF > 0.15 and pass that vcf to HC with the -L flag to force HC to make these calls. This is working, all of the calls we are trying to pick up are now being found with our combined method. We also run the vanilla best practices HC on our data and merge the vanilla and samtools vcfs after they go through HC for downstream hard filtering and annotation. Part of this hybrid vanilla/samtools method is for continuity, we're been running 'vanilla' HC for awhile now and didn't want to completely drop it for our new samtools/HC calling approach, so we are combining both to be extra conservative. We decided to keep HC around for 2 reasons, 1) it's not going to give us as many false positives as a 'pileup' method and 2) our downstream annotation software has been set up for dealing with HC vcf files and switching to another vcf INFO format would be painful. But it certainly has causes some alarm about the 'unknown unknowns' that we could be missing in a clinical context. All of these troublesome variants checked out with Sanger sequencing, so this is definitely a real issue and the problem is occurring in clinically-relevant genes, such as F5. I'm happy to provide additional info to help the GATK development team figure out why these variants are missed with HC in the 'vanilla' best practices mode.

@ldgauthier
Copy link
Contributor

@davidbenjamin did you ever get data to reproduce this issue?

@davidbenjamin
Copy link
Contributor

I have the data, just need to find time to steal from Mutect!

@davidbenjamin
Copy link
Contributor

@munrosa @ldgauthier Possible breakthrough.

First, what's definitely true about the het at 169510380 in 55_55003_F5region.bam when I reproduce the bug with -L chr1:169510380 -ip 100:

  • The variant is considered active and triggers assembly, as it should.
  • For every kmer size there are non-unique kmers in the reference, so it increases up to k = 85, the last attempt at which the engine relaxes the unique kmers requirement. (See ReadThreadingAssembler line 425).
  • Once it reaches this kmer size, there are cycles in the graph and so no assembly is returned. (See ReadThreadingAssembler line 464). Thus no alt haplotype is discovered and the variant is missed.

I believe there are two possible solutions.

  • The assembly engine looks for cycles before pruning, but this order could be switched with no ill effects. In the case of this het there are no cycles after pruning because the apparent cycle was a poorly-supported path due to sequencing error. Here regular pruning works but the new --adaptive-pruning option would give a bit more security against false cycles.
  • We don't actually have to check for cycles, especially in the last, desperate kmer attempt. Well, we do with the current recursive implementation of KBestHaplotypeFinder, but we don't in the Dijkstra's algorithm implementation currently under review: Simplified KBestHaplotypeFinder by replacing recursion with Dijkstra's algorithm #5462. (Technical note: @ldgauthier I know I promised that this PR gives entirely equivalent results to the existing implementation, but technically this is only true if the existing implementation finishes in finite time. Due to the greedy -- but optimal -- nature of Dijkstra's algorithm, cycles do not cause issues).

Personally, I am in favor of both solutions -- looking for cycles after pruning, and waiving the no-cycle requirement on the last attempt. They are complementary.

@davidbenjamin
Copy link
Contributor

@kelepiradam @sooheelee @droazen I have diagnosed the problem in the original issue at the top of this page. When we increase interval padding we introduce additional downstream reference sequence that is homologous to kmers with the variants that get dropped. Thus the kmers with the variant never actually get threaded into the graph because we only start threading at unique kmers. When you change the code to start threading from the beginning of the read, you get the variants back.

There is no way to fix this on the command line, although there is a ticket (#4942) to consider doing something about this as @vruano has proposed. At the very least we should add an argument to allow threading to start at non-unique kmers. After some investigation we might want to make it the default behavior. @ldgauthier would you support creating a command-line argument to start threading from the beginning of each sequence?

@gokalpcelik
Copy link
Contributor

Awesome news. Looking forward to test it.

@ldgauthier
Copy link
Contributor

These all sound like positive improvements. Provided they don't affect performance by dramatically increasing the number of discovered haplotypes, I'm on board. Hopefully this will go a long way towards removing the dependence of calling on the active region boundaries.

@droazen
Copy link
Contributor

droazen commented Dec 13, 2018

That's great news @davidbenjamin -- I think this issue has been around as long as the HaplotypeCaller has been in existence, so I imagine that the GATK community will be very appreciative of a solution after all these years! @jamesemery @jonn-smith take note (#3697 (comment))

@davidbenjamin
Copy link
Contributor

Looping in @kachulis.

@davidbenjamin
Copy link
Contributor

@munrosa Is there any chance we could use part of the data you shared as an integration test within the gatk repository? The repo is public, but we would only need a few hundred bases of your data and could anonymize the sample name and anything else in the header.

We fully understand if this is not possible, and are very grateful for the help you have given already.

@munrosa
Copy link

munrosa commented Jan 13, 2019

@davidbenjamin I checked with our diagnostic lab director about which data can be put on the public repo (anonymized of course). The only file that cannot be used is the one labeled "Exome_NBPF16_SNP.bam", the other bam files I shared with you are from control samples and can be used in the integration test.

@davidbenjamin
Copy link
Contributor

@munrosa Wonderful, thank you!!

@droazen droazen added this to the Engine-Q12019 milestone Feb 4, 2019
@davidbenjamin
Copy link
Contributor

We just merged PR #5562, which addresses one of @munrosa's missed calls.

I am investigating the harder fix of threading in both directions from the first unique kmer. It seems that there is nothing fundamentally wrong with this change, but it exposes mapping artifacts that we have never had to handle before. I think I know how to address these but it will take a while. Maybe two months, though it's hard to guess.

@droazen
Copy link
Contributor

droazen commented May 6, 2019

Re-assigning to @jamesemery, as he is working on this problem this quarter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants