-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HaplotypeCaller makes different variant calls depending on input padding #3697
Comments
@chandrans Can you try again with the latest 4.x release and report whether this is still an issue (I suspect it is, as it was a problem in GATK3 as well) |
Yes, I can confirm it is still an issue. |
Hi this is SkyWarrior from GATK forum. |
Hello, @kelepiradam (SkyWarrior). First, thanks for all your contributions to the GATK forum. The Communications team appreciates you. When you say:
in regards to HaplotypeCaller/Mutect2, do you mean that the issue above is a reason to stop using these tools? |
Sorry that sentence was not clear at all. I meant I am using an additional locus based caller in order not to miss those variants missed by haplotypecaller padding issue. I want to drop that duct tape solution from my workflows completely. |
Ah, so If I am understanding your issue correctly, then one solution is not to use an intervals list at all and to subset variants within desired intervals to after calling. Omission of an exome intervals list is a solution I use in the GATK4 Mutect2 tutorial, because I want to ensure reproducibility for the tutorial calls. Does this solution work for you? Or, is there still some additional value-add that locus-based calling provides? Can you please be more specific? |
There is really not much of a value other than a few potential variants being caught pileup callers and this was probably due to my pickiness. Regardless omitting intervals from the HC step to remove the restrain on entrophy calculations may be the ultimate solution as you indicated (Why didn't I try that before? - Probably I was too focussed on getting it done one way). I will benchmark this and let you know. Actually this may be a really interesting finding to share with others in GATK forums regarding best practices workflows. Thanks @sooheelee |
If you will benchmark this @kelepiradam, that would be awesome. |
@ldgauthier / @davidbenjamin Would either of you like to comment on this one? This is a long-standing issue with our assembly-based callers, and it's not clear to me that there's an obvious solution. |
In @meganshand's work on mitochondria we found that adjusting This is not always the answer but for high depths or high error rates it is a strong possibility. I would be curious to see what turning on kmer error correction does, and in the long run I hope that #4868 will help. |
I tried several things, the only thing that has made the call was -forceActive. Based on the fact that forceActive allows the call to be made, it seems like the problem is in the definition of active regions... |
I'm going to add to my previous comment, when I used -forceActive alone I picked up that het call I was missing BUT I lost another obvious het call that was getting called when I wasn't using -forceActive. If I use -forceActive and -dontTrimActiveRegions I can pick up both calls. It's still very troubling that these obvious calls are getting missed or not depending on these options... |
@munrosa I wouldn't mind taking a look. Could you provide your bam restricted to 1000 bases on each side of the variant? |
@davidbenjamin I've got a munrosa_bams_bugreport.tar.gz (2.1 MB) ready for you -- I'm trying to upload to the ftp side via the instructions here, but I haven't been able to get access this morning due to the 20 user limit. Is there any other way I can send it over to you? I'd prefer not to post here. |
Nevermind, I was just able to get through and upload the tar.gz. |
@munrosa I logged into that ftp server and couldn't see anything. Those instructions are really old and who knows if they still work. It's only 2 MB, so could you just email to [email protected]? |
Thanks for the data @munrosa! The first thing that jumps out is the enormous amount of soft clipping in all of your bams. Do you have any idea what it's from? By default, HaplotypeCaller keeps soft-clipped bases because they could be evidence for large indels, but here it seems they are just noise. You could try the In some cases there is a pattern to the soft clips and perhaps there is something subtler one could do. |
Thanks for looking into this @davidbenjamin. I followed the best practices using bwa mem, mark duplicates etc., to create these input bams for HaplotypeCaller. This is Novaseq 2 x 150 data, I ran Fastqc on the reads and everything looks really good, the only thing I can find that might explain the soft-clipping is that there's some Nextera adapter read through on a small percentage of the reads. I haven't been using -Y with bwa (I see it's used in GATK 4 wdls), so it seems like there should be less soft-clipping than normal. I'll admit these are definitely messy regions we're dealing with, but we really need to make the F5 calls for our clinical pipeline. I just tried --dont-use-soft-clipped-bases and I wasn't able to pick the SNP up in the 55-55003_F5_region.bam, but using forceActive/dontTrimActiveRegions does work on this call. |
Hello, Though the above issue is with GATK 3.7, I have also run the same pipeline with GATK4 and the variant is still not called. is there a solution for this ? |
This is an update of where I am at with this issue. It turns out that using --forceActive and --dontTrimActiveRegions only worked for picking up some of the het SNP calls with HaplotypeCaller. A fun side effect was that some calls that were made with the 'vanilla' best practices HC options were now being missed with the forceActive/dontTrim options. So our clinical team decided to use samtools/bcftools for a pileup approach in combination with HC. We call variants with samtools/bcftools then filter the 'samtools' vcf for VAF > 0.15 and pass that vcf to HC with the -L flag to force HC to make these calls. This is working, all of the calls we are trying to pick up are now being found with our combined method. We also run the vanilla best practices HC on our data and merge the vanilla and samtools vcfs after they go through HC for downstream hard filtering and annotation. Part of this hybrid vanilla/samtools method is for continuity, we're been running 'vanilla' HC for awhile now and didn't want to completely drop it for our new samtools/HC calling approach, so we are combining both to be extra conservative. We decided to keep HC around for 2 reasons, 1) it's not going to give us as many false positives as a 'pileup' method and 2) our downstream annotation software has been set up for dealing with HC vcf files and switching to another vcf INFO format would be painful. But it certainly has causes some alarm about the 'unknown unknowns' that we could be missing in a clinical context. All of these troublesome variants checked out with Sanger sequencing, so this is definitely a real issue and the problem is occurring in clinically-relevant genes, such as F5. I'm happy to provide additional info to help the GATK development team figure out why these variants are missed with HC in the 'vanilla' best practices mode. |
@davidbenjamin did you ever get data to reproduce this issue? |
I have the data, just need to find time to steal from Mutect! |
@munrosa @ldgauthier Possible breakthrough. First, what's definitely true about the het at 169510380 in 55_55003_F5region.bam when I reproduce the bug with
I believe there are two possible solutions.
Personally, I am in favor of both solutions -- looking for cycles after pruning, and waiving the no-cycle requirement on the last attempt. They are complementary. |
@kelepiradam @sooheelee @droazen I have diagnosed the problem in the original issue at the top of this page. When we increase interval padding we introduce additional downstream reference sequence that is homologous to kmers with the variants that get dropped. Thus the kmers with the variant never actually get threaded into the graph because we only start threading at unique kmers. When you change the code to start threading from the beginning of the read, you get the variants back. There is no way to fix this on the command line, although there is a ticket (#4942) to consider doing something about this as @vruano has proposed. At the very least we should add an argument to allow threading to start at non-unique kmers. After some investigation we might want to make it the default behavior. @ldgauthier would you support creating a command-line argument to start threading from the beginning of each sequence? |
Awesome news. Looking forward to test it. |
These all sound like positive improvements. Provided they don't affect performance by dramatically increasing the number of discovered haplotypes, I'm on board. Hopefully this will go a long way towards removing the dependence of calling on the active region boundaries. |
That's great news @davidbenjamin -- I think this issue has been around as long as the |
Looping in @kachulis. |
@munrosa Is there any chance we could use part of the data you shared as an integration test within the gatk repository? The repo is public, but we would only need a few hundred bases of your data and could anonymize the sample name and anything else in the header. We fully understand if this is not possible, and are very grateful for the help you have given already. |
@davidbenjamin I checked with our diagnostic lab director about which data can be put on the public repo (anonymized of course). The only file that cannot be used is the one labeled "Exome_NBPF16_SNP.bam", the other bam files I shared with you are from control samples and can be used in the integration test. |
@munrosa Wonderful, thank you!! |
We just merged PR #5562, which addresses one of @munrosa's missed calls. I am investigating the harder fix of threading in both directions from the first unique kmer. It seems that there is nothing fundamentally wrong with this change, but it exposes mapping artifacts that we have never had to handle before. I think I know how to address these but it will take a while. Maybe two months, though it's hard to guess. |
Re-assigning to @jamesemery, as he is working on this problem this quarter. |
Bug Report
Affected tool(s)
HaplotypeCaller
Affected version(s)
GATK4.beta5
Description
HaplotypeCaller does not make some calls depending on the padding size around the interval of interest. The variant calls should not be dependent on the interval size. For example,with -ip 50, I get 7 variant calls. But with -ip 150, I get only 2 variant calls. It seems to be an issue with the graph assembly (perhaps due to repeat regions), but adding --allowNonUniqueKmersInRef does not help. In the IGV screenshots below, the top is the original BAM file; the second is the bamout with -ip 50; the third is with -ip 100; the fourth is with -ip 150; the fifth is with -ip 200.
Notice the difference in calls between -ip 50 and -ip 150. The call should be made regardless of -ip.
Steps to reproduce
Files are here:
/humgen/gsa-scr1/schandra/SkyWarrior_HCMissingCalls/GATK_Bugsubmit_10448_haplotypecaller-missing-snp-calls
Commands:
gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.vcf
gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.50.vcf -ip 50
gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.100.vcf -ip 100
gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.150.vcf -ip 150
gatk-4.beta.5/gatk-launch HaplotypeCaller -R reference/hg19_ref-ym.fa -I MLC1_Exome_Depth208.bam -L region.bed -O Sheila.HaplotypeCaller.200.vcf -ip 200
This Issue was generated from your [forums]
[forums]: https://gatkforums.broadinstitute.org/gatk/discussion/10448/haplotypecaller-missing-snp-calls/p1
The text was updated successfully, but these errors were encountered: