-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unclear criteria for when merge sets genotypes missing vs ref homoz #1891
Comments
There is no hidden filtering, so probably you hit a rare bug. Merging of 1379 gVCF files is complex, hard to tell without seeing the data. Can you provide a small reproducible test case please? |
Petr, You have probably already thought this through but I would say the situations that arise are: I also think this is sufficiently complex that a section in the more detailed bcftools web documentation would be useful. I looked there for an explanation and there is nothing explaining the merge logic and algorithm. However I do realize that documentation is not always top of the list ;-) |
Petr,
I have a test case but would like to transfer it separately. Please email me directly when you are ready.
On Mar 24, 2023, at 10:14 AM, Petr Danecek ***@***.***> wrote:
There is no hidden filtering, so probably you hit a rare bug. Merging of 1379 gVCF files is complex, hard to tell without seeing the data. Can you provide a small reproducible test case please?
—
Reply to this email directly, view it on GitHub<#1891 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AKO2SFIAOLIFQR5OUR3IYLTW5WT53ANCNFSM6AAAAAAWFXPSJU>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
@jcm6t Your github profile does not show an email address, I listed mine on the profile page or you check the README in the bcftools distribution. Thanks |
Thank you for the test case, it was very helpful. The problem should be now fixed, please try it out. As this was a bigger change, please check the outputs carefully and let me know if you spot anything odd. |
Thanks for supporting these tools.
bcftools 1.16, file formats are VCF 4.2.
I have merged 1379 gcvf files using the following pipeline:
These are the top few variants in the chr22 run and first 2 samples:
if we focus on the 2nd sample 77900324, we see that first 4 GTs are missing ./. while the 5th at 10510341 is called 0/0. I'm trying to rationalize why. One of the key issues in multiplex gvcf merging is to utilize confidence in deciding between ./. and 0/0 for rare variants at positions that were not variable in the individual gvcfs, rather than assuming they are all 0/0 because they weren't variable and therefore are more likely to ref homozyg. Note also that I have chosen this example carefully to remove the multi-allelic issues - the positions I am referring to were unique and were SNV biallelic (but I note the split alleles at 10510352).
The block for the first merged variant at 10510105 in the individual 77900324 gvcf is:
ie, this PASSED in the individual gcvf with a GQ of 12, 0/0. Similarly the next 3.
The first called homozygous REF variant in this sample was at 10510341 corresponding to this block:
This PASSED in the gvcf with a GQ of 57, 0/0.
I would have expected all of the top 5 variants to be called 0/0 in the merged bcf file. Is merge applying hidden quality/depth threshold ?
Thanks for your insights.
(I have reproduced the first lines of the gvcf for the sample below for reference:)
The text was updated successfully, but these errors were encountered: