Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gtcheck error #1444

Closed
NstVasileva opened this issue Mar 15, 2021 · 4 comments
Closed

Gtcheck error #1444

NstVasileva opened this issue Mar 15, 2021 · 4 comments

Comments

@NstVasileva
Copy link

NstVasileva commented Mar 15, 2021

Hello!
Comparison ChIP-seq vcf file with breaked regions of gvcf file of same sample results in error output:

#DC     [2]Query Sample [3]Genotyped Sample     [4]Discordance  [5]-log P(HWE)  [6]Number of sites compared
DC      LV3000651846_S12        204182100154_R01C01     0.000000e+00    0.000000e+00    0

Though i expect to get some value in number of sites. I tried to do bcftools norm for both files (genotype and query) but it resulted in too little number of sites compared:

#DC     [2]Query Sample [3]Genotyped Sample     [4]Discordance  [5]-log P(HWE)  [6]Number of sites compared
DC      LV3000651846_S12        204182100154_R01C01     2.873160e+02    3.541719e+05    39623

At the same time when i compare g file against itself I get error result as well:

#DC     [2]Query Sample [3]Genotyped Sample     [4]Discordance  [5]-log P(HWE)  [6]Number of sites compared
DC      LV3000651846_S12        LV3000651846_S12        5.293628e+04    1.326684e+07    673538

When I did the same using bcftools 1.9 the result was pretty good (79 discordance and 79800 number of sites compared).

What i did wrong?

My command is bcftools gtcheck -g file1.vcf.gz file2.vcf.gz for gtcheck.

Also i checked if bcftools 1.9 gives right result for comparison of file against itself. And output was:

[1]CN [2]Discordance with LV3000651846_S12 (total)    [3]Discordance (avg score per site)     [4]Number of sites compared     [5]Sample       [6]Sample ID
CN      0.000000e+00    0.000000e+00    515052  LV3000651846_S12        0

which is correct I think.

@pd3
Copy link
Member

pd3 commented Mar 23, 2021

Ah, I probably know what is happening. Can you please take a look what FORMAT tags are available in the two files, but not just in the header, but in the actual data lines?

@NstVasileva
Copy link
Author

Hello, Petr,
These are slices of two compared VCFs:

chr1	3775163	rs1175549	T	A,C	.	PASS	.	GT:GQ	1/1:10
chr1	3781099	exm7644	T	C	.	PASS	.	GT:GQ	1/1:5
chr1	3781105	exm7645	G	C	.	PASS	.	GT:GQ	1/1:2
chr1	3781111	exm7647	G	C	.	PASS	.	GT:GQ	0/0:3
chr1	3781602	var_1_3698166	A	T,C	.	PASS	.	GT:GQ	2/2:3
chr1	3782656	variant.434	A	T,C	.	PASS	.	GT:GQ	2/2:3
chr1	3775163	.	A	<NON_REF>	.	PASS	.	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:77,0:77:81:55:0,81,1800:0,81,255:40,0
chr1	3781099	.	C	<NON_REF>	.	PASS	.	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:65,0:65:99:55:0,120,1800:0,142,255:40,0
chr1	3781105	.	C	<NON_REF>	.	PASS	.	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:65,0:65:99:55:0,120,1800:0,142,255:40,0
chr1	3781111	.	G	<NON_REF>	.	PASS	.	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:65,0:65:99:55:0,120,1800:0,142,255:40,0
chr1	3781602	.	C	<NON_REF>	.	PASS	.	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:62,0:62:99:46:0,120,1800:0,126,255:40,0
chr1	3782656	.	C	<NON_REF>	.	PASS	.	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:73,0:73:99:55:0,120,1800:0,134,255:40,1
  1. There is option “-G” in bcftools 1.9 gtcheck, that excludes PL field from handling.
  2. As you see, there is no value in REF field in one vcf, but I think it’s not a problem, as they mapped on the same reference.
  3. The main problem is that comparing even a single file with itself doesn’t give a discordance value equal to zero in bcftools 1.11 gtcheck.

Thank you!

@pd3
Copy link
Member

pd3 commented Mar 30, 2021

Thank you for the data. I see several problems, some in the input data, some in bcftools. Let's take a look at the data first: why is the reference base different even though the positions are identical? Either one of them is an invalid VCF (the REF allele must match the reference), or maybe they are both on a different genome build? The files cannot be compared as they are.

@pd3 pd3 closed this as completed in 7173313 Apr 1, 2021
@pd3
Copy link
Member

pd3 commented Apr 1, 2021

Hi, I added a bunch of diagnostic messages that should help to diagnose such problems in the future:

  • the program works only for biallelic sites, so when providing a gVCF, all sites with ALT=<NON_REF> are skipped
  • also sites with multiple ALT alleles are skipped, please use bcftools norm -m - to break into biallelic sites

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants