Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan values in the result #184

Open
voichek opened this issue Nov 6, 2018 · 15 comments
Open

nan values in the result #184

voichek opened this issue Nov 6, 2018 · 15 comments
Assignees
Milestone

Comments

@voichek
Copy link

voichek commented Nov 6, 2018

Hi,

When using GEMMA version 0.98 but not 0.96 I get only nan values in the result file:

$ gemma -bfile snps.plink -lmm 2 -k K -o results.0_96
Reading Files ...
## number of total individuals = 1135
## number of analyzed individuals = 50
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs = 11769920
## number of analyzed SNPs = 1414417
Start Eigen-Decomposition...
pve estimate =0.724026
se(pve) =0.816976
Reading SNPs  ==================================================100.00%
$ head output/results.0_96.assoc.txt -n3
chr     rs      ps      n_miss  allele1 allele0 af      l_mle   p_lrt
1       .       540     1       A       G       0.020   1.000000e+05    1.073966e-01
1       .       603     2       A       G       0.083   1.000000e+05    1.380079e-01
$ cat output/results.0_96.assoc.txt | awk '$9 == "-nan" ' | wc -l
0


$ gemma-0.98-linux-static -bfile snps.plink -lmm 2 -k K -o results.0_98
GEMMA 0.98 (2018-09-28) by Xiang Zhou and team (C) 2012-2018
Reading Files ...
## number of total individuals = 1135
## number of analyzed individuals = 50
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var        = 11769920
## number of analyzed SNPs         =  1414417
Start Eigen-Decomposition...
pve estimate =0.724026
se(pve) =0.816976
================================================== 100%
$ head output/results.0_98.assoc.txt -n3
chr     rs      ps      n_miss  allele1 allele0 af      logl_H1 l_mle   p_lrt
1       .       540     1       A       G       0.020   -nan    1.000000e+05    -nan
1       .       603     2       A       G       0.083   -nan    1.000000e+05    -nan
$ cat output/results.0_98.assoc.txt | awk '$10 == "-nan" ' | wc -l
1414417
$ cat output/results.0_98.assoc.txt | awk '$10 != "-nan" ' | wc -l
1

I have uploaded this example, if you want to take a look.

Hope you can help me,
Yoav

@pjotrp
Copy link
Member

pjotrp commented Nov 6, 2018

Hi @voichek,

Thanks for reporting. I'll take a look. It may take up to a week because I am on the road.

@pjotrp pjotrp self-assigned this Nov 6, 2018
@pjotrp
Copy link
Member

pjotrp commented Nov 21, 2018

When I run with debug and check switches I get:

~/gemma-0.98-linux-static -bfile snps.plink -lmm 2 -k K -o results.0_98 -debug -check
GEMMA 0.98 (2018-09-28) by Xiang Zhou and team (C) 2012-2018
Reading Files ... 
**** DEBUG: entered in src/gemma_io.cpp at line 511 in ReadFile_bim
**** DEBUG: entered in src/gemma_io.cpp at line 558 in ReadFile_fam
**** DEBUG: entered in src/gemma_io.cpp at line 857 in ReadFile_bed

FATAL ERROR: GEMMA caused a floating point error which suggests machine boundaries were reached.

You can disable floating point tests with the -no-check switch (use at your own risk!)

Floating point exception

@pjotrp
Copy link
Member

pjotrp commented Nov 21, 2018

A debug stack trace shows

#0  0x0000000000432405 in ReadFile_bed (file_bed=..., setSnps=..., W=W@entry=0x1a5a540, indicator_idv=..., indicator_snp=..., snpInfo=..., 
    maf_level=@0x7fffffffdec0: 0.01, miss_level=@0x7fffffffdeb8: 0.050000000000000003, hwe_level=@0x7fffffffdec8: 0, 
    r2_level=@0x7fffffffded0: 0.99990000000000001, ns_test=@0x7fffffffe2f8: 0) at src/gemma_io.cpp:973
#1  0x00000000004187d4 in PARAM::ReadFiles (this=this@entry=0x7fffffffd9d0) at src/param.cpp:277
#2  0x0000000000454112 in GEMMA::BatchRun (this=this@entry=0x7fffffffe790, cPar=...) at src/gemma.cpp:1645
#3  0x000000000048d6ad in main (argc=11, argv=0x7fffffffe928) at src/main.cpp:86

shows a division by zero

maf /= 2.0 * (double)(ni_test - n_miss);

where the number of test individuals exactly matches the number of NA individuals. I'll add a fix for that. The behaviour of gemma 0.96 is rather unspecified here and 0.98 should have taken the right action (i.e., drop the SNP or halt).

pjotrp added a commit to genenetwork/GEMMA that referenced this issue Nov 21, 2018
@voichek
Copy link
Author

voichek commented Nov 21, 2018

Dear @pjotrp,

I am not sure I understand, I do have individuals with phenotypic data (no nan) in this dataset.
How come all were thrown?

Thanks,
Yoav

@pjotrp
Copy link
Member

pjotrp commented Nov 21, 2018

The problem is in the genotypes. Gemma selects a subset of individuals and for one SNP all were missing.

@pjotrp
Copy link
Member

pjotrp commented Nov 21, 2018

Btw you also have missing values in your .fam file. That is why gemma only selects 50 individuals.

@voichek
Copy link
Author

voichek commented Nov 21, 2018

I know, it is supposed to be this way.

Thank you for the help!
I will have to wait for the next release for this to be fixed?

@pjotrp
Copy link
Member

pjotrp commented Nov 21, 2018

I am looking into it. If there is a fix I can make it available.

@pjotrp
Copy link
Member

pjotrp commented Nov 21, 2018

I spent some time on this and it is actually quite tricky.

When a SNP only contains NAs it is still included in the computation for Plink. I am working on a rewrite of GEMMA which should allow fixing such problems much easier. For now, all I can suggest is to use 0.96 for this particular edge case.

If you want to make sure the answer is correct you can also convert the plink files to BIMBAM and drop all SNPs with genotypes that have only missing values for the individuals you are testing.

@pjotrp pjotrp added this to the Later milestone Nov 21, 2018
@voichek
Copy link
Author

voichek commented Nov 21, 2018

Thank you!
I guess condensing the plink file only to the individuals I use will also work. Though, due to performance consideration I would prefer not to add more steps.

@pjotrp
Copy link
Member

pjotrp commented Nov 22, 2018

Are you creating a pipeline? Is that public information? Coming year we want to create some pipelines using Galaxy/CWL.

@voichek
Copy link
Author

voichek commented Nov 22, 2018

I am writing some specific pipeline to my project. I don't think there are any competing interests.

@pjotrp
Copy link
Member

pjotrp commented Nov 22, 2018

That is not what I am worried about ;). I am merely interested in sharing ideas.

@voichek
Copy link
Author

voichek commented Nov 22, 2018

Soon I hope :)

@voichek
Copy link
Author

voichek commented Jun 18, 2020

Regarding the pipeline I was building using GEMMA, it is now public here. I am calculating associations of k-mers presence/absence calculated directly from sequencing reads, without looking at the genome. I have my own code for the initial approximation of associations as there are a lot of k-mers (hundred of millions or more) and then I use GEMMA for calculating the exact score on a subset of filter variants.

(Just remembered that you asked me previously)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants