-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAF filters and p-value distribution. #83
Comments
The U-shape is totally expected since we do genomic control by default, which makes p-values conversative when there is some true signal (cf. https://doi.org/10.1038%2Fejhg.2011.39). BTW, I would use K=5 from the scree plot you presented, but it would be also good to look at PC scores to see which PCs capture some pop structure. |
Dear @privefl , Thank you so much for such a prompt reply. The plots below are with K = 5. I looked at the raw p-values but am unsure if the p-value distributions are good enough for to apply FDR correction to them? I am getting a lot of outliers, especially with lower MAF filters. With the original dataset, about 21% of all the SNPs are identified as outliers with FDR < 0.05. Original dataset (approx. MAF > 0.006), 11704 total SNPs. FDR < 0.05 = 2486 outliers, FDR < 0.1 = 2939 outliers. MAF > 0.01, 10242 total SNPs. FDR < 0.05 = 1131 outliers, FDR < 0.1 = 1510 outliers. MAF > 0.05, 5648 total SNPs. FDR < 0.05 = 503 outliers, FDR < 0.1 = 671 outliers. You suggested that the scree plot suggest K = 5, but the PC plots actually suggest some population structure captured up to PC10 or so (see PC10 vs PC11 below). Population structure analyses using snmf (from LEA package) seems to suggest K = ~10. However, when I perform pcadapt with K = 10, but the p-value distribution becomes even more u-shaped (image below), even without GIF correction. I have also tried using MAF instead of individual genotype but that also resulted in an extremely u-shaped distribution, even without GIF correction (see image below). I used K = 3 here, as suggested from the screeplot but the u-shape is present for other K values as well. Yours sincerely, |
If your pops are perfectly seperated, then it is normal to get that many outliers; it is just too easy for some species. Other have reported the same kind of results here. |
Would you be able to send me your data? |
Dear @privefl , Thank you so much for such prompt reply. I have sent you my data and the codes I have used to [email protected]. I sent you both the genotype data (.bed files) and the MAF data (.lfmm file). Yours sincerely, |
I do not see any problem then. I guess it is just that your pops are too easily being differentiated. What I tried: |
Dear @privefl , Thank you so much for spending time on this and also providing the codes of what you have tried. Could I confirm my understanding of what PCAdapt does? I am understanding that PCAdapt calculates how much each variant is associated with population structure, and variants which are very highly associated to the population structure will appear as outliers (in the chi-squared distribution of the Mahalanobis distances). These outliers are assumed to be indicative of local adaptation. Is that correct? I wanted to confirm this to try to understand the large number of outliers. My dataset is highly structured and multiple (sub)populations have experienced very strong drift, and this would explain the large number of outliers. Would this warrant performing PCAdapt on these (sub)populations separately? Yours sincerely, |
Your understanding is right. You can always run pcadapt on two populations at once to see what variants differentiate them, but I don't think I've seen pcadapt used like that before. |
Dear all,
I was wondering how MAF filters affect the p-value distribution. I have RADseq SNP data of a species with very high population structure and I have used a rather low MAF filter (minor allele counts > 10 in a dataset of 886 diploid individuals, approx. MAF > 0.006) to retain as many SNPs as possible. In doing so, I retained 11704 SNPs.
I ran the initial PCA and decided to keep K = 6 (see scree plot, I could also argue for K = 10-12 from previous population structure studies).
I ran pcadapt with K = 6 and found the histogram of p-values to be U-shaped. I have previously observed this before as well.
Rerunning pcadapt with higher min.maf thresholds seem to fix this but reduced the number of SNPs greatly. I also do not want to remove SNPs with MAF <= 0.05 that may potentially be true outliers.
MAF > 0.01: 10242 SNPs
MAF > 0.05: 5648 SNPs
I tried adding NULL SNPs as suggested (https://github.com/bcm-uga/pcadapt/issues/56) but but most of my real SNPs have p-values close to zero.
50000 NULL SNPs added to original dataset (11704 SNPs):
50000 NULL SNPs added to dataset with only MAF > 0.05 (5646 SNPs):
What is the reason for this? I have also found a similar pattern when using a matrix of minor allele frequencies (type = "pool") (i.e., u-shaped p-value distribution when including low MAF, and a uniform distribution with a peak near zero when filtering for MAF > 0.05).
I have also tried different K-values, but it appears that the min.maf threshold is what determines the shape of the p-value distribution. Should the "null" SNPs be generated differently?
A
The text was updated successfully, but these errors were encountered: