Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAF filters and p-value distribution. #83

Open
akamolphat opened this issue Oct 19, 2023 · 8 comments
Open

MAF filters and p-value distribution. #83

akamolphat opened this issue Oct 19, 2023 · 8 comments

Comments

@akamolphat
Copy link

Dear all,

I was wondering how MAF filters affect the p-value distribution. I have RADseq SNP data of a species with very high population structure and I have used a rather low MAF filter (minor allele counts > 10 in a dataset of 886 diploid individuals, approx. MAF > 0.006) to retain as many SNPs as possible. In doing so, I retained 11704 SNPs.

I ran the initial PCA and decided to keep K = 6 (see scree plot, I could also argue for K = 10-12 from previous population structure studies).

9c14650e-6558-4325-b687-48417d0a7ca4

I ran pcadapt with K = 6 and found the histogram of p-values to be U-shaped. I have previously observed this before as well.

cb3b8a27-e605-4969-9103-3de3713e29b4

Rerunning pcadapt with higher min.maf thresholds seem to fix this but reduced the number of SNPs greatly. I also do not want to remove SNPs with MAF <= 0.05 that may potentially be true outliers.

MAF > 0.01: 10242 SNPs
fef2e93a-f0f5-43a2-a3a8-e480181a1872

MAF > 0.05: 5648 SNPs
d3d752a7-8efe-4c9f-95b0-85b399b3591a

I tried adding NULL SNPs as suggested (https://github.com/bcm-uga/pcadapt/issues/56) but but most of my real SNPs have p-values close to zero.

50000 NULL SNPs added to original dataset (11704 SNPs):
f7cc611e-1cdd-40f5-b78c-28899482f232

50000 NULL SNPs added to dataset with only MAF > 0.05 (5646 SNPs):
7ac47643-2ef1-4c68-95b2-986d3c9297a0

What is the reason for this? I have also found a similar pattern when using a matrix of minor allele frequencies (type = "pool") (i.e., u-shaped p-value distribution when including low MAF, and a uniform distribution with a peak near zero when filtering for MAF > 0.05).

I have also tried different K-values, but it appears that the min.maf threshold is what determines the shape of the p-value distribution. Should the "null" SNPs be generated differently?

A

@privefl
Copy link
Member

privefl commented Oct 19, 2023

The U-shape is totally expected since we do genomic control by default, which makes p-values conversative when there is some true signal (cf. https://doi.org/10.1038%2Fejhg.2011.39).
You can try to look at the raw p-values pchisq(res$stat, df = K, lower.tail = FALSE).

BTW, I would use K=5 from the scree plot you presented, but it would be also good to look at PC scores to see which PCs capture some pop structure.

@akamolphat
Copy link
Author

Dear @privefl ,

Thank you so much for such a prompt reply. The plots below are with K = 5.

I looked at the raw p-values but am unsure if the p-value distributions are good enough for to apply FDR correction to them? I am getting a lot of outliers, especially with lower MAF filters. With the original dataset, about 21% of all the SNPs are identified as outliers with FDR < 0.05.

Original dataset (approx. MAF > 0.006), 11704 total SNPs. FDR < 0.05 = 2486 outliers, FDR < 0.1 = 2939 outliers.
5eb8686d-88d6-4f4f-9734-cf7cd3d5e850

MAF > 0.01, 10242 total SNPs. FDR < 0.05 = 1131 outliers, FDR < 0.1 = 1510 outliers.
1fa03611-4e89-4d84-8c29-a3b0cda17306-1

MAF > 0.05, 5648 total SNPs. FDR < 0.05 = 503 outliers, FDR < 0.1 = 671 outliers.
f93e6367-5cfc-423b-955c-f1591e9c5957

You suggested that the scree plot suggest K = 5, but the PC plots actually suggest some population structure captured up to PC10 or so (see PC10 vs PC11 below). Population structure analyses using snmf (from LEA package) seems to suggest K = ~10.
PC10 vs PC11
ecc69d3b-9ae6-47b5-a4c3-704613362b65

However, when I perform pcadapt with K = 10, but the p-value distribution becomes even more u-shaped (image below), even without GIF correction.
a137cd93-d8c9-4cf3-98cd-b52f9a9c11af

I have also tried using MAF instead of individual genotype but that also resulted in an extremely u-shaped distribution, even without GIF correction (see image below). I used K = 3 here, as suggested from the screeplot but the u-shape is present for other K values as well.
9aaf4a70-48c0-4a1c-8c7b-19a83d7f5110-1

Yours sincerely,
A

@privefl
Copy link
Member

privefl commented Oct 20, 2023

If your pops are perfectly seperated, then it is normal to get that many outliers; it is just too easy for some species. Other have reported the same kind of results here.
But if you don't have many variants, and lots of them are outliers, I think it is always good to add e.g. 50K null variants.

@privefl
Copy link
Member

privefl commented Oct 20, 2023

Would you be able to send me your data?
I would like to try something.

@akamolphat
Copy link
Author

Dear @privefl ,

Thank you so much for such prompt reply.

I have sent you my data and the codes I have used to [email protected]. I sent you both the genotype data (.bed files) and the MAF data (.lfmm file).

Yours sincerely,
A

@privefl
Copy link
Member

privefl commented Oct 30, 2023

  • Adding fake variants actually increases the number of variants detected.
  • Strong clumping does not reduce the number of outlier variants detected by much.

I do not see any problem then. I guess it is just that your pops are too easily being differentiated.


What I tried:

@akamolphat
Copy link
Author

Dear @privefl ,

Thank you so much for spending time on this and also providing the codes of what you have tried. Could I confirm my understanding of what PCAdapt does?

I am understanding that PCAdapt calculates how much each variant is associated with population structure, and variants which are very highly associated to the population structure will appear as outliers (in the chi-squared distribution of the Mahalanobis distances). These outliers are assumed to be indicative of local adaptation. Is that correct?

I wanted to confirm this to try to understand the large number of outliers. My dataset is highly structured and multiple (sub)populations have experienced very strong drift, and this would explain the large number of outliers.

Would this warrant performing PCAdapt on these (sub)populations separately?

Yours sincerely,
A

@privefl
Copy link
Member

privefl commented Oct 31, 2023

Your understanding is right.

You can always run pcadapt on two populations at once to see what variants differentiate them, but I don't think I've seen pcadapt used like that before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants