MAF filters and p-value distribution. #83

akamolphat · 2023-10-19T02:40:24Z

Dear all,

I was wondering how MAF filters affect the p-value distribution. I have RADseq SNP data of a species with very high population structure and I have used a rather low MAF filter (minor allele counts > 10 in a dataset of 886 diploid individuals, approx. MAF > 0.006) to retain as many SNPs as possible. In doing so, I retained 11704 SNPs.

I ran the initial PCA and decided to keep K = 6 (see scree plot, I could also argue for K = 10-12 from previous population structure studies).

I ran pcadapt with K = 6 and found the histogram of p-values to be U-shaped. I have previously observed this before as well.

Rerunning pcadapt with higher min.maf thresholds seem to fix this but reduced the number of SNPs greatly. I also do not want to remove SNPs with MAF <= 0.05 that may potentially be true outliers.

MAF > 0.01: 10242 SNPs

MAF > 0.05: 5648 SNPs

I tried adding NULL SNPs as suggested (https://github.com/bcm-uga/pcadapt/issues/56) but but most of my real SNPs have p-values close to zero.

50000 NULL SNPs added to original dataset (11704 SNPs):

50000 NULL SNPs added to dataset with only MAF > 0.05 (5646 SNPs):

What is the reason for this? I have also found a similar pattern when using a matrix of minor allele frequencies (type = "pool") (i.e., u-shaped p-value distribution when including low MAF, and a uniform distribution with a peak near zero when filtering for MAF > 0.05).

I have also tried different K-values, but it appears that the min.maf threshold is what determines the shape of the p-value distribution. Should the "null" SNPs be generated differently?

A

privefl · 2023-10-19T06:39:41Z

The U-shape is totally expected since we do genomic control by default, which makes p-values conversative when there is some true signal (cf. https://doi.org/10.1038%2Fejhg.2011.39).
You can try to look at the raw p-values pchisq(res$stat, df = K, lower.tail = FALSE).

BTW, I would use K=5 from the scree plot you presented, but it would be also good to look at PC scores to see which PCs capture some pop structure.

akamolphat · 2023-10-19T23:54:54Z

Dear @privefl ,

Thank you so much for such a prompt reply. The plots below are with K = 5.

I looked at the raw p-values but am unsure if the p-value distributions are good enough for to apply FDR correction to them? I am getting a lot of outliers, especially with lower MAF filters. With the original dataset, about 21% of all the SNPs are identified as outliers with FDR < 0.05.

Original dataset (approx. MAF > 0.006), 11704 total SNPs. FDR < 0.05 = 2486 outliers, FDR < 0.1 = 2939 outliers.

MAF > 0.01, 10242 total SNPs. FDR < 0.05 = 1131 outliers, FDR < 0.1 = 1510 outliers.

MAF > 0.05, 5648 total SNPs. FDR < 0.05 = 503 outliers, FDR < 0.1 = 671 outliers.

You suggested that the scree plot suggest K = 5, but the PC plots actually suggest some population structure captured up to PC10 or so (see PC10 vs PC11 below). Population structure analyses using snmf (from LEA package) seems to suggest K = ~10.
PC10 vs PC11

However, when I perform pcadapt with K = 10, but the p-value distribution becomes even more u-shaped (image below), even without GIF correction.

I have also tried using MAF instead of individual genotype but that also resulted in an extremely u-shaped distribution, even without GIF correction (see image below). I used K = 3 here, as suggested from the screeplot but the u-shape is present for other K values as well.

Yours sincerely,
A

privefl · 2023-10-20T07:06:42Z

If your pops are perfectly seperated, then it is normal to get that many outliers; it is just too easy for some species. Other have reported the same kind of results here.
But if you don't have many variants, and lots of them are outliers, I think it is always good to add e.g. 50K null variants.

privefl · 2023-10-20T21:54:32Z

Would you be able to send me your data?
I would like to try something.

akamolphat · 2023-10-24T04:40:57Z

Dear @privefl ,

Thank you so much for such prompt reply.

I have sent you my data and the codes I have used to [email protected]. I sent you both the genotype data (.bed files) and the MAF data (.lfmm file).

Yours sincerely,
A

privefl · 2023-10-30T09:13:49Z

Adding fake variants actually increases the number of variants detected.
Strong clumping does not reduce the number of outlier variants detected by much.

I do not see any problem then. I guess it is just that your pops are too easily being differentiated.

What I tried:

akamolphat · 2023-10-30T22:58:23Z

Dear @privefl ,

Thank you so much for spending time on this and also providing the codes of what you have tried. Could I confirm my understanding of what PCAdapt does?

I am understanding that PCAdapt calculates how much each variant is associated with population structure, and variants which are very highly associated to the population structure will appear as outliers (in the chi-squared distribution of the Mahalanobis distances). These outliers are assumed to be indicative of local adaptation. Is that correct?

I wanted to confirm this to try to understand the large number of outliers. My dataset is highly structured and multiple (sub)populations have experienced very strong drift, and this would explain the large number of outliers.

Would this warrant performing PCAdapt on these (sub)populations separately?

Yours sincerely,
A

privefl · 2023-10-31T06:55:08Z

Your understanding is right.

You can always run pcadapt on two populations at once to see what variants differentiate them, but I don't think I've seen pcadapt used like that before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAF filters and p-value distribution. #83

MAF filters and p-value distribution. #83

akamolphat commented Oct 19, 2023

privefl commented Oct 19, 2023 •

edited

Loading

akamolphat commented Oct 19, 2023

privefl commented Oct 20, 2023 •

edited

Loading

privefl commented Oct 20, 2023

akamolphat commented Oct 24, 2023

privefl commented Oct 30, 2023 •

edited

Loading

akamolphat commented Oct 30, 2023

privefl commented Oct 31, 2023

MAF filters and p-value distribution. #83

MAF filters and p-value distribution. #83

Comments

akamolphat commented Oct 19, 2023

privefl commented Oct 19, 2023 • edited Loading

akamolphat commented Oct 19, 2023

privefl commented Oct 20, 2023 • edited Loading

privefl commented Oct 20, 2023

akamolphat commented Oct 24, 2023

privefl commented Oct 30, 2023 • edited Loading

akamolphat commented Oct 30, 2023

privefl commented Oct 31, 2023

privefl commented Oct 19, 2023 •

edited

Loading

privefl commented Oct 20, 2023 •

edited

Loading

privefl commented Oct 30, 2023 •

edited

Loading