PCA from the genealogical PoV, and LD pruning #2775

hyanwong · 2023-07-05T15:06:33Z

hyanwong
Jul 5, 2023
Collaborator

Background: @brieuclehmann has been working on getting efficient PCA calculation into tskit, including a branch-length version (e.g. see #1743). Moreover there might be an even more efficient approach which doesn't require explicit calculation of the GRM (#275). This is another really nice case illustrating the duality between branch-length stats and site stats.

The two classic papers in the field are Patterson, Price, & Reich and McVean. The first of these recommends "LD pruning", with the following explanation

Correcting for LD
The theory above works well if the markers are independent (that is have no LD), but in practice, and especially with the large genotype arrays that are beginning to be available, this is difficult to ensure. In extreme cases uncorrected LD will seriously distort the eigenvector/eigenvalue structure, making results difficult to interpret. Suppose, for example, that there is a large “block” [35,36] in which markers are in complete LD, and we have genotyped many markers in the block. A large eigenvector of our Wishart matrix X will tend to correlate with the genotype pattern in the block (all markers producing the same pattern). This will distort the eigenvector structure and also the distribution of eigenvalues.

Their solution is to remove some of the sites in LD (and also, I gather from @astheeggeggs, to make sure that e.g. the chosen samples are not close relatives, which will share extensive LD). See e.g. http://alimanfoo.github.io/2015/09/28/fast-pca.html. This is somewhat equivalent to skipping along the genome and picking "independent" trees.

However, when calculating the LD from the genealogy (rather than from the sites), we know the non-independence between sites: LD exists where mutations share an edge. So if we have the trees, we should be able to do better than a naive LD pruning approach. I think this comes down to weighting. In particular, the LD pruning approach is equivalent to weighting the contributions of edges to the pairwise distances in the correct way.

It should be easy to demonstrate the effect of different weighting schemes, I guess. There is also the suggestion that PCs should be weighted by $1/(p(1-p))$ (Patterson, Price, & Reich, eq 3). I think there should be some genealogical justification of this too (weighting by some function of the time, or the depth of the branch), but I haven't thought this through particularly.

I think there is considerable scope for exploring what the meaning of these different PCA adjustments mean in terms of the genealogy, which could lead to more principled PCA analyses, given a simulated or inferred genealogy. This discussion thread is to get the ball rolling.

hyanwong · 2023-07-05T18:35:37Z

hyanwong
Jul 5, 2023
Collaborator Author

Another thing that I just realised: when PCA is normally calculated it doesn't use absolute genetic distance between samples, but relative distance within each marginal tree (i.e. normalised by the total height of the tree). Similarly to the LD pruning rationale, this means that very deep branches, which collect many linked mutations, are downweighted relative to neighbouring trees that might have more recent MRCAs. In other words, the standard PCA approach deliberately downweights long branches.

I surmise that the standard PCA plots therefore are not designed to reflect actual genetic distance, but something more akin to the "topological distance" between samples. There are probably a whole host of branch weighting schemes that would produce slightly different PCA components. I'm not sure which would provide the greatest separation between distinct populations. If people haven't looked at this, it would make a great paper, I think. @astheeggeggs suggested that Alex Blumenthal might be a good person to chat to some time.

3 replies

astheeggeggs Jul 13, 2023
Collaborator

Here's a useful paper to take a look at:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2912642/

hyanwong Jul 13, 2023
Collaborator Author

Thanks Duncan. Understanding what LD pruning is doing from a tree-sequence ("branch length") PoV would be useful, I think.

petrelharp Jul 19, 2023
Maintainer

One thing LD pruning does is like measuring distance along the genome in map length units (ie, recombination distance) instead of physical distance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCA from the genealogical PoV, and LD pruning #2775

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

PCA from the genealogical PoV, and LD pruning #2775

hyanwong Jul 5, 2023 Collaborator

Replies: 1 comment · 3 replies

hyanwong Jul 5, 2023 Collaborator Author

astheeggeggs Jul 13, 2023 Collaborator

hyanwong Jul 13, 2023 Collaborator Author

petrelharp Jul 19, 2023 Maintainer

hyanwong
Jul 5, 2023
Collaborator

Replies: 1 comment 3 replies

hyanwong
Jul 5, 2023
Collaborator Author

astheeggeggs Jul 13, 2023
Collaborator

hyanwong Jul 13, 2023
Collaborator Author

petrelharp Jul 19, 2023
Maintainer