Replies: 1 comment 3 replies
-
Another thing that I just realised: when PCA is normally calculated it doesn't use absolute genetic distance between samples, but relative distance within each marginal tree (i.e. normalised by the total height of the tree). Similarly to the LD pruning rationale, this means that very deep branches, which collect many linked mutations, are downweighted relative to neighbouring trees that might have more recent MRCAs. In other words, the standard PCA approach deliberately downweights long branches. I surmise that the standard PCA plots therefore are not designed to reflect actual genetic distance, but something more akin to the "topological distance" between samples. There are probably a whole host of branch weighting schemes that would produce slightly different PCA components. I'm not sure which would provide the greatest separation between distinct populations. If people haven't looked at this, it would make a great paper, I think. @astheeggeggs suggested that Alex Blumenthal might be a good person to chat to some time. |
Beta Was this translation helpful? Give feedback.
-
Background: @brieuclehmann has been working on getting efficient PCA calculation into tskit, including a branch-length version (e.g. see #1743). Moreover there might be an even more efficient approach which doesn't require explicit calculation of the GRM (#275). This is another really nice case illustrating the duality between branch-length stats and site stats.
The two classic papers in the field are Patterson, Price, & Reich and McVean. The first of these recommends "LD pruning", with the following explanation
Their solution is to remove some of the sites in LD (and also, I gather from @astheeggeggs, to make sure that e.g. the chosen samples are not close relatives, which will share extensive LD). See e.g. http://alimanfoo.github.io/2015/09/28/fast-pca.html. This is somewhat equivalent to skipping along the genome and picking "independent" trees.
However, when calculating the LD from the genealogy (rather than from the sites), we know the non-independence between sites: LD exists where mutations share an edge. So if we have the trees, we should be able to do better than a naive LD pruning approach. I think this comes down to weighting. In particular, the LD pruning approach is equivalent to weighting the contributions of edges to the pairwise distances in the correct way.
It should be easy to demonstrate the effect of different weighting schemes, I guess. There is also the suggestion that PCs should be weighted by$1/(p(1-p))$ (Patterson, Price, & Reich, eq 3). I think there should be some genealogical justification of this too (weighting by some function of the time, or the depth of the branch), but I haven't thought this through particularly.
I think there is considerable scope for exploring what the meaning of these different PCA adjustments mean in terms of the genealogy, which could lead to more principled PCA analyses, given a simulated or inferred genealogy. This discussion thread is to get the ball rolling.
Beta Was this translation helpful? Give feedback.
All reactions