-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PC Relate #133
PC Relate #133
Conversation
return a.T.dot(a) | ||
|
||
|
||
def pc_relate(ds: xr.Dataset, maf: float = 0.01) -> xr.Dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-czech please ignore the test code for now, current tests is there to validate results against the reference implementation (but I need to make it smaller etc). But I'm curious to hear your opinion about the implementation itself, do you have any suggestions about xr/dask use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a few minor suggestions, but looks good to me!
Some array shape checks would definitely help in thinking about the scalability though.
sgkit/stats/pc_relate.py
Outdated
# impute with variant mean | ||
variant_mean = ( | ||
call_g.where(~call_g_mask) | ||
.mean(dim="samples") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why it isn't documented but keepdims
is a part of the reduce function API (pydata/xarray#3033) so you could just do .mean("samples", keepdims=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-czech thanks for pointing this out. I can actually make this even simpler + xr's where (turns out xr/da broadcasting differs a bit in some cases, like for example where
. Specifically: pydata/xarray#2171
sgkit/stats/pc_relate.py
Outdated
.mean(dim="samples") | ||
.expand_dims(dim="samples", axis=1) | ||
) | ||
imputed_g = da.where(call_g_mask, variant_mean, call_g) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not that it matters but I'm curious if there was any reason not to use xr.where
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think it could matter since getting an imputed genotype count should probably be it's own function. I've been assuming this would be applied externally to these kind of methods.
sgkit/stats/pc_relate.py
Outdated
# phi, eq: 4 | ||
mask = (mu <= maf) | (mu >= 1.0 - maf) | call_g_mask | ||
mu_mask = da.ma.masked_array(mu, mask=mask) | ||
variance = mu_mask.map_blocks(lambda i: i * (1.0 - i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this not be mu_mask * (1.0 - mu_mask)
?
stddev = da.sqrt(variance) | ||
centered_af = call_g / 2 - mu_mask | ||
centered_af = da.ma.filled(centered_af, fill_value=0.0) | ||
# NOTE: gramian could be a performance bottleneck, and we could explore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some shape assertions along the way could be really helpful to spot why this is
sgkit/testing.py
Outdated
replace=False, | ||
size=int(call_genotype.size * missing_pct), | ||
) | ||
call_genotype[np.unravel_index(indices, call_genotype.shape)] = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
choice
always seems very slow compared to the other functions and it's probably a bit easier to follow as:
np.where(rs.rand(*call_genotype.shape) < missing_pct, -1, call_genotype)
@@ -14,6 +14,7 @@ def simulate_genotype_call_dataset( | |||
n_allele: int = 2, | |||
n_contig: int = 1, | |||
seed: Optional[int] = None, | |||
missing_pct: Optional[float] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😁
# Note: dask qr decomp requires no chunking in one dimension, and because number of | ||
# components should be smaller than number of samples in most cases, we disable | ||
# chunking on number components | ||
pcsi = pcsi.T.rechunk((None, -1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
# chunking on number components | ||
pcsi = pcsi.T.rechunk((None, -1)) | ||
|
||
q, r = da.linalg.qr(pcsi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me again how many PCs are typically provided? If we need to, there's probably a way to do this without the unit chunking requirement but I don't remember if the number of PCs is ever big enough to matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the number of PCs should be relatively low, from UKBB:
Principal components should ideally be computed using a subset of high quality, unrelated samples. However, the metrics used to find related samples, as well as poorer quality samples themselves require information about population structure (see Sections S 3.5 and S 3.7). We therefore conducted an initial round of PCA, computing just the top 8 PCs, using a set of unrelated samples based on an initial round of kinship estimation. We used the results of this analysis to compute PC adjusted heterozygosity as well as refine the relatedness inference. Having then identified a set of high quality, unrelated samples, we conducted a second round of PCA, computing the first 40 PCs. Results of the second round are made available to researchers and visualised in Figure 3A, Figure S6-S7, Extended Data Figure 3, and discussed in the main text.
Further for PC Relate afaiu the number can be further limited to only the components that best capture the populations, from Conomos, 2016 “Model-Free Estimation of Recent Genetic Relatedness.”:
The appropriate number of PCs that should be used to adjust for population structure in a PC-Relate analysis will depend on the sample structure. A reasonable set of PCs can often be selected by examining scatter plots and parallel coordinates plots of the top PCs to identify which ones appear to reflect population structure, and by examining a scree plot of the eigenvalues to identify a point of separation between PCs that explain a significant proportion of the total variation in the data and those that explain little variation. However, making the appropriate choice can be challenging, so we investigated the sensitivity of relatedness inference with PC-Relate to the number of PCs used in the analysis. Consider population structure II, where there are only two dimensions of population structure (i.e., ancestry contributed by three subpopulations). Figure S12 displays the kinship coefficient estimates from PC-Relate using varying numbers of PCs. Relatedness estimates with PC-Relate were nearly identical when using the top 2, 5, 10, or 20 PCs. However, including the top 100 PCs, which is 50 times more PCs than are required to explain the population structure in the sample, resulted in a substantial increase in variability. In this particular setting, choosing the number of PCs to be within a factor of 10 of the appropriate number allowed for accurate relatedness inference with PC-Relate. These results suggest that PC-Relate is quite robust to the choice of PCs, provided that there are a sufficient number of PCs included in the relatedness analysis to fully capture the population structure in the sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah perfect, no need to worry about that then! Thanks for pulling those references.
d47843f
to
c2bf1cd
Compare
Re: #24