Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PC Relate #133

Closed
wants to merge 1 commit into from
Closed

PC Relate #133

wants to merge 1 commit into from

Conversation

ravwojdyla
Copy link
Collaborator

@ravwojdyla ravwojdyla commented Aug 22, 2020

Re: #24

return a.T.dot(a)


def pc_relate(ds: xr.Dataset, maf: float = 0.01) -> xr.Dataset:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-czech please ignore the test code for now, current tests is there to validate results against the reference implementation (but I need to make it smaller etc). But I'm curious to hear your opinion about the implementation itself, do you have any suggestions about xr/dask use?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a few minor suggestions, but looks good to me!

Some array shape checks would definitely help in thinking about the scalability though.

# impute with variant mean
variant_mean = (
call_g.where(~call_g_mask)
.mean(dim="samples")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why it isn't documented but keepdims is a part of the reduce function API (pydata/xarray#3033) so you could just do .mean("samples", keepdims=True)

Copy link
Collaborator Author

@ravwojdyla ravwojdyla Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-czech thanks for pointing this out. I can actually make this even simpler + xr's where (turns out xr/da broadcasting differs a bit in some cases, like for example where. Specifically: pydata/xarray#2171

.mean(dim="samples")
.expand_dims(dim="samples", axis=1)
)
imputed_g = da.where(call_g_mask, variant_mean, call_g)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that it matters but I'm curious if there was any reason not to use xr.where here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think it could matter since getting an imputed genotype count should probably be it's own function. I've been assuming this would be applied externally to these kind of methods.

# phi, eq: 4
mask = (mu <= maf) | (mu >= 1.0 - maf) | call_g_mask
mu_mask = da.ma.masked_array(mu, mask=mask)
variance = mu_mask.map_blocks(lambda i: i * (1.0 - i))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this not be mu_mask * (1.0 - mu_mask)?

stddev = da.sqrt(variance)
centered_af = call_g / 2 - mu_mask
centered_af = da.ma.filled(centered_af, fill_value=0.0)
# NOTE: gramian could be a performance bottleneck, and we could explore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some shape assertions along the way could be really helpful to spot why this is

sgkit/testing.py Outdated
replace=False,
size=int(call_genotype.size * missing_pct),
)
call_genotype[np.unravel_index(indices, call_genotype.shape)] = -1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

choice always seems very slow compared to the other functions and it's probably a bit easier to follow as:

np.where(rs.rand(*call_genotype.shape) < missing_pct, -1, call_genotype)

@@ -14,6 +14,7 @@ def simulate_genotype_call_dataset(
n_allele: int = 2,
n_contig: int = 1,
seed: Optional[int] = None,
missing_pct: Optional[float] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😁

# Note: dask qr decomp requires no chunking in one dimension, and because number of
# components should be smaller than number of samples in most cases, we disable
# chunking on number components
pcsi = pcsi.T.rechunk((None, -1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

# chunking on number components
pcsi = pcsi.T.rechunk((None, -1))

q, r = da.linalg.qr(pcsi)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me again how many PCs are typically provided? If we need to, there's probably a way to do this without the unit chunking requirement but I don't remember if the number of PCs is ever big enough to matter.

Copy link
Collaborator Author

@ravwojdyla ravwojdyla Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the number of PCs should be relatively low, from UKBB:

Principal components should ideally be computed using a subset of high quality, unrelated samples. However, the metrics used to find related samples, as well as poorer quality samples themselves require information about population structure (see Sections S 3.5 and S 3.7). We therefore conducted an initial round of PCA, computing just the top 8 PCs, using a set of unrelated samples based on an initial round of kinship estimation. We used the results of this analysis to compute PC adjusted heterozygosity as well as refine the relatedness inference. Having then identified a set of high quality, unrelated samples, we conducted a second round of PCA, computing the first 40 PCs. Results of the second round are made available to researchers and visualised in Figure 3A, Figure S6-S7, Extended Data Figure 3, and discussed in the main text.

Further for PC Relate afaiu the number can be further limited to only the components that best capture the populations, from Conomos, 2016 “Model-Free Estimation of Recent Genetic Relatedness.”:

The appropriate number of PCs that should be used to adjust for population structure in a PC-Relate analysis will depend on the sample structure. A reasonable set of PCs can often be selected by examining scatter plots and parallel coordinates plots of the top PCs to identify which ones appear to reflect population structure, and by examining a scree plot of the eigenvalues to identify a point of separation between PCs that explain a significant proportion of the total variation in the data and those that explain little variation. However, making the appropriate choice can be challenging, so we investigated the sensitivity of relatedness inference with PC-Relate to the number of PCs used in the analysis. Consider population structure II, where there are only two dimensions of population structure (i.e., ancestry contributed by three subpopulations). Figure S12 displays the kinship coefficient estimates from PC-Relate using varying numbers of PCs. Relatedness estimates with PC-Relate were nearly identical when using the top 2, 5, 10, or 20 PCs. However, including the top 100 PCs, which is 50 times more PCs than are required to explain the population structure in the sample, resulted in a substantial increase in variability. In this particular setting, choosing the number of PCs to be within a factor of 10 of the appropriate number allowed for accurate relatedness inference with PC-Relate. These results suggest that PC-Relate is quite robust to the choice of PCs, provided that there are a sufficient number of PCs included in the relatedness analysis to fully capture the population structure in the sample.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah perfect, no need to worry about that then! Thanks for pulling those references.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants