PC Relate #133

ravwojdyla · 2020-08-22T00:14:03Z

ravwojdyla · 2020-08-22T00:17:38Z

sgkit/stats/pc_relate.py

+    return a.T.dot(a)
+
+
+def pc_relate(ds: xr.Dataset, maf: float = 0.01) -> xr.Dataset:


@eric-czech please ignore the test code for now, current tests is there to validate results against the reference implementation (but I need to make it smaller etc). But I'm curious to hear your opinion about the implementation itself, do you have any suggestions about xr/dask use?

Had a few minor suggestions, but looks good to me!

Some array shape checks would definitely help in thinking about the scalability though.

eric-czech · 2020-08-23T12:29:23Z

sgkit/stats/pc_relate.py

+    # impute with variant mean
+    variant_mean = (
+        call_g.where(~call_g_mask)
+        .mean(dim="samples")


I'm not sure why it isn't documented but keepdims is a part of the reduce function API (pydata/xarray#3033) so you could just do .mean("samples", keepdims=True)

@eric-czech thanks for pointing this out. I can actually make this even simpler + xr's where (turns out xr/da broadcasting differs a bit in some cases, like for example where. Specifically: pydata/xarray#2171

eric-czech · 2020-08-23T12:33:24Z

sgkit/stats/pc_relate.py

+        .mean(dim="samples")
+        .expand_dims(dim="samples", axis=1)
+    )
+    imputed_g = da.where(call_g_mask, variant_mean, call_g)


not that it matters but I'm curious if there was any reason not to use xr.where here?

Actually I think it could matter since getting an imputed genotype count should probably be it's own function. I've been assuming this would be applied externally to these kind of methods.

eric-czech · 2020-08-23T12:34:13Z

sgkit/stats/pc_relate.py

+    # phi, eq: 4
+    mask = (mu <= maf) | (mu >= 1.0 - maf) | call_g_mask
+    mu_mask = da.ma.masked_array(mu, mask=mask)
+    variance = mu_mask.map_blocks(lambda i: i * (1.0 - i))


Can this not be mu_mask * (1.0 - mu_mask)?

eric-czech · 2020-08-23T12:36:02Z

sgkit/stats/pc_relate.py

+    stddev = da.sqrt(variance)
+    centered_af = call_g / 2 - mu_mask
+    centered_af = da.ma.filled(centered_af, fill_value=0.0)
+    # NOTE: gramian could be a performance bottleneck, and we could explore


I think some shape assertions along the way could be really helpful to spot why this is

eric-czech · 2020-08-23T12:42:18Z

sgkit/testing.py

+            replace=False,
+            size=int(call_genotype.size * missing_pct),
+        )
+        call_genotype[np.unravel_index(indices, call_genotype.shape)] = -1


choice always seems very slow compared to the other functions and it's probably a bit easier to follow as:

np.where(rs.rand(*call_genotype.shape) < missing_pct, -1, call_genotype)

eric-czech · 2020-08-23T12:42:31Z

sgkit/testing.py

@@ -14,6 +14,7 @@ def simulate_genotype_call_dataset(
    n_allele: int = 2,
    n_contig: int = 1,
    seed: Optional[int] = None,
+    missing_pct: Optional[float] = None,


eric-czech · 2020-08-23T12:44:34Z

sgkit/stats/pc_relate.py

+    # Note: dask qr decomp requires no chunking in one dimension, and because number of
+    # components should be smaller than number of samples in most cases, we disable
+    # chunking on number components
+    pcsi = pcsi.T.rechunk((None, -1))


eric-czech · 2020-08-23T12:57:15Z

sgkit/stats/pc_relate.py

+    # chunking on number components
+    pcsi = pcsi.T.rechunk((None, -1))
+
+    q, r = da.linalg.qr(pcsi)


Remind me again how many PCs are typically provided? If we need to, there's probably a way to do this without the unit chunking requirement but I don't remember if the number of PCs is ever big enough to matter.

I believe the number of PCs should be relatively low, from UKBB:

Principal components should ideally be computed using a subset of high quality, unrelated samples. However, the metrics used to find related samples, as well as poorer quality samples themselves require information about population structure (see Sections S 3.5 and S 3.7). We therefore conducted an initial round of PCA, computing just the top 8 PCs, using a set of unrelated samples based on an initial round of kinship estimation. We used the results of this analysis to compute PC adjusted heterozygosity as well as refine the relatedness inference. Having then identified a set of high quality, unrelated samples, we conducted a second round of PCA, computing the first 40 PCs. Results of the second round are made available to researchers and visualised in Figure 3A, Figure S6-S7, Extended Data Figure 3, and discussed in the main text.

Further for PC Relate afaiu the number can be further limited to only the components that best capture the populations, from Conomos, 2016 “Model-Free Estimation of Recent Genetic Relatedness.”:

The appropriate number of PCs that should be used to adjust for population structure in a PC-Relate analysis will depend on the sample structure. A reasonable set of PCs can often be selected by examining scatter plots and parallel coordinates plots of the top PCs to identify which ones appear to reflect population structure, and by examining a scree plot of the eigenvalues to identify a point of separation between PCs that explain a significant proportion of the total variation in the data and those that explain little variation. However, making the appropriate choice can be challenging, so we investigated the sensitivity of relatedness inference with PC-Relate to the number of PCs used in the analysis. Consider population structure II, where there are only two dimensions of population structure (i.e., ancestry contributed by three subpopulations). Figure S12 displays the kinship coefficient estimates from PC-Relate using varying numbers of PCs. Relatedness estimates with PC-Relate were nearly identical when using the top 2, 5, 10, or 20 PCs. However, including the top 100 PCs, which is 50 times more PCs than are required to explain the population structure in the sample, resulted in a substantial increase in variability. In this particular setting, choosing the number of PCs to be within a factor of 10 of the appropriate number allowed for accurate relatedness inference with PC-Relate. These results suggest that PC-Relate is quite robust to the choice of PCs, provided that there are a sufficient number of PCs included in the relatedness analysis to fully capture the population structure in the sample.

Ah perfect, no need to worry about that then! Thanks for pulling those references.

ravwojdyla commented Aug 22, 2020

View reviewed changes

eric-czech reviewed Aug 23, 2020

View reviewed changes

WIP: Add PC Relate

c2bf1cd

ravwojdyla force-pushed the rav/pc_relate branch from d47843f to c2bf1cd Compare August 24, 2020 09:48

ravwojdyla closed this Sep 2, 2020

ravwojdyla mentioned this pull request Sep 2, 2020

Add PC Relate #228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PC Relate #133

PC Relate #133

ravwojdyla commented Aug 22, 2020 •

edited

Loading

ravwojdyla Aug 22, 2020

eric-czech Aug 23, 2020

eric-czech Aug 23, 2020

ravwojdyla Aug 24, 2020 •

edited

Loading

eric-czech Aug 23, 2020

eric-czech Aug 23, 2020

eric-czech Aug 23, 2020

eric-czech Aug 23, 2020

eric-czech Aug 23, 2020

eric-czech Aug 23, 2020

eric-czech Aug 23, 2020

eric-czech Aug 23, 2020

ravwojdyla Aug 24, 2020 •

edited

Loading

eric-czech Aug 24, 2020

		return a.T.dot(a)


		def pc_relate(ds: xr.Dataset, maf: float = 0.01) -> xr.Dataset:

PC Relate #133

PC Relate #133

Conversation

ravwojdyla commented Aug 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravwojdyla Aug 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravwojdyla Aug 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravwojdyla commented Aug 22, 2020 •

edited

Loading

ravwojdyla Aug 24, 2020 •

edited

Loading

ravwojdyla Aug 24, 2020 •

edited

Loading