PC-Relate experiment #706

hammer · 2021-10-02T21:37:20Z

hammer
Oct 2, 2021
Maintainer

We have been experimenting with PC-Relate, and I would like to share some initial results/thoughts/questions.

Context

Genome-wide association studies (GWASs) are used to associate loci with phenotype/traits. Identification/validation of family structure for GWAS is a critical step for both family and population based studies. Failure to appropriately account for both pedigree and population structure can lead to spurious association [src]. There are many existing methods to estimate recent relatedness, prior to the KING-robust method, these methods would make an assumption that the samples come from a homogeneous population [src]. KING-robust inferrs the relationship (up to 3rd-degree) of any pair of individuals by their kinship coefficient, independently of sample composition or population structure, it makes an assumption of sampling from ancestrally distinct subpopulations with no admixture. KING-robust essentially assumes that sharing of common alleles is not informative to relatedness, and sharing rare alleles suggests recent relatedness (the rare allele must have been inherited by descent). In the presence of admixture these assumptions do not hold anymore because in some cases rare alleles could be associated with one of the ancestry groups. KING-robust gives biased relatedness estimates for pairs of individuals who have different ancestry, which can result in incorrect relationship inference for relatives with admixed ancestry [src]. PC-Relate is, a model-free approach for estimating commonly used measures of recent genetic relatedness, such as kinship coefficients and IBD sharing probabilities, in the presence of unspecified structure (including admixture) [src].

GENESIS implementation

GENESIS is the reference R implementation of the PC-Relate paper. The major downside of this implementation is its single threaded nature. In our tests for 1kGP data (629 samples and 5.8e6 SNPs) it takes about 2.3 hours. The time complexity of PC-Relate is quadratic on the number of samples, thus using GENESIS for larger studies (>=100k of samples) would be impractical. GENESIS has potential (requires changes in the implementation) to be parallelized on multiple cores thus improving the performance, nevertheless for large studies we need to be able to distribute the computation among many nodes.

In the presence of population structure GENESIS requires the user to provide ancestry representative principal components (PCs). The GENESIS package provides PC-AiR implementation of PCA which can be used to compute ancestry principal components (PCs) in the presence of cryptic family structure, it uses KING-robust method to separate related and unrelated samples and computes PCA on unrelated set, then projects related sample to the PC space. PC-AiR is a recommended way to provide PCs for PC-Relate [src].

GENESIS handles missing SNPs by imputation with variant mean. Both Hail and Dask implementation below handle missing data the same way.

Hail implementation

Hail is an open-source, general-purpose, Python-based data analysis library with additional data types and methods for working with genomic data. It provides an implementation of PC-Relate. In our tests it produces kinship estimation results close to GENESIS, with e-7 magnitude error. Hail's advantage over GENESIS is its scalability, Hail based on Spark allows to distribute the computation among many nodes. In Hail genetic data used for PC-Relate is stored in BlockMatrix, block being the unit of parallelization.

Histogram of kinship estimation absolute error between GENESIS and Hail for 1000 samples (499,500 unique pairs).

Hail allows passing PCs for PC-Relate, in case when they are not provided, Hail computes PCA on the whole input [src], therefore it’s up to the user to decide if related samples should be included in the PCA or not, this has been somewhat unclear and might lead to confusing results (#3490). Hail doesn’t provide a PC-AiR equivalent method, nevertheless it’s possible to build it up with the pre existing functions.

Dask implementation

Our Dask based implementation produces kinship estimation results close to GENESIS, with e-16 magnitude error. Much like Hail, Dask provides the advantage of scalable computation. Our implementation uses the Dask Array to perform PC-Relate, conceptually similar to Hail’s BlockMatrix.

Histogram of kinship estimation absolute error between GENESIS and Dask for 1000 samples (499,500 unique pairs).

Much like GENESIS, our Dask implementation requires that the user provides ancestry representative principal components. We currently do not have a PC-AiR implementation in Dask.

Dask implementation validation

Our implementation of kinship estimation was validated against synthetic data with and without missing values, and real world data including HapMap and 1kGP. In both cases it produces results with e-15/e-16 magnitude error (when compared to GENESIS). See validation sections of this notebook.

Performance comparison

Real world example: our 1kGP dataset contains 629 samples and ~5.8e6 SNPs. Test VM has 64GB of memory, 8 CPUs. PC-Relate took 2.3 hours via GENESIS, ~51 minutes via Hail and ~3 minutes via Dask (for both Hail and Dask we compute only the kinship coefficient, default block sizes, MAF=0.01). More tests and benchmarks can be found in notebooks here and here.

PCA discussion

GENESIS and PC-Relate paper recommend it’s important to compute PCA on a set of unrelated samples (and project related samples to the PCs space). Below we compare PC-Relate results computed based on PCs with and without related samples:

Histograms of kinship coefficient distribution based on PCs with and without related samples. Y axis is on a log scale. Second histogram includes only kinship coefficient >= 0.05. The input is real 1kGP data.

As mentioned above it’s recommended to not include related samples for PCA, as it might confound the results of population stratification, which is then used for PC-Relate [src]. In the graphs above we can see the inflation, nevertheless PC-Relate was able to identify related samples (with some error). It is an interesting question if it’s practically necessary to separate related samples and how much error does it create for both population and relatedness estimation. This requires more tests and discussion.

Alternative methods

The UKBB pipeline used an alternative solution to PC-Relate to estimate both population structure and recent relatedness in the presence of admixture. UKBB uses two iterations of KING-robust + PCA. First one to find a set of unrelated samples and perform PCA to identify rare/informative ancestry variants (large loadings), and second iteration to compute kinship coefficient accounting for ancestry information (variants with small loadings to minimize inflation due to recent admixture) and final PCA. This pipeline includes some manual steps, nevertheless it could be interesting to try to reproduce it to measure the performance and also compare the results to pc_relate.

hammer · 2021-10-02T21:37:43Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @alimanfoo)

Hi @ravwojdyla, just to say (belatedly) this was a very cool and informative post, thanks a lot for sharing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PC-Relate experiment #706

{{title}}

Replies: 1 comment

{{title}}

Select a reply

PC-Relate experiment #706

hammer Oct 2, 2021 Maintainer

Context

GENESIS implementation

Hail implementation

Dask implementation

Dask implementation validation

Performance comparison

PCA discussion

Alternative methods

Replies: 1 comment

hammer Oct 2, 2021 Maintainer Author

hammer
Oct 2, 2021
Maintainer

hammer
Oct 2, 2021
Maintainer Author