Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tr gwas #26

Open
wants to merge 112 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
0db77ab
add plink2 docker
nicholema Oct 21, 2024
526b431
add wdl to run gwas genome wide
nicholema Oct 21, 2024
66097ff
add wdl to run gwas genome wide
nicholema Oct 21, 2024
19afd91
fix bugs
nicholema Oct 22, 2024
f37668d
add script to run tr_gwas.wdl
nicholema Oct 22, 2024
28050ac
change pgen path
nicholema Oct 22, 2024
39f4761
change phenotype and cohorts path
nicholema Oct 22, 2024
ab44141
fix minor bugs
nicholema Oct 22, 2024
b45192e
add gcs client
nicholema Oct 22, 2024
d564ca2
fix bug
nicholema Oct 22, 2024
8ed33b2
fix bug
nicholema Oct 22, 2024
61fb1ae
fix bugs: add gs prefix
nicholema Oct 22, 2024
ccf7bec
add project
nicholema Oct 22, 2024
65b0e70
fix minor bug: add python
nicholema Oct 22, 2024
00915e4
fix minor bug
nicholema Oct 22, 2024
6abd82a
change into python3 and add project and token
nicholema Oct 22, 2024
14c58d6
change into python3
nicholema Oct 22, 2024
7f34adf
add project
nicholema Oct 23, 2024
35fcdfc
add project
nicholema Oct 23, 2024
9df4a3c
delete check if ancestry_pred_path
nicholema Oct 23, 2024
44f64c3
fix download ancestry file bug
nicholema Oct 23, 2024
e3df252
fix download ancestry file bug
nicholema Oct 23, 2024
8278f92
update docker
nicholema Oct 23, 2024
57a8987
change gcloud config
nicholema Oct 23, 2024
b737514
add DownloadAncestry
nicholema Oct 24, 2024
af1b31f
add print statement to debug download ancestry
nicholema Oct 24, 2024
e3febb9
add ancestry argument
nicholema Oct 24, 2024
08a6a7f
download ancestry_pred
nicholema Oct 24, 2024
69b8059
add import gcsfs
nicholema Oct 24, 2024
3335b55
add token and project
nicholema Oct 24, 2024
2a1444e
add token and project
nicholema Oct 24, 2024
38ada1b
remove duplicated pd.read ancestry
nicholema Oct 24, 2024
55778c2
change cohort to plink format
nicholema Oct 24, 2024
7c10542
print project and token
nicholema Oct 24, 2024
8e97380
fix bugs
nicholema Oct 24, 2024
bed38dd
fix bugs
nicholema Oct 24, 2024
6932ced
fix bugs
nicholema Oct 24, 2024
52ab7ba
fix bugs
nicholema Oct 24, 2024
0bc38ab
fix bugs
nicholema Oct 24, 2024
9a3079e
fix bugs
nicholema Oct 24, 2024
3701644
fix bugs
nicholema Oct 24, 2024
90b79e6
fix bugs
nicholema Oct 24, 2024
d546331
fix bugs
nicholema Oct 24, 2024
7c008ea
fix bugs
nicholema Oct 24, 2024
137fa4f
delete gcsfs
nicholema Oct 24, 2024
a048330
fix bugs
nicholema Oct 24, 2024
1c9caed
change plink2 docker name
nicholema Oct 24, 2024
b0fa982
change plink2 docker name
nicholema Oct 24, 2024
cce4ba3
return dataframe
nicholema Oct 25, 2024
e1bdb2f
delete debug comment and set convert_phenotype to Array[File]
nicholema Oct 25, 2024
c45407b
change to Array[File]pheno, covar
nicholema Oct 25, 2024
a1ad60f
return dataframe
nicholema Oct 25, 2024
4ec124a
add return covar and phenotype output
nicholema Oct 25, 2024
df913ea
return dataframe,remove to_csv
nicholema Oct 25, 2024
5dd9048
change output name
nicholema Oct 25, 2024
190c90c
fix bugs
nicholema Oct 25, 2024
a155230
fix bugs
nicholema Oct 25, 2024
788e234
add return files
nicholema Oct 25, 2024
2407f0b
remove array
nicholema Oct 25, 2024
846ab21
remove extra main()
nicholema Oct 25, 2024
a376533
remove exit code
nicholema Oct 25, 2024
d81e1e1
fix bug
nicholema Oct 25, 2024
fdd99ff
fix bugs
nicholema Oct 25, 2024
6db4784
fix bugs
nicholema Oct 25, 2024
47758a1
fix bugs
nicholema Oct 25, 2024
294eeb4
fix bugs
nicholema Oct 25, 2024
a6cf960
fix bugs
nicholema Oct 25, 2024
273f977
change pheno name
nicholema Oct 25, 2024
e58c3c4
change pheno name
nicholema Oct 25, 2024
a1c6faa
check bucket
nicholema Oct 25, 2024
56ff63a
add workspace_bucket
nicholema Oct 25, 2024
141ddef
fix bugs
nicholema Oct 25, 2024
81cd528
change outfile_covar name
nicholema Oct 25, 2024
aabe6ba
increase disk and memory for run_tr_gwas
nicholema Oct 25, 2024
587f66f
change plink gwas outname
nicholema Oct 25, 2024
d4de44e
change gwas outname and add gwas_logs
nicholema Oct 25, 2024
ed522b8
change concat and debug psam
nicholema Oct 25, 2024
fa5128e
debug psam
nicholema Oct 25, 2024
0f4e1d7
fix bugs
nicholema Oct 26, 2024
cd6027f
fix bugs
nicholema Nov 3, 2024
dcad759
change PFILEARRARY
nicholema Nov 4, 2024
d63ded5
fix bugs
nicholema Nov 4, 2024
5e98362
fix bugs
nicholema Nov 4, 2024
7c3d239
add print gwas_outfiles to debug
nicholema Nov 4, 2024
aa148c7
add print gwas_outfiles to debug
nicholema Nov 4, 2024
12850e4
add space seperated list
nicholema Nov 4, 2024
4d345e5
use chrx_annotated as pfile name
nicholema Nov 4, 2024
f5030bd
add full path to chrom_outprefix
nicholema Nov 4, 2024
01179eb
add back chrom prefix
nicholema Nov 4, 2024
bf16c58
remove extra echo statements
nicholema Nov 5, 2024
b77689d
add options to run gwas to targeted phenotype
nicholema Nov 15, 2024
05c31ea
increase vif to 2000 to run gwas on EUR
nicholema Nov 15, 2024
f33c513
add target cohort
nicholema Nov 15, 2024
4378005
add readme
nicholema Nov 15, 2024
95a2eb7
add readme
nicholema Nov 15, 2024
1bd5ea5
add detailed usage
nicholema Nov 15, 2024
f6f5bca
debug line --vif 2000
nicholema Nov 15, 2024
3f9df7a
remove PC6 if EUR or non-AFR cohort
nicholema Nov 22, 2024
51952e7
delete non-AFR option
nicholema Nov 22, 2024
7591374
fix bugs
nicholema Nov 22, 2024
477c353
debug:print covar_file
nicholema Nov 23, 2024
fcd08d0
fix minor bug
nicholema Nov 23, 2024
d8c3e1c
fix minor bug
nicholema Nov 23, 2024
eff1a92
fix minor bug
nicholema Nov 24, 2024
af6e497
fix minor bug
nicholema Nov 24, 2024
249905a
add rm pc6 if covar file is eur or non-afr
nicholema Nov 25, 2024
b120c83
fix minor bug
nicholema Nov 25, 2024
0121879
made changes for pull request
nicholema Dec 2, 2024
4b17e00
use latest plink2 path
nicholema Dec 2, 2024
c428508
delete commented lines
nicholema Dec 2, 2024
bdff2a8
fix bugs
nicholema Dec 2, 2024
8cbe3e0
fix bugs
nicholema Dec 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions docker/plink2/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
FROM ubuntu:20.04

RUN apt-get update && DEBIAN_FRONTEND="noninteractive" apt-get install -qqy \
python3 \
python3-pip \
wget \
build-essential \
curl \
unzip \
&& rm -rf /var/lib/apt/lists/*


# Install Python dependencies
RUN pip3 install pandas

# Download and install gcloud
RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz
RUN mkdir -p /usr/local/gcloud \
&& tar -C /usr/local/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \
&& /usr/local/gcloud/google-cloud-sdk/install.sh
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin

# Copy python script
COPY convert_phenotype_plink.py /usr/bin/
RUN chmod +x /usr/bin/convert_phenotype_plink.py

# Download and install PLINK2
RUN wget https://s3.amazonaws.com/plink2-assets/alpha6/plink2_linux_x86_64_20241124.zip \
&& unzip plink2_linux_x86_64_20241124.zip\
&& mv plink2 /usr/bin/ \
&& chmod +x /usr/bin/plink2 \
&& rm plink2_linux_x86_64_20241124.zip









18 changes: 18 additions & 0 deletions docker/plink2/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Definitions
repository = gcr.io/ucsd-medicine-cast
identifier = plink2
version = 0.0.0
git_commit ?= $(shell git log --pretty=oneline -n 1 | cut -f1 -d " ")
name = ${repository}/${identifier}
tag = ${version}--${git_commit}

# Steps
build:
# do the docker build
docker build --no-cache -t ${name}:${tag} .
docker tag ${name}:${tag} ${name}:latest

push:
# Requires ~/.dockercfg
docker push ${name}:${tag}
docker push ${name}:latest
134 changes: 134 additions & 0 deletions docker/plink2/convert_phenotype_plink.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
#!/usr/bin/env python3



"""
Convert AoU phenotypes to plink friendly format.
Output cleaned phenotype and comined covariates files
for the specified phenotype

Usage:
./convert_phenotype_plink.py --phenotype <phenotype>

"""

import os
import pandas as pd
import subprocess
import sys
import csv
import argparse


# Get token and set up project
token_fetch_command = subprocess.run(['gcloud', 'auth', 'application-default', 'print-access-token'], \
capture_output=True, check=True, encoding='utf-8')
token = str.strip(token_fetch_command.stdout)
project = os.getenv("GCS_REQUESTER_PAYS_PROJECT")
bucket = os.getenv("WORKSPACE_BUCKET")
#print(f"Workspace Bucket: {bucket}")
#print all the environmental variables
#print(dict(os.environ))

def GetPTCovarPath(phenotype):
return os.path.join(bucket, \
"phenotypes", "%s_phenocovar.csv"%phenotype)

def DownloadPT(filename):
"""
Download phenotype_covar.csv locally
nicholema marked this conversation as resolved.
Show resolved Hide resolved

Arguments
---------
filename : str
GCP path
"""
cmd = "gsutil cp {filename} .".format(filename=filename)
output = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE).stdout.read()
print(output.decode("utf-8"))

return filename.split("/")[-1] # Return local filename


def GetFloatFromPC(x):
x = x.replace("[","").replace("]","")

return float(x)

def LoadAncestry(ancestry_pred_path,project):
"""
Download ancestry_pred.tsv locally

Arguments
---------
ancestry_pred_path : str
GCP path
project : str
google project for downloading
"""
if ancestry_pred_path.startswith("gs://"):
os.system(f"gsutil -u {project} cp {ancestry_pred_path} .")
ancestry_pred_path = "ancestry_preds.tsv"
ancestry = pd.read_csv(ancestry_pred_path, sep="\t")
ancestry.rename({"research_id": "IID"}, axis=1, inplace=True)
ancestry['IID'] = ancestry['IID'].astype(str)
num_pcs = len(ancestry["pca_features"].values[0].split(","))
pcols = ["PC_%s"%i for i in range(num_pcs)]
ancestry[pcols] = ancestry["pca_features"].str.split(",", expand=True)
for p in pcols:
ancestry[p] = ancestry[p].apply(lambda x: GetFloatFromPC(x), 1)
ancestry.insert(0, 'FID', 0)

return ancestry

def convert_csv_to_plink (ptfile):
df = pd.read_csv(ptfile)
df.insert(0, 'FID', 0)
df.rename(columns={"person_id": "IID"}, inplace=True)
df['IID'] = df['IID'].astype(str)

return df

def main():
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--phenotype", help="Comma-separated list of phenotypes to process. If not provided, process all available phenotypes", type=str, required=True)
parser.add_argument("--num-pcs", help="Number of PCs to use as covariates", type=int, default=10)
parser.add_argument("--ancestry-pred-path", help="Path to ancestry predictions",type=str, default="gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/ancestry/ancestry_preds.tsv")
parser.add_argument("--ptcovars", help="Comma-separated list of phenotype-specific covariates. Default: age", type=str, default="age")
parser.add_argument("--sharedcovars", help="Comma-separated list of shared covariates (besides PCs). Default: sex_at_birth_Male", type=str, default="sex_at_birth_Male")
parser.add_argument("--dryrun", help="Don't actually run the workflow. Just set up", action="store_true")
args = parser.parse_args()


# Set up paths
if args.phenotype.endswith(".csv"):
ptcovar_path = args.phenotype
else:
ptcovar_path = GetPTCovarPath(args.phenotype)


# Get covarlist
pcols = ["PC_%s"%i for i in range(1, args.num_pcs+1)]
shared_covars = [item for item in args.sharedcovars.split(",") if item != ""]
pt_covars = [item for item in args.ptcovars.split(",") if item != ""]
covars = pt_covars + shared_covars

# Set up data frame with phenotype and covars
ancestry = LoadAncestry(args.ancestry_pred_path,project)
plink = convert_csv_to_plink(DownloadPT(ptcovar_path))

# Extract phenotype and covars only
data = pd.merge(plink[["FID","IID"]+covars], ancestry[["IID"]+pcols],on=["IID"],how="inner")
plink_pheno = plink[["FID","IID","phenotype"]]

# Output files
pheno_file_path = f"{args.phenotype}_pheno_plink.txt"
covar_file_path = f"{args.phenotype}_covar_combined.txt"

plink_pheno.to_csv(pheno_file_path, sep="\t", index=False)
data.to_csv(covar_file_path, sep="\t", index=False)

sys.stderr.write(f"Done converting {args.phenotype} to plink format.")

if __name__ == "__main__":
plink_pheno,data = main()
nicholema marked this conversation as resolved.
Show resolved Hide resolved
94 changes: 94 additions & 0 deletions tr-gwas/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Running TR-GWAS genome-wide on the All of Us workbench

The goal of this workflow is to run TR-gwas genome-wide with plink2 across different phenotypes in different cohorts.

For most of our use cases, users do not interact with the WDL described here directly, but rather call launcher scripts which handle setting up inputs and calling the WDL using cromshell. Sections below give additional WDL details which can be helpful for development/testing or debugging.


## Setup
In all cases you'll have to run the following steps in the AoU workbench before starting a workflow:

1. Start the cromwell environment (pink circle icon on the right).
2. Start a cloud environment (Jupyter icon) with:
* "General Analysis" environment
* 2 CPU/7.5GB RAM
* Compute type: Standard VM
* Automatically pause after: 30 minutes
* Storage disk options: standard disk, 120GB
3. Open a terminal (terminal icon) and run the commands below:

```
git clone https://github.com/cast-genomics/cast-workflows/
cd cast-workflows/tr-gwas
../utils/configure-cromshell.py
```

## Run TR-Gwas on targeted phenotype and cohorts

Example test to run tr-gwas :

```
./tr_gwas_aou.py \
--name test \
--phenotype platelet_count \
--cohort EUR,AFR
```

This will print out a friendly turtle with the job ID if successful. Use the following to check the status of your job. It will take around 10-20 minutes to run. If successful the status will eventually change to "Succeeded".

```
cromshell status $JOBID
```

If you see "Failed", you can look at the logs to see what happened:

```
cromshell logs -s ALL $JOBID
cromshell logs -s ALL -des $JOBID (use -des for descriptive log info)
cromshell logs -s Failed -des $JOBID
```

You can check the output:
```
cromshell list-outputs $JOBID
```

## Run a full job on all samples and all phenotypes

```
./tr_gwas_aou.py \
--name test_all
```
## Detailed usage

Required arguments:

* `--name <STR>`: name of the run

Optional arguments:

* `--phenotype <STR>`: name of targeted phenotype, seperated by comma. Default: None, run all
* `--cohort <STR>`: name of targeted cohort, seperated by comma. Options: AFR, EUR, NOT_AFR, ALL . Default: None, run all
nicholema marked this conversation as resolved.
Show resolved Hide resolved

Warning: This script manually filters PC6 for EUR and NOT_AFR cohorts


## Samples preprocessed in cast-workflows/gwas/aou/sample_preprocessing

## Output files

plink friendly phenotype files
```
<phenotype>_covar_combined.txt
<phenotype>_pheno_plink.txt
```

TR-gwas results
```
<phenotype>_<cohort>_gwas.tab
```

## TODO

(1) looks at pairwise correlation of all covariates + the phenotype and removes highly correlated ones and
(2) filtering for any categorical covariates as we do in Margoliash et al. e.g. statin usage and sex are examples of categorial covariates we have used.
Loading