Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom gene set? #2

Closed
pvtodorov opened this issue Sep 29, 2021 · 16 comments
Closed

custom gene set? #2

pvtodorov opened this issue Sep 29, 2021 · 16 comments

Comments

@pvtodorov
Copy link

Hi!

I'm trying to figure out how to generate a custom geneset file using MAGMA that could plug into scDRS. I found the following text within the manuscript

We use MAGMA20 502 to compute gene-level association p-values from disease
503 GWAS summary statistics (Box 1, step 1). We use a reference panel based on individuals of European ancestry in the 1000 Genomes Project97 504 . We use a 10-kb window around the gene body to map SNPs to genes. We select the top 1,000 genes based 505 on MAGMA p-values as putative disease genes.

Could you provide an example of the MAGMA inputs and commands that performs this?

@KangchengHou
Copy link
Collaborator

Thanks for your interest!

I attach a guide to compute MAGMA gene sets from GWAS summary statistics:

# Step 1: download MAGMA software, gene location file, and reference data from
# https://ctg.cncr.nl/software/magma after this step, one should have a folder <MAGMA_DIR>
# with the following files:
# 1) <MAGMA_DIR>/magma 2) <MAGMA_DIR>/g1000_eur.(bed|bim|fam) 3) <MAGMA_DIR>/NCBI37.3.gene.loc

magma_dir=<MAGMA_DIR>

# Step 2: make gene annotation file for MAGMA using the following command, this only needs to be done
# once for different GWAS summary statistics, and the file will be saved to out/step1.genes.annot
mkdir -p out/step1
${magma_dir}/magma \
    --annotate window=10,10 \
    --snp-loc ${magma_dir}/g1000_eur.bim \
    --gene-loc ${magma_dir}/NCBI37.3.gene.loc \
    --out out/step1

# Step 3: run MAGMA using the following command, this takes a GWAS file ${trait}.pval,
# which at least has the columns: SNP, P, N, which corresponds to the SNP id
# (matched to the ${magma_dir}/g1000_eur.bim), p-value, sample size. For example,
# <trait>.pval file looks like
#
# CHR     BP      SNP             P           N
# 1       717587  rs144155419     0.453345    279949
# 1       740284  rs61770167      0.921906    282079
# 1       769223  rs60320384      0.059349    281744
#
# After this step, one should obtain a file out/step2/${trait}.gene.out, and the top genes with
# largest Z-scores can be input to scDRS.

mkdir -p out/step2
${magma_dir}/magma \
    --bfile ${magma_dir}/g1000_eur \
    --pval ${trait}.pval use='SNP,P' ncol='N' \
    --gene-annot out/step1.genes.annot \
    --out out/step2/${trait}

We will update our document correspondingly.

@pvtodorov
Copy link
Author

Wow! Awesome & thank you for quick response 🚀

@jaredslosberg
Copy link

Hi Kangcheng. Do you have any documentation on your .sumstats files? Specifically, are the CHISQ and Z columns from the GWAS, or an output of a step within scDRS?

Here is the first few lines of PASS_IBD_deLange2017.sumstats, for example:

SNP A1 A2 N CHISQ Z
rs7899632 A G 59957 3.4299 -1.852
rs3750595 A C 59957 3.3124 1.82
rs10786405 T C 59957 3.4559 1.859
rs3793692 A G 59957 0.119 -0.345
rs737657 A G 59957 0.0912 -0.302
rs17524355 T C 59957 4.02 -2.005
rs7086391 T C 59957 3.7249 -1.93

@martinjzhang
Copy link
Owner

Hi @jaredslosberg

Those are from GWAS and follow the LDSC format https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format

Let us know if you have further questions/suggestions.

Best,
Martin

@twoneu
Copy link

twoneu commented Jul 27, 2022

Hi @martinjzhang! Thank you for this awesome package.

Do you have a recommended method for converting the Entrez IDs produced by MAGMA to gene symbols? It seems that the compute-score tool does not recognize the Entrez IDs (First 3 elements for 'TRAIT': [], []) and returns a segmentation fault.

This is what my geneset looks like after following the tutorial above:
image

@KangchengHou
Copy link
Collaborator

@twoneu MAGMA's NCBI37.3.gene.loc (e.g., see https://ctg.cncr.nl/software/MAGMA/aux_files/NCBI37.3.zip) has 1st column as entrez IDs and last column as gene symbol.

Also adding that the gene symbols in the .gs file should be consistent with var_names in AnnData.

@twoneu
Copy link

twoneu commented Jul 30, 2022

Hi Kangcheng. Do you have any documentation on your .sumstats files? Specifically, are the CHISQ and Z columns from the GWAS, or an output of a step within scDRS?

Here is the first few lines of PASS_IBD_deLange2017.sumstats, for example:

SNP A1 A2 N CHISQ Z rs7899632 A G 59957 3.4299 -1.852 rs3750595 A C 59957 3.3124 1.82 rs10786405 T C 59957 3.4559 1.859 rs3793692 A G 59957 0.119 -0.345 rs737657 A G 59957 0.0912 -0.302 rs17524355 T C 59957 4.02 -2.005 rs7086391 T C 59957 3.7249 -1.93

Sorry, another question - how would you recommend that we generate genesets using MAGMA for summary statistics like these with no P-value column?

Thank you so much for your time!

@KangchengHou
Copy link
Collaborator

@twoneu are there other columns containing information needed to derive a p-value? We can help if you provide us information about columns you have. Also see MungeSumstats which help standardize sumstats.

@twoneu
Copy link

twoneu commented Jul 30, 2022

The only columns are SNP, A1, A2, N, Z, and CHISQ (see PASS_IBD_deLange2017.sumstats as an example). Mungesumstats seems to require a P value to run as well.

@KangchengHou
Copy link
Collaborator

KangchengHou commented Jul 31, 2022

@twoneu You can use the following function to convert z-score to p-value.

import scipy.stats
import numpy as np
# zscore can be a numpy array
pval = 2 * (1 - scipy.stats.norm.cdf(np.abs(zscore)))

You need to manually add a column to the sumstats file.

UPDATE NOTE: My previous reply pval = scdrs.util.zsc2pval(zsc) is for computing one-sided p-value. But in GWAS we typically do two-sided p-value (as in the updated reply).

@KangchengHou
Copy link
Collaborator

@twoneu please see the updated reply above (i made a mistake earlier: we should compute two-sided p-value instead one-sided p-value to convert GWAS z-score to p-value)

@pariaaliour
Copy link

Thanks for your interest!

I attach a guide to compute MAGMA gene sets from GWAS summary statistics:

# Step 1: download MAGMA software, gene location file, and reference data from
# https://ctg.cncr.nl/software/magma after this step, one should have a folder <MAGMA_DIR>
# with the following files:
# 1) <MAGMA_DIR>/magma 2) <MAGMA_DIR>/g1000_eur.(bed|bim|fam) 3) <MAGMA_DIR>/NCBI37.3.gene.loc

magma_dir=<MAGMA_DIR>

# Step 2: make gene annotation file for MAGMA using the following command, this only needs to be done
# once for different GWAS summary statistics, and the file will be saved to out/step1.genes.annot
mkdir -p out/step1
${magma_dir}/magma \
    --annotate window=10,10 \
    --snp-loc ${magma_dir}/g1000_eur.bim \
    --gene-loc ${magma_dir}/NCBI37.3.gene.loc \
    --out out/step1

# Step 3: run MAGMA using the following command, this takes a GWAS file ${trait}.pval,
# which at least has the columns: SNP, P, N, which corresponds to the SNP id
# (matched to the ${magma_dir}/g1000_eur.bim), p-value, sample size. For example,
# <trait>.pval file looks like
#
# CHR     BP      SNP             P           N
# 1       717587  rs144155419     0.453345    279949
# 1       740284  rs61770167      0.921906    282079
# 1       769223  rs60320384      0.059349    281744
#
# After this step, one should obtain a file out/step2/${trait}.gene.out, and the top genes with
# largest Z-scores can be input to scDRS.

mkdir -p out/step2
${magma_dir}/magma \
    --bfile ${magma_dir}/g1000_eur \
    --pval ${trait}.pval use='SNP,P' ncol='N' \
    --gene-annot out/step1.genes.annot \
    --out out/step2/${trait}

We will update our document correspondingly.

Thank you so much for this post!
I got a question and appreciate it if you could help me with this; I'm wondering in the second script as I have three files (bim, fam, bed) separately how do I put it here --bfile ${magma_dir}/g1000_eur. I tried to combined them in one file using cat but I keep getting this error: ERROR - parsing arguments: file '/home/paria/scratch/als-project/analysis/ocu_med_sc/rpca/scDRS/MAGMA/g1000_eur.bed' specified in --bfile does not exist or is not a file.
I would like to know how to make the g1000_eur including all three .bed, .bim, .fam.
Many thanks,
Paria

@KangchengHou
Copy link
Collaborator

Hi,

You should keep the three files as is: g1000_eur.bed / g1000_eur.bim / g1000_eur.fam

--bfile corresponds to a prefix to the three files

@pariaaliour
Copy link

Hi,

You should keep the three files as is: g1000_eur.bed / g1000_eur.bim / g1000_eur.fam

--bfile corresponds to a prefix to the three files

Perfect, thanks for the swift response! I have one more questions if you don't mind.
First, based on step 2, the --snp-loc is ${magma_dir}/g1000_eur.bim \ but in MAGMA tutorial (https://ctg.cncr.nl/software/MAGMA/doc/manual_v1.10.pdf) it says the --snp-loc could be provided by user. I was wondering if I have a gwas should the --snp-loc in the step2 --bfile in step 3 be the gwas?
Thanks,
Paria

@KangchengHou
Copy link
Collaborator

Hi,

--snp-loc should be --snp-loc ${magma_dir}/g1000_eur.bim

GWAS should be --pval in step 3

@pariaaliour
Copy link

Perfect, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants