NEW VERSION AVAILABLE! Please utilise https://github.com/UoS-HGIG/GenePy-1.4
GenePy v1.2 a score for the analysis of next generation sequencing data.
BMC Bioinformatics publication: https://doi.org/10.1186/s12859-019-2877-3
The current (1.2) version of GenePy has the following improvements:
- GenePy can include variants from non-coding regions
- annotation no longer limited to non-synonymous single-nucleotide variants (dbnsfp33a)
- annotation is based on whole genome metrics (CADD 1.3, DANN, GWAVA, EIGEN and REVEL) for all missense variants
- population frequencies are now obtained from gnomAD instead of 1000 Genomes Prj.
- annotations are based on hg19 but can be switched to HG38 when available from ANNOVAR
For historic reference the older GenePy version is available, but support is no longer given.
- A (multi)sample VCF file (can accept compressed vcf.gz)
- List of genes for which generate GenePy scores. (gene.list)
- Vcftools
- Annovar
- Python 2.7.x
Before running GenePy, we need to annotate SNVs and generate a GenePy-ready file (ALL_genepy.meta)
The first required input to GenePy is a multi-sample VCF (GENOTYPED_ALL.vcf.gz in this example).
We recommend the following filters to improve score consistency based on Tom et al. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5525370/pdf/12859_2017_Article_1756.pdf ).
vcftools --gzvcf GENOTYPED_ALL.vcf.gz --minGQ 20 --recode --out GENO_FILT_GQ20 # Set GQ<20 as missing ( retain high confidence calls)
vcftools --vcf GENO_FILT_GQ20.vcf --max-missing 0.7 --out filtered # Remove SNVs with missing rate >30%
vcftools --vcf filtered.recode.vcf --min-alleles 2 --max-alleles 2 # Keep only Biallelic SNVs
./annovar/convert2annovar.pl \
-format vcf4 FINAL_GQ20M30_BIALL.recode.vcf.gz \
-outfile ALL_genepy.input \
-allsample \
-withfreq \
-include 2>annovar.log
./annovar/table_annovar.pl \
ALL_genepy.input \
./annovar/humandb/ \
-buildver hg19 \
-out ALL_genepy \
-remove \
-protocol refGene,gnomad_exome,cadd13,eigen,revel,gwava,dann \
-operation g,f,f,f,f,f,f \
--thread 40 \
--maxgenethread 40 \
-nastring . >>annovar.log
cut -f 18- ALL_genepy.input > a1
zgrep '^#CHR' FINAL_GQ20M30_BIALL.recode.vcf.gz | cut -f 10- > b1
cat b1 a1 > ALL_temp
paste ALL_genepy.hg19_multianno.txt ALL_temp > ALL_genepy.meta
rm a1 b1 ALL_temp
make new folders in your current directory to store raw GenePy score files
mkdir CADD13_RawScore Eigen GWAVA_region_score GWAVA_tss_score dann REVEL
Take the header from the ALL_genepy.meta file and stores it in a newly created header file
grep "^Chr" ALL_genepy.meta> header
Once the ALL_genepy.meta file is created, GenePy_1.2.sh can be run by simply iterating through the list of deisred genes. Be aware, the make_scores_mat_5.py file must be in the same directory of GenePy_1.2.sh.
while read gene:
do
sh GenePy_1.2.sh $gene ;
done< gene.list
We strongly reccomend to combine GenePy output files (one per gene) into a single matrix where each column is a gene and each row is a sample.
In order to compare/combine GenePy scores across genes, normalisation by length is required. Length depends on the BED file used to generate the VCF file. To correct by gene length please refer to extract_gene_size.py script (requires GenePy scores formatted in a matrix as previously explained).
We also strongly reccomend to normalise GenePy scores by GDI score (http://lab.rockefeller.edu/casanova/GDI) using scirpt gdi_scale.py