This repository includes python/shell scripts to test Illumina/SpliceAI to check whether SpliceAI is really good at predicting splice sites.
This project simply tries to run SpliceAI using all alternative isoforms from GENCODE v19.
-
It requires,
- python2 interpreter satisfying the requirement specified in setup.py
- python3 interpreter with installed minwoooj/lab-modules.
So I recommend to use conda virtual environments.
-
Original source codes: https://basespace.illumina.com/s/5u6ThOblecrh
-
Additional source codes or source code modification will appear at the commit history.
The following description is original README.md of Illumina/SpliceAI.
This package annotates genetic variants with their predicted effect on splicing, as described in Jaganathan et al, Cell 2019 in press.
The simplest way to install SpliceAI is through pip:
pip install spliceai
Alternately, SpliceAI can be installed from the github repository:
git clone https://github.com/Illumina/SpliceAI.git
cd SpliceAI
python setup.py install
SpliceAI requires tensorflow>=1.2.0, which is best installed separately via pip: pip install tensorflow
. See the TensorFlow website for other installation options.
SpliceAI can be run from the command line:
spliceai -I input.vcf -O output.vcf -R genome.fa -A annotations.txt
# or you can pipe the input and output VCFs
cat input.vcf | spliceai -R genome.fa -A annotations.txt > output.vcf
Options:
- -I: Input VCF with variants of interest.
- -O: Output VCF with SpliceAI predictions
SpliceAI=ALLELE|SYMBOL|DS_AG|DS_AL|DS_DG|DS_DL|DP_AG|DP_AL|DP_DG|DP_DL
included in the INFO column (see table below for details). Only SNVs and simple INDELs (ref or alt must be a single base) within genes are annotated. Variants in multiple genes have separate predictions for each gene. - -R: Reference genome fasta file.
- -A: Gene annotation file. Can instead provide
grch37
orgrch38
to use GENCODE canonical annotation files included with the package. To create custom annotation files, usespliceai/annotations/grch37.txt
in repository as template.
Note: The annotations for all possible SNVs within genes are available here for download.
Details of SpliceAI INFO field:
ID | Description |
---|---|
ALLELE | Alternate allele |
SYMBOL | Gene symbol |
DS_AG | Delta score (acceptor gain) |
DS_AL | Delta score (acceptor loss) |
DS_DG | Delta score (donor gain) |
DS_DL | Delta score (donor loss) |
DP_AG | Delta position (acceptor gain) |
DP_AL | Delta position (acceptor loss) |
DP_DG | Delta position (donor gain) |
DP_DL | Delta position (donor loss) |
Delta score of a variant ranges from 0 to 1, and can be interpreted as the probability of the variant being splice-altering. In the paper, a detailed characterization is provided for 0.2 (high recall/likely pathogenic), 0.5 (recommended/pathogenic), and 0.8 (high precision/pathogenic) cutoffs. Delta position conveys information about the location where splicing changes relative to the variant position (positive values are upstream of the variant, negative values are downstream).
A sample input file and the corresponding output file can be found at examples/input.vcf
and examples/output.vcf
respectively (grch37
annotation). The output SpliceAI=T|RYR1|0.22|0.00|0.91|0.70|-107|-46|-2|90
for the variant 19:38958362 C>T
can be interpreted as follows:
- The probability that the position
19:38958255
is used as a splice acceptor increases by0.22
. - The probability that the position
19:38958360
is used as a splice donor increases by0.91
. - The probability that the position
19:38958452
is used as a splice donor decreases by0.70
.
Kishore Jaganathan: [email protected]