- What is DeepArk?
- The DeepArk server
- Running DeepArk locally
- Frequently asked questions
- Citing DeepArk
- Related tools
DeepArk is a set of models of the worm, fish, fly, and mouse regulatory codes. For each of these organism, we constructed a deep convolutional neural network that predict regulatory activities (i.e. histone modifications, transcription factor binding, and chromatin state) directly from genomic sequences. Besidese accurately predicting a sequence's regulatory activity, DeepArk can predict the effects of variants on regulatory function and profile sequences regulatory potential with in silico saturated mutagenesis. If you are a researcher with no programming experience or access to GPUs, please take a look at our free and user-friendly GPU-accelerated webserver. We also provide instructions for running DeepArk on your own computer.
Like most methods using deep learning, DeepArk is designed to run on graphics processing units (GPUs). However, we did not intend for DeepArk to only be used by researchers with access to high-end GPU clusters. To lower this barrier, we are publicly hosting DeepArk on a free GPU-accelerated server here. Documentation and guidelines for using the server may be found here. If you need to make a large number (e.g. hundreds of thousands) of predictions with DeepArk, we recommend doing so on your local machine. Instructions on how to do this are provided below.
This repository is all you should need to run DeepArk on your own computer. The following subsections describe how to install DeepArk locally, and use it to predict the regulatory activity of genomic sequences, predict the regulatory effects of variants, and profile sequences with in silico saturated mutagenesis
To install DeepArk locally you should first clone this repository as follows:
git clone https://github.com/FunctionLab/DeepArk.git
cd DeepArk
We recommend managing DeepArk's dependencies with a conda environment as follows:
conda env create -f environment.yml
conda activate DeepArk
The default conda environment uses PyTorch and CUDA.
If you do not have access to a CUDA-enabled GPU, we recommend using the GPU-accelerated DeepArk webserver.
However, we have also included a CPU-only conda environment in cpu_environment.yml
if you cannot use the webserver either.
After downloading DeepArk, you will need to download the weights for the network.
You can download all of the weights by running the download_weights.sh
script as follows:
./download_weights.sh
Alternatively, you can download the weights for a subset of the species as follows:
./download_weights.sh caenorhabditis_elegans danio_rerio
To start using DeepArk, simply run the DeepArk.py
script with python.
The model.py
file contains code to build the DeepArk model in python.
The checkpoints to use with DeepArk are included in the data
directory as *.pth.tar
files.
The checkpoints for worm, fish, fly, and mouse are saved in mus_musculus.pth.tar
, drosophila_melanogaster.pth.tar
, danio_rerio.pth.tar
, and caenorhabditis_elegans.pth.tar
respectively.
It is worth noting that these checkpoint files are slight different from the ones produced by training the models with Selene, since we have included the arguments required to construct each model object.
Information on each feature predicted by each model can be found in the *.tsv
files in the data
directory (e.g. mus_musculus.tsv
and so on).
These feature information files are described further in this section of the FAQ.
We describe each use case for DeepArk in further detail in the sections below, but the general syntax for running a command with DeepArk is as follows:
python DeepArk.py [COMMAND] [OPTIONS]
You can find a listing of commands and some general usage information by using the following command:
python DeepArk.py --help
You can also find command-specific usage information like so:
python DeepArk.py [COMMAND] --help
Predicting the regulatory activity of a genomic sequence with a DeepArk model is the most straightforward way to use DeepArk. To do so, you only need a DeepArk model checkpoint and a FASTA file with the sequences you would like to make predictions for. Note that the sequences in the FASTA file should be 4095 bases long. Below is an example showing how to use DeepArk for prediction.
python DeepArk.py predict \
--checkpoint-file 'data/caenorhabditis_elegans.pth.tar' \
--input-file 'examples/caenorhabditis_elegans_prediction_example.fasta' \
--output-dir './' \
--output-format 'tsv' \
--batch-size '64'
Instead of a FASTA file with 4095 base pair sequences, you can alternatively provide DeepArk with a BED file specifying regions in a reference genome. If using DeepArk with a BED file, you must include a FASTA file specifying the reference genome sequence to use with it. Additional information about where to find a FASTA file for a reference genome is included in this section below. We include an example of this usage below.
python DeepArk.py predict \
--checkpoint-file 'data/caenorhabditis_elegans.pth.tar' \
--input-file 'examples/caenorhabditis_elegans_prediction_example.bed' \
--genome-file 'ce11.fa' \
--output-dir './' \
--output-format 'tsv' \
--batch-size '64'
Because of its size, the reference FASTA ce11.fa
must be downloaded separately from this repository.
More information about where to download this file can be found here.
Finally, further information about predict
and its arguments may be found using the following invocation of DeepArk.py
:
python DeepArk.py predict --help
To predict the effects of variants with DeepArk, we simply compare the predicted probabilities of the reference sequence to the mutated sequence containing the variant. To run make predictions for variants, you will need a DeepArk model checkpoint, a VCF file with your variants, and a FASTA file with the reference genome sequence. We show an example invocation below.
python DeepArk.py vep \
--checkpoint-file 'data/mus_musculus.pth.tar' \
--input-file 'examples/mus_musculus_vep_example.vcf' \
--genome-file 'mm10.fa' \
--output-dir './' \
--output-format 'tsv' \
--batch-size '64'
Because of its size, the reference FASTA mm10.fa
must be downloaded separately from this repository.
More information about where to download this file can be found here.
Finally, additional information about each argument for vep
can be found using the following command:
python DeepArk.py vep --help
In silico saturated mutagenesis (ISSM) allows us to profile the regulatory potential of sequences by predicting the effects of all possible mutations in that sequence. Note that ISSM generates roughly 17400 predictions per sequence, so it is much slower than the other prediction methods. To profile sequences with ISSM, you will need a DeepArk model checkpoint and a FASTA file with at least one entry in it. Note that the sequences in the FASTA file should be 4095 bases long. We show an example invocation of ISSM command below.
python DeepArk.py issm \
--checkpoint-file 'data/drosophila_melanogaster.pth.tar' \
--input-file 'examples/drosophila_melanogaster_issm_example.fasta' \
--output-dir './' \
--output-format 'tsv' \
--batch-size '64'
Additional information regarding issm
and its argument may be found with the following command:
python DeepArk.py issm --help
- What regulatory features predicted by each DeepArk model?
- Where can I download reference genomes to use with DeepArk?
- How do I force DeepArk to use or ignore my GPU?
- How can I leverage multiple GPUs with DeepArk?
- How do I set the number of threads used when I run DeepArk without a GPU?
- How can I speed up in silico saturated mutagenesis?
- How did you train DeepArk?
- How accurate is DeepArk?
- How do I cite DeepArk?
- Why are DeepArk's checkpoints different from Selene's?
- How can I binarize the probability predictions from DeepArk?
The features predicted by each model are included in the *.tsv
files in the data
directory.
The information for worm, fish, fly, and mouse is stored in caenorhabditis_elegans.tsv
, danio_rerio.tsv
, drosophila_melanogaster.tsv
, and mus_musculus.tsv
respectively.
For a given row in these files, the index
column specifies the corresponding entry in the DeepArk model output prediction vector.
These index values start at zero.
All of the information and metadata regarding the experiments was sourced from ChIP-atlas.
Additional information about data from ChIP-atlas can be found here.
There are many possible sources for reference genomes. For most cases, we recommend downloading genomes from RefSeq, ENSEMBL, or the UCSC genome browser.
There are a few situations where you are using DeepArk with CUDA-enabled PyTorch on a machine with a GPU, but you do not want to use the GPU to run DeepArk.
Conversely, you may want DeepArk to crash if it cannot use a GPU.
This behavior can be achieved by explicitly specifying whether DeepArk should use a CUDA or not.
To force DeepArk to use or ignore the GPU, set the --cuda
or --no-cuda
flag during the invocation of any command.
To demonstrate this, we modify the vep
example from above to not use the GPU as follows:
python DeepArk.py vep \
--checkpoint-file 'data/mus_musculus.pth.tar' \
--input-file 'examples/mus_musculus_vep_example.vcf' \
--genome-file 'mm10.fa' \
--output-dir './' \
--output-format 'tsv' \
--batch-size '64' \
--no-cuda
If you do not explicitly specify whether to use a GPU or not, DeepArk will use torch.cuda.is_available
to decide.
If it returns True
, then DeepArk will use the GPU.
Otherwise, DeepArk will not attempt to leverage a GPU.
Running DeepArk on multiple GPUs in parallel is straightforward.
To toggle whether DeepArk should leverage multiple GPUs, simply specify the --data-parallel
or --no-data-parallel
flags.
This will toggle batch-level data parallelism on and off respectively.
We modify the issm
example from above to use data parallelism as follows:
python DeepArk.py issm \
--checkpoint-file 'data/drosophila_melanogaster.pth.tar' \
--input-file 'examples/drosophila_melanogaster_issm_example.fasta' \
--output-dir './' \
--output-format 'tsv' \
--batch-size '64' \
--data-parallel
If you do not have more than one GPU available, then toggling data parallelism is unlikely to improve DeepArk's runtime performance.
If you are using DeepArk without a GPU, you may want to alter the number of threads being used by PyTorch.
To do so, simply set the --n-threads
argument to DeepArk.py
.
This sets the number of PyTorch threads in a call to torch.set_num_threads
.
As a demonstration of the proper usage, we modify the predict
example from above to use 16 threads would work as follows:
python DeepArk.py predict \
--checkpoint-file 'data/caenorhabditis_elegans.pth.tar' \
--input-file 'examples/caenorhabditis_elegans_prediction_example.fasta' \
--output-dir './' \
--output-format 'tsv' \
--batch-size '64' \
--n-threads '16'
In silico saturated mutagenesis (ISSM) is generally the slowest process for DeepArk, in part because it is making far more predictions (i.e. roughly 17400 predictions per input sequence) than the other methods. Consequently, ISSM will generally take longer to write its output to file than other methods. A simple way to speed up ISSM runtime is to write predictions to HDF5 files instead of TSV files. We also recommend using a GPU when running ISSM. If ISSM appears to be running slowly when using the GPU, make sure to force DeepArk to crash if it cannot access said GPU by explicitly specifying CUDA use. If ISSM is too slow on a single GPU, you may want to consider using multiple GPUs. If you do not have access to a GPU, you can use the GPU-accelerated DeepArk webserver to run your ISSM experiments.
DeepArk was trained using Selene, our PyTorch-based library for developing deep learning models of biological sequences. All training details, such as model hyperparameters, will be described in a forthcoming manuscript.
DeepArk is quite accurate, and we are currently quantifying performance on a rigorous benchmark. Details regarding performance will be thoroughly discussed in a forthcoming manuscript.
If you use the DeepArk webserver or run DeepArk locally, we ask that you cite DeepArk. Specific instructions on citing DeepArk can be found in this section.
To simplify DeepArk's use, we have include the constructor arguments for the model in the checkpoint files.
We also removed information that was not relevant to model inference (e.g. the minimum loss during training).
This allows us to distribute the model as two files: the model.py
file and the weights file.
Clearly, this is different from the checkpoints generated by Selene.
To convert a checkpoint from Selene for use with DeepArk, use the convert_checkpoint.py
script.
Documentation for this script can be accessed via python scripts/convert_checkpoint.py --help
.
DeepArk's output predictions are probabilities, rather than binary labels.
For an input genomic sequence, a higher predicted probability for a given regulatory feature means that DeepArk has more confidence in its prediction.
Thus, a zero would indicate that DeepArk has no confidence that a regulatory feature occurs in the given input sequence, and a one would indicate that DeepArk has total confidence that the regulatory feature is there.
This is useful for most scenarios, because it gives you the sensitivity to compare even subtle differences in regulatory activity between sequences.
However, there are scenarios where DeepArk's outputs need to be binarized (i.e. the regulatory feature must be active or inactive).
There are also cases where the user is applying DeepArk at a large scale, and wants to get some guarantee on detection performance (e.g. recall above a certain threshold).
Importantly, one must use different probability cutoffs for each regulatory feature predicted by DeepArk.
To make this easy, we have compiled performance (i.e. recall, false positive rate) and the probability cutoffs for a number of useful recall thresholds, and stored them in data/probability_thresholds.tsv
.
If you use DeepArk in your publication, please cite it. We include a BibTex citation below.
@article{DeepArk,
author = {Evan M Cofer, and Jo{\~{a}}o Raimundo, and Alicja Tadych, and Yuji Yamazaki, and Aaron K Wong, and Chandra L Theesfeld, and Michael S Levine, and Olga G Troyanskaya},
title = {{DeepArk}: modeling \textit{cis}-regulatory codes of model species with deep learning},
doi = {10.1101/2020.04.23.058040},
url = {https://doi.org/10.1101/2020.04.23.058040},
year = {2020},
month = apr,
journal = {biorXiv}
}
Please check out Selene, our library for developing sequence-based deep learning models in PyTorch. Our paper on Selene is available in Nature Methods or as a preprint here.