Drug Sensitivity Prediction From Cell Line-Based Pharmacogenomics Data: Guidelines for Developing Machine Learning Models
- Python 3
- Conda
To get the source files of PGx Guidelines you need to clone into its repository:
git clone https://github.com/bhklab/PGx_Guidelines
All the required packages to run PGx Guidelines experiments are specified in environment
subdirectory.
To install these packages run the following command:
conda env create -f PGx.yml
This command installs PGxG
environment.
After the successful installation of the packages into environmet you would be able to load them using conda activate
.
All of the utilized datasets for PGx Guidelines experiments are publicly available in the PSet
format via ORCESTRA platform:
https://www.orcestra.ca/pset/stats
After downloading PSet
objects, the molecular and pharmacological data can be extracted via R
using codes provided in Preprocess data
subdirectory.
To load all datasets and Area above dose-response curve (AAC) data, run LoadAllPSets.R
.
To load log transformed and truncated IC50 values, run IC50Loading_logtruncated.R
.
tissueType_encoding.csv
file is one-hot coding of tissue types which is added to molecular profiles to adjust for tissue type.
Running R
scripts generates the final datasets in .tsv
format. Add them to a new subdirectory Data_All
:
mkdir Data_All
By creating this subdirectory and adding all the data files to it, you will be able to re-run PGx Guidelines experiments. Alternatively, we have also provided these preprocessed files on Zenodo.
Each Rscript includes code to load required libraries and datasets.
Simply run the following for:
- all [solid and non-solid] tissues:
Rscript biomarker_analysis_alltissues.R "$@"
- after excluding non-solid tissues:
Rscript biomarker_analysis_solidonly.R "$@"
- after excluding non-solid tissues and log transformed IC50 values:
Rscript biomarker_analysis_log.R "$@"
- after excluding non-solid tissues and truncated
Rscript biomarker_analysis_truncated.R "$@"
- after excluding non-solid tissues, truncated, and log transformed IC50 values:
Rscript biomarker_analysis_truncated_log.R "$@"
For this analysis, we have provided the Python
scripts as follows:
- Ridge Regression:
Within-Ridge-aac.py
andWithin-Ridge-ic50.py
sbatch ridge-wjob-aac.bs
sbatch ridge-wjob-ic50.bs
- Elastic Net:
Within-EN-aac.py
andWithin-EN-ic50.py
:
sbatch en-wjob-aac.bs
sbatch en-wjob-ic50.bs
- Random Forest:
Within-RF-aac.py
andWithin-RF-ic50.py
.
sbatch rf-wjob-aac.bs
sbatch rf-wjob-ic50.bs
For this analysis, we have provided the Jupyter notebooks to run Ridge Regression (Ridge.ipynb
), Elastic Net (ElasticNet.ipynb
), and Random Forest (RandomForest.ipynb
). For Deep Neural Networks experiments, we have provided python
scripts in DNN
subdirectory to run them. First you should create directories to store logs, models, and results. You should also add your local path
to these directories to PGxGRun.bs
:
mkdir logs
mkdir models
mkdir results
sbatch PGxGRun.bs
We have also provided randomly generated hyperparameter settings in filelistF10Uniquev1
.
We have provided the model objects for the best settings of DNN experiments on Zenodo.
For this analysis, we have provided the Jupyter notebook GDSCv1.ipynb
.
For this analysis, we have provided the Jupyter notebook SolidandnonSolid.ipynb
. For running the random subset experiment, run SNRidge-aac.py
script.
python SNRidge-aac.py
author = {Sharifi-Noghabi, Hossein and Jahangiri-Tazehkand, Soheil and Smirnov, Petr and Hon, Casey and Mammoliti, Anthony and Nair, Sisira Kadambat and Mer, Arvind Singh and Ester, Martin and Haibe-Kains, Benjamin},
title = "{Drug sensitivity prediction from cell line-based pharmacogenomics data: guidelines for developing machine learning models}",
journal = {Briefings in Bioinformatics},
year = {2021},
month = {08},
issn = {1477-4054},
doi = {10.1093/bib/bbab294},
url = {https://doi.org/10.1093/bib/bbab294},
note = {bbab294},
eprint = {https://academic.oup.com/bib/advance-article-pdf/doi/10.1093/bib/bbab294/39679532/bbab294.pdf},
}