CarDEC (Count adapted regularized Deep Embedded Clustering) is a joint deep learning computational tool that is useful for analyses of single-cell RNA-seq data. The CarDEC method's repository can be found here.
This repository is dedicated to providing the code used to perform all evaluations in the CarDEC paper. It includes code used to generate results for CarDEC, and for every competing method:
- scVI
- DCA + Combat
- MNN
- Scanorama
- scDeepCluster
It is recommended the user proceeds as follows.
- Clone this repository to their local machine
- Download the data from Box.
- Install all necessary packages.
- Run all evaluations.
- Run Rscripts to generate final plots.
Clone this repository to your local machine using the standard procedure.
Download the data from Box, and place them into the currently empty data folder.
The user will need to install multiple packages: anaconda, two conda environments containing many dependencies, and a version of R >= 4.0
First, install Anaconda if you do not already have it, so that you can access conda commands in terminal.
Next, use cardec.yml and cardec_alternatives.yml to set up the "cardec" and "cardec_alternatives" environments respectively.
To do this, simply cd in the cloned "CarDEC_Codes" repository. Once in this directory, run the following two commands.
$ conda env create -f cardec.yml
$ conda env create -f cardec_alternatives.yml
Lastly, install a version of R. It is highly recommended that the user installs R version >= 4.0. Rstudio is also reccomended for installation, but not required.
Next, it is recommended that the user run all of the evaluation notebooks. The user should activate either the cardec or cardec_alternatives environment before opening jupyter to run the python notebooks. This is necessary because these two environments have "nb_conda_kernels" installed, which will allow the user to switch anaconda environments in the jupyter app. The following command will activate the cardec environment.
$ conda activate cardec
Then, open jupyter. The user can use either jupyter notebook or jupyter lab. The following command will open jupyter lab.
$ jupyter lab
It is recommended that the user first run the CarDEC notebook. Simply, open each of the following notebooks in jupyter. Make sure to set the activate conda kernel in jupyter to "cardec" and then run all cells. Repeat this for every notebook listed below.
- CarDEC Macaque.ipynb
- CarDEC Mouse Cortex.ipynb
- CarDEC Mouse Retina.ipynb
- CarDEC PBMC.ipynb
- CarDEC Pancreas.ipynb
- CarDEC Liver Runtime.ipynb
- CarDEC Mouse Cortex-SCT.ipynb
- CarDEC Pancreas Revisions.ipynb
- CarDEC_monocyte.ipynb
Next, it is recommended that the user run all scripts to evaluate MNN. For each file in the list below, the user should open R (or Rstudio), and execute the script.
- MNN_Cortex.R
- MNN_Cortex_HVG.R
- MNN_Liver_Runtime.R
- MNN_PBMC.R
- MNN_PBMC_HVG.R
- MNN_Pancreas.R
- MNN_Pancreas_HVG.R
- MNN_Retina.R
- MNN_Retina_HVG.R
- MNN_sampleMacaque.R
- MNN_sampleMacaque_HVG.R
- MNN_monocyte.R
In the next step, the user should run the python notebooks to evaluate all methods other than CarDEC and MNN. Simply, open each of the following notebooks in jupyter. Make sure to set the activate conda kernel in jupyter to "cardec_alternatives" and then run all cells. Repeat this for every notebook listed below.
- Competing Methods Macaque.ipynb
- Competing Methods Mouse Cortex.ipynb
- Competing Methods Mouse Retina.ipynb
- Competing Methods PBMC.ipynb
- Competing Methods Pancreas.ipynb
- Competing Methods for monocyte
- DCA Liver Runtime.ipynb
- Scanorama Liver Runtime.ipynb
- scVI Liver Runtime.ipynb
Remark: The Competing Methods for monocyte
is a folder. All reproducing codes related to monocyte dataset can be found in this folder.
Lastly, the user should run the python notebooks used to generate the coefficient of variation plots demonstrated in many of the CarDEC paper's figures. Simply, open each of the following notebooks in jupyter. Make sure to set the activate conda kernel in jupyter to "cardec" and then run all cells. Repeat this for every notebook listed below.
- Batch Calibration Tests Cortex.ipynb
- Batch Calibration Tests Macaque.ipynb
- Batch Calibration Tests Mouse Retina.ipynb
- Batch Calibration Tests PBMC.ipynb
- Batch Calibration Tests Pancreas.ipynb
This last step is purely optional. In the previous steps, all analysis was completed. This final step involves using Rscripts to generate final figures. These Rscripts do not perform any actual analysis, they are simply used in order to generate prettier plots than Python for the paper. For example, all UMAP plots in the paper were generated by running all analysis in Python, exporting the computed UMAP coordinates to a csv file, and then reading this csv into R to build a prettier UMAP plot using ggplot2.
If the user wishes to generate the final plots, they just need to open each folder and run any Rscripts they find. These Rscripts should run in under 30 seconds each since they just read in small csv files and generate UMAP plots. The scripts have names like "figure_make.R", "figure_make_HVGo.R", "figure_make_bybatch.R", etcetra. A few figure folders will not contain Rscripts, which means that no R postprocessing was done to generate final figures.