Merge pull request #4 from cantinilab/dev

Dev
cantinilab · Sep 19, 2024 · 098dcde · 098dcde
2 parents 2fcf5c7 + 3b3a074
commit 098dcde
Show file tree

Hide file tree

Showing 52 changed files with 9,844 additions and 540 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,29 @@
+# Use the specified base image
+FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-devel
+
+# Set environment variable to prevent interactive prompts
+ENV DEBIAN_FRONTEND=noninteractive
+
+# Update the package list
+RUN apt-get update -y
+
+# Install git
+RUN apt-get install -y git
+
+# Install Python packages using pip
+RUN git clone https://github.com/jkobject/scprint .
+RUN cd scprint
+RUN pip install -e ".[flash,dev]"
+RUN lamin init --storage ./main --name main --schema bionty
+RUN python -c 'import bionty as bt; bt.base.reset_sources(); bt.core.sync_all_sources_to_latest()'
+RUN lamin load main
+RUN python -c 'from scdataloader.utils import populate_my_ontology; populate_my_ontology()'
+
+# Set the default command (can be overridden)
+CMD ["scprint", "--help"]
+
+# to install the nvidia-cuda-toolkit
+# curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
+# curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
+# sudo apt-get update
+# sudo apt-get install -y nvidia-container-toolkit
diff --git a/HISTORY.md b/HISTORY.md
@@ -4,6 +4,24 @@ Changelog
 
 (unreleased)
 ------------
+- Adding docker. [jkobject]
+- Better reqs and lamin update. [jkobject]
+- Add ruff and uv. [jkobject]
+- Work in progress. [jkobject]
+- Work on multiple updates. [jkobject]
+- Merge remote-tracking branch 'origin/main' into dev. [jkobject]
+- Better tests. [jkobject]
+- Should be good now. [jkobject]
+- Error. [jkobject]
+- Trying precising the attenion type. [jkobject]
+- Better tests. [jkobject]
+- Merge branch 'main' into dev. [maestro-jk]
+- Merge branch 'main' into dev. [jkobject]
+
+
+1.1.3 (2024-09-04)
+------------------
+- Release: version 1.1.3 🚀 [jkobject]
 - Ready now. [jkobject]
 
 

diff --git a/Makefile b/Makefile
@@ -26,16 +26,16 @@ install:          ## Install the project in dev mode.
 
 .PHONY: fmt
 fmt:              ## Format code using black & isort.
-	$(ENV_PREFIX)isort scprint/
-	$(ENV_PREFIX)black -l 88 scprint/
-	$(ENV_PREFIX)black -l 88 tests/
+	$(ENV_PREFIX)ruff check --fix scprint/
+	$(ENV_PREFIX)ruff check --fix tests/
+	$(ENV_PREFIX)ruff format tests/
+	$(ENV_PREFIX)ruff format scprint/
 
 .PHONY: lint
 lint:             ## Run pep8, black, mypy linters.
 #most are due to flashattention...
-	$(ENV_PREFIX)flake8 --ignore=E501,E203,E266,E265,W503,F401,F403,F841,E731,E722,E402 scprint/
-	$(ENV_PREFIX)black -l 88 --check scprint/
-	$(ENV_PREFIX)black -l 88 --check tests/
+	$(ENV_PREFIX)ruff check --fix scprint/
+	$(ENV_PREFIX)ruff check --fix tests/
 
 .PHONY: test
 test: lint        ## Run tests and generate coverage report.

diff --git a/README.md b/README.md
@@ -56,6 +56,10 @@ scPRINT can be used to perform the following analyses:
     - [where to find the gene embeddings?](#where-to-find-the-gene-embeddings)
   - [Documentation](#documentation)
   - [Model Weights](#model-weights)
+  - [Docker](#docker)
+    - [Building the Docker Image](#building-the-docker-image)
+    - [Pulling the Docker Image from Docker Hub](#pulling-the-docker-image-from-docker-hub)
+    - [Running the Docker Container](#running-the-docker-container)
   - [Development](#development)
   - [Work in progress (PR welcomed):](#work-in-progress-pr-welcomed)
 
@@ -264,6 +268,41 @@ For more information on usage please see the documentation in [https://www.jkobj
 
 Model weights are available on [hugging face](https://huggingface.co/jkobject/scPRINT/).
 
+## Docker
+
+By using the `scPRINT Docker image`, you can bypass the complexities of manual package installation, ensuring a consistent deployment environment. Included in this repository is a Dockerfile that lets you craft a container for the project; you have the choice to either build this image on your own or conveniently pull it from Docker Hub.
+
+Make sure that you have the `docker` command line interface installed on your system.
+
+A recommended way to install docker with the correct nvidia drivers on linux is to use this [script](https://gist.github.com/xueerchen1990/baad7baa545cb547e8633bc9e5b84786)
+
+### Building the Docker Image
+
+To build the Docker image from the provided `Dockerfile`, run the following command from the root directory of this repository:
+
+```bash
+docker build -t scprint:latest -f Dockerfile .
+```
+
+### Pulling the Docker Image from Docker Hub
+
+If you don't want to build the image yourself, you can pull it directly from Docker Hub:
+
+```bash
+docker pull jkobject/scprint:1.1.3
+docker tag jkobject/scprint:1.1.3 scprint:latest
+```
+
+### Running the Docker Container
+
+Once you have the image (either by building it or pulling it), you can start a container with:
+
+```bash
+docker run --gpus all --rm -it scprint:latest bash
+```
+
+Please note: When running the Docker container, ensure you mount any necessary folders using the -v option to access them inside the container.
+`
 ## Development
 
 Read the [CONTRIBUTING.md](CONTRIBUTING.md) file.

diff --git a/docs/index.md b/docs/index.md
@@ -1,14 +1,15 @@
 
 # scPRINT: Large Cell Model for scRNAseq data
 
+[![codecov](https://codecov.io/gh/jkobject/scPRINT/branch/main/graph/badge.svg?token=GRnnData_token_here)](https://codecov.io/gh/jkobject/scPRINT)
+[![CI](https://github.com/jkobject/scPRINT/actions/workflows/main.yml/badge.svg)](https://github.com/jkobject/scPRINT/actions/workflows/main.yml)
 [![PyPI version](https://badge.fury.io/py/scprint.svg)](https://badge.fury.io/py/scprint)
-[![Documentation Status](https://readthedocs.org/projects/scprint/badge/?version=latest)](https://scprint.readthedocs.io/en/latest/?badge=latest)
 [![Downloads](https://pepy.tech/badge/scprint)](https://pepy.tech/project/scprint)
 [![Downloads](https://pepy.tech/badge/scprint/month)](https://pepy.tech/project/scprint)
 [![Downloads](https://pepy.tech/badge/scprint/week)](https://pepy.tech/project/scprint)
 [![GitHub issues](https://img.shields.io/github/issues/jkobject/scPRINT)](https://img.shields.io/github/issues/jkobject/scPRINT)
 [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
-[![DOI](https://zenodo.org/badge/391909874.svg)]()
+[![DOI](https://zenodo.org/badge/391909874.svg)](https://doi.org/10.1101/2024.07.29.605556)
 
 ![logo](logo.png)
 
@@ -23,39 +24,122 @@ scPRINT can be used to perform the following analyses:
 - __label prediction__: predict the cell type, disease, sequencer, sex, and ethnicity of your cells
 - __gene network inference__: generate a gene network from any cell or cell cluster in your scRNAseq dataset
 
-[Read the paper!](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1) if you would like to know more about scPRINT.
+[Read the manuscript!](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1) if you would like to know more about scPRINT. Have a look at some of my [X-plainers](https://twitter.com/jkobject). 
 
 ![figure1](figure1.png)
 
+## Table of Contents
+
+- [scPRINT: Large Cell Model for scRNAseq data](#scprint-large-cell-model-for-scrnaseq-data)
+  - [Table of Contents](#table-of-contents)
+  - [Install `scPRINT`](#install-scprint)
+    - [lamin.ai](#laminai)
+    - [install](#install)
+    - [pytorch and GPUs](#pytorch-and-gpus)
+    - [dev install](#dev-install)
+  - [Usage](#usage)
+    - [scPRINT's basic commands](#scprints-basic-commands)
+    - [Notes on GPU/CPU usage with triton](#notes-on-gpucpu-usage-with-triton)
+    - [Simple tests:](#simple-tests)
+  - [FAQ](#faq)
+    - [I want to generate gene networks from scRNAseq data:](#i-want-to-generate-gene-networks-from-scrnaseq-data)
+    - [I want to generate cell embeddings and cell label predictions from scRNAseq data:](#i-want-to-generate-cell-embeddings-and-cell-label-predictions-from-scrnaseq-data)
+    - [I want to denoise my scRNAseq dataset:](#i-want-to-denoise-my-scrnaseq-dataset)
+    - [I want to generate an atlas-level embedding](#i-want-to-generate-an-atlas-level-embedding)
+    - [I need to generate gene tokens using pLLMs](#i-need-to-generate-gene-tokens-using-pllms)
+    - [I want to pre-train scPRINT from scratch on my own data](#i-want-to-pre-train-scprint-from-scratch-on-my-own-data)
+    - [how can I find if scPRINT was trained on my data?](#how-can-i-find-if-scprint-was-trained-on-my-data)
+    - [can I use scPRINT on other organisms rather than human?](#can-i-use-scprint-on-other-organisms-rather-than-human)
+    - [how long does scPRINT takes? what kind of resources do I need? (or in alternative: can i run scPRINT locally?)](#how-long-does-scprint-takes-what-kind-of-resources-do-i-need-or-in-alternative-can-i-run-scprint-locally)
+    - [I have different scRNASeq batches. Should I integrate my data before running scPRINT?](#i-have-different-scrnaseq-batches-should-i-integrate-my-data-before-running-scprint)
+    - [where to find the gene embeddings?](#where-to-find-the-gene-embeddings)
+  - [Documentation](#documentation)
+  - [Model Weights](#model-weights)
+  - [Development](#development)
+  - [Work in progress (PR welcomed):](#work-in-progress-pr-welcomed)
+
 
 ## Install `scPRINT`
 
-For the moment scPRINT has been tested on MacOS and Linux (Ubuntu 20.04) with Python 3.10.
+For the moment scPRINT has been tested on MacOS and Linux (Ubuntu 20.04) with Python 3.10. Its instalation takes on average 10 minutes.
 
 If you want to be using flashattention2, know that it only supports triton 2.0 MLIR's version and torch==2.0.0 for now.
 
-```python
-conda create -n "[whatever]" python==3.10
+### lamin.ai
+
+To use scPRINT, I need you to use lamin.ai. This is needed to load biological informations like genes, cell types, organisms etc...
+
+To do so, you will need to connect with google or github to [lamin.ai](https://lamin.ai/login), then be sure to connect before running anything (or before starting a notebook): `lamin login <email> --key <API-key>`. Follow the instructions on [their website](https://docs.lamin.ai/guide).
+
+### install
+
+To start you will need to do:
+
+```bash
+conda create -n <env-name> python==3.10 #scprint might work with python >3.10, but it is not tested
 #one of
 pip install scprint # OR
-pip install scprint[dev] # for the dev dependencies (building etc..) AND/OR [dev,flash]
-pip install scprint[flash] && pip install -e "git+https:/
-/github.com/triton-lang/triton.git@legacy-backend
-#egg=triton&subdirectory=python" # to use flashattention2, you will need to install triton 2.0.0.dev20221202 specifically, working on removing this dependency # only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
+pip install scprint[dev] # for the dev dependencies (building etc..) OR
+pip install scprint[flash] # to use flashattention2 with triton: only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
+#OR pip install scPRINT[dev,flash]
+
+lamin login <email> --key <API-key>
+lamin init --storage <folder-name-where-lamin-data-will-be-stored> --schema bionty
+```
+
+if you start with lamin and had to do a `lamin init`, you will also need to populate your ontologies. This is because scPRINT is using ontologies to define its cell types, diseases, sexes, ethnicities, etc.
+
+you can do it manually or with our function:
+
+```python
+from scdataloader.utils import populate_my_ontology
+
+populate_my_ontology() #to populate everything (recommended) (can take 2-10mns)
+
+populate_my_ontology( #the minimum for scprint to run some inferences (denoising, grn inference)
+organisms: List[str] = ["NCBITaxon:10090", "NCBITaxon:9606"],
+    sex: List[str] = ["PATO:0000384", "PATO:0000383"],
+    celltypes = None,
+    ethnicities = None,
+    assays = None,
+    tissues = None,
+    diseases = None,
+    dev_stages = None,
+)
 ```
 
 We make use of some additional packages we developed alongside scPRint.
+
 Please refer to their documentation for more information:
 
 - [scDataLoader](https://github.com/jkobject/scDataLoader): a dataloader for training large cell models.
 - [GRnnData](https://github.com/cantinilab/GRnnData): a package to work with gene networks from single cell data.
 - [benGRN](https://github.com/jkobject/benGRN): a package to benchmark gene network inference methods from single cell data.
 
-### lamin.ai
+### pytorch and GPUs
+
+scPRINT can run on machines without GPUs, but it will be slow. It is highly recommended to use a GPU for inference.
+
+Once you have a GPU, and installed the required drivers, you might need to install a specific version of pytorch that is compatible with your drivers (e.g. nvidia 550 drivers will lead to a nvidia toolkit 11.7 or 11.8 which might mean you need to re-install a different flavor of pytorch for things to work. e.g. using the command:
+`pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118` on my case on linux
+ ).
+
+I was able to test it with nvidia 11.7, 11.8, 12.2.
 
-⚠️ if you want to use the scDataloader's multi-dataset mode or if you want to preprocess datasets and other functions of the model, you will need to use lamin.ai.
+### dev install
 
-In that case, connect with google or github to [lamin.ai](https://lamin.ai/login), then be sure to connect before running anything (or before starting a notebook): `lamin login <email> --key <API-key>`. Follow the instructions on [their website](https://docs.lamin.ai/guide).
+If you want to use the latest version of scPRINT and work on the code yourself use `git clone` and `pip -e` instead of `pip install`.
+
+```bash
+git clone https://github.com/jkobject/scPRINT
+git clone https://github.com/jkobject/scDataLoader
+git clone https://github.com/cantinilab/GRnnData
+git clone https://github.com/jkobject/benGRN
+pip install -e scPRINT[dev]
+pip install -e scDataLoader[dev]
+pip install -e GRnnData[dev]
+pip install -e benGRN[dev]
+```
 
 ## Usage
 
@@ -88,7 +172,7 @@ $ scprint fit/train/predict/test/denoise/embed/gninfer --config config/[medium|l
 
 find out more about the commands by running `scprint --help` or `scprint [command] --help`.
 
-more examples of using the command line are available in the [docs](./docs/usage.md).
+more examples of using the command line are available in the [docs](usage.md).
 
 ### Notes on GPU/CPU usage with triton
 
@@ -102,6 +186,10 @@ model = scPrint.load_from_checkpoint(
     transformer="normal")
 ```
 
+### Simple tests:
+
+An instalation of scPRINT and a simple test of the denoiser is performed during each commit to the main branch with a [Github action](https://github.com/jkobject/scPRINT/actions) and [pytest workflow](https://github.com/jkobject/scPRINT/blob/main/.github/workflows/main.yml). It also provides an expected runtime for the installation and run of scPRINT.
+
 We now explore the different usages of scPRINT:
 
 ## FAQ
@@ -110,27 +198,27 @@ We now explore the different usages of scPRINT:
 
 -> Refer to the section . gene network inference in [this notebook](./notebooks/cancer_usecase.ipynb#).
 
--> More examples in this notebook [notebooks/assessments/bench_omni.ipynb](../notebooks/bench_omni.ipynb).
+-> More examples in this notebook [./notebooks/assessments/bench_omni.ipynb](https://github.com/jkobject/scPRINT/blob/main/notebooks/bench_omni.ipynb).
 
 ### I want to generate cell embeddings and cell label predictions from scRNAseq data:
 
 -> Refer to the embeddings and cell annotations section in [this notebook](./notebooks/cancer_usecase.ipynb#).
 
-### I want to denoising my scRNAseq dataset:
+### I want to denoise my scRNAseq dataset:
 
 -> Refer to the Denoising of B-cell section in [this notebook](./notebooks/cancer_usecase.ipynb).
 
--> More example in our benchmark notebook [notebooks/assessments/bench_denoising.ipynb](../notebooks/bench_denoising.ipynb).
+-> More example in our benchmark notebook [./notebooks/assessments/bench_denoising.ipynb](https://github.com/jkobject/scPRINT/blob/main/notebooks/bench_denoising.ipynb).
 
 ### I want to generate an atlas-level embedding
 
--> Refer to the notebook [figures/nice_umap.ipynb](../figures/nice_umap.ipynb).
+-> Refer to the notebook [nice_umap.ipynb](https://github.com/jkobject/scPRINT/blob/main/figures/nice_umap.ipynb).
 
 ### I need to generate gene tokens using pLLMs
 
 To run scPRINT, you can use the option to define the gene tokens using protein language model embeddings of genes. This is done by providing the path to a parquet file of the precomputed set of embeddings for each gene name to scPRINT via "precpt_gene_emb"
 
--> To generate this file please refer to the notebook [notebooks/generate_gene_embeddings.ipynb](../notebooks/generate_gene_embeddings.ipynb).
+-> To generate this file please refer to the notebook [generate_gene_embeddings](https://github.com/jkobject/scPRINT/blob/main/notebooks/generate_gene_embeddings.ipynb).
 
 ### I want to pre-train scPRINT from scratch on my own data
 
@@ -163,7 +251,7 @@ model = scPrint.load_from_checkpoint(
 )
 ```
 
-You can also recreate the gene embedding file through [this notebook](notebooks/generate_gene_embeddings.ipynb). Just call the functions, and it should recreate the file itself.
+You can also recreate the gene embedding file through [this notebook](https://github.com/jkobject/scPRINT/blob/main/notebooks/generate_gene_embeddings.ipynb). Just call the functions, and it should recreate the file itself.
 
 the file itself is also available on [hugging face](https://huggingface.co/jkobject/scPRINT/tree/main)
 
@@ -177,21 +265,23 @@ Model weights are available on [hugging face](https://huggingface.co/jkobject/sc
 
 ## Development
 
-Read the [CONTRIBUTING.md](CONTRIBUTING.md) file.
+Read the [CONTRIBUTING.md](https://github.com/jkobject/scPRINT/blob/main/CONTRIBUTING.md) file.
 
 Read the [training runs](https://wandb.ai/ml4ig/scprint_scale/reports/scPRINT-trainings--Vmlldzo4ODIxMjgx?accessToken=80metwx7b08hhourotpskdyaxiflq700xzmzymr6scvkp69agybt79l341tv68hp) document to know more about how pre-training was performed and the its behavior.
 
+code coverage is not right as I am using the command line interface for now. >50% of the code is covered by my current unit test.
+
 Acknowledgement:
 [python template](https://github.com/rochacbruno/python-project-template)
 [laminDB](https://lamin.ai/)
 [lightning](https://lightning.ai/)
 
-## Work in progress:
+## Work in progress (PR welcomed):
 
 1. remove the triton dependencies
 2. add version with additional labels (tissues, age) and organisms (mouse, zebrafish) and more datasets from cellxgene
 3. version with separate transformer blocks for the encoding part of the bottleneck learning and for the cell embeddings
 4. improve classifier to output uncertainties and topK predictions when unsure
-5. 
+5. setup latest lamindb version
 
 Awesome Large Cell Model created by Jeremie Kalfon.