Skip to content

Commit

Permalink
Merge pull request #4 from cantinilab/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
jkobject authored Sep 19, 2024
2 parents 2fcf5c7 + 3b3a074 commit 098dcde
Show file tree
Hide file tree
Showing 52 changed files with 9,844 additions and 540 deletions.
29 changes: 29 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Use the specified base image
FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-devel

# Set environment variable to prevent interactive prompts
ENV DEBIAN_FRONTEND=noninteractive

# Update the package list
RUN apt-get update -y

# Install git
RUN apt-get install -y git

# Install Python packages using pip
RUN git clone https://github.com/jkobject/scprint .
RUN cd scprint
RUN pip install -e ".[flash,dev]"
RUN lamin init --storage ./main --name main --schema bionty
RUN python -c 'import bionty as bt; bt.base.reset_sources(); bt.core.sync_all_sources_to_latest()'
RUN lamin load main
RUN python -c 'from scdataloader.utils import populate_my_ontology; populate_my_ontology()'

# Set the default command (can be overridden)
CMD ["scprint", "--help"]

# to install the nvidia-cuda-toolkit
# curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# sudo apt-get update
# sudo apt-get install -y nvidia-container-toolkit
18 changes: 18 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,24 @@ Changelog

(unreleased)
------------
- Adding docker. [jkobject]
- Better reqs and lamin update. [jkobject]
- Add ruff and uv. [jkobject]
- Work in progress. [jkobject]
- Work on multiple updates. [jkobject]
- Merge remote-tracking branch 'origin/main' into dev. [jkobject]
- Better tests. [jkobject]
- Should be good now. [jkobject]
- Error. [jkobject]
- Trying precising the attenion type. [jkobject]
- Better tests. [jkobject]
- Merge branch 'main' into dev. [maestro-jk]
- Merge branch 'main' into dev. [jkobject]


1.1.3 (2024-09-04)
------------------
- Release: version 1.1.3 🚀 [jkobject]
- Ready now. [jkobject]


Expand Down
12 changes: 6 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -26,16 +26,16 @@ install: ## Install the project in dev mode.

.PHONY: fmt
fmt: ## Format code using black & isort.
$(ENV_PREFIX)isort scprint/
$(ENV_PREFIX)black -l 88 scprint/
$(ENV_PREFIX)black -l 88 tests/
$(ENV_PREFIX)ruff check --fix scprint/
$(ENV_PREFIX)ruff check --fix tests/
$(ENV_PREFIX)ruff format tests/
$(ENV_PREFIX)ruff format scprint/

.PHONY: lint
lint: ## Run pep8, black, mypy linters.
#most are due to flashattention...
$(ENV_PREFIX)flake8 --ignore=E501,E203,E266,E265,W503,F401,F403,F841,E731,E722,E402 scprint/
$(ENV_PREFIX)black -l 88 --check scprint/
$(ENV_PREFIX)black -l 88 --check tests/
$(ENV_PREFIX)ruff check --fix scprint/
$(ENV_PREFIX)ruff check --fix tests/

.PHONY: test
test: lint ## Run tests and generate coverage report.
Expand Down
39 changes: 39 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,10 @@ scPRINT can be used to perform the following analyses:
- [where to find the gene embeddings?](#where-to-find-the-gene-embeddings)
- [Documentation](#documentation)
- [Model Weights](#model-weights)
- [Docker](#docker)
- [Building the Docker Image](#building-the-docker-image)
- [Pulling the Docker Image from Docker Hub](#pulling-the-docker-image-from-docker-hub)
- [Running the Docker Container](#running-the-docker-container)
- [Development](#development)
- [Work in progress (PR welcomed):](#work-in-progress-pr-welcomed)

Expand Down Expand Up @@ -264,6 +268,41 @@ For more information on usage please see the documentation in [https://www.jkobj

Model weights are available on [hugging face](https://huggingface.co/jkobject/scPRINT/).

## Docker

By using the `scPRINT Docker image`, you can bypass the complexities of manual package installation, ensuring a consistent deployment environment. Included in this repository is a Dockerfile that lets you craft a container for the project; you have the choice to either build this image on your own or conveniently pull it from Docker Hub.

Make sure that you have the `docker` command line interface installed on your system.

A recommended way to install docker with the correct nvidia drivers on linux is to use this [script](https://gist.github.com/xueerchen1990/baad7baa545cb547e8633bc9e5b84786)

### Building the Docker Image

To build the Docker image from the provided `Dockerfile`, run the following command from the root directory of this repository:

```bash
docker build -t scprint:latest -f Dockerfile .
```

### Pulling the Docker Image from Docker Hub

If you don't want to build the image yourself, you can pull it directly from Docker Hub:

```bash
docker pull jkobject/scprint:1.1.3
docker tag jkobject/scprint:1.1.3 scprint:latest
```

### Running the Docker Container

Once you have the image (either by building it or pulling it), you can start a container with:

```bash
docker run --gpus all --rm -it scprint:latest bash
```

Please note: When running the Docker container, ensure you mount any necessary folders using the -v option to access them inside the container.
`
## Development

Read the [CONTRIBUTING.md](CONTRIBUTING.md) file.
Expand Down
136 changes: 113 additions & 23 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@

# scPRINT: Large Cell Model for scRNAseq data

[![codecov](https://codecov.io/gh/jkobject/scPRINT/branch/main/graph/badge.svg?token=GRnnData_token_here)](https://codecov.io/gh/jkobject/scPRINT)
[![CI](https://github.com/jkobject/scPRINT/actions/workflows/main.yml/badge.svg)](https://github.com/jkobject/scPRINT/actions/workflows/main.yml)
[![PyPI version](https://badge.fury.io/py/scprint.svg)](https://badge.fury.io/py/scprint)
[![Documentation Status](https://readthedocs.org/projects/scprint/badge/?version=latest)](https://scprint.readthedocs.io/en/latest/?badge=latest)
[![Downloads](https://pepy.tech/badge/scprint)](https://pepy.tech/project/scprint)
[![Downloads](https://pepy.tech/badge/scprint/month)](https://pepy.tech/project/scprint)
[![Downloads](https://pepy.tech/badge/scprint/week)](https://pepy.tech/project/scprint)
[![GitHub issues](https://img.shields.io/github/issues/jkobject/scPRINT)](https://img.shields.io/github/issues/jkobject/scPRINT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![DOI](https://zenodo.org/badge/391909874.svg)]()
[![DOI](https://zenodo.org/badge/391909874.svg)](https://doi.org/10.1101/2024.07.29.605556)

![logo](logo.png)

Expand All @@ -23,39 +24,122 @@ scPRINT can be used to perform the following analyses:
- __label prediction__: predict the cell type, disease, sequencer, sex, and ethnicity of your cells
- __gene network inference__: generate a gene network from any cell or cell cluster in your scRNAseq dataset

[Read the paper!](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1) if you would like to know more about scPRINT.
[Read the manuscript!](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1) if you would like to know more about scPRINT. Have a look at some of my [X-plainers](https://twitter.com/jkobject).

![figure1](figure1.png)

## Table of Contents

- [scPRINT: Large Cell Model for scRNAseq data](#scprint-large-cell-model-for-scrnaseq-data)
- [Table of Contents](#table-of-contents)
- [Install `scPRINT`](#install-scprint)
- [lamin.ai](#laminai)
- [install](#install)
- [pytorch and GPUs](#pytorch-and-gpus)
- [dev install](#dev-install)
- [Usage](#usage)
- [scPRINT's basic commands](#scprints-basic-commands)
- [Notes on GPU/CPU usage with triton](#notes-on-gpucpu-usage-with-triton)
- [Simple tests:](#simple-tests)
- [FAQ](#faq)
- [I want to generate gene networks from scRNAseq data:](#i-want-to-generate-gene-networks-from-scrnaseq-data)
- [I want to generate cell embeddings and cell label predictions from scRNAseq data:](#i-want-to-generate-cell-embeddings-and-cell-label-predictions-from-scrnaseq-data)
- [I want to denoise my scRNAseq dataset:](#i-want-to-denoise-my-scrnaseq-dataset)
- [I want to generate an atlas-level embedding](#i-want-to-generate-an-atlas-level-embedding)
- [I need to generate gene tokens using pLLMs](#i-need-to-generate-gene-tokens-using-pllms)
- [I want to pre-train scPRINT from scratch on my own data](#i-want-to-pre-train-scprint-from-scratch-on-my-own-data)
- [how can I find if scPRINT was trained on my data?](#how-can-i-find-if-scprint-was-trained-on-my-data)
- [can I use scPRINT on other organisms rather than human?](#can-i-use-scprint-on-other-organisms-rather-than-human)
- [how long does scPRINT takes? what kind of resources do I need? (or in alternative: can i run scPRINT locally?)](#how-long-does-scprint-takes-what-kind-of-resources-do-i-need-or-in-alternative-can-i-run-scprint-locally)
- [I have different scRNASeq batches. Should I integrate my data before running scPRINT?](#i-have-different-scrnaseq-batches-should-i-integrate-my-data-before-running-scprint)
- [where to find the gene embeddings?](#where-to-find-the-gene-embeddings)
- [Documentation](#documentation)
- [Model Weights](#model-weights)
- [Development](#development)
- [Work in progress (PR welcomed):](#work-in-progress-pr-welcomed)


## Install `scPRINT`

For the moment scPRINT has been tested on MacOS and Linux (Ubuntu 20.04) with Python 3.10.
For the moment scPRINT has been tested on MacOS and Linux (Ubuntu 20.04) with Python 3.10. Its instalation takes on average 10 minutes.

If you want to be using flashattention2, know that it only supports triton 2.0 MLIR's version and torch==2.0.0 for now.

```python
conda create -n "[whatever]" python==3.10
### lamin.ai

To use scPRINT, I need you to use lamin.ai. This is needed to load biological informations like genes, cell types, organisms etc...

To do so, you will need to connect with google or github to [lamin.ai](https://lamin.ai/login), then be sure to connect before running anything (or before starting a notebook): `lamin login <email> --key <API-key>`. Follow the instructions on [their website](https://docs.lamin.ai/guide).

### install

To start you will need to do:

```bash
conda create -n <env-name> python==3.10 #scprint might work with python >3.10, but it is not tested
#one of
pip install scprint # OR
pip install scprint[dev] # for the dev dependencies (building etc..) AND/OR [dev,flash]
pip install scprint[flash] && pip install -e "git+https:/
/github.com/triton-lang/triton.git@legacy-backend
#egg=triton&subdirectory=python" # to use flashattention2, you will need to install triton 2.0.0.dev20221202 specifically, working on removing this dependency # only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
pip install scprint[dev] # for the dev dependencies (building etc..) OR
pip install scprint[flash] # to use flashattention2 with triton: only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
#OR pip install scPRINT[dev,flash]

lamin login <email> --key <API-key>
lamin init --storage <folder-name-where-lamin-data-will-be-stored> --schema bionty
```

if you start with lamin and had to do a `lamin init`, you will also need to populate your ontologies. This is because scPRINT is using ontologies to define its cell types, diseases, sexes, ethnicities, etc.

you can do it manually or with our function:

```python
from scdataloader.utils import populate_my_ontology

populate_my_ontology() #to populate everything (recommended) (can take 2-10mns)

populate_my_ontology( #the minimum for scprint to run some inferences (denoising, grn inference)
organisms: List[str] = ["NCBITaxon:10090", "NCBITaxon:9606"],
sex: List[str] = ["PATO:0000384", "PATO:0000383"],
celltypes = None,
ethnicities = None,
assays = None,
tissues = None,
diseases = None,
dev_stages = None,
)
```

We make use of some additional packages we developed alongside scPRint.

Please refer to their documentation for more information:

- [scDataLoader](https://github.com/jkobject/scDataLoader): a dataloader for training large cell models.
- [GRnnData](https://github.com/cantinilab/GRnnData): a package to work with gene networks from single cell data.
- [benGRN](https://github.com/jkobject/benGRN): a package to benchmark gene network inference methods from single cell data.

### lamin.ai
### pytorch and GPUs

scPRINT can run on machines without GPUs, but it will be slow. It is highly recommended to use a GPU for inference.

Once you have a GPU, and installed the required drivers, you might need to install a specific version of pytorch that is compatible with your drivers (e.g. nvidia 550 drivers will lead to a nvidia toolkit 11.7 or 11.8 which might mean you need to re-install a different flavor of pytorch for things to work. e.g. using the command:
`pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118` on my case on linux
).

I was able to test it with nvidia 11.7, 11.8, 12.2.

⚠️ if you want to use the scDataloader's multi-dataset mode or if you want to preprocess datasets and other functions of the model, you will need to use lamin.ai.
### dev install

In that case, connect with google or github to [lamin.ai](https://lamin.ai/login), then be sure to connect before running anything (or before starting a notebook): `lamin login <email> --key <API-key>`. Follow the instructions on [their website](https://docs.lamin.ai/guide).
If you want to use the latest version of scPRINT and work on the code yourself use `git clone` and `pip -e` instead of `pip install`.

```bash
git clone https://github.com/jkobject/scPRINT
git clone https://github.com/jkobject/scDataLoader
git clone https://github.com/cantinilab/GRnnData
git clone https://github.com/jkobject/benGRN
pip install -e scPRINT[dev]
pip install -e scDataLoader[dev]
pip install -e GRnnData[dev]
pip install -e benGRN[dev]
```

## Usage

Expand Down Expand Up @@ -88,7 +172,7 @@ $ scprint fit/train/predict/test/denoise/embed/gninfer --config config/[medium|l

find out more about the commands by running `scprint --help` or `scprint [command] --help`.

more examples of using the command line are available in the [docs](./docs/usage.md).
more examples of using the command line are available in the [docs](usage.md).

### Notes on GPU/CPU usage with triton

Expand All @@ -102,6 +186,10 @@ model = scPrint.load_from_checkpoint(
transformer="normal")
```

### Simple tests:

An instalation of scPRINT and a simple test of the denoiser is performed during each commit to the main branch with a [Github action](https://github.com/jkobject/scPRINT/actions) and [pytest workflow](https://github.com/jkobject/scPRINT/blob/main/.github/workflows/main.yml). It also provides an expected runtime for the installation and run of scPRINT.

We now explore the different usages of scPRINT:

## FAQ
Expand All @@ -110,27 +198,27 @@ We now explore the different usages of scPRINT:

-> Refer to the section . gene network inference in [this notebook](./notebooks/cancer_usecase.ipynb#).

-> More examples in this notebook [notebooks/assessments/bench_omni.ipynb](../notebooks/bench_omni.ipynb).
-> More examples in this notebook [./notebooks/assessments/bench_omni.ipynb](https://github.com/jkobject/scPRINT/blob/main/notebooks/bench_omni.ipynb).

### I want to generate cell embeddings and cell label predictions from scRNAseq data:

-> Refer to the embeddings and cell annotations section in [this notebook](./notebooks/cancer_usecase.ipynb#).

### I want to denoising my scRNAseq dataset:
### I want to denoise my scRNAseq dataset:

-> Refer to the Denoising of B-cell section in [this notebook](./notebooks/cancer_usecase.ipynb).

-> More example in our benchmark notebook [notebooks/assessments/bench_denoising.ipynb](../notebooks/bench_denoising.ipynb).
-> More example in our benchmark notebook [./notebooks/assessments/bench_denoising.ipynb](https://github.com/jkobject/scPRINT/blob/main/notebooks/bench_denoising.ipynb).

### I want to generate an atlas-level embedding

-> Refer to the notebook [figures/nice_umap.ipynb](../figures/nice_umap.ipynb).
-> Refer to the notebook [nice_umap.ipynb](https://github.com/jkobject/scPRINT/blob/main/figures/nice_umap.ipynb).

### I need to generate gene tokens using pLLMs

To run scPRINT, you can use the option to define the gene tokens using protein language model embeddings of genes. This is done by providing the path to a parquet file of the precomputed set of embeddings for each gene name to scPRINT via "precpt_gene_emb"

-> To generate this file please refer to the notebook [notebooks/generate_gene_embeddings.ipynb](../notebooks/generate_gene_embeddings.ipynb).
-> To generate this file please refer to the notebook [generate_gene_embeddings](https://github.com/jkobject/scPRINT/blob/main/notebooks/generate_gene_embeddings.ipynb).

### I want to pre-train scPRINT from scratch on my own data

Expand Down Expand Up @@ -163,7 +251,7 @@ model = scPrint.load_from_checkpoint(
)
```

You can also recreate the gene embedding file through [this notebook](notebooks/generate_gene_embeddings.ipynb). Just call the functions, and it should recreate the file itself.
You can also recreate the gene embedding file through [this notebook](https://github.com/jkobject/scPRINT/blob/main/notebooks/generate_gene_embeddings.ipynb). Just call the functions, and it should recreate the file itself.

the file itself is also available on [hugging face](https://huggingface.co/jkobject/scPRINT/tree/main)

Expand All @@ -177,21 +265,23 @@ Model weights are available on [hugging face](https://huggingface.co/jkobject/sc

## Development

Read the [CONTRIBUTING.md](CONTRIBUTING.md) file.
Read the [CONTRIBUTING.md](https://github.com/jkobject/scPRINT/blob/main/CONTRIBUTING.md) file.

Read the [training runs](https://wandb.ai/ml4ig/scprint_scale/reports/scPRINT-trainings--Vmlldzo4ODIxMjgx?accessToken=80metwx7b08hhourotpskdyaxiflq700xzmzymr6scvkp69agybt79l341tv68hp) document to know more about how pre-training was performed and the its behavior.

code coverage is not right as I am using the command line interface for now. >50% of the code is covered by my current unit test.

Acknowledgement:
[python template](https://github.com/rochacbruno/python-project-template)
[laminDB](https://lamin.ai/)
[lightning](https://lightning.ai/)

## Work in progress:
## Work in progress (PR welcomed):

1. remove the triton dependencies
2. add version with additional labels (tissues, age) and organisms (mouse, zebrafish) and more datasets from cellxgene
3. version with separate transformer blocks for the encoding part of the bottleneck learning and for the cell embeddings
4. improve classifier to output uncertainties and topK predictions when unsure
5.
5. setup latest lamindb version

Awesome Large Cell Model created by Jeremie Kalfon.
Loading

0 comments on commit 098dcde

Please sign in to comment.