Tinder for single-cell data

Overview

This package provides Python bindings to the C++ implementation of the SingleR method, originally developed by Aran et al. (2019). It is designed to annotate cell types by matching cells to known references based on their expression profiles. So kind of like Tinder, but for cells.

Quick start

Firstly, let's load in the famous PBMC 4k dataset from 10X Genomics:

import singlecellexperiment as sce
data = sce.read_tenx_h5("pbmc4k-tenx.h5", realize_assays=True)
mat = data.assay("counts")
features = [str(x) for x in data.row_data["name"]]

or if you are coming from scverse ecosystem, i.e. AnnData, simply read the object as SingleCellExperiment and extract the matrix and the features. Read more on SingleCellExperiment here.

import singlecellexperiment as sce

sce_adata = sce.SingleCellExperiment.from_anndata(adata) 

# or from a h5ad file
sce_h5ad = sce.read_h5ad("tests/data/adata.h5ad")

Now, we fetch the Blueprint/ENCODE reference:

import celldex

ref_data = celldex.fetch_reference("blueprint_encode", "2024-02-26", realize_assays=True)

We can annotate each cell in mat with the reference:

import singler
results = singler.annotate_single(
    test_data = mat,
    test_features = features,
    ref_data = ref_data,
    ref_labels = ref_data.get_column_data().column("label.main"),
)

The results data frame contains all of the assignments and the scores for each label:

results.column("best")
## ['Monocytes',
##  'Monocytes',
##  'Monocytes',
##  'CD8+ T-cells',
##  'CD4+ T-cells',
##  'CD8+ T-cells',
##  'Monocytes',
##  'Monocytes',
##  'B-cells',
##  ...
## ]

results.column("scores").column("Macrophages")
## array([0.35935275, 0.40833545, 0.37430726, ..., 0.32135929, 0.29728435,
##        0.40208581])

Calling low-level functions

The annotate_single() function is a convenient wrapper around a number of lower-level functions in singler. Advanced users may prefer to build the reference and run the classification separately. This allows us to re-use the same reference for multiple datasets without repeating the build step.

built = singler.train_single(
    ref_data = ref_data.assay("logcounts"),
    ref_labels = ref_data.get_column_data().column("label.main"),
    ref_features = ref_data.get_row_names(),
    test_features = features,
)

And finally, we apply the pre-built reference to the test dataset to obtain our label assignments. This can be repeated with different datasets that have the same features as test_features=.

output = singler.classify_single(mat, ref_prebuilt=built)

## output
BiocFrame with 4340 rows and 3 columns
            best                                   scores                delta
        <list>                              <BiocFrame>   <ndarray[float64]>
[0] Monocytes 0.33265560369962943:0.407117403330602...  0.40706830113982534
[1] Monocytes 0.4078771641637374:0.4783396310685646...  0.07000418564184802
[2] Monocytes 0.3517036021728629:0.4076971245524348...  0.30997293412307647
            ...                                      ...                  ...
[4337]  NK cells 0.3472631136865701:0.3937898240670208...  0.09640242155786138
[4338]   B-cells 0.26974632191999887:0.334862058137758... 0.061215905058676856
[4339] Monocytes 0.39390119034537324:0.468867490667427...  0.06678168346812047

Integrating labels across references

We can use annotations from multiple references through the annotate_integrated() function:

import singler
import celldex

blueprint_ref = celldex.fetch_reference("blueprint_encode", "2024-02-26", realize_assays=True)
immune_cell_ref = celldex.fetch_reference("dice", "2024-02-26", realize_assays=True)

single_results, integrated = singler.annotate_integrated(
    mat,
    features,
    ref_data = [
        blueprint_ref,
        immune_cell_ref
    ],
    ref_labels = [
        blueprint_ref.get_column_data().column("label.main"),
        immune_cell_ref.get_column_data().column("label.main")
    ],
    num_threads = 6
)

This annotates the test dataset against each reference individually to obtain the best per-reference label, and then it compares across references to find the best label from all references.

integrated.column("best_label")
## ['Monocytes', 
##  'Monocytes',
##  'Monocytes',
##  'CD8+ T-cells',
##  'CD4+ T-cells',
##  'CD8+ T-cells',
##  'Monocytes',
##  'Monocytes',
##  ...
## ]

integrated.column("best_reference")
## [0,
##  0, 
##  0,
##  0,
##  0,
##  0,
##  0,
##  0,
##  ...
## ]

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
docs		docs
lib		lib
src/singler		src/singler
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
AUTHORS.md		AUTHORS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tinder for single-cell data

Overview

Quick start

Calling low-level functions

Integrating labels across references

About

Releases

Packages

Contributors 3

Languages

License

SingleR-inc/singler-py

Folders and files

Latest commit

History

Repository files navigation

Tinder for single-cell data

Overview

Quick start

Calling low-level functions

Integrating labels across references

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages