diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index 3534baf..e55d835 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -15,7 +15,7 @@ jobs: contents: write steps: - name: Check out repository - uses: actions/checkout@v3 + uses: actions/checkout@v4 - name: Set up Quarto uses: quarto-dev/quarto-actions/setup@v2 @@ -27,7 +27,9 @@ jobs: with: python-version: '3.9' cache: 'pip' - - run: pip install jupyter + # - run: pip install uv + # - run: uv venv + # - run: source .venv/bin/activate - run: pip install -r requirements.txt - name: Render diff --git a/.gitignore b/.gitignore index 2f15f58..a7e1cc8 100644 --- a/.gitignore +++ b/.gitignore @@ -2,4 +2,6 @@ /_site/ docs _freeze -.jupyter_cache/ \ No newline at end of file +.jupyter_cache/ + +chapters/zilinoislung_with_celltypist/ \ No newline at end of file diff --git a/_quarto.yml b/_quarto.yml index b8b3a3a..2f16973 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -43,6 +43,8 @@ book: - chapters/experiments/extending_se.qmd - chapters/experiments/multiassay_expt.qmd - chapters/interop.qmd + - chapters/language_agnostic.qmd + - chapters/workflow.qmd - part: chapters/extras/index.qmd chapters: - chapters/extras/iranges.qmd diff --git a/assets/data/zilinois-lung-subset.rds b/assets/data/zilinois-lung-subset.rds new file mode 100644 index 0000000..812b201 Binary files /dev/null and b/assets/data/zilinois-lung-subset.rds differ diff --git a/chapters/interop.qmd b/chapters/interop.qmd index bf54cc8..98397de 100644 --- a/chapters/interop.qmd +++ b/chapters/interop.qmd @@ -1,4 +1,4 @@ -# Interop with R +# Interop with RDS files The [rds2py](https://github.com/BiocPy/rds2py) package serves as a Python interface to the [rds2cpp](https://github.com/LTLA/rds2cpp) library, enabling direct reading of RDS files within Python. This eliminates the need for additional data conversion tools or intermediate formats, streamlining the transition between Python and R for seamless analysis. diff --git a/chapters/language_agnostic.qmd b/chapters/language_agnostic.qmd new file mode 100644 index 0000000..9cc1968 --- /dev/null +++ b/chapters/language_agnostic.qmd @@ -0,0 +1,36 @@ +# Language-agnostic genomic data store + +In this section, we will illustrate a workflow that utilizes language-agnostic representations for storing genomic data, facilitating seamless access to datasets and analysis results across multiple programming frameworks such as R and Python. The [ArtifactDB](https://github.com/artifactdb) framework provides this functionality. + +To begin, we will download the "zilionis lung" dataset from the [scRNAseq](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) package. Subsequently, we will store this dataset in a language-agnostic format using the [alabaster suite](https://github.com/ArtifactDB/alabaster.base) of R packages. + +```r +library(scRNAseq) +library(alabaster) + +sce <- ZilionisLungData() +saveObject(sce, path=paste(getwd(), "zilinoislung", sep="/")) +``` + +:::{.callout-note} +Additionally, you can save this dataset as an RDS object for access in Python. Refer to [interop with R](./interop.qmd) section for more details. +::: + +We can now load this dataset in Python using the [dolomite suite](https://github.com/ArtifactDB/dolomite-base) of Python packages. Both dolomite and alabaster are integral parts of the ArtifactDB ecosystem designed to read artifacts stored in language-agnostic formats. + +```python +from dolomite_base import read_object + +data = read_object("./zilinoislung") +print(data) +``` + +You can now convert this to `AnnData` representations for downstream analysis. + +```python +adata = data.to_anndata() +``` + +:::{.callout-note} +Check out [ArtifactDB](https://github.com/artifactdb) framework for more information. +::: \ No newline at end of file diff --git a/chapters/workflow.qmd b/chapters/workflow.qmd new file mode 100644 index 0000000..886285b --- /dev/null +++ b/chapters/workflow.qmd @@ -0,0 +1,95 @@ +# Seamless analysis workflow + +In this section, we will illustrate a workflow that utilizes either language-agnostic representations for storing genomic data or reading RDS files directly in Python, to facilitate seamless access to datasets and analysis results. + +:::{.callout-note} +Check out + +- the [interop with R](./interop.qmd) section for reading RDS files directly in Python or +- the [language agnostic](./language_agnostic.qmd) representations for storing genomic data +::: + +To begin, we will download the "zilionis lung" dataset from the [scRNAseq](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) package. Subsequently, we will store this dataset in a language-agnostic format using the [alabaster suite](https://github.com/ArtifactDB/alabaster.base) of R packages. + +```r +library(scRNAseq) + +sce <- ZeiselBrainData() +sub <- sce[,1:2000] +saveRDS(sub, "../assets/data/zilinois-lung-subset.rds") +``` + +To demonstrate this workflow, we will employ the [CellTypist](https://github.com/Teichlab/celltypist) model to annotate cell types for this dataset. CellTypist operates on an AnnData representation. + +```{python} +from rds2py import read_rds, as_summarized_experiment +import numpy as np + +r_object = read_rds("../assets/data/zilinois-lung-subset.rds") +sce = as_summarized_experiment(r_object) +adata, _ = sce.to_anndata() +adata.X = np.log1p(adata.layers["counts"]) +adata.var.index = adata.var["genes"].tolist() +print(adata) +``` + +Before annotation, let's download the "human lung atlas" model from celltypist. + +```{python} +import celltypist +from celltypist import models + +models.download_models() +model_name = "Human_Lung_Atlas.pkl" +model = models.Model.load(model = model_name) +print(model) +``` + +Now, let's annotate our dataset. + +```{python} +predictions = celltypist.annotate(adata, model = model_name, majority_voting = True) +print(predictions.predicted_labels) +``` + +:::{.callout-note} +The celltypist workflow is based on the tutorial described [here](https://colab.research.google.com/github/Teichlab/celltypist/blob/main/docs/notebook/celltypist_tutorial.ipynb#scrollTo=postal-chicken). +::: + +Next, let's retrieve the `AnnData` object with the predicted labels embedded into the `obs` dataframe. + +```{python} +adata = predictions.to_adata() +adata +``` + +We can now reverse the workflow and save this object into an Artifactdb format from Python. However, the object needs to be converted to a `SingleCellExperiment` class first. Read more about our experiment representations [here](./experiments/singlecell_expt.qmd). + +```{python} +from singlecellexperiment import SingleCellExperiment + +sce = SingleCellExperiment.from_anndata(adata) +print(sce) +``` + +We use the dolomite package to save it into a language-agnostic format. +```{python} +import dolomite_base +import dolomite_sce + +dolomite_base.save_object(sce, "./zilinoislung_with_celltypist") +``` + +Finally, read the object back in R. +```r +sce_with_celltypist = readObject(path=paste(getwd(), "zilinoislung_with_celltypist", sep="/")) +sce_with_celltypist +``` + +And that concludes the workflow. Leveraging the generic **read** functions `readObject` (R) and `read_object` (Python), along with the **save** functions `saveObject` (R) and `save_object` (Python), you can seamlessly store most Bioconductor objects in language-agnostic formats. + +---- + +## Further reading + +- ArtifactDB GitHub organization - [https://github.com/ArtifactDB](https://github.com/ArtifactDB). \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index 1896194..2d870c7 100644 --- a/requirements.txt +++ b/requirements.txt @@ -10,6 +10,7 @@ singler numpy scipy pandas +jupyter jupyter-cache rich jupyterlab @@ -20,5 +21,7 @@ anndata mudata delayedarray[dask] joblib -dolomite -hdf5array \ No newline at end of file +dolomite_mae +dolomite_sce +hdf5array +celltypist \ No newline at end of file