Add section on language agnostic representatios (#7)

BiocPy · Feb 24, 2024 · 5ab8303 · 5ab8303
1 parent 5a8b887
commit 5ab8303
Show file tree

Hide file tree

Showing 8 changed files with 146 additions and 6 deletions.
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -15,7 +15,7 @@ jobs:
       contents: write
     steps:
       - name: Check out repository
-        uses: actions/checkout@v3
+        uses: actions/checkout@v4
 
       - name: Set up Quarto
         uses: quarto-dev/quarto-actions/setup@v2
@@ -27,7 +27,9 @@ jobs:
         with:
           python-version: '3.9'
           cache: 'pip'
-      - run: pip install jupyter
+      # - run: pip install uv
+      # - run: uv venv
+      # - run: source .venv/bin/activate
       - run: pip install -r requirements.txt
 
       - name: Render

diff --git a/.gitignore b/.gitignore
@@ -2,4 +2,6 @@
 /_site/
 docs
 _freeze
-.jupyter_cache/
+.jupyter_cache/
+
+chapters/zilinoislung_with_celltypist/
diff --git a/_quarto.yml b/_quarto.yml
@@ -43,6 +43,8 @@ book:
         - chapters/experiments/extending_se.qmd
         - chapters/experiments/multiassay_expt.qmd
     - chapters/interop.qmd
+    - chapters/language_agnostic.qmd
+    - chapters/workflow.qmd
     - part: chapters/extras/index.qmd
       chapters:
         - chapters/extras/iranges.qmd

diff --git a/assets/data/zilinois-lung-subset.rds b/assets/data/zilinois-lung-subset.rds
diff --git a/chapters/interop.qmd b/chapters/interop.qmd
@@ -1,4 +1,4 @@
-# Interop with R
+# Interop with RDS files
 
 The [rds2py](https://github.com/BiocPy/rds2py) package serves as a Python interface to the [rds2cpp](https://github.com/LTLA/rds2cpp) library, enabling direct reading of RDS files within Python. This eliminates the need for additional data conversion tools or intermediate formats, streamlining the transition between Python and R for seamless analysis.
 

diff --git a/chapters/language_agnostic.qmd b/chapters/language_agnostic.qmd
@@ -0,0 +1,36 @@
+# Language-agnostic genomic data store
+
+In this section, we will illustrate a workflow that utilizes language-agnostic representations for storing genomic data, facilitating seamless access to datasets and analysis results across multiple programming frameworks such as R and Python. The [ArtifactDB](https://github.com/artifactdb) framework provides this functionality.
+
+To begin, we will download the "zilionis lung" dataset from the [scRNAseq](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) package. Subsequently, we will store this dataset in a language-agnostic format using the [alabaster suite](https://github.com/ArtifactDB/alabaster.base) of R packages.
+
+```r
+library(scRNAseq)
+library(alabaster)
+
+sce <- ZilionisLungData()
+saveObject(sce, path=paste(getwd(), "zilinoislung", sep="/"))
+```
+
+:::{.callout-note}
+Additionally, you can save this dataset as an RDS object for access in Python. Refer to [interop with R](./interop.qmd) section for more details.
+:::
+
+We can now load this dataset in Python using the [dolomite suite](https://github.com/ArtifactDB/dolomite-base) of Python packages. Both dolomite and alabaster are integral parts of the ArtifactDB ecosystem designed to read artifacts stored in language-agnostic formats.
+
+```python
+from dolomite_base import read_object
+
+data = read_object("./zilinoislung")
+print(data)
+```
+
+You can now convert this to `AnnData` representations for downstream analysis.
+
+```python
+adata = data.to_anndata()
+```
+
+:::{.callout-note}
+Check out [ArtifactDB](https://github.com/artifactdb) framework for more information.
+:::
diff --git a/chapters/workflow.qmd b/chapters/workflow.qmd
@@ -0,0 +1,95 @@
+# Seamless analysis workflow
+
+In this section, we will illustrate a workflow that utilizes either language-agnostic representations for storing genomic data or reading RDS files directly in Python, to facilitate seamless access to datasets and analysis results.
+
+:::{.callout-note}
+Check out 
+
+- the [interop with R](./interop.qmd) section for reading RDS files directly in Python or
+- the [language agnostic](./language_agnostic.qmd) representations for storing genomic data
+:::
+
+To begin, we will download the "zilionis lung" dataset from the [scRNAseq](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) package. Subsequently, we will store this dataset in a language-agnostic format using the [alabaster suite](https://github.com/ArtifactDB/alabaster.base) of R packages.
+
+```r
+library(scRNAseq)
+
+sce <- ZeiselBrainData()
+sub <- sce[,1:2000]
+saveRDS(sub, "../assets/data/zilinois-lung-subset.rds")
+```
+
+To demonstrate this workflow, we will employ the [CellTypist](https://github.com/Teichlab/celltypist) model to annotate cell types for this dataset. CellTypist operates on an AnnData representation.
+
+```{python}
+from rds2py import read_rds, as_summarized_experiment
+import numpy as np
+
+r_object = read_rds("../assets/data/zilinois-lung-subset.rds")
+sce = as_summarized_experiment(r_object)
+adata, _ = sce.to_anndata()
+adata.X = np.log1p(adata.layers["counts"])
+adata.var.index = adata.var["genes"].tolist()
+print(adata)
+```
+
+Before annotation, let's download the "human lung atlas" model from celltypist.
+
+```{python}
+import celltypist
+from celltypist import models
+
+models.download_models()
+model_name = "Human_Lung_Atlas.pkl"
+model = models.Model.load(model = model_name)
+print(model)
+```
+
+Now, let's annotate our dataset.
+
+```{python}
+predictions = celltypist.annotate(adata, model = model_name, majority_voting = True)
+print(predictions.predicted_labels)
+```
+
+:::{.callout-note}
+The celltypist workflow is based on the tutorial described [here](https://colab.research.google.com/github/Teichlab/celltypist/blob/main/docs/notebook/celltypist_tutorial.ipynb#scrollTo=postal-chicken).
+:::
+
+Next, let's retrieve the `AnnData` object with the predicted labels embedded into the `obs` dataframe.
+
+```{python}
+adata = predictions.to_adata()
+adata
+```
+
+We can now reverse the workflow and save this object into an Artifactdb format from Python. However, the object needs to be converted to a `SingleCellExperiment` class first. Read more about our experiment representations [here](./experiments/singlecell_expt.qmd).
+
+```{python}
+from singlecellexperiment import SingleCellExperiment
+
+sce = SingleCellExperiment.from_anndata(adata)
+print(sce)
+```
+
+We use the dolomite package to save it into a language-agnostic format.
+```{python}
+import dolomite_base
+import dolomite_sce
+
+dolomite_base.save_object(sce, "./zilinoislung_with_celltypist")
+```
+
+Finally, read the object back in R.
+```r
+sce_with_celltypist = readObject(path=paste(getwd(), "zilinoislung_with_celltypist", sep="/"))
+sce_with_celltypist
+```
+
+And that concludes the workflow. Leveraging the generic **read** functions `readObject` (R) and `read_object` (Python), along with the **save** functions `saveObject` (R) and `save_object` (Python), you can seamlessly store most Bioconductor objects in language-agnostic formats.
+
+----
+
+## Further reading
+
+- ArtifactDB GitHub organization - [https://github.com/ArtifactDB](https://github.com/ArtifactDB).
diff --git a/requirements.txt b/requirements.txt
@@ -10,6 +10,7 @@ singler
 numpy
 scipy
 pandas
+jupyter
 jupyter-cache
 rich
 jupyterlab
@@ -20,5 +21,7 @@ anndata
 mudata
 delayedarray[dask]
 joblib
-dolomite
-hdf5array
+dolomite_mae
+dolomite_sce
+hdf5array
+celltypist