Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analysis workflows #7

Merged
merged 4 commits into from
Feb 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
contents: write
steps:
- name: Check out repository
uses: actions/checkout@v3
uses: actions/checkout@v4

- name: Set up Quarto
uses: quarto-dev/quarto-actions/setup@v2
Expand All @@ -27,7 +27,9 @@ jobs:
with:
python-version: '3.9'
cache: 'pip'
- run: pip install jupyter
# - run: pip install uv
# - run: uv venv
# - run: source .venv/bin/activate
- run: pip install -r requirements.txt

- name: Render
Expand Down
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,6 @@
/_site/
docs
_freeze
.jupyter_cache/
.jupyter_cache/

chapters/zilinoislung_with_celltypist/
2 changes: 2 additions & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ book:
- chapters/experiments/extending_se.qmd
- chapters/experiments/multiassay_expt.qmd
- chapters/interop.qmd
- chapters/language_agnostic.qmd
- chapters/workflow.qmd
- part: chapters/extras/index.qmd
chapters:
- chapters/extras/iranges.qmd
Expand Down
Binary file added assets/data/zilinois-lung-subset.rds
Binary file not shown.
2 changes: 1 addition & 1 deletion chapters/interop.qmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Interop with R
# Interop with RDS files

The [rds2py](https://github.com/BiocPy/rds2py) package serves as a Python interface to the [rds2cpp](https://github.com/LTLA/rds2cpp) library, enabling direct reading of RDS files within Python. This eliminates the need for additional data conversion tools or intermediate formats, streamlining the transition between Python and R for seamless analysis.

Expand Down
36 changes: 36 additions & 0 deletions chapters/language_agnostic.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Language-agnostic genomic data store

In this section, we will illustrate a workflow that utilizes language-agnostic representations for storing genomic data, facilitating seamless access to datasets and analysis results across multiple programming frameworks such as R and Python. The [ArtifactDB](https://github.com/artifactdb) framework provides this functionality.

To begin, we will download the "zilionis lung" dataset from the [scRNAseq](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) package. Subsequently, we will store this dataset in a language-agnostic format using the [alabaster suite](https://github.com/ArtifactDB/alabaster.base) of R packages.

```r
library(scRNAseq)
library(alabaster)

sce <- ZilionisLungData()
saveObject(sce, path=paste(getwd(), "zilinoislung", sep="/"))
```

:::{.callout-note}
Additionally, you can save this dataset as an RDS object for access in Python. Refer to [interop with R](./interop.qmd) section for more details.
:::

We can now load this dataset in Python using the [dolomite suite](https://github.com/ArtifactDB/dolomite-base) of Python packages. Both dolomite and alabaster are integral parts of the ArtifactDB ecosystem designed to read artifacts stored in language-agnostic formats.

```python
from dolomite_base import read_object

data = read_object("./zilinoislung")
print(data)
```

You can now convert this to `AnnData` representations for downstream analysis.

```python
adata = data.to_anndata()
```

:::{.callout-note}
Check out [ArtifactDB](https://github.com/artifactdb) framework for more information.
:::
95 changes: 95 additions & 0 deletions chapters/workflow.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Seamless analysis workflow

In this section, we will illustrate a workflow that utilizes either language-agnostic representations for storing genomic data or reading RDS files directly in Python, to facilitate seamless access to datasets and analysis results.

:::{.callout-note}
Check out

- the [interop with R](./interop.qmd) section for reading RDS files directly in Python or
- the [language agnostic](./language_agnostic.qmd) representations for storing genomic data
:::

To begin, we will download the "zilionis lung" dataset from the [scRNAseq](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) package. Subsequently, we will store this dataset in a language-agnostic format using the [alabaster suite](https://github.com/ArtifactDB/alabaster.base) of R packages.

```r
library(scRNAseq)

sce <- ZeiselBrainData()
sub <- sce[,1:2000]
saveRDS(sub, "../assets/data/zilinois-lung-subset.rds")
```

To demonstrate this workflow, we will employ the [CellTypist](https://github.com/Teichlab/celltypist) model to annotate cell types for this dataset. CellTypist operates on an AnnData representation.

```{python}
from rds2py import read_rds, as_summarized_experiment
import numpy as np

r_object = read_rds("../assets/data/zilinois-lung-subset.rds")
sce = as_summarized_experiment(r_object)
adata, _ = sce.to_anndata()
adata.X = np.log1p(adata.layers["counts"])
adata.var.index = adata.var["genes"].tolist()
print(adata)
```

Before annotation, let's download the "human lung atlas" model from celltypist.

```{python}
import celltypist
from celltypist import models

models.download_models()
model_name = "Human_Lung_Atlas.pkl"
model = models.Model.load(model = model_name)
print(model)
```

Now, let's annotate our dataset.

```{python}
predictions = celltypist.annotate(adata, model = model_name, majority_voting = True)
print(predictions.predicted_labels)
```

:::{.callout-note}
The celltypist workflow is based on the tutorial described [here](https://colab.research.google.com/github/Teichlab/celltypist/blob/main/docs/notebook/celltypist_tutorial.ipynb#scrollTo=postal-chicken).
:::

Next, let's retrieve the `AnnData` object with the predicted labels embedded into the `obs` dataframe.

```{python}
adata = predictions.to_adata()
adata
```

We can now reverse the workflow and save this object into an Artifactdb format from Python. However, the object needs to be converted to a `SingleCellExperiment` class first. Read more about our experiment representations [here](./experiments/singlecell_expt.qmd).

```{python}
from singlecellexperiment import SingleCellExperiment

sce = SingleCellExperiment.from_anndata(adata)
print(sce)
```

We use the dolomite package to save it into a language-agnostic format.
```{python}
import dolomite_base
import dolomite_sce

dolomite_base.save_object(sce, "./zilinoislung_with_celltypist")
```

Finally, read the object back in R.
```r
sce_with_celltypist = readObject(path=paste(getwd(), "zilinoislung_with_celltypist", sep="/"))
sce_with_celltypist
```

And that concludes the workflow. Leveraging the generic **read** functions `readObject` (R) and `read_object` (Python), along with the **save** functions `saveObject` (R) and `save_object` (Python), you can seamlessly store most Bioconductor objects in language-agnostic formats.

----

## Further reading

- ArtifactDB GitHub organization - [https://github.com/ArtifactDB](https://github.com/ArtifactDB).
7 changes: 5 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ singler
numpy
scipy
pandas
jupyter
jupyter-cache
rich
jupyterlab
Expand All @@ -20,5 +21,7 @@ anndata
mudata
delayedarray[dask]
joblib
dolomite
hdf5array
dolomite_mae
dolomite_sce
hdf5array
celltypist
Loading