Add section on tiledb-backed arrays, more edits (#9)

BiocPy · Feb 29, 2024 · 8cecd78 · 8cecd78
1 parent 64f87cb
commit 8cecd78
Show file tree

Hide file tree

Showing 7 changed files with 95 additions and 18 deletions.
diff --git a/.gitignore b/.gitignore
@@ -6,3 +6,4 @@ _freeze
 
 chapters/zilinois_lung_with_celltypist/
 *whee.h5
+*.tiledb
diff --git a/chapters/language_agnostic.qmd b/chapters/language_agnostic.qmd
@@ -31,6 +31,10 @@ You can now convert this to `AnnData` representations for downstream analysis.
 adata = data.to_anndata()
 ```
 
+:::{.callout-important}
+Leveraging the generic **read** functions `readObject` (R) and `read_object` (Python), along with the **save** functions `saveObject` (R) and `save_object` (Python), you can seamlessly store most Bioconductor objects in language-agnostic formats.
+:::
+
 ## Further reading
 
 - Check out [ArtifactDB](https://github.com/artifactdb) framework for more information.
diff --git a/chapters/representations/biocframe.qmd b/chapters/representations/biocframe.qmd
@@ -18,7 +18,7 @@ pip install biocframe
 
 ## Advantages of `BiocFrame`
 
-One of the core principles guiding the implementation of the `BiocFrame` class is "what you put is what you get." Unlike Pandas `DataFrame`, `BiocFrame` makes no assumptions about the types of the columns provided as input. Some key differences to highlight the advantages of using `BiocFrame` are especially in terms of modifications to column types and handling nested dataframes.
+One of the core principles guiding the implementation of the `BiocFrame` class is "**_what you put is what you get_**". Unlike Pandas `DataFrame`, `BiocFrame` makes no assumptions about the types of the columns provided as input. Some key differences to highlight the advantages of using `BiocFrame` are especially in terms of modifications to column types and handling nested dataframes.
 
 ### Inadvertent modification of types
 

diff --git a/chapters/representations/file_backed_arrays.qmd b/chapters/representations/file_backed_arrays.qmd
@@ -14,10 +14,10 @@ These classes follow a functional paradigm for accessing or setting properties,
 This package is published to [PyPI](https://pypi.org/project/delayedarray/) and can be installed via the usual methods:
 
 ```bash
-pip install hdf5array
+pip install hdf5array tiledbarray
 ```
 
-## Quick start
+## HDF5-backed arrays
 
 Let's mock up a dense array:
 
@@ -51,7 +51,7 @@ transformed
 Check out the [documentation](https://biocpy.github.io/hdf5array/) for more details.
 :::
 
-## Handling sparse matrices
+### Handling sparse matrices
 
 We support a variety of compressed sparse formats where the non-zero elements are held inside three separate datasets -
 usually `data`, `indices` and `indptr`, based on the [10X Genomics sparse HDF5 format](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/h5_matrices).
@@ -86,6 +86,71 @@ arr = hdf5array.Hdf5CompressedSparseMatrix(
 arr
 ```
 
+
+## TileDB-backed arrays
+
+Let's mock up a dense array:
+
+```{python}
+import numpy
+import tiledb
+
+data = numpy.random.rand(40, 50)
+tiledb.from_numpy("dense.tiledb", data)
+```
+
+We can now represent it as a `TileDbArray`:
+
+```{python}
+import tiledbarray
+arr = tiledbarray.TileDbArray("dense.tiledb", attribute_name="")
+```
+
+This is just a subclass of a `DelayedArray` and can be used anywhere in the BiocPy framework.
+Parts of the NumPy API are also supported - for example, we could apply a variety of delayed operations:
+
+```{python}
+scaling = numpy.random.rand(100)
+transformed = numpy.log1p(arr / scaling)
+transformed
+```
+
+Check out the [documentation](https://biocpy.github.io/tiledbarray/) for more details.
+
+### Sparse Matrices
+
+We can perform similar operations on a sparse matrix as well. Lets mock a sparse matrix and store it as a tiledb file.
+
+```{python}
+dir_path = "sparse_array.tiledb"
+dom = tiledb.Domain(
+     tiledb.Dim(name="rows", domain=(0, 4), tile=4, dtype=numpy.int32),
+     tiledb.Dim(name="cols", domain=(0, 4), tile=4, dtype=numpy.int32),
+)
+schema = tiledb.ArraySchema(
+     domain=dom, sparse=True, attrs=[tiledb.Attr(name="", dtype=numpy.int32)]
+)
+tiledb.SparseArray.create(f"{dir_path}", schema)
+
+tdb = tiledb.SparseArray(f"{dir_path}", mode="w")
+i, j = [1, 2, 2], [1, 4, 3]
+data = numpy.array(([1, 2, 3]))
+tdb[i, j] = data
+```
+
+We can now represent this as a `TileDbArray`:
+
+```{python}
+import tiledbarray
+arr = tiledbarray.TileDbArray(dir_path, attribute_name="")
+
+slices = (slice(0, 2), [2, 3])
+
+import delayedarray
+subset = delayedarray.extract_sparse_array(arr, (*slices,))
+subset
+```
+
 ----
 
 ## Further reading

diff --git a/chapters/workflow.qmd b/chapters/workflow.qmd
@@ -2,6 +2,8 @@
 
 In this section, we will illustrate a workflow that utilizes either language-agnostic representations for storing genomic data or reading RDS files directly in Python, to facilitate seamless access to datasets and analysis results.
 
+## Load dataset
+
 :::{.callout-note}
 Check out 
 
@@ -33,6 +35,8 @@ adata.var.index = adata.var["genes"].tolist()
 print(adata)
 ```
 
+## Download ML models
+
 Before inferring cell types, let's download the "human lung atlas" model from CellTypist.
 
 ```{python}
@@ -63,6 +67,8 @@ adata = predictions.to_adata()
 adata
 ```
 
+## Save results
+
 We can now reverse the workflow and save this object into an Artifactdb format from Python. However, the object needs to be converted to a `SingleCellExperiment` class first. Read more about our experiment representations [here](./experiments/single_cell_experiment.qmd).
 
 ```{python}
@@ -82,7 +88,7 @@ dolomite_base.save_object(sce, "./zilinois_lung_with_celltypist")
 
 Finally, read the object back in R.
 ```r
-sce_with_celltypist = readObject(path=paste(getwd(), "zilinois_lung_with_celltypist", sep="/"))
+sce_with_celltypist <- readObject(path=paste(getwd(), "zilinois_lung_with_celltypist", sep="/"))
 sce_with_celltypist
 ```
 

diff --git a/index.qmd b/index.qmd
@@ -1,12 +1,12 @@
-# Welcome {.unnumbered}
+# Welcome
 
 [Bioconductor](https://www.bioconductor.org) is an open-source software project 
 that provides tools for the analysis and comprehension of genomic data. 
 One of the main advantages of Bioconductor is the availability of 
-standard data representations and large number of analysis tools for genomic 
-experiments. 
-These tools allow researchers to efficiently store, manipulate, and analyze
-their data across multiple tools and workflows.
+standard data representations and large number of analysis tools tailored 
+for genomic experiments. 
+These tools allow researchers to seamlessly store, manipulate, and analyze 
+data across multiple tools and workflows in R.
 
 Inspired by Bioconductor, [BiocPy](https://github.com/BiocPy) aims to facilitate 
 bioinformatics workflows in Python. 
@@ -23,15 +23,12 @@ like [SummarizedExperiment](https://github.com/BiocPy/SummarizedExperiment),
 [MultiAssayExperiment](https://github.com/BiocPy/MultiAssayExperiment) represent 
 single or multi-omic experimental data and metadata.
 
-Moreover, BiocPy introduces a range of data type classes designed to support the 
-representation of atomic entities, including float, string, int lists, and named lists. 
+Moreover, BiocPy introduces a diverse range of data type classes designed to support the 
+representation of atomic entities, including *float*, *string*, *int* lists, and named lists. 
 These generics and utilities are provided through [BiocUtils](https://github.com/BiocPy/BiocUtils) 
 package and the delayed and file-backed array operations in the
-[DelayedArray](https://github.com/BiocPy/DelayedArray) package. 
-While there have been previous efforts to port bioconductor representations into 
-Python e.g. [AnnData](https://github.com/scverse/anndata), 
-[PyRanges](https://github.com/pyranges/pyranges), these efforts are fragmented and 
-have limited interoperability.
+[DelayedArray](https://github.com/BiocPy/DelayedArray) and their derivatives 
+([HDF5Array](https://github.com/BiocPy/HDF5array), [TileDbArray](https://github.com/BiocPy/tiledbarray)). 
 To our knowledge, BiocPy is the first Python framework to provide seamless, well-integrated data 
 structures and representations for genomic data analysis.
 
@@ -43,6 +40,8 @@ the requirement for additional data conversion tools or intermediate formats.
 The package's functionality streamlines the transition between Python and R, 
 facilitating seamless analysis.
 
+Although not covered by this tutorial, BiocPy provides bindings to [libscran](https://github.com/LTLA/libscran) and various other single-cell analysis methods incorporated into the [scranpy](https://github.com/BiocPy/scranpy) package to support analysis of multi-modal single-cell datasets. It also features integration with the [singleR](https://github.com/BiocPy/singler) algorithm to annotate cell types by matching cells to known references based on their expression profiles.
+
 All packages within the BiocPy ecosystem are published to 
 [Python's Package Index (PyPI)](https://pypi.org/).
 
@@ -61,6 +60,7 @@ For complete list of all packages, please visit the
 - `MultiAssayExperiment` ([GitHub](https://github.com/BiocPy/MultiAssayExperiment), [Docs](https://biocpy.github.io/MultiAssayExperiment/)): Container class to represent multiple experiments and assays performed over a set of samples. follows Bioconductor's [MAE R/Bioc Package](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html).
 
 #### Analysis packages
+
 - `scranpy`([GitHub](https://github.com/BiocPy/scranpy), [Docs](https://biocpy.github.io/scranpy/)): Python bindings to the single-cell analysis methods from **libscran** and related C++ libraries.
 - `singler`([GitHub](https://github.com/BiocPy/singler), [Docs](https://biocpy.github.io/singler/)): Python bindings to the **singleR** algorithm to annotate cell types from known references.
 
@@ -78,7 +78,7 @@ For complete list of all packages, please visit the
 
 ## Further reading
 
-Many online resources offer detailed information on these data structures, namely:
+Many online resources offer detailed information on Bioconductor data structures, namely:
 
 - [https://compgenomr.github.io/book/](https://compgenomr.github.io/book/)
 - [https://www.nature.com/articles/nmeth.3252](https://www.nature.com/articles/nmeth.3252)

diff --git a/requirements.txt b/requirements.txt
@@ -24,4 +24,5 @@ joblib
 dolomite_mae
 dolomite_sce
 hdf5array
+tiledbarray
 celltypist
Original file line number	Diff line number	Diff line change
Expand Up		@@ -6,3 +6,4 @@ _freeze

		chapters/zilinois_lung_with_celltypist/
		*whee.h5
		*.tiledb