Skip to content

Commit

Permalink
Add section on tiledb-backed arrays, more edits (#9)
Browse files Browse the repository at this point in the history
  • Loading branch information
jkanche authored Feb 29, 2024
1 parent 64f87cb commit 8cecd78
Show file tree
Hide file tree
Showing 7 changed files with 95 additions and 18 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ _freeze

chapters/zilinois_lung_with_celltypist/
*whee.h5
*.tiledb
4 changes: 4 additions & 0 deletions chapters/language_agnostic.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@ You can now convert this to `AnnData` representations for downstream analysis.
adata = data.to_anndata()
```

:::{.callout-important}
Leveraging the generic **read** functions `readObject` (R) and `read_object` (Python), along with the **save** functions `saveObject` (R) and `save_object` (Python), you can seamlessly store most Bioconductor objects in language-agnostic formats.
:::

## Further reading

- Check out [ArtifactDB](https://github.com/artifactdb) framework for more information.
2 changes: 1 addition & 1 deletion chapters/representations/biocframe.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ pip install biocframe

## Advantages of `BiocFrame`

One of the core principles guiding the implementation of the `BiocFrame` class is "what you put is what you get." Unlike Pandas `DataFrame`, `BiocFrame` makes no assumptions about the types of the columns provided as input. Some key differences to highlight the advantages of using `BiocFrame` are especially in terms of modifications to column types and handling nested dataframes.
One of the core principles guiding the implementation of the `BiocFrame` class is "**_what you put is what you get_**". Unlike Pandas `DataFrame`, `BiocFrame` makes no assumptions about the types of the columns provided as input. Some key differences to highlight the advantages of using `BiocFrame` are especially in terms of modifications to column types and handling nested dataframes.

### Inadvertent modification of types

Expand Down
71 changes: 68 additions & 3 deletions chapters/representations/file_backed_arrays.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@ These classes follow a functional paradigm for accessing or setting properties,
This package is published to [PyPI](https://pypi.org/project/delayedarray/) and can be installed via the usual methods:

```bash
pip install hdf5array
pip install hdf5array tiledbarray
```

## Quick start
## HDF5-backed arrays

Let's mock up a dense array:

Expand Down Expand Up @@ -51,7 +51,7 @@ transformed
Check out the [documentation](https://biocpy.github.io/hdf5array/) for more details.
:::

## Handling sparse matrices
### Handling sparse matrices

We support a variety of compressed sparse formats where the non-zero elements are held inside three separate datasets -
usually `data`, `indices` and `indptr`, based on the [10X Genomics sparse HDF5 format](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/h5_matrices).
Expand Down Expand Up @@ -86,6 +86,71 @@ arr = hdf5array.Hdf5CompressedSparseMatrix(
arr
```


## TileDB-backed arrays

Let's mock up a dense array:

```{python}
import numpy
import tiledb
data = numpy.random.rand(40, 50)
tiledb.from_numpy("dense.tiledb", data)
```

We can now represent it as a `TileDbArray`:

```{python}
import tiledbarray
arr = tiledbarray.TileDbArray("dense.tiledb", attribute_name="")
```

This is just a subclass of a `DelayedArray` and can be used anywhere in the BiocPy framework.
Parts of the NumPy API are also supported - for example, we could apply a variety of delayed operations:

```{python}
scaling = numpy.random.rand(100)
transformed = numpy.log1p(arr / scaling)
transformed
```

Check out the [documentation](https://biocpy.github.io/tiledbarray/) for more details.

### Sparse Matrices

We can perform similar operations on a sparse matrix as well. Lets mock a sparse matrix and store it as a tiledb file.

```{python}
dir_path = "sparse_array.tiledb"
dom = tiledb.Domain(
tiledb.Dim(name="rows", domain=(0, 4), tile=4, dtype=numpy.int32),
tiledb.Dim(name="cols", domain=(0, 4), tile=4, dtype=numpy.int32),
)
schema = tiledb.ArraySchema(
domain=dom, sparse=True, attrs=[tiledb.Attr(name="", dtype=numpy.int32)]
)
tiledb.SparseArray.create(f"{dir_path}", schema)
tdb = tiledb.SparseArray(f"{dir_path}", mode="w")
i, j = [1, 2, 2], [1, 4, 3]
data = numpy.array(([1, 2, 3]))
tdb[i, j] = data
```

We can now represent this as a `TileDbArray`:

```{python}
import tiledbarray
arr = tiledbarray.TileDbArray(dir_path, attribute_name="")
slices = (slice(0, 2), [2, 3])
import delayedarray
subset = delayedarray.extract_sparse_array(arr, (*slices,))
subset
```

----

## Further reading
Expand Down
8 changes: 7 additions & 1 deletion chapters/workflow.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

In this section, we will illustrate a workflow that utilizes either language-agnostic representations for storing genomic data or reading RDS files directly in Python, to facilitate seamless access to datasets and analysis results.

## Load dataset

:::{.callout-note}
Check out

Expand Down Expand Up @@ -33,6 +35,8 @@ adata.var.index = adata.var["genes"].tolist()
print(adata)
```

## Download ML models

Before inferring cell types, let's download the "human lung atlas" model from CellTypist.

```{python}
Expand Down Expand Up @@ -63,6 +67,8 @@ adata = predictions.to_adata()
adata
```

## Save results

We can now reverse the workflow and save this object into an Artifactdb format from Python. However, the object needs to be converted to a `SingleCellExperiment` class first. Read more about our experiment representations [here](./experiments/single_cell_experiment.qmd).

```{python}
Expand All @@ -82,7 +88,7 @@ dolomite_base.save_object(sce, "./zilinois_lung_with_celltypist")

Finally, read the object back in R.
```r
sce_with_celltypist = readObject(path=paste(getwd(), "zilinois_lung_with_celltypist", sep="/"))
sce_with_celltypist <- readObject(path=paste(getwd(), "zilinois_lung_with_celltypist", sep="/"))
sce_with_celltypist
```

Expand Down
26 changes: 13 additions & 13 deletions index.qmd
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Welcome {.unnumbered}
# Welcome

[Bioconductor](https://www.bioconductor.org) is an open-source software project
that provides tools for the analysis and comprehension of genomic data.
One of the main advantages of Bioconductor is the availability of
standard data representations and large number of analysis tools for genomic
experiments.
These tools allow researchers to efficiently store, manipulate, and analyze
their data across multiple tools and workflows.
standard data representations and large number of analysis tools tailored
for genomic experiments.
These tools allow researchers to seamlessly store, manipulate, and analyze
data across multiple tools and workflows in R.

Inspired by Bioconductor, [BiocPy](https://github.com/BiocPy) aims to facilitate
bioinformatics workflows in Python.
Expand All @@ -23,15 +23,12 @@ like [SummarizedExperiment](https://github.com/BiocPy/SummarizedExperiment),
[MultiAssayExperiment](https://github.com/BiocPy/MultiAssayExperiment) represent
single or multi-omic experimental data and metadata.

Moreover, BiocPy introduces a range of data type classes designed to support the
representation of atomic entities, including float, string, int lists, and named lists.
Moreover, BiocPy introduces a diverse range of data type classes designed to support the
representation of atomic entities, including *float*, *string*, *int* lists, and named lists.
These generics and utilities are provided through [BiocUtils](https://github.com/BiocPy/BiocUtils)
package and the delayed and file-backed array operations in the
[DelayedArray](https://github.com/BiocPy/DelayedArray) package.
While there have been previous efforts to port bioconductor representations into
Python e.g. [AnnData](https://github.com/scverse/anndata),
[PyRanges](https://github.com/pyranges/pyranges), these efforts are fragmented and
have limited interoperability.
[DelayedArray](https://github.com/BiocPy/DelayedArray) and their derivatives
([HDF5Array](https://github.com/BiocPy/HDF5array), [TileDbArray](https://github.com/BiocPy/tiledbarray)).
To our knowledge, BiocPy is the first Python framework to provide seamless, well-integrated data
structures and representations for genomic data analysis.

Expand All @@ -43,6 +40,8 @@ the requirement for additional data conversion tools or intermediate formats.
The package's functionality streamlines the transition between Python and R,
facilitating seamless analysis.

Although not covered by this tutorial, BiocPy provides bindings to [libscran](https://github.com/LTLA/libscran) and various other single-cell analysis methods incorporated into the [scranpy](https://github.com/BiocPy/scranpy) package to support analysis of multi-modal single-cell datasets. It also features integration with the [singleR](https://github.com/BiocPy/singler) algorithm to annotate cell types by matching cells to known references based on their expression profiles.

All packages within the BiocPy ecosystem are published to
[Python's Package Index (PyPI)](https://pypi.org/).

Expand All @@ -61,6 +60,7 @@ For complete list of all packages, please visit the
- `MultiAssayExperiment` ([GitHub](https://github.com/BiocPy/MultiAssayExperiment), [Docs](https://biocpy.github.io/MultiAssayExperiment/)): Container class to represent multiple experiments and assays performed over a set of samples. follows Bioconductor's [MAE R/Bioc Package](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html).

#### Analysis packages

- `scranpy`([GitHub](https://github.com/BiocPy/scranpy), [Docs](https://biocpy.github.io/scranpy/)): Python bindings to the single-cell analysis methods from **libscran** and related C++ libraries.
- `singler`([GitHub](https://github.com/BiocPy/singler), [Docs](https://biocpy.github.io/singler/)): Python bindings to the **singleR** algorithm to annotate cell types from known references.

Expand All @@ -78,7 +78,7 @@ For complete list of all packages, please visit the

## Further reading

Many online resources offer detailed information on these data structures, namely:
Many online resources offer detailed information on Bioconductor data structures, namely:

- [https://compgenomr.github.io/book/](https://compgenomr.github.io/book/)
- [https://www.nature.com/articles/nmeth.3252](https://www.nature.com/articles/nmeth.3252)
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,5 @@ joblib
dolomite_mae
dolomite_sce
hdf5array
tiledbarray
celltypist

0 comments on commit 8cecd78

Please sign in to comment.