Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make example docs more notebook-centric, part 1 of 3 #205

Merged
merged 1 commit into from
Jun 29, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ website:

sidebar:
- style: "floating"
collapse-level: 2
collapse-level: 3
align: left
contents:
- href: "overview.md"
Expand All @@ -56,15 +56,14 @@ website:
contents:

- section: "Python examples"
href: "apis/python/examples/examples-overview.md"
contents:
- href: "apis/python/examples/obtaining-data-files.md"
text: "Obtaining data files"
- href: "apis/python/examples/ingesting-data-files.md"
text: "Ingesting data files"
- href: "apis/python/examples/anndata-and-tiledb.md"
text: "Comparing AnnData and TileDB files"
- href: "apis/python/examples/inspecting-schema.md"
text: "Inspecting SOMA schemas"
- href: "apis/python/examples/uniform-collection.md"
text: "Uniformizing a collection"
- href: "apis/python/examples/soco-reconnaissance.md"
Expand All @@ -75,6 +74,7 @@ website:
text: "SOMA-collection batch query"

- section: "Python API"
href: "apis/python/doc/api-overview.md"
contents:
- href: "apis/python/doc/overview.md"

Expand Down
6 changes: 6 additions & 0 deletions apis/python/doc/api-overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Documentation in this section is generated directly from the [Python API implementation](https://github.com/single-cell-data/TileDB-SingleCell/tree/main/apis/python).

See also for the R package:

* [https://github.com/TileDB-Inc/tiledbsc](R repo)
* [https://tiledb-inc.github.io/tiledbsc](R docs)
23 changes: 19 additions & 4 deletions apis/python/examples/anndata-and-tiledb.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
If you're familiar with AnnData, you'll recognize much about the SOMA data model.

## Single AnnData file

```
>>> import anndata

>>> ann = anndata.read_h5ad('anndata/pbmc3k_processed.h5ad')
>>> ann = anndata.read_h5ad('pbmc3k_processed.h5ad')

>>> ann.obs.keys()
Index(['n_genes', 'percent_mito', 'n_counts', 'louvain'], dtype='object')
Expand Down Expand Up @@ -75,12 +77,25 @@ array([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,

## Single TileDB SOMA

After `./tools/ingestor ./anndata/pbmc3k_processed.h5ad ./tiledb-data/pbmc3k_processed`:

```
>>> import tiledbsc

>>> soma = tiledbsc.SOMA('tiledb-data/pbmc3k_processed')
>>> soma = tiledbsc.SOMA('tiledb://johnkerl-tiledb/pbmc3k_processed')

>>> soma
Name: pbmc3k_processed
URI: tiledb://johnkerl-tiledb/pbmc3k_processed
(n_obs, n_var): (2638, 1838)
X: 'data'
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
var: 'n_cells'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
varm: 'PCs'
obsp: 'distances', 'connectivities'
varp:
raw/X: 'data'
raw/var: 'n_cells'
uns: draw_graph, louvain, louvain_colors, neighbors, pca

>>> soma.obs.keys()
['n_genes', 'percent_mito', 'n_counts', 'louvain']
Expand Down
33 changes: 33 additions & 0 deletions apis/python/examples/examples-overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# What is SOMA?

SOMA -- for _stack of matrices, annotated_ -- is a unified data model and API for single-cell data.

If you know about `obs`, `var`, and `X`, you'll recognize what you're seeing.

The data model and API -- here as implemented using the TileDB storage engine -- allow you do persist and share single-cell data

* at scale
* with auditable data-sharing
* using the same storage across multiple high-level languages (currently Python and R)
* allowing interop with multiple tools including AnnData, Scanpy, Seurat, and Bioconductor.

See also [the schema specification](https://github.com/single-cell-data/SOMA/blob/main/README.md).

# Examples overview

In these example we will offer how-to's on the [TileDB SingleCell Python package](https://github.com/single-cell-data/TileDB-SingleCell/tree/main/apis/python):

* How to install the software
* How to retrieve sample H5AD inputs from a public S3 bucket
* How to ingest these into SOMA storage on local disk, S3, and TileDB Cloud for increasing levels of scalability and shareability
* How to slice and query SOMA and SOMA-collection objects in new and empowering ways

See also for the R package:

* [R repo](https://github.com/TileDB-Inc/tiledbsc)
* [R docs](https://tiledb-inc.github.io/tiledbsc)

# Notebook

Examples with screenshots and copy/pasteable reusable samples are shown here. As well, you can use
the [public TileDB Cloud notebook](https://cloud.tiledb.com/notebooks/details/johnkerl-tiledb/d3d7ff44-dc65-4cd9-b574-98312c4cbdbd/preview) from which most of these screenshots are taken.
Binary file added apis/python/examples/images/local-inspect-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added apis/python/examples/images/local-inspect-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added apis/python/examples/images/public-bucket.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added apis/python/examples/images/s3-inspect.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added apis/python/examples/images/soco-inspect.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
137 changes: 91 additions & 46 deletions apis/python/examples/ingesting-data-files.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,87 @@
## Ingesting into SOMAs
# Ingesting into SOMAs

A **SOMA** (Stack of Matrices, Annotated) is a [unified single-cell data model and API](https://github.com/single-cell-data/SOMA). A SOMA contains the kinds of data that belong in a single-cell dataset: `obs` and `var` matrices, `X` layers, and so on, offering **write-once, read-many semantics** via Python and R toolchains, with I/O to AnnData and Seurat formats, and interoperability with Scanpy and Seurat toolchains.

Use the [ingestor script](../tools/ingestor) to ingest [files you've downloaded](obtaining-data-files.md) into SOMAs:
In the next few sections we'll ingest some AnnData files into SOMA storage three ways:

* To notebook-local disk
* To S3 storage
* To TileDB Cloud storage

Notebook-level storage is great for quick kick-the-tires exploration.

Cloud-level storage is crucial for at-scale, beyond-core analysis.

## Local ingestion

If you have an in-memory `AnnData` object, you can ingest it into a SOMA using `tiledbsc.io.from_anndata()`:

```
pbmc3k = scanpy.datasets.pbmc3k_processed()
local_soma = tiledbsc.SOMA('pbmc3k')
tiledbsc.io.from_anndata(local_soma, pbmc3k)
```

We'll focus from here on out mainly on ingesting from H5AD disk files. Given local
`pbmc3k_processed.h5ad`, as in the previous section, we can populate a SOMA.

```
local_soma = tiledbsc.SOMA('pbmc3k')
tiledbsc.io.from_h5ad(local_soma, './pbmc3k_processed.h5ad')
```

Now we can examine the data, using things like the following:

```
local_soma
local_soma.obs.keys()
local_soma.obs.df()
local_soma.obs.df().groupby['cell_type'].size()
local_soma.obs.df().groupby['louvain'].size()
local_soma.uns['louvain_colors'].to_matrix()
```

![](images/local-inspect-1.png)
![](images/local-inspect-2.png)

## S3 ingestion

To ingest into S3, simply provide an S3 URI as the destination. The simplest way to get S3
credentials set up is to export the `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and
`AWS_DEFAULT_REGION` environment variables -- please see the
[public TileDB Cloud notebook](https://cloud.tiledb.com/notebooks/details/johnkerl-tiledb/d3d7ff44-dc65-4cd9-b574-98312c4cbdbd/preview) for examples.

```
s3_soma = tiledbsc.SOMA('s3://mybucket/scratch/puck-001', ctx=ctx)
tiledbsc.io.from_h5ad(s3_soma, './Puck_200903_10.h5ad')
```

![](images/s3-inspect.png)

## TiledB Cloud ingestion

Again we simply vary the destination URI, this time to `tiledb://...`. The simplest way to configure
TileDB Cloud access is to export the `TILEDB_REST_TOKEN` environment variable (which is done already for you
within TileDB Cloud notebooks). See
[here](https://docs.tiledb.com/cloud/how-to/account/create-api-tokens) for how to create an API
token.

For upload to TileDB Cloud, we use _creation URIs_. Here you indicate your target namespace along with S3 storage. The SOMA name will be taken from the last component of the creation URI

* Example creation URI: `tiledb://mynamespace/s3://mybucket/path/to/somaname`
* Example post-upload URI: `tiledb://mynamespace/somaname`

```
hscs = tiledbsc.SOMA('tiledb://johnkerl-tiledb/s3://tiledb-johnkerl/cloud/001/HSCs', ctx=ctx)
tiledbsc.io.from_h5ad(hscs, 'HSCs.h5ad')
hscs = tiledbsc.SOMA('tiledb://johnkerl-tiledb/HSCs', ctx=ctx)
```

![](images/tiledb-cloud-inspect.png)

## Scripted ingestion

Alternatively, especially for bulk/batch/scripted jobs, you may with to use the [ingestor script](../tools/ingestor) to ingest [files you've downloaded](obtaining-data-files.md) into SOMAs:

```
tools/ingestor -o /mini-corpus/tiledb-data -n /mini-corpus/anndata/0cfab2d4-1b79-444e-8cbe-2ca9671ca85e.h5ad
Expand All @@ -20,9 +99,17 @@ package) or [`tiledbsc-r`](https://github.com/TileDB-Inc/tiledbsc) you can read
language, regardless of which language was used to store them. This lets you use
best-in-class/state-of-the-art analysis algorithms, whichever language they're implemented in.

## Populate a SOMA collection
# Populate a SOMA collection

Once you have SOMAs, you can optionally add them to a _SOMA collection_ -- a list-of-SOMAs object,
storable on local disk, S3, or the cloud, designed for multi-SOMA slicing and querying, which can be
permissioned and shared like any other TileDB Cloud object.

Use the [populator script](../tools/populate-soco) to mark these as members of a SOMA collection:
![](images/soco-inspect.png)

## Scripted population

Alternatively, especially for bulk/batch/scripted jobs, you may with to use the [populator script](../tools/populate-soco) to mark SOMAs as members of a SOMA collection:

```
populate-soco -o /mini-corpus/soco --relative false -a /mini-corpus/tiledb-data/0cfab2d4-1b79-444e-8cbe-2ca9671ca85e
Expand All @@ -45,45 +132,3 @@ collection at ingest time, so you don't even have to run `populate-soco` as an a
tools/ingestor -o /mini-corpus/tiledb-data --soco -n /mini-corpus/anndata/0cfab2d4-1b79-444e-8cbe-2ca9671ca85e.h5ad
tools/ingestor -o /mini-corpus/tiledb-data --soco -n /mini-corpus/anndata/10x_pbmc68k_reduced.h5ad
```

## Names and URIs

Next let's start taking a look across the collection.

```
for soma in soco:
print("%-40s %s" % (soma.name, soma.uri))
```

```
NAMES AND URIS
tabula-sapiens-immune file:///mini-corpus/tiledb-data/tabula-sapiens-immune
tabula-sapiens-epithelial file:///mini-corpus/tiledb-data/tabula-sapiens-epithelial
integrated-human-lung-cell-atlas file:///mini-corpus/tiledb-data/integrated-human-lung-cell-atlas
af9d8c03-696c-4997-bde8-8ef00844881b file:///mini-corpus/tiledb-data/af9d8c03-696c-4997-bde8-8ef00844881b
subset_100_100 file:///mini-corpus/tiledb-data/subset_100_100
pbmc3k_processed file:///mini-corpus/tiledb-data/pbmc3k_processed
pbmc3k-krilow file:///mini-corpus/tiledb-data/pbmc3k-krilow
issue-74 file:///mini-corpus/tiledb-data/issue-74
local2 file:///mini-corpus/tiledb-data/local2
developmental-single-cell-atlas-of-the-murine-lung file:///mini-corpus/tiledb-data/developmental-single-cell-atlas-of-the-murine-lung
single-cell-transcriptomes file:///mini-corpus/tiledb-data/single-cell-transcriptomes
tabula-sapiens-stromal file:///mini-corpus/tiledb-data/tabula-sapiens-stromal
azimuth-meta-analysis file:///mini-corpus/tiledb-data/azimuth-meta-analysis
vieira19_Alveoli_and_parenchyma_anonymised.processed file:///mini-corpus/tiledb-data/vieira19_Alveoli_and_parenchyma_anonymised.processed
local3 file:///mini-corpus/tiledb-data/local3
Single_cell_atlas_of_peripheral_immune_response_to_SARS_CoV_2_infection file:///mini-corpus/tiledb-data/Single_cell_atlas_of_peripheral_immune_response_to_SARS_CoV_2_infection
4056cbab-2a32-4c9e-a55f-c930bc793fb6 file:///mini-corpus/tiledb-data/4056cbab-2a32-4c9e-a55f-c930bc793fb6
longitudinal-profiling-49 file:///mini-corpus/tiledb-data/longitudinal-profiling-49
human-kidney-tumors-wilms file:///mini-corpus/tiledb-data/human-kidney-tumors-wilms
issue-69 file:///mini-corpus/tiledb-data/issue-69
autoimmunity-pbmcs file:///mini-corpus/tiledb-data/autoimmunity-pbmcs
brown-adipose-tissue-mouse file:///mini-corpus/tiledb-data/brown-adipose-tissue-mouse
d4db74ad-a129-4b1a-b9da-1b30db86bbe4-issue-74 file:///mini-corpus/tiledb-data/d4db74ad-a129-4b1a-b9da-1b30db86bbe4-issue-74
pbmc-small file:///mini-corpus/tiledb-data/pbmc-small
10x_pbmc68k_reduced file:///mini-corpus/tiledb-data/10x_pbmc68k_reduced
Puck_200903_10 file:///mini-corpus/tiledb-data/Puck_200903_10
0cfab2d4-1b79-444e-8cbe-2ca9671ca85e file:///mini-corpus/tiledb-data/0cfab2d4-1b79-444e-8cbe-2ca9671ca85e
acute-covid19-cohort file:///mini-corpus/tiledb-data/acute-covid19-cohort
adult-mouse-cortical-cell-taxonomy file:///mini-corpus/tiledb-data/adult-mouse-cortical-cell-taxonomy
```
Loading