diff --git a/_quarto.yml b/_quarto.yml index faf54c9628..95340069c3 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -39,7 +39,7 @@ website: sidebar: - style: "floating" - collapse-level: 2 + collapse-level: 3 align: left contents: - href: "overview.md" @@ -56,6 +56,7 @@ website: contents: - section: "Python examples" + href: "apis/python/examples/examples-overview.md" contents: - href: "apis/python/examples/obtaining-data-files.md" text: "Obtaining data files" @@ -63,8 +64,6 @@ website: text: "Ingesting data files" - href: "apis/python/examples/anndata-and-tiledb.md" text: "Comparing AnnData and TileDB files" - - href: "apis/python/examples/inspecting-schema.md" - text: "Inspecting SOMA schemas" - href: "apis/python/examples/uniform-collection.md" text: "Uniformizing a collection" - href: "apis/python/examples/soco-reconnaissance.md" @@ -75,6 +74,7 @@ website: text: "SOMA-collection batch query" - section: "Python API" + href: "apis/python/doc/api-overview.md" contents: - href: "apis/python/doc/overview.md" diff --git a/apis/python/doc/api-overview.md b/apis/python/doc/api-overview.md new file mode 100644 index 0000000000..dd16e4da58 --- /dev/null +++ b/apis/python/doc/api-overview.md @@ -0,0 +1,6 @@ +Documentation in this section is generated directly from the [Python API implementation](https://github.com/single-cell-data/TileDB-SingleCell/tree/main/apis/python). + +See also for the R package: + +* [https://github.com/TileDB-Inc/tiledbsc](R repo) +* [https://tiledb-inc.github.io/tiledbsc](R docs) diff --git a/apis/python/examples/anndata-and-tiledb.md b/apis/python/examples/anndata-and-tiledb.md index fc1d7a0f80..6914355193 100644 --- a/apis/python/examples/anndata-and-tiledb.md +++ b/apis/python/examples/anndata-and-tiledb.md @@ -1,9 +1,11 @@ +If you're familiar with AnnData, you'll recognize much about the SOMA data model. + ## Single AnnData file ``` >>> import anndata ->>> ann = anndata.read_h5ad('anndata/pbmc3k_processed.h5ad') +>>> ann = anndata.read_h5ad('pbmc3k_processed.h5ad') >>> ann.obs.keys() Index(['n_genes', 'percent_mito', 'n_counts', 'louvain'], dtype='object') @@ -75,12 +77,25 @@ array([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884, ## Single TileDB SOMA -After `./tools/ingestor ./anndata/pbmc3k_processed.h5ad ./tiledb-data/pbmc3k_processed`: - ``` >>> import tiledbsc ->>> soma = tiledbsc.SOMA('tiledb-data/pbmc3k_processed') +>>> soma = tiledbsc.SOMA('tiledb://johnkerl-tiledb/pbmc3k_processed') + +>>> soma +Name: pbmc3k_processed +URI: tiledb://johnkerl-tiledb/pbmc3k_processed +(n_obs, n_var): (2638, 1838) +X: 'data' +obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain' +var: 'n_cells' +obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr' +varm: 'PCs' +obsp: 'distances', 'connectivities' +varp: +raw/X: 'data' +raw/var: 'n_cells' +uns: draw_graph, louvain, louvain_colors, neighbors, pca >>> soma.obs.keys() ['n_genes', 'percent_mito', 'n_counts', 'louvain'] diff --git a/apis/python/examples/examples-overview.md b/apis/python/examples/examples-overview.md new file mode 100644 index 0000000000..6ae0dbc378 --- /dev/null +++ b/apis/python/examples/examples-overview.md @@ -0,0 +1,33 @@ +# What is SOMA? + +SOMA -- for _stack of matrices, annotated_ -- is a unified data model and API for single-cell data. + +If you know about `obs`, `var`, and `X`, you'll recognize what you're seeing. + +The data model and API -- here as implemented using the TileDB storage engine -- allow you do persist and share single-cell data + +* at scale +* with auditable data-sharing +* using the same storage across multiple high-level languages (currently Python and R) +* allowing interop with multiple tools including AnnData, Scanpy, Seurat, and Bioconductor. + +See also [the schema specification](https://github.com/single-cell-data/SOMA/blob/main/README.md). + +# Examples overview + +In these example we will offer how-to's on the [TileDB SingleCell Python package](https://github.com/single-cell-data/TileDB-SingleCell/tree/main/apis/python): + +* How to install the software +* How to retrieve sample H5AD inputs from a public S3 bucket +* How to ingest these into SOMA storage on local disk, S3, and TileDB Cloud for increasing levels of scalability and shareability +* How to slice and query SOMA and SOMA-collection objects in new and empowering ways + +See also for the R package: + +* [R repo](https://github.com/TileDB-Inc/tiledbsc) +* [R docs](https://tiledb-inc.github.io/tiledbsc) + +# Notebook + +Examples with screenshots and copy/pasteable reusable samples are shown here. As well, you can use +the [public TileDB Cloud notebook](https://cloud.tiledb.com/notebooks/details/johnkerl-tiledb/d3d7ff44-dc65-4cd9-b574-98312c4cbdbd/preview) from which most of these screenshots are taken. diff --git a/apis/python/examples/images/local-inspect-1.png b/apis/python/examples/images/local-inspect-1.png new file mode 100644 index 0000000000..a2dbf49394 Binary files /dev/null and b/apis/python/examples/images/local-inspect-1.png differ diff --git a/apis/python/examples/images/local-inspect-2.png b/apis/python/examples/images/local-inspect-2.png new file mode 100644 index 0000000000..29719d6d98 Binary files /dev/null and b/apis/python/examples/images/local-inspect-2.png differ diff --git a/apis/python/examples/images/public-bucket.png b/apis/python/examples/images/public-bucket.png new file mode 100644 index 0000000000..7b91159c5d Binary files /dev/null and b/apis/python/examples/images/public-bucket.png differ diff --git a/apis/python/examples/images/s3-inspect.png b/apis/python/examples/images/s3-inspect.png new file mode 100644 index 0000000000..8bf757842e Binary files /dev/null and b/apis/python/examples/images/s3-inspect.png differ diff --git a/apis/python/examples/images/soco-inspect.png b/apis/python/examples/images/soco-inspect.png new file mode 100644 index 0000000000..bc46d61b21 Binary files /dev/null and b/apis/python/examples/images/soco-inspect.png differ diff --git a/apis/python/examples/images/soco-reconnaissance.png b/apis/python/examples/images/soco-reconnaissance.png new file mode 100644 index 0000000000..aab93297bd Binary files /dev/null and b/apis/python/examples/images/soco-reconnaissance.png differ diff --git a/apis/python/examples/images/tiledb-cloud-inspect.png b/apis/python/examples/images/tiledb-cloud-inspect.png new file mode 100644 index 0000000000..a1f70f4ee9 Binary files /dev/null and b/apis/python/examples/images/tiledb-cloud-inspect.png differ diff --git a/apis/python/examples/ingesting-data-files.md b/apis/python/examples/ingesting-data-files.md index 51c303c119..79625dbc0a 100644 --- a/apis/python/examples/ingesting-data-files.md +++ b/apis/python/examples/ingesting-data-files.md @@ -1,8 +1,87 @@ -## Ingesting into SOMAs +# Ingesting into SOMAs A **SOMA** (Stack of Matrices, Annotated) is a [unified single-cell data model and API](https://github.com/single-cell-data/SOMA). A SOMA contains the kinds of data that belong in a single-cell dataset: `obs` and `var` matrices, `X` layers, and so on, offering **write-once, read-many semantics** via Python and R toolchains, with I/O to AnnData and Seurat formats, and interoperability with Scanpy and Seurat toolchains. -Use the [ingestor script](../tools/ingestor) to ingest [files you've downloaded](obtaining-data-files.md) into SOMAs: +In the next few sections we'll ingest some AnnData files into SOMA storage three ways: + +* To notebook-local disk +* To S3 storage +* To TileDB Cloud storage + +Notebook-level storage is great for quick kick-the-tires exploration. + +Cloud-level storage is crucial for at-scale, beyond-core analysis. + +## Local ingestion + +If you have an in-memory `AnnData` object, you can ingest it into a SOMA using `tiledbsc.io.from_anndata()`: + +``` +pbmc3k = scanpy.datasets.pbmc3k_processed() +local_soma = tiledbsc.SOMA('pbmc3k') +tiledbsc.io.from_anndata(local_soma, pbmc3k) +``` + +We'll focus from here on out mainly on ingesting from H5AD disk files. Given local +`pbmc3k_processed.h5ad`, as in the previous section, we can populate a SOMA. + +``` +local_soma = tiledbsc.SOMA('pbmc3k') +tiledbsc.io.from_h5ad(local_soma, './pbmc3k_processed.h5ad') +``` + +Now we can examine the data, using things like the following: + +``` +local_soma +local_soma.obs.keys() +local_soma.obs.df() +local_soma.obs.df().groupby['cell_type'].size() +local_soma.obs.df().groupby['louvain'].size() +local_soma.uns['louvain_colors'].to_matrix() +``` + +![](images/local-inspect-1.png) +![](images/local-inspect-2.png) + +## S3 ingestion + +To ingest into S3, simply provide an S3 URI as the destination. The simplest way to get S3 +credentials set up is to export the `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and +`AWS_DEFAULT_REGION` environment variables -- please see the +[public TileDB Cloud notebook](https://cloud.tiledb.com/notebooks/details/johnkerl-tiledb/d3d7ff44-dc65-4cd9-b574-98312c4cbdbd/preview) for examples. + +``` +s3_soma = tiledbsc.SOMA('s3://mybucket/scratch/puck-001', ctx=ctx) +tiledbsc.io.from_h5ad(s3_soma, './Puck_200903_10.h5ad') +``` + +![](images/s3-inspect.png) + +## TiledB Cloud ingestion + +Again we simply vary the destination URI, this time to `tiledb://...`. The simplest way to configure +TileDB Cloud access is to export the `TILEDB_REST_TOKEN` environment variable (which is done already for you +within TileDB Cloud notebooks). See +[here](https://docs.tiledb.com/cloud/how-to/account/create-api-tokens) for how to create an API +token. + +For upload to TileDB Cloud, we use _creation URIs_. Here you indicate your target namespace along with S3 storage. The SOMA name will be taken from the last component of the creation URI + +* Example creation URI: `tiledb://mynamespace/s3://mybucket/path/to/somaname` +* Example post-upload URI: `tiledb://mynamespace/somaname` + +``` +hscs = tiledbsc.SOMA('tiledb://johnkerl-tiledb/s3://tiledb-johnkerl/cloud/001/HSCs', ctx=ctx) +tiledbsc.io.from_h5ad(hscs, 'HSCs.h5ad') +hscs = tiledbsc.SOMA('tiledb://johnkerl-tiledb/HSCs', ctx=ctx) +``` + +![](images/tiledb-cloud-inspect.png) + +## Scripted ingestion + +Alternatively, especially for bulk/batch/scripted jobs, you may with to use the [ingestor script](../tools/ingestor) to ingest [files you've downloaded](obtaining-data-files.md) into SOMAs: ``` tools/ingestor -o /mini-corpus/tiledb-data -n /mini-corpus/anndata/0cfab2d4-1b79-444e-8cbe-2ca9671ca85e.h5ad @@ -20,9 +99,17 @@ package) or [`tiledbsc-r`](https://github.com/TileDB-Inc/tiledbsc) you can read language, regardless of which language was used to store them. This lets you use best-in-class/state-of-the-art analysis algorithms, whichever language they're implemented in. -## Populate a SOMA collection +# Populate a SOMA collection + +Once you have SOMAs, you can optionally add them to a _SOMA collection_ -- a list-of-SOMAs object, +storable on local disk, S3, or the cloud, designed for multi-SOMA slicing and querying, which can be +permissioned and shared like any other TileDB Cloud object. -Use the [populator script](../tools/populate-soco) to mark these as members of a SOMA collection: +![](images/soco-inspect.png) + +## Scripted population + +Alternatively, especially for bulk/batch/scripted jobs, you may with to use the [populator script](../tools/populate-soco) to mark SOMAs as members of a SOMA collection: ``` populate-soco -o /mini-corpus/soco --relative false -a /mini-corpus/tiledb-data/0cfab2d4-1b79-444e-8cbe-2ca9671ca85e @@ -45,45 +132,3 @@ collection at ingest time, so you don't even have to run `populate-soco` as an a tools/ingestor -o /mini-corpus/tiledb-data --soco -n /mini-corpus/anndata/0cfab2d4-1b79-444e-8cbe-2ca9671ca85e.h5ad tools/ingestor -o /mini-corpus/tiledb-data --soco -n /mini-corpus/anndata/10x_pbmc68k_reduced.h5ad ``` - -## Names and URIs - -Next let's start taking a look across the collection. - -``` -for soma in soco: - print("%-40s %s" % (soma.name, soma.uri)) -``` - -``` -NAMES AND URIS -tabula-sapiens-immune file:///mini-corpus/tiledb-data/tabula-sapiens-immune -tabula-sapiens-epithelial file:///mini-corpus/tiledb-data/tabula-sapiens-epithelial -integrated-human-lung-cell-atlas file:///mini-corpus/tiledb-data/integrated-human-lung-cell-atlas -af9d8c03-696c-4997-bde8-8ef00844881b file:///mini-corpus/tiledb-data/af9d8c03-696c-4997-bde8-8ef00844881b -subset_100_100 file:///mini-corpus/tiledb-data/subset_100_100 -pbmc3k_processed file:///mini-corpus/tiledb-data/pbmc3k_processed -pbmc3k-krilow file:///mini-corpus/tiledb-data/pbmc3k-krilow -issue-74 file:///mini-corpus/tiledb-data/issue-74 -local2 file:///mini-corpus/tiledb-data/local2 -developmental-single-cell-atlas-of-the-murine-lung file:///mini-corpus/tiledb-data/developmental-single-cell-atlas-of-the-murine-lung -single-cell-transcriptomes file:///mini-corpus/tiledb-data/single-cell-transcriptomes -tabula-sapiens-stromal file:///mini-corpus/tiledb-data/tabula-sapiens-stromal -azimuth-meta-analysis file:///mini-corpus/tiledb-data/azimuth-meta-analysis -vieira19_Alveoli_and_parenchyma_anonymised.processed file:///mini-corpus/tiledb-data/vieira19_Alveoli_and_parenchyma_anonymised.processed -local3 file:///mini-corpus/tiledb-data/local3 -Single_cell_atlas_of_peripheral_immune_response_to_SARS_CoV_2_infection file:///mini-corpus/tiledb-data/Single_cell_atlas_of_peripheral_immune_response_to_SARS_CoV_2_infection -4056cbab-2a32-4c9e-a55f-c930bc793fb6 file:///mini-corpus/tiledb-data/4056cbab-2a32-4c9e-a55f-c930bc793fb6 -longitudinal-profiling-49 file:///mini-corpus/tiledb-data/longitudinal-profiling-49 -human-kidney-tumors-wilms file:///mini-corpus/tiledb-data/human-kidney-tumors-wilms -issue-69 file:///mini-corpus/tiledb-data/issue-69 -autoimmunity-pbmcs file:///mini-corpus/tiledb-data/autoimmunity-pbmcs -brown-adipose-tissue-mouse file:///mini-corpus/tiledb-data/brown-adipose-tissue-mouse -d4db74ad-a129-4b1a-b9da-1b30db86bbe4-issue-74 file:///mini-corpus/tiledb-data/d4db74ad-a129-4b1a-b9da-1b30db86bbe4-issue-74 -pbmc-small file:///mini-corpus/tiledb-data/pbmc-small -10x_pbmc68k_reduced file:///mini-corpus/tiledb-data/10x_pbmc68k_reduced -Puck_200903_10 file:///mini-corpus/tiledb-data/Puck_200903_10 -0cfab2d4-1b79-444e-8cbe-2ca9671ca85e file:///mini-corpus/tiledb-data/0cfab2d4-1b79-444e-8cbe-2ca9671ca85e -acute-covid19-cohort file:///mini-corpus/tiledb-data/acute-covid19-cohort -adult-mouse-cortical-cell-taxonomy file:///mini-corpus/tiledb-data/adult-mouse-cortical-cell-taxonomy -``` diff --git a/apis/python/examples/inspecting-schema.md b/apis/python/examples/inspecting-schema.md deleted file mode 100644 index f819ca7c16..0000000000 --- a/apis/python/examples/inspecting-schema.md +++ /dev/null @@ -1,212 +0,0 @@ -## Inspecting single SOMA schemas - -Before we do full-collection traversals, let's first take a look at a single SOMA. - -``` -import tiledbsc -soma = tiledbsc.SOMA('/mini-corpus/tiledb-data/tabula-sapiens-epithelial') -``` - -``` ->>> soma.obs.shape() -(104148, 26) - ->>> soma.var.shape() -(58559, 12) -``` - -``` ->>> soma.obs.df() - tissue_in_publication ... development_stage -obs_id ... -AAACCCAAGAACTCCT_TSP14_Lung_Distal_10X_1_1 Lung ... 59-year-old human stage -AAACCCAAGAGGGTAA_TSP8_Prostate_NA_10X_1_1 Prostate ... 56-year-old human stage -AAACCCAAGCCACTCG_TSP14_Prostate_NA_10X_1_2 Prostate ... 59-year-old human stage -AAACCCAAGCCGGAAT_TSP14_Liver_NA_10X_1_1 Liver ... 59-year-old human stage -AAACCCAAGCCTTGAT_TSP7_Tongue_Posterior_10X_1_1 Tongue ... 69-year-old human stage -... ... ... ... -TTTGTTGTCTACGGTA_TSP5_Eye_NA_10X_1_2 Eye ... 40-year-old human stage -TTTGTTGTCTATCGGA_TSP2_Lung_proxmedialdistal_10X... Lung ... 61-year-old human stage -TTTGTTGTCTCTCAAT_TSP2_Kidney_NA_10X_1_2 Kidney ... 61-year-old human stage -TTTGTTGTCTGCCTGT_TSP4_Mammary_NA_10X_1_2 Mammary ... 38-year-old human stage -TTTGTTGTCTGTAACG_TSP14_Prostate_NA_10X_1_1 Prostate ... 59-year-old human stage - -[104148 rows x 26 columns] - ->>> soma.var.df() - feature_type ensemblid ... feature_name feature_reference -var_id ... -ENSG00000000003 Gene Expression ENSG00000000003.14 ... b'TSPAN6' NCBITaxon:9606 -ENSG00000000005 Gene Expression ENSG00000000005.6 ... b'TNMD' NCBITaxon:9606 -ENSG00000000419 Gene Expression ENSG00000000419.12 ... b'DPM1' NCBITaxon:9606 -ENSG00000000457 Gene Expression ENSG00000000457.14 ... b'SCYL3' NCBITaxon:9606 -ENSG00000000460 Gene Expression ENSG00000000460.17 ... b'C1orf112' NCBITaxon:9606 -... ... ... ... ... ... -ENSG00000286268 Gene Expression ENSG00000286268.1 ... b'LL0XNC01-30I4.1' NCBITaxon:9606 -ENSG00000286269 Gene Expression ENSG00000286269.1 ... b'RP11-510D4.1' NCBITaxon:9606 -ENSG00000286270 Gene Expression ENSG00000286270.1 ... b'XXyac-YX60D10.3' NCBITaxon:9606 -ENSG00000286271 Gene Expression ENSG00000286271.1 ... b'CTD-2201E18.6' NCBITaxon:9606 -ENSG00000286272 Gene Expression ENSG00000286272.1 ... b'RP11-444B5.1' NCBITaxon:9606 - -[58559 rows x 12 columns] -``` - -``` ->>> soma.obs.keys() -['tissue_in_publication', 'assay_ontology_term_id', 'donor', 'anatomical_information', -'n_counts_UMIs', 'n_genes', 'cell_ontology_class', 'free_annotation', 'manually_annotated', -'compartment', 'sex_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', -'disease_ontology_term_id', 'ethnicity_ontology_term_id', 'development_stage_ontology_term_id', -'cell_type_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'assay', 'disease', -'organism', 'sex', 'tissue', 'ethnicity', 'development_stage'] - ->>> soma.var.attr_names_to_types() -{'feature_type': dtype('>> -``` - -## Names of obs and var columns - -``` -print("OBS NAMES") -for soma in soco: - print(soma.name) - for attr_name in soma.obs.keys(): - print(" obs", attr_name) - -print("VAR NAMES") -for soma in soco: - print(soma.name) - for attr_name in soma.var.keys(): - print(" var", attr_name) -``` - -``` -OBS NAMES -tabula-sapiens-immune - obs tissue_in_publication - obs assay_ontology_term_id - obs donor - obs anatomical_information - obs n_counts_UMIs - obs n_genes - obs cell_ontology_class - obs free_annotation - obs manually_annotated - obs compartment - obs sex_ontology_term_id - obs is_primary_data - obs organism_ontology_term_id - obs disease_ontology_term_id - obs ethnicity_ontology_term_id - obs development_stage_ontology_term_id - obs cell_type_ontology_term_id - obs tissue_ontology_term_id - obs cell_type - obs assay - obs disease - obs organism - obs sex - obs tissue - obs ethnicity - obs development_stage -integrated-human-lung-cell-atlas - obs is_primary_data - obs assay_ontology_term_id - obs cell_type_ontology_term_id - obs development_stage_ontology_term_id - obs disease_ontology_term_id - obs ethnicity_ontology_term_id - obs tissue_ontology_term_id - obs organism_ontology_term_id - obs sex_ontology_term_id - obs sample - obs study - obs subject_ID - obs smoking_status - obs BMI - obs condition - obs subject_type - obs sample_type - obs 3'_or_5' - obs sequencing_platform - obs cell_ranger_version - obs fresh_or_frozen - obs dataset - obs anatomical_region_level_2 - obs anatomical_region_level_3 - obs anatomical_region_highest_res - obs age - obs ann_highest_res - obs n_genes - obs size_factors - obs log10_total_counts - obs mito_frac - obs ribo_frac - obs original_ann_level_1 - obs original_ann_level_2 - obs original_ann_level_3 - obs original_ann_level_4 - obs original_ann_level_5 - obs original_ann_nonharmonized - obs scanvi_label - obs leiden_1 - obs leiden_2 - obs leiden_3 - obs anatomical_region_ccf_score - obs entropy_study_leiden_3 - obs entropy_dataset_leiden_3 - obs entropy_subject_ID_leiden_3 - obs entropy_original_ann_level_1_leiden_3 - obs entropy_original_ann_level_2_clean_leiden_3 - obs entropy_original_ann_level_3_clean_leiden_3 - obs entropy_original_ann_level_4_clean_leiden_3 - obs entropy_original_ann_level_5_clean_leiden_3 - obs leiden_4 - obs reannotation_type - obs leiden_5 - obs ann_finest_level - obs ann_level_1 - obs ann_level_2 - obs ann_level_3 - obs ann_level_4 - obs ann_level_5 - obs ann_coarse_for_GWAS_and_modeling - obs cell_type - obs assay - obs disease - obs organism - obs sex - obs tissue - obs ethnicity - obs development_stage -... - -VAR NAMES -tabula-sapiens-immune - var feature_type - var ensemblid - var highly_variable - var means - var dispersions - var dispersions_norm - var mean - var std - var feature_biotype - var feature_is_filtered - var feature_name - var feature_reference -integrated-human-lung-cell-atlas - var n_cells - var highly_variable - var means - var dispersions - var feature_biotype - var feature_is_filtered - var feature_name - var feature_reference -... -``` diff --git a/apis/python/examples/obtaining-data-files.md b/apis/python/examples/obtaining-data-files.md index c3760451f7..98ca8f1792 100644 --- a/apis/python/examples/obtaining-data-files.md +++ b/apis/python/examples/obtaining-data-files.md @@ -5,45 +5,17 @@ This Python package supports import of H5AD and 10X data files. For example, you can visit [https://cellxgene.cziscience.com](https://cellxgene.cziscience.com) and select from among various choices there and download. -Files used for this example: +## Public bucket -``` -$ ls -Shlr /mini-corpus/anndata -total 58076592 --rw-r--r-- 1 testuser staff 34K Apr 25 11:13 subset_100_100.h5ad --rw-r--r-- 1 testuser staff 230K May 11 08:08 pbmc-small.h5ad --rw-r--r-- 1 testuser staff 4.3M May 10 18:12 10x_pbmc68k_reduced.h5ad --rw-r--r--@ 1 testuser staff 27M May 13 22:34 af9d8c03-696c-4997-bde8-8ef00844881b.h5ad --rw-r--r--@ 1 testuser staff 30M May 13 23:25 issue-74.h5ad --rw-r--r--@ 1 testuser staff 30M May 13 22:45 d4db74ad-a129-4b1a-b9da-1b30db86bbe4-issue-74.h5ad --rw-r--r--@ 1 testuser staff 32M May 13 22:17 Puck_200903_10.h5ad --rw-r--r--@ 1 testuser staff 36M Apr 25 11:13 local3.h5ad --rw-r--r-- 1 testuser staff 38M May 11 08:08 pbmc3k_processed.h5ad --rw-r--r--@ 1 testuser staff 40M May 13 22:21 brown-adipose-tissue-mouse.h5ad --rw-r--r-- 1 testuser staff 47M Apr 28 15:15 pbmc3k-krilow.h5ad --rw-r--r--@ 1 testuser staff 51M May 13 22:34 0cfab2d4-1b79-444e-8cbe-2ca9671ca85e.h5ad --rw-r--r--@ 1 testuser staff 55M May 13 22:33 4056cbab-2a32-4c9e-a55f-c930bc793fb6.h5ad --rw-r--r--@ 1 testuser staff 70M Apr 25 22:45 human-kidney-tumors-wilms.h5ad --rw-r--r--@ 1 testuser staff 99M May 11 09:57 issue-71.h5ad --rw-r--r--@ 1 testuser staff 117M May 13 22:20 adult-mouse-cortical-cell-taxonomy.h5ad --rw-r--r--@ 1 testuser staff 119M Apr 25 22:47 longitudinal-profiling-49.h5ad --rw-r--r--@ 1 testuser staff 221M Apr 25 22:46 single-cell-transcriptomes.h5ad --rw-r--r--@ 1 testuser staff 230M Apr 29 17:13 Single_cell_atlas_of_peripheral_immune_response_to_SARS_CoV_2_infection.h5ad --rw-r--r--@ 1 testuser staff 231M Apr 25 11:13 vieira19_Alveoli_and_parenchyma_anonymised.processed.h5ad --rw-r--r--@ 1 testuser staff 312M May 11 08:12 not --rw-r--r--@ 1 testuser staff 329M Apr 25 11:13 local2.h5ad --rw-r--r--@ 1 testuser staff 357M May 10 20:08 issue-69.h5ad --rw-r--r--@ 1 testuser staff 376M Apr 25 22:48 acute-covid19-cohort.h5ad --rw-r--r--@ 1 testuser staff 686M Apr 25 22:50 autoimmunity-pbmcs.h5ad --rw-r--r--@ 1 testuser staff 712M May 13 22:22 developmental-single-cell-atlas-of-the-murine-lung.h5ad --rw-r--r--@ 1 testuser staff 2.5G May 13 22:35 tabula-sapiens-stromal.h5ad --rw-r--r--@ 1 testuser staff 3.2G May 13 22:30 tabula-sapiens-epithelial.h5ad --rw-r--r--@ 1 testuser staff 5.6G Apr 25 23:04 integrated-human-lung-cell-atlas.h5ad --rw-r--r--@ 1 testuser staff 5.7G May 13 22:38 tabula-sapiens-immune.h5ad --rw-r--r--@ 1 testuser staff 6.6G May 13 22:40 azimuth-meta-analysis.h5ad -``` +The public S3 bucket `s3://tiledb-singlecell-data/anndata` contains several H5AD files you can use. +See also the [public TileDB Cloud +notebook](https://cloud.tiledb.com/notebooks/details/johnkerl-tiledb/d3d7ff44-dc65-4cd9-b574-98312c4cbdbd/preview) +which reads these. + +You can copy one of these sample files to your current directory as follows: ``` -$ du -hs /mini-corpus/anndata - 27G /mini-corpus/anndata +aws s3 cp s3://tiledb-singlecell-data/anndata/pbmc3k_processed.h5ad . ``` + +![](images/public-bucket.png) diff --git a/apis/python/examples/soco-reconnaissance.md b/apis/python/examples/soco-reconnaissance.md index 0bd9610fd8..6e2602558b 100644 --- a/apis/python/examples/soco-reconnaissance.md +++ b/apis/python/examples/soco-reconnaissance.md @@ -1,128 +1,76 @@ -Next, let's do some cross-cutting queries over schemas of all SOMAs in the collection. The goal is --- in preparation for a collection-level query -- to find out which `obs` columns, and which values -in those columns, are most likely to be promising in terms of yielding results given our -mini-corpus. +Next, let's do some cross-cutting queries over schemas of all SOMAs in the collection, querying +annotation data (`obs` and/or `var`) to see what we have, in preparaton for slice and batch queries +afterward. ## Total cell-counts -The mini-corpus we prepared is 29 SOMAs, 26GB total: +As noted in the [public TileDB Cloud notebook](https://cloud.tiledb.com/notebooks/details/johnkerl-tiledb/d3d7ff44-dc65-4cd9-b574-98312c4cbdbd/preview), we prepared an example SOMA collection in the S3 public bucket: ``` -$ du -hs /mini-corpus/tiledb-data - 26G /mini-corpus/tiledb-data - -$ ls /mini-corpus/tiledb-data | wc -l - 29 +soco = tiledbsc.SOMACollection('s3://tiledb-singlecell-data/soco/soco6') +for soma in soco: + print("%-20s %s" % (soma.name, soma.uri)) ``` -This collection includes data on about 2.4 million cells: - ``` -import tiledbsc -soco = tiledbsc.SOMACollection('/mini-corpus/soco') +acute-covid19-cohort s3://tiledb-singlecell-data/soco/soco6/acute-covid19-cohort +human-kidney-tumors-wilms s3://tiledb-singlecell-data/soco/soco6/human-kidney-tumors-wilms +autoimmunity-pbmcs s3://tiledb-singlecell-data/soco/soco6/autoimmunity-pbmcs +ileum s3://tiledb-singlecell-data/soco/soco6/ileum +brown-adipose-tissue-mouse s3://tiledb-singlecell-data/soco/soco6/brown-adipose-tissue-mouse +Puck_200903_10 s3://tiledb-singlecell-data/soco/soco6/Puck_200903_10 +``` ->>> sum(len(soma.obs) for soma in soco) -2464363 +This collection includes data on about 230K cells: ->>> [len(soma.obs) for soma in soco] -[264824, 4636, 6288, 2223, 59506, 100, 2638, 982538, 385, 67794, 2638, 104148, 44721, 3799, 11574, 1679, 3589, 700, 584884, 16245, 4603, 3726, 4636, 7348, 3589, 40268, 12971, 4232, 80, 82478, 97499, 38024] ``` +>>> cell_counts = [len(soma.obs) for soma in soco] -``` ->>> for soma in soco: -... print(len(soma.obs), soma.name) -... -264824 tabula-sapiens-immune -4636 wilms-tumors-seurat -6288 issue-69 -2223 brown-adipose-tissue-mouse -59506 acute-covid19-cohort -100 subset_100_100 -2638 pbmc3k_processed -982538 azimuth-meta-analysis -385 local3 -67794 developmental-single-cell-atlas-of-the-murine-lung -2638 pbmc3k-krilow -104148 tabula-sapiens-epithelial -44721 Single_cell_atlas_of_peripheral_immune_response_to_SARS_CoV_2_infection -3799 adipocytes-seurat -11574 longitudinal-profiling-49 -1679 adult-mouse-cortical-cell-taxonomy -3589 issue-74 -700 10x_pbmc68k_reduced -584884 integrated-human-lung-cell-atlas -16245 issue-71 -4603 4056cbab-2a32-4c9e-a55f-c930bc793fb6 -3726 0cfab2d4-1b79-444e-8cbe-2ca9671ca85e -4636 human-kidney-tumors-wilms -7348 local2 -3589 d4db74ad-a129-4b1a-b9da-1b30db86bbe4-issue-74 -40268 single-cell-transcriptomes -12971 vieira19_Alveoli_and_parenchyma_anonymised.processed -4232 af9d8c03-696c-4997-bde8-8ef00844881b -80 pbmc-small -82478 tabula-sapiens-stromal -97499 autoimmunity-pbmcs -38024 Puck_200903_10 +>>> cell_counts +[59506, 4636, 97499, 32458, 2223, 38024] + +>>> sum(cell_counts) +234346 ``` -## Cell-counts before running a query +## Query cell counts -Before running a query, we may wish to know how many cells will be involved in the result: +Let's find out -- before running a query involving full `X` data -- solely by looking at the smaller `obs` data, how many cells would be involved if we were to query for, say, `"B cell"`: ``` ->>> [soma.obs.query('cell_type == "B cell"').size for soma in soco if 'cell_type' in soma.obs.keys()] -[514982, 0, 0, 14283, 245240, 391446, 0, 0, 0, 125154, 0, 29060, 0, 0, 311259, 176, 6120, 2480, 0, 0, 0, 26325, 5220, 0, 12750, 0] +>>> query_cell_counts = [len(soma.obs.query('cell_type == "B cell"').index) for soma in soco if 'cell_type' in soma.obs.keys()] ->>> sum([soma.obs.query('cell_type == "B cell"').size for soma in soco if 'cell_type' in soma.obs.keys()]) -1684495 -``` +>>> query_cell_counts +[6131, 0, 510, 3183, 529, 0] +>>> sum(query_cell_counts) +10353 ``` ->>> [soma.obs.query('cell_type == "leukocyte"').size for soma in soco if 'cell_type' in soma.obs.keys()] -[59436, 5616, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5472, 0, 0, 0, 0, 0, 0, 3753] ->>> sum([soma.obs.query('cell_type == "leukocyte"').size for soma in soco if 'cell_type' in soma.obs.keys()]) -74277 +## Counts by metadata values + ``` +soco = tiledbsc.SOMACollection('s3://tiledb-singlecell-data/soco/soco3a') +soco -## Datasets having all three of obs.cell_type, obs.tissue, and obs.feature_name +soco.keys() -``` -names = sorted([ - soma.name for soma in soco - if 'cell_type' in soma.obs.keys() and 'tissue' in soma.obs.keys() and 'feature_name' in soma.var.keys() -]) -for name in names: print(name) -``` +for soma in soco: + print("%6d %-20s %s" % (len(soma.obs), soma.name, soma.uri)) +for soma in soco: + print() + print(f"--- {soma.name}") + print(soma.obs.df().groupby('cell_type').size()) ``` -0cfab2d4-1b79-444e-8cbe-2ca9671ca85e -4056cbab-2a32-4c9e-a55f-c930bc793fb6 -Puck_200903_10 -acute-covid19-cohort -adult-mouse-cortical-cell-taxonomy -af9d8c03-696c-4997-bde8-8ef00844881b -autoimmunity-pbmcs -azimuth-meta-analysis -brown-adipose-tissue-mouse -developmental-single-cell-atlas-of-the-murine-lung -human-kidney-tumors-wilms -integrated-human-lung-cell-atlas -local2 -local3 -longitudinal-profiling-49 -single-cell-transcriptomes -tabula-sapiens-epithelial -tabula-sapiens-immune -tabula-sapiens-stromal -``` -## Show counts of obs_ids and var_ids across the collection +![](images/soco-reconnaissance.png) + +## More reconnaissance + +See also [collection-counts.py](collection-counts.py) for some additional material. -Using [./collection-counts.py](collection-counts.py) we can answer questions such as _How many cells will -be involved if I do a query?_ Since these pre-counts operate on the smaller `obs` arrays, they run -faster than going ahead and doing full queries (as shown below) on the larger `X` arrays. +For example: ``` ---------------------------------------------------------------- @@ -246,69 +194,7 @@ gamma-delta T cell 216 leukocyte 144 eukaryotic cell 28 -TOTAL count 181544 -dtype: int64 -Collection-wide counts of values of tissue - -obs_label tissue - count -name -blood 176908 -kidney 4636 - -TOTAL count 181544 -dtype: int64 -Collection-wide counts of values of cell_type_ontology_term_id - -obs_label cell_type_ontology_term_id - count -name -CL:0000576 29878 -CL:0000895 26887 -CL:0001054 23648 -CL:0000763 10261 -CL:0000788 8679 -CL:0000625 8658 -CL:0000236 8524 -CL:0000814 7755 -CL:0000939 6948 -CL:0000624 6726 -CL:0000909 6224 -CL:0000232 3918 -CL:0000938 3638 -CL:0000623 3474 -CL:0000897 3276 -CL:0000233 2926 -CL:0000134 2811 -CL:0000900 2387 -CL:0002396 1923 -CL:0000980 1825 -CL:0000084 1697 -CL:0000789 1659 -CL:1000449 1216 -CL:0000451 1061 -CL:0000786 1025 -CL:0000548 661 -CL:0000784 650 -CL:0000990 543 -CL:0000003 465 -CL:0000787 457 -CL:0000815 306 -CL:0000775 302 -CL:0000037 270 -CL:0000816 255 -CL:0000940 223 -CL:0000798 216 -CL:0000738 144 -CL:0000255 28 - TOTAL count 181544 dtype: int64 ... ``` - -## Conclusion - -From these we conclude that `obs.cell_type == "B cell"` and `obs.tissue == "blood"`, and -`var.feature_name == "MT-CO3"` (acquired similarly but not shown here) are likeliest to produce the -largest result set, given our local-disk mini-corpus. diff --git a/apis/python/examples/soco-slice-query.md b/apis/python/examples/soco-slice-query.md index 066d5a431c..ce9d25dd62 100644 --- a/apis/python/examples/soco-slice-query.md +++ b/apis/python/examples/soco-slice-query.md @@ -8,36 +8,6 @@ can _query_ large datasets without having to first _download_ large datasets. Another key point is that the _out-of-core processing_ showing here allows you to slice data out of a collection which is far larger than fits in RAM. -## Populate the collection - -Here we use a few small sample files included in this repository. - -``` -import tiledbsc -import tiledbsc.io -import os -import shutil - -soco_path = './soco-attribute-filter' -if os.path.exists(soco_path): - shutil.rmtree(soco_path) - -soco = tiledbsc.SOMACollection(soco_path) -if not soco.exists(): - soco._create() - -for name, h5ad in [ - ('subset-soma-01', './anndata/subset-soma-01.h5ad'), - ('subset-soma-02', './anndata/subset-soma-02.h5ad'), - ('subset-soma-03', './anndata/subset-soma-03.h5ad'), - ('subset-soma-04', './anndata/subset-soma-04.h5ad'), -]: - soma_path = os.path.join(soco_path, name) - soma = tiledbsc.SOMA(soma_path) - tiledbsc.io.from_h5ad(soma, h5ad) - soco.add(soma) -``` - ## Do the slice query Using [soco-slice-query.py](soco-slice-query.py) diff --git a/apis/python/examples/uniform-collection.md b/apis/python/examples/uniform-collection.md index 735d50447e..f658f9df5d 100644 --- a/apis/python/examples/uniform-collection.md +++ b/apis/python/examples/uniform-collection.md @@ -1,11 +1,15 @@ -## Creating a SOMA collection +## Uniformizing a SOMA collection The [uniformizer script](../examples/uniformizer.py) shows an example of how to take a collection of H5AD files -- and/or already-ingested SOMAs -- and make them into a uniform collection. -This is an alternative to using the [ingestor](../tools/ingestor) script -- the ingestor script -pulls in data as-is, while this uniformizer is more strongly opinionated. You might think of this -script as a template for your own organization-specific opinionated uniformization. +This isn't necessary, or useful, for data exploration; you may find this information a helpful +guide if at any point in time your organization needs to construct an atlas. + +This is intended for bulk/batch/scripted jobs, as an alternative to using the +[ingestor](../tools/ingestor) script -- the ingestor script pulls in data as-is, while this +uniformizer is more strongly opinionated. You might think of this script as a template for your own +organization-specific opinionated uniformization. ``` examples/uniformizer.py -v /Users/testuser/mini-corpus/atlas add-h5ad file-01.h5ad diff --git a/apis/python/src/tiledbsc/soma.py b/apis/python/src/tiledbsc/soma.py index cdc32581f7..e18c4fb228 100644 --- a/apis/python/src/tiledbsc/soma.py +++ b/apis/python/src/tiledbsc/soma.py @@ -307,6 +307,8 @@ def query( if slice_obs_df is None: return None obs_ids = list(slice_obs_df.index) + if len(obs_ids) == 0: + return None slice_var_df = self.var.query(query_string=var_query_string, ids=var_ids) # E.g. querying for 'feature_name == "MT-CO3"' and this SOMA does have a feature_name column @@ -314,6 +316,8 @@ def query( if slice_var_df is None: return None var_ids = list(slice_var_df.index) + if len(var_ids) == 0: + return None # TODO: # do this here: diff --git a/apis/python/src/tiledbsc/soma_collection.py b/apis/python/src/tiledbsc/soma_collection.py index f0edc07b48..2428e8b4ba 100644 --- a/apis/python/src/tiledbsc/soma_collection.py +++ b/apis/python/src/tiledbsc/soma_collection.py @@ -183,6 +183,12 @@ def query( # print("Slice SOMA from", soma.name, soma.X.data.shape(), "to", soma_slice.ann.X.shape) soma_slices.append(soma_slice) + print("SLICES", len(soma_slices)) + for soma_slice in soma_slices: + print(soma_slice) + print() + for soma_slice in soma_slices: + print(soma_slice.obs) return SOMASlice.concat(soma_slices) # ----------------------------------------------------------------