Skip to content

Commit

Permalink
Collection-slice with heterogeneous data (#207)
Browse files Browse the repository at this point in the history
* temp

* doc material
  • Loading branch information
johnkerl authored Jun 29, 2022
1 parent f6cddf6 commit ac63388
Show file tree
Hide file tree
Showing 10 changed files with 106 additions and 79 deletions.
4 changes: 2 additions & 2 deletions apis/python/doc/soma_collection.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,10 +109,10 @@ member exists. Overloads the `[...]` operator.
#### query

```python
def query(obs_attr_names: Optional[List[str]] = None,
def query(obs_attrs: Optional[List[str]] = None,
obs_query_string: str = None,
obs_ids: List[str] = None,
var_attr_names: Optional[List[str]] = None,
var_attrs: Optional[List[str]] = None,
var_query_string: str = None,
var_ids: List[str] = None) -> Optional[SOMASlice]
```
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion apis/python/examples/soco-batch-query.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ n = len(ctot_ids)
print("cell_type_ontology_term_id count =", n)
for i, ctot_id in enumerate(ctot_ids):
soma_slice = soco.query(
obs_attr_names=["cell_type_ontology_term_id"],
obs_attrs=["cell_type_ontology_term_id"],
obs_query_string=f'cell_type_ontology_term_id == "{ctot_id}"',
)
if soma_slice is None:
Expand Down
2 changes: 1 addition & 1 deletion apis/python/examples/soco-batch-query.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
print("cell_type_ontology_term_id count =", n)
for i, ctot_id in enumerate(ctot_ids):
soma_slice = soco.query(
obs_attr_names=["cell_type_ontology_term_id"],
obs_attrs=["cell_type_ontology_term_id"],
obs_query_string=f'cell_type_ontology_term_id == "{ctot_id}"',
)
if soma_slice is None:
Expand Down
78 changes: 38 additions & 40 deletions apis/python/examples/soco-slice-query.md
Original file line number Diff line number Diff line change
@@ -1,68 +1,66 @@
Here we show an example of doing a _slice query_ across a `SOMACollection` -- we extract a
relatively small subset out of the full collection for analysis.

:::{.callout-tip}
A key point is that these data (shown here on local disk) can likewise be stored on object stores
like S3. This means you can _query_ large datasets without having to first _download_ large
datasets.
:::

:::{.callout-tip}
Another key point is that the _out-of-core processing_ showing here allows you to slice data out of
a collection which is far larger than fits in RAM.
:::

:::{.callout-tip}
Best S3 read performance is obtained by querying from an in-region EC2 instance, or a TileDB Cloud
notebook -- this is preferred to laptop-to-S3 reads.
:::

## Do the slice query
## Prepare the inputs

Using [soco-slice-query.py](soco-slice-query.py)
As shown in the [public TileDB Cloud notebook](https://cloud.tiledb.com/notebooks/details/johnkerl-tiledb/d3d7ff44-dc65-4cd9-b574-98312c4cbdbd/preview):

```
TWO-SIDED QUERY
Wrote mini-atlas-two-sided.h5ad (8524, 1)
Wrote mini-atlas-two-sided (8524, 1)
OBS-ONLY QUERY
Wrote mini-atlas-obs-sided.h5ad (8524, 21648)
Wrote mini-atlas-obs-sided (8524, 21648)
ctx = tiledb.Ctx({"py.init_buffer_bytes": 4 * 1024**3})
soco = tiledbsc.SOMACollection("s3://tiledb-singlecell-data/soco/soco3", ctx)
```

VAR-ONLY QUERY
Wrote mini-atlas-var-sided.h5ad (181544, 1)
Wrote mini-atlas-var-sided (181544, 1)
Slices to be concatenated must all have the same attributes for their `obs` and `var`. If the input SOMAs were all normalized (see also [Uniformizing a Collection](uniform-collection.md)), we wouldn't need to specify `obs_attrs` and `var_attrs`. Since the input data here is heterogeneous, though, we find which `obs`/`var` attributes they all have in common.

OBS-ONLY QUERY
Wrote cell-ontology-236.h5ad (8524, 21648)
Wrote cell-ontology-236 (8524, 21648)
```
obs_attrs_set = None
var_attrs_set = None
for soma in soco:
if obs_attrs_set is None:
obs_attrs_set = set(soma.obs.keys())
var_attrs_set = set(soma.var.keys())
else:
obs_attrs_set = set(soma.obs.keys()).intersection(obs_attrs_set)
var_attrs_set = set(soma.var.keys()).intersection(var_attrs_set)
obs_attrs = sorted(list(obs_attrs_set))
var_attrs = sorted(list(var_attrs_set))
```

## Examine the results
## Do the query

```
$ peek-soma mini-atlas-two-sided
johnkerl@Kerl-MBP[prod][python]$ peek-soma mini-atlas-obs-sided
>>> soma.obs.df()
assay_ontology_term_id cell_type_ontology_term_id ... sex tissue
obs_id ...
AAACCCACACCCAATA EFO:0009922 CL:0000236 ... male blood
AAACCCAGTTCCACAA EFO:0009922 CL:0000236 ... male blood
AAACCCATCCCTCATG EFO:0009922 CL:0000236 ... male blood
AAACCCATCGAAGAAT EFO:0009922 CL:0000236 ... male blood
AAACGAAAGAATTTGG EFO:0009922 CL:0000236 ... male blood
... ... ... ... ... ...
batch4_5p_rna|TTTGTCAAGACTGTAA-1 EFO:0011025 CL:0000236 ... unknown blood
batch4_5p_rna|TTTGTCAAGGATGGTC-1 EFO:0011025 CL:0000236 ... unknown blood
batch4_5p_rna|TTTGTCACATCGATTG-1 EFO:0011025 CL:0000236 ... unknown blood
batch4_5p_rna|TTTGTCAGTATGAAAC-1 EFO:0011025 CL:0000236 ... unknown blood
batch4_5p_rna|TTTGTCAGTCGCATAT-1 EFO:0011025 CL:0000236 ... unknown blood
slice = soco.query(
obs_query_string='cell_type == "pericyte cell"',
var_query_string='feature_name == "DPM1"',
obs_attrs=obs_attrs,
var_attrs=var_attrs,
)
ann = slice.to_anndata()
```

[8524 rows x 16 columns]
## Persist the output

>>> soma.var.df()
Empty DataFrame
Columns: []
Index: [ENSG00000000003, ENSG00000000419, ENSG00000000457, ...]
```
slice_soma = tiledbsc.SOMA('slice-query-output')
tiledbsc.io.from_anndata(slice_soma, ann)
```

[21648 rows x 0 columns]
## Examine the results

```
![](images/slice-query-output.png)
26 changes: 13 additions & 13 deletions apis/python/examples/soco-slice-query.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,16 @@ def soco_query_and_store(
soco: tiledbsc.SOMACollection,
output_h5ad_path: str,
output_soma_path: str,
obs_attr_names: Optional[List[str]] = None,
obs_attrs: Optional[List[str]] = None,
obs_query_string: str = None,
var_attr_names: Optional[List[str]] = None,
var_attrs: Optional[List[str]] = None,
var_query_string: str = None,
) -> None:

result_soma_slice = soco.query(
obs_attr_names=obs_attr_names,
obs_attrs=obs_attrs,
obs_query_string=obs_query_string,
var_attr_names=var_attr_names,
var_attrs=var_attrs,
var_query_string=var_query_string,
)

Expand Down Expand Up @@ -53,9 +53,9 @@ def soco_query_and_store(
soco=tiledbsc.SOMACollection("/Users/johnkerl/mini-corpus/atlas"),
output_h5ad_path="mini-atlas-two-sided.h5ad",
output_soma_path="mini-atlas-two-sided",
obs_attr_names=["cell_type"],
obs_attrs=["cell_type"],
obs_query_string='cell_type == "B cell"',
var_attr_names=["feature_name"],
var_attrs=["feature_name"],
var_query_string='feature_name == "MT-CO3"',
)

Expand All @@ -66,7 +66,7 @@ def soco_query_and_store(
soco=tiledbsc.SOMACollection("/Users/johnkerl/mini-corpus/atlas"),
output_h5ad_path="mini-atlas-obs-sided.h5ad",
output_soma_path="mini-atlas-obs-sided",
obs_attr_names=["cell_type"],
obs_attrs=["cell_type"],
obs_query_string='cell_type == "B cell"',
)

Expand All @@ -77,7 +77,7 @@ def soco_query_and_store(
soco=tiledbsc.SOMACollection("/Users/johnkerl/mini-corpus/atlas"),
output_h5ad_path="mini-atlas-var-sided.h5ad",
output_soma_path="mini-atlas-var-sided",
var_attr_names=["feature_name"],
var_attrs=["feature_name"],
var_query_string='feature_name == "MT-CO3"',
)

Expand All @@ -88,7 +88,7 @@ def soco_query_and_store(
soco=tiledbsc.SOMACollection("/Users/johnkerl/mini-corpus/atlas"),
output_h5ad_path="cell-ontology-236.h5ad",
output_soma_path="cell-ontology-236",
obs_attr_names=["cell_type_ontology_term_id"],
obs_attrs=["cell_type_ontology_term_id"],
obs_query_string='cell_type_ontology_term_id == "CL:0000236"',
)

Expand All @@ -99,7 +99,7 @@ def soco_query_and_store(
soco=tiledbsc.SOMACollection("/Users/johnkerl/mini-corpus/atlas"),
output_h5ad_path="kidney.h5ad",
output_soma_path="kidney",
obs_attr_names=["tissue"],
obs_attrs=["tissue"],
obs_query_string='tissue == "kidney"',
)

Expand All @@ -110,7 +110,7 @@ def soco_query_and_store(
soco=tiledbsc.SOMACollection("/Users/johnkerl/mini-corpus/atlas"),
output_h5ad_path="platelet.h5ad",
output_soma_path="platelet",
obs_attr_names=["cell_type"],
obs_attrs=["cell_type"],
obs_query_string='cell_type == "platelet"',
)

Expand All @@ -121,7 +121,7 @@ def soco_query_and_store(
soco=tiledbsc.SOMACollection("/Users/johnkerl/mini-corpus/atlas"),
output_h5ad_path="platelet.h5ad",
output_soma_path="platelet",
obs_attr_names=["cell_type", "tissue"],
obs_attrs=["cell_type", "tissue"],
obs_query_string='cell_type == "B cell" and tissue == "blood"',
)

Expand All @@ -132,6 +132,6 @@ def soco_query_and_store(
soco=tiledbsc.SOMACollection("/Users/johnkerl/mini-corpus/atlas"),
output_h5ad_path="platelet.h5ad",
output_soma_path="platelet",
obs_attr_names=["cell_type", "tissue"],
obs_attrs=["cell_type", "tissue"],
obs_query_string='cell_type == "B cell" or cell_type == "T cell"',
)
21 changes: 16 additions & 5 deletions apis/python/src/tiledbsc/soma.py
Original file line number Diff line number Diff line change
Expand Up @@ -290,18 +290,27 @@ def dim_slice(self, obs_ids, var_ids) -> Dict:
# ----------------------------------------------------------------
def query(
self,
obs_attrs: Optional[List[str]] = None,
obs_query_string: Optional[str] = None,
var_query_string: Optional[str] = None,
obs_ids: Optional[List[str]] = None,
var_attrs: Optional[List[str]] = None,
var_query_string: Optional[str] = None,
var_ids: Optional[List[str]] = None,
) -> SOMASlice:
"""
Subselects the SOMA's obs, var, and X/data using the specified queries on obs and var.
Queries use the TileDB-Py `QueryCondition` API. If `obs_query_string` is `None`,
the `obs` dimension is not filtered and all of `obs` is used; similiarly for `var`.
Queries use the TileDB-Py `QueryCondition` API.
If `obs_query_string` is `None`, the `obs` dimension is not filtered and all of `obs` is
used; similiarly for `var`.
If `obs_attrs` or `var_attrs` are unspecified, the slice will take all `obs`/`var` attributes
from the source SOMAs; if they are specified, the slice will take the specified `obs`/`var`
"""

slice_obs_df = self.obs.query(query_string=obs_query_string, ids=obs_ids)
slice_obs_df = self.obs.query(
query_string=obs_query_string, ids=obs_ids, attrs=obs_attrs
)
# E.g. querying for 'cell_type == "blood"' and this SOMA does have a cell_type column in its
# obs, but no rows with cell_type == "blood".
if slice_obs_df is None:
Expand All @@ -310,7 +319,9 @@ def query(
if len(obs_ids) == 0:
return None

slice_var_df = self.var.query(query_string=var_query_string, ids=var_ids)
slice_var_df = self.var.query(
query_string=var_query_string, ids=var_ids, attrs=var_attrs
)
# E.g. querying for 'feature_name == "MT-CO3"' and this SOMA does have a feature_name column
# in its var, but no rows with feature_name == "MT-CO3".
if slice_var_df is None:
Expand Down
27 changes: 18 additions & 9 deletions apis/python/src/tiledbsc/soma_collection.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,39 +141,48 @@ def __getitem__(self, name) -> SOMA:
# ----------------------------------------------------------------
def query(
self,
obs_attr_names: Optional[List[str]] = None,
obs_attrs: Optional[List[str]] = None,
obs_query_string: str = None,
obs_ids: List[str] = None,
var_attr_names: Optional[List[str]] = None,
var_attrs: Optional[List[str]] = None,
var_query_string: str = None,
var_ids: List[str] = None,
) -> Optional[SOMASlice]:
"""
Subselects the obs, var, and X/data using the specified queries on obs and var,
concatenating across SOMAs in the collection. Queries use the TileDB-Py `QueryCondition`
API. If `obs_query_string` is `None`, the `obs` dimension is not filtered and all of `obs`
is used; similiarly for `var`. Return value of `None` indicates an empty slice.
If `obs_ids` or `var_ids` are not `None`, they are effectively ANDed into the query.
For example, you can pass in a known list of `obs_ids`, then use `obs_query_string`
to further restrict the query.
API.
If `obs_query_string` is `None`, the `obs` dimension is not filtered and all of `obs` is
used; similiarly for `var`. Return value of `None` indicates an empty slice. If `obs_ids`
or `var_ids` are not `None`, they are effectively ANDed into the query. For example, you
can pass in a known list of `obs_ids`, then use `obs_query_string` to further restrict the
query.
If `obs_attrs` or `var_attrs` are unspecified, slices will take all `obs`/`var` attributes
from their source SOMAs; if they are specified, slices will take the specified `obs`/`var`
attributes. If all SOMAs in the collection have the same `obs`/`var` attributes, then you
needn't specify these; if they don't, you must.
"""

soma_slices = []
for soma in self:
# E.g. querying for 'cell_type == "blood"' but this SOMA doesn't have a cell_type column in
# its obs at all.
if obs_query_string is not None and not soma.obs.has_attr_names(
obs_attr_names or []
obs_attrs or []
):
continue
# E.g. querying for 'feature_name == "MT-CO3"' but this SOMA doesn't have a feature_name
# column in its var at all.
if var_query_string is not None and not soma.var.has_attr_names(
var_attr_names or []
var_attrs or []
):
continue

soma_slice = soma.query(
obs_attrs=obs_attrs,
var_attrs=var_attrs,
obs_query_string=obs_query_string,
var_query_string=var_query_string,
obs_ids=obs_ids,
Expand Down
17 changes: 13 additions & 4 deletions apis/python/src/tiledbsc/soma_slice.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,10 +165,19 @@ def concat(cls, soma_slices):
for i, slicei in enumerate(soma_slices):
if i == 0:
continue
# This works in Python -- not just a reference/pointer compare but a contents-compare :)
assert list(slicei.X.keys()) == list(slice0.X.keys())
assert list(slicei.obs.keys()) == list(slice0.obs.keys())
assert list(slicei.var.keys()) == list(slice0.var.keys())
# This list-equals works in Python -- not just a reference/pointer compare but a contents-compare :)
if sorted(list(slicei.X.keys())) != sorted(list(slice0.X.keys())):
raise Exception(
"SOMA slices to be concatenated must have all the same X attributes"
)
if sorted(list(slicei.obs.keys())) != sorted(list(slice0.obs.keys())):
raise Exception(
"SOMA slices to be concatenated must have all the same obs attributes"
)
if sorted(list(slicei.var.keys())) != sorted(list(slice0.var.keys())):
raise Exception(
"SOMA slices to be concatenated must have all the same var attributes"
)

# Use AnnData.concat.
# TODO: try to remove this dependency.
Expand Down
8 changes: 4 additions & 4 deletions apis/python/tests/test_soco_slice_query.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,20 +27,20 @@ def test_soco_slice_query(tmp_path):
soco.add(soma)

# Do the slice query
obs_attr_names = ["tissue"]
obs_attrs = ["tissue"]
obs_query_string = 'tissue == "blood"'
var_attr_names = ["feature_name"]
var_attrs = ["feature_name"]
var_query_string = 'feature_name == "MT-CO3"'

soma_slices = []
for soma in soco:
# E.g. querying for 'cell_type == "blood"' but this SOMA doesn't have a cell_type column in
# its obs at all.
if not soma.obs.has_attr_names(obs_attr_names):
if not soma.obs.has_attr_names(obs_attrs):
continue
# E.g. querying for 'feature_name == "MT-CO3"' but this SOMA doesn't have a feature_name
# column in its var at all.
if not soma.var.has_attr_names(var_attr_names):
if not soma.var.has_attr_names(var_attrs):
continue

soma_slice = soma.query(
Expand Down

0 comments on commit ac63388

Please sign in to comment.