Optimize queries for allow_duplicates=False #208

johnkerl · 2022-06-30T19:02:18Z

Following on #199.

Summary

Setting allow_duplicates=False is the path of least surprise for users. This ensures that, if a matrix is ever re-written/updated, when people query, expecting a matrix of numbers, they'll get that -- vs a matrix of lists where each cell contains historical values.

However, this comes at some read-performance cost. The get-all-the-data timing queries done on #199 were insufficient; better queries are shown in the details below.

Using TileDB Core 2.10, which is released, we see about a 2x slowdown going to the 'safe' mode `allow_duplicates=False'. Using dev version of core with mods to be released in TileDB Core 2.11 (in a couple weeks) we see the slowdown more like 1.2x.

Timing results

We took the same 59K cell (393M total) dataset, stored in four ways:

dupes-noconsol -- allow_duplicates=True, X/data not consolidated (in a half-dozen fragments)
dupes-consol -- a copy of that, but with X/data consolidated into a single fragment.
nodupes-noconsol -- allow_duplicates=False, X/data not consolidated (in a half-dozen fragments)
nodupes-consol -- a copy of that, but with X/data consolidated into a single fragment.

The cell-types available in that dataset are as follows:

>>> soma.obs.df().groupby('cell_type').size().sort_values()
cell_type
plasmacytoid dendritic cell          575
plasmablast                          586
platelet                            1007
dendritic cell                      1038
alpha-beta T cell                   1659
natural killer cell                 3248
B cell                              6131
CD4-positive, alpha-beta T cell     6726
CD8-positive, alpha-beta T cell     8658
monocyte                           29878
dtype: int64

Running the below query code on all four, we get the following timings using core 2.11:

nobs	dupes-noconsol_seconds	dupes-consol_seconds	nodupes-noconsol_seconds	nodupes-consol_seconds
575	1.215	1.201	1.228	1.243
586	1.186	1.198	1.174	1.197
1007	1.678	1.763	1.708	1.786
1038	2.321	2.431	2.160	2.218
1659	2.252	2.823	2.558	2.598
3248	4.046	3.444	3.849	3.587
6131	4.982	4.561	5.101	4.720
6726	4.670	4.820	4.955	4.504
8658	5.284	5.743	6.253	6.049
29878	17.523	15.439	18.501	16.221

Query code

#!/usr/bin/env python

import tiledb
import tiledbsc
import sys
import time

ctx = tiledb.Ctx({"py.init_buffer_bytes": 8 * 1024**3})

# ----------------------------------------------------------------
# Public bucket:
# s3://tiledb-singlecell-data/profile-data/dupes-consol
# s3://tiledb-singlecell-data/profile-data/dupes-noconsol
# s3://tiledb-singlecell-data/profile-data/nodupes-consol
# s3://tiledb-singlecell-data/profile-data/dupes-noconsol
# uri = 's3://tiledb-singlecell-data/profile-data/nodupes-consol'
# uri = 's3://tiledb-singlecell-data/profile-data/nodupes-consol'
uri = '/tmp/nodupes-consol'
if len(sys.argv) >= 2:
    uri = sys.argv[1]
soma = tiledbsc.SOMA(uri, ctx=ctx)

# ----------------------------------------------------------------
# Choices for cell_type:
#
# >>> soma.obs.df().groupby('cell_type').size().sort_values()
# cell_type
# plasmacytoid dendritic cell          575
# plasmablast                          586
# platelet                            1007
# dendritic cell                      1038
# alpha-beta T cell                   1659
# natural killer cell                 3248
# B cell                              6131
# CD4-positive, alpha-beta T cell     6726
# CD8-positive, alpha-beta T cell     8658
# monocyte                           29878
# dtype: int64
# ----------------------------------------------------------------

which_attr = 'cell_type'
which_value = 'B cell'
if len(sys.argv) == 3:
    which_value = sys.argv[2]
if len(sys.argv) == 4:
    which_attr  = sys.argv[2]
    which_value = sys.argv[3]
query_string = f'{which_attr} == "{which_value}"'

print('URI:', uri)
print('query_string:', query_string)

o1 = time.time()
slice = soma.query(
    obs_query_string=query_string,
)
o2 = time.time()
print(f"got shapes obs:{slice.obs.shape} var:{slice.var.shape} X/data:{slice.X['data'].shape}")
fsec = "%.3f" % (o2 - o1)
print(f"SECONDS_TOTAL={fsec} NOBS={slice.obs.shape[0]} QUERY='{query_string}' URI={uri}")

johnkerl · 2022-06-30T19:11:48Z

apis/python/src/tiledbsc/soma.py

@@ -64,6 +64,20 @@ def __init__(
        :param uri: URI of the TileDB group
        """

+        # People can (and should) call by name. However, it's easy to forget. For example,


Found during debug. soma = tiledbsc.SOMA(uri, ctx) is incorrect while soma = tiledbsc.SOMA(uri, ctx=ctx) is correct. This needs to be surfaced proactively to the user.

You may want to enforce keyword-only arguments to avoid such incorrect usage:
def __init__(self, uri, *, name=None, ...)

Manual runtime type checking is considered unidiomatic in general and doesn't scale well with the complexity of annotations (think of Dict[str, List[Optional[int]]). For the cases that really need it, there are 3rd party libraries such as Typeguard or Pydantic that use and enforce annotations at runtime transparently.

johnkerl · 2022-06-30T19:13:21Z

apis/python/src/tiledbsc/soma.py

+        # we'll do `A.df[obs_ids, :]`. We can't pass a `:` down the callstack to get there,
+        # but we pass `None` instead.
+        #
+        # It's important to do this. Say for example the X matrix is nobs=1000 by nvar=2000,


We were doing the following:

For example a 1000x2000 matrix

80 obs IDs -> 80 obs IDs

None (for :) -> 2000 var IDs

which is not new. However, it performed OK with allow_duplicates=True but performed poorly with allow_duplicates=False.

johnkerl · 2022-06-30T19:13:42Z

apis/python/src/tiledbsc/soma_collection.py

+        # People can (and should) call by name. However, it's easy to forget. For example,
+        # if someone does 'tiledbsc.SOMACollection("myuri", ctx)' instead of 'tiledbsc.SOMA("myury", ctx)',
+        # behavior will not be what they expect, and we should let them know sooner than later.
+        if name is not None:


Arg-checking as above, for SOMACollection as well as SOMA.

johnkerl · 2022-06-30T19:14:07Z

apis/python/src/tiledbsc/soma_slice.py

@@ -57,37 +57,6 @@ def __init__(
        # self.raw_var = raw_var
        assert "data" in X

-        # Find the dtype.


This code is vestigial and was adding unnecessary time to SOMA slice queries.

Optimize queries for allow_duplicates=False

15d6ede

johnkerl mentioned this pull request Jun 30, 2022

Disallow duplicates #199

Merged

johnkerl commented Jun 30, 2022

View reviewed changes

johnkerl requested review from aaronwolen, ihnorton and Shelnutt2 June 30, 2022 19:14

johnkerl marked this pull request as ready for review June 30, 2022 19:15

johnkerl requested a review from gspowley June 30, 2022 19:23

ihnorton approved these changes Jun 30, 2022

View reviewed changes

johnkerl merged commit da84a48 into main Jun 30, 2022

johnkerl deleted the kerl/dup-test branch June 30, 2022 21:44

johnkerl mentioned this pull request Jul 1, 2022

Use keyword-only args #209

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize queries for allow_duplicates=False #208

Optimize queries for allow_duplicates=False #208

johnkerl commented Jun 30, 2022 •

edited

Loading

johnkerl Jun 30, 2022

gsakkis Jul 1, 2022 •

edited

Loading

johnkerl Jun 30, 2022 •

edited

Loading

johnkerl Jun 30, 2022

johnkerl Jun 30, 2022

Optimize queries for allow_duplicates=False #208

Optimize queries for allow_duplicates=False #208

Conversation

johnkerl commented Jun 30, 2022 • edited Loading

Summary

Timing results

Query code

johnkerl Jun 30, 2022

Choose a reason for hiding this comment

gsakkis Jul 1, 2022 • edited Loading

Choose a reason for hiding this comment

johnkerl Jun 30, 2022 • edited Loading

Choose a reason for hiding this comment

johnkerl Jun 30, 2022

Choose a reason for hiding this comment

johnkerl Jun 30, 2022

Choose a reason for hiding this comment

johnkerl commented Jun 30, 2022 •

edited

Loading

gsakkis Jul 1, 2022 •

edited

Loading

johnkerl Jun 30, 2022 •

edited

Loading