Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize queries for allow_duplicates=False #208

Merged
merged 1 commit into from
Jun 30, 2022
Merged

Optimize queries for allow_duplicates=False #208

merged 1 commit into from
Jun 30, 2022

Conversation

johnkerl
Copy link
Member

@johnkerl johnkerl commented Jun 30, 2022

Following on #199.

Summary

Setting allow_duplicates=False is the path of least surprise for users. This ensures that, if a matrix is ever re-written/updated, when people query, expecting a matrix of numbers, they'll get that -- vs a matrix of lists where each cell contains historical values.

However, this comes at some read-performance cost. The get-all-the-data timing queries done on #199 were insufficient; better queries are shown in the details below.

Using TileDB Core 2.10, which is released, we see about a 2x slowdown going to the 'safe' mode `allow_duplicates=False'. Using dev version of core with mods to be released in TileDB Core 2.11 (in a couple weeks) we see the slowdown more like 1.2x.

Timing results

We took the same 59K cell (393M total) dataset, stored in four ways:

  • dupes-noconsol -- allow_duplicates=True, X/data not consolidated (in a half-dozen fragments)
  • dupes-consol -- a copy of that, but with X/data consolidated into a single fragment.
  • nodupes-noconsol -- allow_duplicates=False, X/data not consolidated (in a half-dozen fragments)
  • nodupes-consol -- a copy of that, but with X/data consolidated into a single fragment.

The cell-types available in that dataset are as follows:

>>> soma.obs.df().groupby('cell_type').size().sort_values()
cell_type
plasmacytoid dendritic cell          575
plasmablast                          586
platelet                            1007
dendritic cell                      1038
alpha-beta T cell                   1659
natural killer cell                 3248
B cell                              6131
CD4-positive, alpha-beta T cell     6726
CD8-positive, alpha-beta T cell     8658
monocyte                           29878
dtype: int64

Running the below query code on all four, we get the following timings using core 2.11:

nobs dupes-noconsol_seconds dupes-consol_seconds nodupes-noconsol_seconds nodupes-consol_seconds
575 1.215 1.201 1.228 1.243
586 1.186 1.198 1.174 1.197
1007 1.678 1.763 1.708 1.786
1038 2.321 2.431 2.160 2.218
1659 2.252 2.823 2.558 2.598
3248 4.046 3.444 3.849 3.587
6131 4.982 4.561 5.101 4.720
6726 4.670 4.820 4.955 4.504
8658 5.284 5.743 6.253 6.049
29878 17.523 15.439 18.501 16.221

Query code

#!/usr/bin/env python

import tiledb
import tiledbsc
import sys
import time

ctx = tiledb.Ctx({"py.init_buffer_bytes": 8 * 1024**3})

# ----------------------------------------------------------------
# Public bucket:
# s3://tiledb-singlecell-data/profile-data/dupes-consol
# s3://tiledb-singlecell-data/profile-data/dupes-noconsol
# s3://tiledb-singlecell-data/profile-data/nodupes-consol
# s3://tiledb-singlecell-data/profile-data/dupes-noconsol
# uri = 's3://tiledb-singlecell-data/profile-data/nodupes-consol'
# uri = 's3://tiledb-singlecell-data/profile-data/nodupes-consol'
uri = '/tmp/nodupes-consol'
if len(sys.argv) >= 2:
    uri = sys.argv[1]
soma = tiledbsc.SOMA(uri, ctx=ctx)

# ----------------------------------------------------------------
# Choices for cell_type:
#
# >>> soma.obs.df().groupby('cell_type').size().sort_values()
# cell_type
# plasmacytoid dendritic cell          575
# plasmablast                          586
# platelet                            1007
# dendritic cell                      1038
# alpha-beta T cell                   1659
# natural killer cell                 3248
# B cell                              6131
# CD4-positive, alpha-beta T cell     6726
# CD8-positive, alpha-beta T cell     8658
# monocyte                           29878
# dtype: int64
# ----------------------------------------------------------------

which_attr = 'cell_type'
which_value = 'B cell'
if len(sys.argv) == 3:
    which_value = sys.argv[2]
if len(sys.argv) == 4:
    which_attr  = sys.argv[2]
    which_value = sys.argv[3]
query_string = f'{which_attr} == "{which_value}"'

print('URI:', uri)
print('query_string:', query_string)

o1 = time.time()
slice = soma.query(
    obs_query_string=query_string,
)
o2 = time.time()
print(f"got shapes obs:{slice.obs.shape} var:{slice.var.shape} X/data:{slice.X['data'].shape}")
fsec = "%.3f" % (o2 - o1)
print(f"SECONDS_TOTAL={fsec} NOBS={slice.obs.shape[0]} QUERY='{query_string}' URI={uri}")

@johnkerl johnkerl mentioned this pull request Jun 30, 2022
@@ -64,6 +64,20 @@ def __init__(
:param uri: URI of the TileDB group
"""

# People can (and should) call by name. However, it's easy to forget. For example,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found during debug. soma = tiledbsc.SOMA(uri, ctx) is incorrect while soma = tiledbsc.SOMA(uri, ctx=ctx) is correct. This needs to be surfaced proactively to the user.

Copy link
Contributor

@gsakkis gsakkis Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to enforce keyword-only arguments to avoid such incorrect usage:
def __init__(self, uri, *, name=None, ...)

Manual runtime type checking is considered unidiomatic in general and doesn't scale well with the complexity of annotations (think of Dict[str, List[Optional[int]]). For the cases that really need it, there are 3rd party libraries such as Typeguard or Pydantic that use and enforce annotations at runtime transparently.

# we'll do `A.df[obs_ids, :]`. We can't pass a `:` down the callstack to get there,
# but we pass `None` instead.
#
# It's important to do this. Say for example the X matrix is nobs=1000 by nvar=2000,
Copy link
Member Author

@johnkerl johnkerl Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were doing the following:

  • For example a 1000x2000 matrix
  • 80 obs IDs -> 80 obs IDs
  • None (for :) -> 2000 var IDs

which is not new. However, it performed OK with allow_duplicates=True but performed poorly with allow_duplicates=False.

# People can (and should) call by name. However, it's easy to forget. For example,
# if someone does 'tiledbsc.SOMACollection("myuri", ctx)' instead of 'tiledbsc.SOMA("myury", ctx)',
# behavior will not be what they expect, and we should let them know sooner than later.
if name is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arg-checking as above, for SOMACollection as well as SOMA.

@@ -57,37 +57,6 @@ def __init__(
# self.raw_var = raw_var
assert "data" in X

# Find the dtype.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is vestigial and was adding unnecessary time to SOMA slice queries.

@johnkerl johnkerl marked this pull request as ready for review June 30, 2022 19:15
@johnkerl johnkerl requested a review from gspowley June 30, 2022 19:23
@johnkerl johnkerl merged commit da84a48 into main Jun 30, 2022
@johnkerl johnkerl deleted the kerl/dup-test branch June 30, 2022 21:44
@johnkerl johnkerl mentioned this pull request Jul 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants