-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize queries for allow_duplicates=False #208
Conversation
@@ -64,6 +64,20 @@ def __init__( | |||
:param uri: URI of the TileDB group | |||
""" | |||
|
|||
# People can (and should) call by name. However, it's easy to forget. For example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found during debug. soma = tiledbsc.SOMA(uri, ctx)
is incorrect while soma = tiledbsc.SOMA(uri, ctx=ctx)
is correct. This needs to be surfaced proactively to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to enforce keyword-only arguments to avoid such incorrect usage:
def __init__(self, uri, *, name=None, ...)
Manual runtime type checking is considered unidiomatic in general and doesn't scale well with the complexity of annotations (think of Dict[str, List[Optional[int]]
). For the cases that really need it, there are 3rd party libraries such as Typeguard or Pydantic that use and enforce annotations at runtime transparently.
# we'll do `A.df[obs_ids, :]`. We can't pass a `:` down the callstack to get there, | ||
# but we pass `None` instead. | ||
# | ||
# It's important to do this. Say for example the X matrix is nobs=1000 by nvar=2000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were doing the following:
- For example a 1000x2000 matrix
- 80 obs IDs -> 80 obs IDs
None
(for:
) -> 2000 var IDs
which is not new. However, it performed OK with allow_duplicates=True
but performed poorly with allow_duplicates=False
.
# People can (and should) call by name. However, it's easy to forget. For example, | ||
# if someone does 'tiledbsc.SOMACollection("myuri", ctx)' instead of 'tiledbsc.SOMA("myury", ctx)', | ||
# behavior will not be what they expect, and we should let them know sooner than later. | ||
if name is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arg-checking as above, for SOMACollection
as well as SOMA
.
@@ -57,37 +57,6 @@ def __init__( | |||
# self.raw_var = raw_var | |||
assert "data" in X | |||
|
|||
# Find the dtype. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is vestigial and was adding unnecessary time to SOMA slice queries.
Following on #199.
Summary
Setting
allow_duplicates=False
is the path of least surprise for users. This ensures that, if a matrix is ever re-written/updated, when people query, expecting a matrix of numbers, they'll get that -- vs a matrix of lists where each cell contains historical values.However, this comes at some read-performance cost. The get-all-the-data timing queries done on #199 were insufficient; better queries are shown in the details below.
Using TileDB Core 2.10, which is released, we see about a 2x slowdown going to the 'safe' mode `allow_duplicates=False'. Using dev version of core with mods to be released in TileDB Core 2.11 (in a couple weeks) we see the slowdown more like 1.2x.
Timing results
We took the same 59K cell (393M total) dataset, stored in four ways:
dupes-noconsol
--allow_duplicates=True
,X/data
not consolidated (in a half-dozen fragments)dupes-consol
-- a copy of that, but withX/data
consolidated into a single fragment.nodupes-noconsol
--allow_duplicates=False
,X/data
not consolidated (in a half-dozen fragments)nodupes-consol
-- a copy of that, but withX/data
consolidated into a single fragment.The cell-types available in that dataset are as follows:
Running the below query code on all four, we get the following timings using core 2.11:
Query code