Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow duplicates #199

Merged
merged 2 commits into from
Jun 28, 2022
Merged

Disallow duplicates #199

merged 2 commits into from
Jun 28, 2022

Conversation

johnkerl
Copy link
Member

@johnkerl johnkerl requested review from aaronwolen and Shelnutt2 June 26, 2022 18:31
@bkmartinjr
Copy link
Member

I agree with the semantics of this change. Any thoughts on how to deal with the performance impact?

@johnkerl
Copy link
Member Author

@bkmartinjr I'm on this today. I need to run some experiments & quantify the cost for some small/medium/large sample data. Also I'll propose we keep false as default and provide an 'expert-mode' flag in SOMAOptions for those who wish to override the safer/slower default.

@johnkerl johnkerl force-pushed the kerl/disallow-duplicates branch from d8a7510 to e9e8fc2 Compare June 27, 2022 19:38
@johnkerl
Copy link
Member Author

johnkerl commented Jun 28, 2022

@Shelnutt2 @ihnorton @bkmartinjr here is what I found:

Ingest and scale

ingestor $mca/acute-covid19-cohort.h5ad /tmp/dupes
ingestor $mca/acute-covid19-cohort.h5ad /tmp/nodupes
$ du -hs /tmp/dupes/X/data /tmp/nodupes/X/data
192M	/tmp/dupes/X/data
192M	/tmp/nodupes/X/data
>>> len(soma.obs)
59506
>>> len(soma.var)
24004

So about 60K cells; small-to-medium dataset size.

Codemod under test

In between 1st & 2nd upload:

$ git diff
diff --git a/apis/python/src/tiledbsc/annotation_matrix.py b/apis/python/src/tiledbsc/annotation_matrix.py
index fdee87d..6f8b5bc 100644
--- a/apis/python/src/tiledbsc/annotation_matrix.py
+++ b/apis/python/src/tiledbsc/annotation_matrix.py
@@ -174,7 +174,7 @@ class AnnotationMatrix(TileDBArray):
-            allows_duplicates=True,
+            allows_duplicates=False,
diff --git a/apis/python/src/tiledbsc/assay_matrix.py b/apis/python/src/tiledbsc/assay_matrix.py
index eccdace..260f6ea 100644
--- a/apis/python/src/tiledbsc/assay_matrix.py
+++ b/apis/python/src/tiledbsc/assay_matrix.py
@@ -212,7 +212,7 @@ class AssayMatrix(TileDBArray):
-            allows_duplicates=True,
+            allows_duplicates=False,
diff --git a/apis/python/src/tiledbsc/uns_array.py b/apis/python/src/tiledbsc/uns_array.py
index 838f071..961d501 100644
--- a/apis/python/src/tiledbsc/uns_array.py
+++ b/apis/python/src/tiledbsc/uns_array.py
@@ -169,7 +169,7 @@ class UnsArray(TileDBArray):
-            allows_duplicates=True,
+            allows_duplicates=False,

Timings

>>> soma.uri
'/tmp/dupes/'

>>> soma.X.data.tiledb_array_schema()
ArraySchema(
  domain=Domain(*[
    Dim(name='obs_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([RleFilter(), ])),
    Dim(name='var_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=22), ])),
  ]),
  attrs=[
    Attr(name='value', dtype='float32', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=True,
)

>>> t1=time.time(); x=soma.X.data.df(); t2=time.time(); t2-t1
27.456372022628784
>>> t1=time.time(); x=soma.X.data.df(); t2=time.time(); t2-t1
27.91620707511902
>>> t1=time.time(); x=soma.X.data.df(); t2=time.time(); t2-t1
31.489280939102173
>>> soma.uri
'/tmp/nodupes/'

>>> soma.X.data.tiledb_array_schema()
ArraySchema(
  domain=Domain(*[
    Dim(name='obs_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([RleFilter(), ])),
    Dim(name='var_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=22), ])),
  ]),
  attrs=[
    Attr(name='value', dtype='float32', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=False,
)

>>> t1=time.time(); x=soma.X.data.df(); t2=time.time(); t2-t1
36.322843074798584
>>> t1=time.time(); x=soma.X.data.df(); t2=time.time(); t2-t1
33.002532720565796
>>> t1=time.time(); x=soma.X.data.df(); t2=time.time(); t2-t1
38.652859926223755

Analysis

DIviding the mean of the 3 'before' experiments by the mean of the 3 'after' experiments I get

35.992 / 28.953
1.243103

i.e. 25% perf change.

This is not an order-of-magnitude change. I am OK with this -- especially as if allow_duplicates=True and data are re-written, unsuspecting users are going to get things like [3,4] in matrix elements they thought were scalars.

@johnkerl johnkerl force-pushed the kerl/disallow-duplicates branch from e8d0221 to befc996 Compare June 28, 2022 03:05
@bkmartinjr
Copy link
Member

These timings are not consistent with what I have previously found (on more complex and much larger datasets stored on S3). Not sure if the underlying system has improved (yay!) or if the tests are different.

Either way, I agree with the PR change, for now. But I think it remains an open question if this is really a solved problem for the longer term.

@johnkerl
Copy link
Member Author

But I think it remains an open question if this is really a solved problem for the longer term.

100%! :) This is an area for ongoing work.

@johnkerl johnkerl force-pushed the kerl/disallow-duplicates branch from befc996 to c49fcaf Compare June 28, 2022 14:37
@johnkerl johnkerl merged commit cd0a73c into main Jun 28, 2022
@johnkerl johnkerl deleted the kerl/disallow-duplicates branch June 28, 2022 14:45
@johnkerl
Copy link
Member Author

But I think it remains an open question if this is really a solved problem for the longer term.

100%! :) This is an area for ongoing work.

Indeed, my get-all-the-data query

>>> t1=time.time(); x=soma.X.data.df(); t2=time.time(); t2-t1

is not the best indicator of queryability performance. I'm working with Luc on this.

@johnkerl
Copy link
Member Author

@bkmartinjr more findings on #208.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants