Disallow duplicates #199

johnkerl · 2022-06-26T18:31:20Z

Following R impl at https://github.com/TileDB-Inc/tiledbsc/pull/69

bkmartinjr · 2022-06-27T13:48:46Z

I agree with the semantics of this change. Any thoughts on how to deal with the performance impact?

johnkerl · 2022-06-27T17:01:20Z

@bkmartinjr I'm on this today. I need to run some experiments & quantify the cost for some small/medium/large sample data. Also I'll propose we keep false as default and provide an 'expert-mode' flag in SOMAOptions for those who wish to override the safer/slower default.

johnkerl · 2022-06-28T00:22:08Z

@Shelnutt2 @ihnorton @bkmartinjr here is what I found:

Ingest and scale

ingestor $mca/acute-covid19-cohort.h5ad /tmp/dupes
ingestor $mca/acute-covid19-cohort.h5ad /tmp/nodupes

$ du -hs /tmp/dupes/X/data /tmp/nodupes/X/data
192M	/tmp/dupes/X/data
192M	/tmp/nodupes/X/data

>>> len(soma.obs)
59506
>>> len(soma.var)
24004

So about 60K cells; small-to-medium dataset size.

Codemod under test

In between 1st & 2nd upload:

$ git diff
diff --git a/apis/python/src/tiledbsc/annotation_matrix.py b/apis/python/src/tiledbsc/annotation_matrix.py
index fdee87d..6f8b5bc 100644
--- a/apis/python/src/tiledbsc/annotation_matrix.py
+++ b/apis/python/src/tiledbsc/annotation_matrix.py
@@ -174,7 +174,7 @@ class AnnotationMatrix(TileDBArray):
-            allows_duplicates=True,
+            allows_duplicates=False,
diff --git a/apis/python/src/tiledbsc/assay_matrix.py b/apis/python/src/tiledbsc/assay_matrix.py
index eccdace..260f6ea 100644
--- a/apis/python/src/tiledbsc/assay_matrix.py
+++ b/apis/python/src/tiledbsc/assay_matrix.py
@@ -212,7 +212,7 @@ class AssayMatrix(TileDBArray):
-            allows_duplicates=True,
+            allows_duplicates=False,
diff --git a/apis/python/src/tiledbsc/uns_array.py b/apis/python/src/tiledbsc/uns_array.py
index 838f071..961d501 100644
--- a/apis/python/src/tiledbsc/uns_array.py
+++ b/apis/python/src/tiledbsc/uns_array.py
@@ -169,7 +169,7 @@ class UnsArray(TileDBArray):
-            allows_duplicates=True,
+            allows_duplicates=False,

Timings

>>> soma.uri
'/tmp/dupes/'

>>> soma.X.data.tiledb_array_schema()
ArraySchema(
  domain=Domain(*[
    Dim(name='obs_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([RleFilter(), ])),
    Dim(name='var_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=22), ])),
  ]),
  attrs=[
    Attr(name='value', dtype='float32', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=True,
)

>>> t1=time.time(); x=soma.X.data.df(); t2=time.time(); t2-t1
27.456372022628784
>>> t1=time.time(); x=soma.X.data.df(); t2=time.time(); t2-t1
27.91620707511902
>>> t1=time.time(); x=soma.X.data.df(); t2=time.time(); t2-t1
31.489280939102173

>>> soma.uri
'/tmp/nodupes/'

>>> soma.X.data.tiledb_array_schema()
ArraySchema(
  domain=Domain(*[
    Dim(name='obs_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([RleFilter(), ])),
    Dim(name='var_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=22), ])),
  ]),
  attrs=[
    Attr(name='value', dtype='float32', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=False,
)

>>> t1=time.time(); x=soma.X.data.df(); t2=time.time(); t2-t1
36.322843074798584
>>> t1=time.time(); x=soma.X.data.df(); t2=time.time(); t2-t1
33.002532720565796
>>> t1=time.time(); x=soma.X.data.df(); t2=time.time(); t2-t1
38.652859926223755

Analysis

DIviding the mean of the 3 'before' experiments by the mean of the 3 'after' experiments I get

35.992 / 28.953
1.243103

i.e. 25% perf change.

This is not an order-of-magnitude change. I am OK with this -- especially as if allow_duplicates=True and data are re-written, unsuspecting users are going to get things like [3,4] in matrix elements they thought were scalars.

bkmartinjr · 2022-06-28T04:03:26Z

These timings are not consistent with what I have previously found (on more complex and much larger datasets stored on S3). Not sure if the underlying system has improved (yay!) or if the tests are different.

Either way, I agree with the PR change, for now. But I think it remains an open question if this is really a solved problem for the longer term.

johnkerl · 2022-06-28T14:33:52Z

But I think it remains an open question if this is really a solved problem for the longer term.

100%! :) This is an area for ongoing work.

johnkerl · 2022-06-29T19:39:23Z

But I think it remains an open question if this is really a solved problem for the longer term.

100%! :) This is an area for ongoing work.

Indeed, my get-all-the-data query

>>> t1=time.time(); x=soma.X.data.df(); t2=time.time(); t2-t1

is not the best indicator of queryability performance. I'm working with Luc on this.

johnkerl · 2022-06-30T19:02:33Z

@bkmartinjr more findings on #208.

johnkerl requested review from aaronwolen and Shelnutt2 June 26, 2022 18:31

johnkerl force-pushed the kerl/disallow-duplicates branch from d8a7510 to e9e8fc2 Compare June 27, 2022 19:38

johnkerl force-pushed the kerl/disallow-duplicates branch from e8d0221 to befc996 Compare June 28, 2022 03:05

aaronwolen approved these changes Jun 28, 2022

View reviewed changes

johnkerl added 2 commits June 28, 2022 10:37

disallow duplicates

cd3c8f3

parameterize allows_duplicates within soma_options

c49fcaf

johnkerl force-pushed the kerl/disallow-duplicates branch from befc996 to c49fcaf Compare June 28, 2022 14:37

johnkerl merged commit cd0a73c into main Jun 28, 2022

johnkerl deleted the kerl/disallow-duplicates branch June 28, 2022 14:45

johnkerl mentioned this pull request Jun 30, 2022

Optimize queries for allow_duplicates=False #208

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disallow duplicates #199

Disallow duplicates #199

johnkerl commented Jun 26, 2022

bkmartinjr commented Jun 27, 2022

johnkerl commented Jun 27, 2022

johnkerl commented Jun 28, 2022 •

edited

Loading

bkmartinjr commented Jun 28, 2022

johnkerl commented Jun 28, 2022

johnkerl commented Jun 29, 2022

johnkerl commented Jun 30, 2022

Disallow duplicates #199

Disallow duplicates #199

Conversation

johnkerl commented Jun 26, 2022

bkmartinjr commented Jun 27, 2022

johnkerl commented Jun 27, 2022

johnkerl commented Jun 28, 2022 • edited Loading

Ingest and scale

Codemod under test

Timings

Analysis

bkmartinjr commented Jun 28, 2022

johnkerl commented Jun 28, 2022

johnkerl commented Jun 29, 2022

johnkerl commented Jun 30, 2022

johnkerl commented Jun 28, 2022 •

edited

Loading