-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disallow duplicates #199
Disallow duplicates #199
Conversation
I agree with the semantics of this change. Any thoughts on how to deal with the performance impact? |
@bkmartinjr I'm on this today. I need to run some experiments & quantify the cost for some small/medium/large sample data. Also I'll propose we keep false as default and provide an 'expert-mode' flag in |
d8a7510
to
e9e8fc2
Compare
@Shelnutt2 @ihnorton @bkmartinjr here is what I found: Ingest and scale
So about 60K cells; small-to-medium dataset size. Codemod under testIn between 1st & 2nd upload:
Timings
AnalysisDIviding the mean of the 3 'before' experiments by the mean of the 3 'after' experiments I get
i.e. 25% perf change. This is not an order-of-magnitude change. I am OK with this -- especially as if |
e8d0221
to
befc996
Compare
These timings are not consistent with what I have previously found (on more complex and much larger datasets stored on S3). Not sure if the underlying system has improved (yay!) or if the tests are different. Either way, I agree with the PR change, for now. But I think it remains an open question if this is really a solved problem for the longer term. |
100%! :) This is an area for ongoing work. |
befc996
to
c49fcaf
Compare
Indeed, my get-all-the-data query
is not the best indicator of queryability performance. I'm working with Luc on this. |
@bkmartinjr more findings on #208. |
Following R impl at https://github.com/TileDB-Inc/tiledbsc/pull/69