-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make hashrows_col! not depend on CategoricalArrays.jl #2518
Conversation
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Here are the benchmarks of the change:
@nalimilan - So it seems that the 90% threshold is OK (probably it could be even a bit lower, but it is hard to tune it optimally). @quinnj - do you still disable creation of a I will add a test to make sure that all these cases produce the same hashes. |
99bf7bf
to
8c9f8cd
Compare
Thanks for benchmarking! So as I suspected (I was going to comment) the pooled hashing is a bit slower at 90%. I think I'd go with a lower threshold, e.g. 50% or even 10%. |
I also suspected this, but the previous code did not use this optimization. 10% is too low (copying data is faster than calculating hashes). I will change it to 50% then (it will not be optimal if we hash very long strings though - in that case something closer to 90% is better, but this is probably rare). |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Thank you! |
@nalimilan - this follows your suggestion in #2506 (comment).
It is not fully in line with DataAPI.jl API (but I propose - as already mentioned to make that API stricter and require
DataAPI.refpool
to beAbstractVector
).If we agree on the proposal I will add more tests.