Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make hashrows_col! not depend on CategoricalArrays.jl #2518

Merged
merged 7 commits into from
Nov 7, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Nov 6, 2020

@nalimilan - this follows your suggestion in #2506 (comment).

It is not fully in line with DataAPI.jl API (but I propose - as already mentioned to make that API stricter and require DataAPI.refpool to be AbstractVector).

If we agree on the proposal I will add more tests.

@bkamins bkamins requested a review from nalimilan November 6, 2020 14:44
@bkamins bkamins added non-breaking The proposed change is not breaking ecosystem Issues in DataFrames.jl ecosystem performance labels Nov 6, 2020
@bkamins bkamins added this to the 1.0 milestone Nov 6, 2020
src/dataframerow/utils.jl Outdated Show resolved Hide resolved
src/dataframerow/utils.jl Outdated Show resolved Hide resolved
src/dataframerow/utils.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Nov 6, 2020

Here are the benchmarks of the change:

using DataFrames, PooledArrays, CategoricalArrays, DataAPI

x = [1:9000000; fill(1, 1000001)];
x1 = PooledArray(x);
x2 = categorical(x);
y = [1:9000002; fill(1, 1000000-1)];
y1 = PooledArray(y);
y2 = categorical(y);
length(DataAPI.refpool(x1))/length(x)
length(DataAPI.refpool(y1))/length(y)
for (i,n) in enumerate((x, x1, x2, y, y1, y2))
    @info i
    GC.gc()
    @time DataFrames.hashrows((n,), false)
end


julia> using DataFrames, PooledArrays, CategoricalArrays, DataAPI

julia> x = [1:9000000; fill(1, 1000001)];

julia> x1 = PooledArray(x);
^[[C
julia> x2 = categorical(x);

julia> y = [1:9000002; fill(1, 1000000-1)];

julia> y1 = PooledArray(y);

julia> y2 = categorical(y);

julia> length(DataAPI.refpool(x1))/length(x)
0.899999910000009

julia> length(DataAPI.refpool(y1))/length(y)
0.900000109999989

julia> for (i,n) in enumerate((x, x1, x2, y, y1, y2))
           @info i
           GC.gc()
           @time DataFrames.hashrows((n,), false)
       end
[ Info: 1
  0.051783 seconds (5 allocations: 76.294 MiB, 5.32% gc time)
[ Info: 2
  0.101282 seconds (7 allocations: 144.959 MiB, 2.91% gc time)
[ Info: 3
  0.158293 seconds (7 allocations: 144.959 MiB, 2.19% gc time)
[ Info: 4
  0.061556 seconds (5 allocations: 76.294 MiB, 2.48% gc time)
[ Info: 5
  0.087488 seconds (5 allocations: 76.294 MiB, 1.94% gc time)
[ Info: 6
  0.112878 seconds (5 allocations: 76.294 MiB, 1.38% gc time)

julia> for (i,n) in enumerate((x, x1, x2, y, y1, y2))
           @info i
           GC.gc()
           @time DataFrames.hashrows((n,), false)
       end
[ Info: 1
  0.055569 seconds (5 allocations: 76.294 MiB, 2.99% gc time)
[ Info: 2
  0.112733 seconds (7 allocations: 144.959 MiB, 2.90% gc time)
[ Info: 3
  0.166712 seconds (7 allocations: 144.959 MiB, 2.05% gc time)
[ Info: 4
  0.062511 seconds (5 allocations: 76.294 MiB, 2.40% gc time)
[ Info: 5
  0.081028 seconds (5 allocations: 76.294 MiB, 1.94% gc time)
[ Info: 6
  0.117411 seconds (5 allocations: 76.294 MiB, 1.64% gc time)

@nalimilan - So it seems that the 90% threshold is OK (probably it could be even a bit lower, but it is hard to tune it optimally).
Also - we can see that when there are so many levels it is better not to do pooling.

@quinnj - do you still disable creation of a PooledArray in CSV.jl if there are too many levels in a categorical column or not?

I will add a test to make sure that all these cases produce the same hashes.

@nalimilan
Copy link
Member

Thanks for benchmarking! So as I suspected (I was going to comment) the pooled hashing is a bit slower at 90%. I think I'd go with a lower threshold, e.g. 50% or even 10%.

src/dataframerow/utils.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Nov 7, 2020

So as I suspected

I also suspected this, but the previous code did not use this optimization. 10% is too low (copying data is faster than calculating hashes). I will change it to 50% then (it will not be optimal if we hash very long strings though - in that case something closer to 90% is better, but this is probably rare).

src/dataframerow/utils.jl Outdated Show resolved Hide resolved
bkamins and others added 2 commits November 7, 2020 12:51
src/dataframerow/utils.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <[email protected]>
@bkamins bkamins mentioned this pull request Nov 7, 2020
20 tasks
@bkamins bkamins merged commit b9e47e6 into JuliaData:master Nov 7, 2020
@bkamins bkamins deleted the hashrows_generic branch November 7, 2020 21:21
@bkamins
Copy link
Member Author

bkamins commented Nov 7, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ecosystem Issues in DataFrames.jl ecosystem non-breaking The proposed change is not breaking performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants