fix PooledArray performance bottleneck #2733

bkamins · 2021-04-23T16:08:15Z

avoid using allunique for PooledArray.

bkamins · 2021-04-23T16:39:23Z

Performance. This PR:

julia> using DataFrames, PooledArrays

julia> df = DataFrame(x=PooledArray(rand(1:10^5, 10^7)));

julia> @time groupby(df, :x);
  0.696850 seconds (891.33 k allocations: 127.369 MiB, 10.75% gc time, 93.26% compilation time)

julia> @time groupby(df, :x);
  0.047454 seconds (45 allocations: 76.392 MiB, 8.49% gc time)

julia> @time groupby(df, :x);
  0.048627 seconds (45 allocations: 76.392 MiB, 10.21% gc time)

julia> @time groupby(df, :x);
  0.040838 seconds (45 allocations: 76.392 MiB)

julia> @time groupby(df, :x);
  0.068660 seconds (45 allocations: 76.392 MiB, 26.89% gc time)

julia> df = DataFrame(x=PooledArray(rand(1:10^6, 10^7)));

julia> @time groupby(df, :x);
  0.138823 seconds (45 allocations: 77.250 MiB, 54.25% gc time)

julia> @time groupby(df, :x);
  0.052131 seconds (45 allocations: 77.250 MiB)

julia> @time groupby(df, :x);
  0.060330 seconds (45 allocations: 77.250 MiB, 10.07% gc time)

julia> @time groupby(df, :x);
  0.052009 seconds (45 allocations: 77.250 MiB)

julia> df = DataFrame(x=PooledArray(rand(1:5*10^6, 10^7)));

julia> @time groupby(df, :x);
  0.053141 seconds (45 allocations: 80.420 MiB)

julia> @time groupby(df, :x);
  0.128106 seconds (45 allocations: 80.420 MiB, 56.27% gc time)

julia> @time groupby(df, :x);
  0.047089 seconds (45 allocations: 80.420 MiB)

On main:

julia> using DataFrames, PooledArrays

julia> df = DataFrame(x=PooledArray(rand(1:10^5, 10^7)));

julia> @time groupby(df, :x);
  0.736720 seconds (923.81 k allocations: 131.504 MiB, 10.68% gc time, 93.11% compilation time)

julia> @time groupby(df, :x);
  0.064057 seconds (67 allocations: 78.691 MiB, 10.66% gc time)

julia> @time groupby(df, :x);
  0.064729 seconds (67 allocations: 78.691 MiB, 13.23% gc time)

julia> @time groupby(df, :x);
  0.056717 seconds (67 allocations: 78.691 MiB)

julia> @time groupby(df, :x);
  0.082923 seconds (67 allocations: 78.691 MiB, 27.73% gc time)

julia> df = DataFrame(x=PooledArray(rand(1:10^6, 10^7)));

julia> @time groupby(df, :x);
  0.182038 seconds (67 allocations: 95.299 MiB, 46.42% gc time)

julia> @time groupby(df, :x);
  0.100379 seconds (67 allocations: 95.299 MiB)

julia> @time groupby(df, :x);
  0.110230 seconds (67 allocations: 95.299 MiB, 9.35% gc time)

julia> @time groupby(df, :x);
  0.100721 seconds (67 allocations: 95.299 MiB)

julia> df = DataFrame(x=PooledArray(rand(1:5*10^6, 10^7)));

julia> @time groupby(df, :x);
  0.434018 seconds (67 allocations: 152.468 MiB, 17.97% gc time)

julia> @time groupby(df, :x);
  0.366571 seconds (67 allocations: 152.468 MiB)

julia> @time groupby(df, :x);
  0.381720 seconds (67 allocations: 152.468 MiB, 3.65% gc time)

bkamins · 2021-04-23T16:40:34Z

src/groupeddataframe/utils.jl

+        if x isa PooledVector || allunique(refpool)
+            return refpool, refarray
+        else
+            return nothing, nothing


@quinnj - can Arrow.jl give us an example of pooled vector with duplicates in refpool?

It's technically allowed to have duplicates in the dictionary pool; but we could also call allunique in the DataAPI.refpool method in Arrow.jl to avoid this. Perhaps we should make it formally part of the refpool contract?

That way we ensure other refpool-enabled array types get the same optimization w/o needing to hard-code PooledVector

Yes that's a possibility. But are we sure no existing implementations have duplicates and that it couldn't be useful? Another solution would be to add a new function to DataAPI to check whether entries are unique. That may be overkill if having duplicates is not useful.

I don't know, I can't see duplicate values in a pool being useful at all. It's our API, so I say we enforce it. All people would have to do if they allow duplicates is call allunique on their refpool. I think it just comes back to the fact that all our current uses/designs for refpool revolve around it being a unique pool, so we might as well enforce that.

nalimilan

Looks good at least as a short-term measure until we find a general solution.

bkamins · 2021-04-23T19:49:39Z

I think we have a general solution, but @quinnj would have to confirm that it is OK with Arrow.jl.

DataAPI.jl does not require that refpool does not have duplicates.
However if some type defines also invrefpool then refpool must have unique values as we require:

• for any valid index x into refpool(A), invrefpool(A)[refpool(A)[x]] is equal to x (according to isequal) and of the same type as x;
• for any valid index ix into invrefpool(A) , refpool(A)[invrefpool(A)[ix]] is equal to ix (according to isequal) and of the same type as ix.

which implies that in this case refpool must be unique. Which means that it is enough to check if invrefpool is not nothing. If it is not nothing then we know that refpool must be unique otherwise we call allunique.

The question to @quinnj is:

Arrow.jl does not provide invrefpool AFAICT currently. Is this correct? (I did not find it, but maybe I was not looking hard enough)
Do you think you can efficiently provide invrefpool in Arrow.jl? (this is one of these things that I wanted to discuss with you about efficient integration of Arrow.jl and DataFrames.jl - for joins to be fast we need invrefpool available)

bkamins · 2021-04-23T19:51:26Z

@nalimilan - just to be 100% sure - for CategoricalArray.jl both refpool and invrefpool are cheap as they are computed anyway always. Right?

src/groupeddataframe/utils.jl

nalimilan · 2021-04-23T20:27:41Z

Good idea! Then we already have everything we need. CategoricalArray indeed just wraps its internal dict, so it's very cheap. And types for which computing invrefpool is costly should probably not implement that function.

bkamins · 2021-04-23T20:38:03Z

And types for which computing invrefpool is costly should probably not implement that function.

They can do it lazily and dynamically decide if it is worth to compute it. Let me wait for @quinnj to comment if he is OK with this and if yes then I will merge and make a patch release.

quinnj · 2021-04-23T20:41:52Z

Yes, I'm ok w/ this approach. I think we can figure things out on the Arrow.jl side; I don't think it would be bad even to enforce teh uniqueness in Arrow.jl specifically, since the benefit would be great. We could do a uniqueness check/scrub when we do the initial record batch processing. I'll need to think a bit more about invrefpool; I'll ping @dmbates as well since he was the one who implemented the DataAPI methods in Arrow.jl, though I'm not signing him up for any additional work if he doesn't want it 😛

Co-authored-by: Milan Bouchet-Valat <[email protected]>

bkamins · 2021-04-23T23:15:14Z

Thank you!

fix PooledArray performance bottleneck

12d25ed

bkamins added performance grouping labels Apr 23, 2021

bkamins requested a review from nalimilan April 23, 2021 16:08

bkamins commented Apr 23, 2021

View reviewed changes

nalimilan approved these changes Apr 23, 2021

View reviewed changes

use invrefpool

899f312

bkamins commented Apr 23, 2021

View reviewed changes

src/groupeddataframe/utils.jl Show resolved Hide resolved

nalimilan approved these changes Apr 23, 2021

View reviewed changes

Update src/groupeddataframe/utils.jl

20636fb

Co-authored-by: Milan Bouchet-Valat <[email protected]>

bkamins merged commit 276bbc2 into main Apr 23, 2021

bkamins deleted the improve_groupby_performance branch April 23, 2021 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix PooledArray performance bottleneck #2733

fix PooledArray performance bottleneck #2733

bkamins commented Apr 23, 2021

bkamins commented Apr 23, 2021

bkamins Apr 23, 2021

quinnj Apr 23, 2021

quinnj Apr 23, 2021

nalimilan Apr 23, 2021

quinnj Apr 23, 2021

nalimilan left a comment

bkamins commented Apr 23, 2021

bkamins commented Apr 23, 2021

nalimilan commented Apr 23, 2021

bkamins commented Apr 23, 2021

quinnj commented Apr 23, 2021

bkamins commented Apr 23, 2021

fix PooledArray performance bottleneck #2733

fix PooledArray performance bottleneck #2733

Conversation

bkamins commented Apr 23, 2021

bkamins commented Apr 23, 2021

bkamins Apr 23, 2021

Choose a reason for hiding this comment

quinnj Apr 23, 2021

Choose a reason for hiding this comment

quinnj Apr 23, 2021

Choose a reason for hiding this comment

nalimilan Apr 23, 2021

Choose a reason for hiding this comment

quinnj Apr 23, 2021

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Apr 23, 2021

bkamins commented Apr 23, 2021

nalimilan commented Apr 23, 2021

bkamins commented Apr 23, 2021

quinnj commented Apr 23, 2021

bkamins commented Apr 23, 2021