Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix PooledArray performance bottleneck #2733

Merged
merged 3 commits into from
Apr 23, 2021
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Apr 23, 2021

avoid using allunique for PooledArray.

@bkamins
Copy link
Member Author

bkamins commented Apr 23, 2021

Performance. This PR:

julia> using DataFrames, PooledArrays

julia> df = DataFrame(x=PooledArray(rand(1:10^5, 10^7)));

julia> @time groupby(df, :x);
  0.696850 seconds (891.33 k allocations: 127.369 MiB, 10.75% gc time, 93.26% compilation time)

julia> @time groupby(df, :x);
  0.047454 seconds (45 allocations: 76.392 MiB, 8.49% gc time)

julia> @time groupby(df, :x);
  0.048627 seconds (45 allocations: 76.392 MiB, 10.21% gc time)

julia> @time groupby(df, :x);
  0.040838 seconds (45 allocations: 76.392 MiB)

julia> @time groupby(df, :x);
  0.068660 seconds (45 allocations: 76.392 MiB, 26.89% gc time)

julia> df = DataFrame(x=PooledArray(rand(1:10^6, 10^7)));

julia> @time groupby(df, :x);
  0.138823 seconds (45 allocations: 77.250 MiB, 54.25% gc time)

julia> @time groupby(df, :x);
  0.052131 seconds (45 allocations: 77.250 MiB)

julia> @time groupby(df, :x);
  0.060330 seconds (45 allocations: 77.250 MiB, 10.07% gc time)

julia> @time groupby(df, :x);
  0.052009 seconds (45 allocations: 77.250 MiB)

julia> df = DataFrame(x=PooledArray(rand(1:5*10^6, 10^7)));

julia> @time groupby(df, :x);
  0.053141 seconds (45 allocations: 80.420 MiB)

julia> @time groupby(df, :x);
  0.128106 seconds (45 allocations: 80.420 MiB, 56.27% gc time)

julia> @time groupby(df, :x);
  0.047089 seconds (45 allocations: 80.420 MiB)

On main:

julia> using DataFrames, PooledArrays

julia> df = DataFrame(x=PooledArray(rand(1:10^5, 10^7)));

julia> @time groupby(df, :x);
  0.736720 seconds (923.81 k allocations: 131.504 MiB, 10.68% gc time, 93.11% compilation time)

julia> @time groupby(df, :x);
  0.064057 seconds (67 allocations: 78.691 MiB, 10.66% gc time)

julia> @time groupby(df, :x);
  0.064729 seconds (67 allocations: 78.691 MiB, 13.23% gc time)

julia> @time groupby(df, :x);
  0.056717 seconds (67 allocations: 78.691 MiB)

julia> @time groupby(df, :x);
  0.082923 seconds (67 allocations: 78.691 MiB, 27.73% gc time)

julia> df = DataFrame(x=PooledArray(rand(1:10^6, 10^7)));

julia> @time groupby(df, :x);
  0.182038 seconds (67 allocations: 95.299 MiB, 46.42% gc time)

julia> @time groupby(df, :x);
  0.100379 seconds (67 allocations: 95.299 MiB)

julia> @time groupby(df, :x);
  0.110230 seconds (67 allocations: 95.299 MiB, 9.35% gc time)

julia> @time groupby(df, :x);
  0.100721 seconds (67 allocations: 95.299 MiB)

julia> df = DataFrame(x=PooledArray(rand(1:5*10^6, 10^7)));

julia> @time groupby(df, :x);
  0.434018 seconds (67 allocations: 152.468 MiB, 17.97% gc time)

julia> @time groupby(df, :x);
  0.366571 seconds (67 allocations: 152.468 MiB)

julia> @time groupby(df, :x);
  0.381720 seconds (67 allocations: 152.468 MiB, 3.65% gc time)

if x isa PooledVector || allunique(refpool)
return refpool, refarray
else
return nothing, nothing
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quinnj - can Arrow.jl give us an example of pooled vector with duplicates in refpool?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's technically allowed to have duplicates in the dictionary pool; but we could also call allunique in the DataAPI.refpool method in Arrow.jl to avoid this. Perhaps we should make it formally part of the refpool contract?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That way we ensure other refpool-enabled array types get the same optimization w/o needing to hard-code PooledVector

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a possibility. But are we sure no existing implementations have duplicates and that it couldn't be useful? Another solution would be to add a new function to DataAPI to check whether entries are unique. That may be overkill if having duplicates is not useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, I can't see duplicate values in a pool being useful at all. It's our API, so I say we enforce it. All people would have to do if they allow duplicates is call allunique on their refpool. I think it just comes back to the fact that all our current uses/designs for refpool revolve around it being a unique pool, so we might as well enforce that.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good at least as a short-term measure until we find a general solution.

@bkamins
Copy link
Member Author

bkamins commented Apr 23, 2021

I think we have a general solution, but @quinnj would have to confirm that it is OK with Arrow.jl.

  1. DataAPI.jl does not require that refpool does not have duplicates.
  2. However if some type defines also invrefpool then refpool must have unique values as we require:

• for any valid index x into refpool(A), invrefpool(A)[refpool(A)[x]] is equal to x (according to isequal) and of the same type as x;
• for any valid index ix into invrefpool(A) , refpool(A)[invrefpool(A)[ix]] is equal to ix (according to isequal) and of the same type as ix.

which implies that in this case refpool must be unique. Which means that it is enough to check if invrefpool is not nothing. If it is not nothing then we know that refpool must be unique otherwise we call allunique.

The question to @quinnj is:

  • Arrow.jl does not provide invrefpool AFAICT currently. Is this correct? (I did not find it, but maybe I was not looking hard enough)
  • Do you think you can efficiently provide invrefpool in Arrow.jl? (this is one of these things that I wanted to discuss with you about efficient integration of Arrow.jl and DataFrames.jl - for joins to be fast we need invrefpool available)

@bkamins
Copy link
Member Author

bkamins commented Apr 23, 2021

@nalimilan - just to be 100% sure - for CategoricalArray.jl both refpool and invrefpool are cheap as they are computed anyway always. Right?

@nalimilan
Copy link
Member

Good idea! Then we already have everything we need. CategoricalArray indeed just wraps its internal dict, so it's very cheap. And types for which computing invrefpool is costly should probably not implement that function.

@bkamins
Copy link
Member Author

bkamins commented Apr 23, 2021

And types for which computing invrefpool is costly should probably not implement that function.

They can do it lazily and dynamically decide if it is worth to compute it. Let me wait for @quinnj to comment if he is OK with this and if yes then I will merge and make a patch release.

@quinnj
Copy link
Member

quinnj commented Apr 23, 2021

Yes, I'm ok w/ this approach. I think we can figure things out on the Arrow.jl side; I don't think it would be bad even to enforce teh uniqueness in Arrow.jl specifically, since the benefit would be great. We could do a uniqueness check/scrub when we do the initial record batch processing. I'll need to think a bit more about invrefpool; I'll ping @dmbates as well since he was the one who implemented the DataAPI methods in Arrow.jl, though I'm not signing him up for any additional work if he doesn't want it 😛

Co-authored-by: Milan Bouchet-Valat <[email protected]>
@bkamins bkamins merged commit 276bbc2 into main Apr 23, 2021
@bkamins bkamins deleted the improve_groupby_performance branch April 23, 2021 23:15
@bkamins
Copy link
Member Author

bkamins commented Apr 23, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants