Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix performance issue in multirow split-apply-combine #2749

Merged
merged 14 commits into from
May 7, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented May 5, 2021

@bkamins bkamins requested a review from nalimilan May 5, 2021 17:14
@bkamins bkamins added this to the patch milestone May 5, 2021
NEWS.md Outdated Show resolved Hide resolved
@@ -90,7 +90,7 @@ end
function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
gd::GroupedDataFrame, incols::Tuple{AbstractVector}, i::Integer)
idx = idx[starts[i]:ends[i]]
idx = view(idx, starts[i]:ends[i])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saves allocations. I left allocating code for more passed columns as I was unsure what is faster - allocate indices or have a doubly nested view.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more benchmarking and I change the code to use view in all cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think that views based on a range of indices are quite efficient, but only benchmarking can tell.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance difference is small, but it will be more GC friendly (with a large number of groups it is a bit better to use view with a small number of groups it is a bit better to materialize the idx).

PR after the change:

julia> df = DataFrame(x=rand(1:10^6, 10^7), v1=rand(10^7), v2=rand(10^7));

julia> gdf = groupby(df, :x);

julia> f(v1,v2) = cor(v1, v2)^2
f (generic function with 1 method)

julia> @btime combine($gdf, [:v1, :v2] => f => :r2);
  932.859 ms (1000270 allocations: 179.30 MiB)

julia> df = DataFrame(x=rand(1:10, 10^7), v1=rand(10^7), v2=rand(10^7));

julia> gdf = groupby(df, :x);

julia> f(v1,v2) = cor(v1, v2)^2
f (generic function with 1 method)

julia> @btime combine($gdf, [:v1, :v2] => f => :r2);
  287.085 ms (321 allocations: 76.31 MiB)

current release:

julia> df = DataFrame(x=rand(1:10^6, 10^7), v1=rand(10^7), v2=rand(10^7));

julia> gdf = groupby(df, :x);

julia> f(v1,v2) = cor(v1, v2)^2
f (generic function with 1 method)

julia> @btime combine($gdf, [:v1, :v2] => f => :r2);
  810.070 ms (315 allocations: 22.91 MiB)

julia> df = DataFrame(x=rand(1:10, 10^7), v1=rand(10^7), v2=rand(10^7));

julia> gdf = groupby(df, :x);

julia> f(v1,v2) = cor(v1, v2)^2
f (generic function with 1 method)

julia> @btime combine($gdf, [:v1, :v2] => f => :r2);
  269.337 ms (302 allocations: 18.81 KiB)

@bkamins
Copy link
Member Author

bkamins commented May 5, 2021

I was unable to go lower than 8 allocations per group 😭 (which is bad for many groups).

NEWS.md Show resolved Hide resolved
@@ -543,7 +543,7 @@ function _combine(gd::GroupedDataFrame,
end
idx_keeprows = prepare_idx_keeprows(gd.idx, gd.starts, gd.ends, nrow(parent(gd)))
else
idx_keeprows = nothing
idx_keeprows = Int[]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reduces specialization; if keeprows is false we never use idx_keeprows in processing.

NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented May 7, 2021

do not merge this PR before #2750 is merged (as I update NEWS.md in this PR to avoid having to resolve merge conflicts later)

src/groupeddataframe/complextransforms.jl Outdated Show resolved Hide resolved
src/groupeddataframe/complextransforms.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved
eltys = map(typeof, first)
if any(x -> x <: AbstractVector, eltys)
throw(ArgumentError("mixing single values and vectors in a named tuple is not allowed"))
end
end
idx = idx_agg === NOTHING_IDX_AGG ? Vector{Int}(undef, n) : idx_agg
sizehint!(idx, lgd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that there's going to be at least one row per group, right? Maybe add a comment explaining this? That's not necessarily the case if groups are dropped, though that's probably quite common.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. What you write is exactly the thinking. sizehint! is cheap and dropping groups is uncommon (actually in the past it was even super hard to do 😄):

julia> x = Int[];

julia> @time sizehint!(x, 10^7);
  0.000018 seconds (2 allocations: 76.294 MiB)

julia> x = Int[];

julia> @time sizehint!(x, 10^7);
  0.000021 seconds (2 allocations: 76.294 MiB)

@bkamins bkamins merged commit 32b86d4 into main May 7, 2021
@bkamins bkamins deleted the fix_multirow_aggregation_performance branch May 7, 2021 09:55
@bkamins
Copy link
Member Author

bkamins commented May 7, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants