-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix performance issue in multirow split-apply-combine #2749
Conversation
…rmance' into fix_multirow_aggregation_performance
@@ -90,7 +90,7 @@ end | |||
function do_call(f::Base.Callable, idx::AbstractVector{<:Integer}, | |||
starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, | |||
gd::GroupedDataFrame, incols::Tuple{AbstractVector}, i::Integer) | |||
idx = idx[starts[i]:ends[i]] | |||
idx = view(idx, starts[i]:ends[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
saves allocations. I left allocating code for more passed columns as I was unsure what is faster - allocate indices or have a doubly nested view
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some more benchmarking and I change the code to use view
in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd think that views based on a range of indices are quite efficient, but only benchmarking can tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance difference is small, but it will be more GC friendly (with a large number of groups it is a bit better to use view
with a small number of groups it is a bit better to materialize the idx
).
PR after the change:
julia> df = DataFrame(x=rand(1:10^6, 10^7), v1=rand(10^7), v2=rand(10^7));
julia> gdf = groupby(df, :x);
julia> f(v1,v2) = cor(v1, v2)^2
f (generic function with 1 method)
julia> @btime combine($gdf, [:v1, :v2] => f => :r2);
932.859 ms (1000270 allocations: 179.30 MiB)
julia> df = DataFrame(x=rand(1:10, 10^7), v1=rand(10^7), v2=rand(10^7));
julia> gdf = groupby(df, :x);
julia> f(v1,v2) = cor(v1, v2)^2
f (generic function with 1 method)
julia> @btime combine($gdf, [:v1, :v2] => f => :r2);
287.085 ms (321 allocations: 76.31 MiB)
current release:
julia> df = DataFrame(x=rand(1:10^6, 10^7), v1=rand(10^7), v2=rand(10^7));
julia> gdf = groupby(df, :x);
julia> f(v1,v2) = cor(v1, v2)^2
f (generic function with 1 method)
julia> @btime combine($gdf, [:v1, :v2] => f => :r2);
810.070 ms (315 allocations: 22.91 MiB)
julia> df = DataFrame(x=rand(1:10, 10^7), v1=rand(10^7), v2=rand(10^7));
julia> gdf = groupby(df, :x);
julia> f(v1,v2) = cor(v1, v2)^2
f (generic function with 1 method)
julia> @btime combine($gdf, [:v1, :v2] => f => :r2);
269.337 ms (302 allocations: 18.81 KiB)
I was unable to go lower than 8 allocations per group 😭 (which is bad for many groups). |
@@ -543,7 +543,7 @@ function _combine(gd::GroupedDataFrame, | |||
end | |||
idx_keeprows = prepare_idx_keeprows(gd.idx, gd.starts, gd.ends, nrow(parent(gd))) | |||
else | |||
idx_keeprows = nothing | |||
idx_keeprows = Int[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reduces specialization; if keeprows
is false
we never use idx_keeprows
in processing.
do not merge this PR before #2750 is merged (as I update NEWS.md in this PR to avoid having to resolve merge conflicts later) |
…rmance' into fix_multirow_aggregation_performance
eltys = map(typeof, first) | ||
if any(x -> x <: AbstractVector, eltys) | ||
throw(ArgumentError("mixing single values and vectors in a named tuple is not allowed")) | ||
end | ||
end | ||
idx = idx_agg === NOTHING_IDX_AGG ? Vector{Int}(undef, n) : idx_agg | ||
sizehint!(idx, lgd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that there's going to be at least one row per group, right? Maybe add a comment explaining this? That's not necessarily the case if groups are dropped, though that's probably quite common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. What you write is exactly the thinking. sizehint!
is cheap and dropping groups is uncommon (actually in the past it was even super hard to do 😄):
julia> x = Int[];
julia> @time sizehint!(x, 10^7);
0.000018 seconds (2 allocations: 76.294 MiB)
julia> x = Int[];
julia> @time sizehint!(x, 10^7);
0.000021 seconds (2 allocations: 76.294 MiB)
Thank you! |
Resolves https://discourse.julialang.org/t/allocations-and-slow-perf-for-transform-on-groupeddataframes/60594