fix performance issue in multirow split-apply-combine #2749

bkamins · 2021-05-05T17:14:07Z

Resolves https://discourse.julialang.org/t/allocations-and-slow-perf-for-transform-on-groupeddataframes/60594

src/groupeddataframe/splitapplycombine.jl

NEWS.md

src/groupeddataframe/complextransforms.jl

…rmance' into fix_multirow_aggregation_performance

bkamins · 2021-05-05T18:57:04Z

src/groupeddataframe/callprocessing.jl

@@ -90,7 +90,7 @@ end
 function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
                 starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
                 gd::GroupedDataFrame, incols::Tuple{AbstractVector}, i::Integer)
-    idx = idx[starts[i]:ends[i]]
+    idx = view(idx, starts[i]:ends[i])


saves allocations. I left allocating code for more passed columns as I was unsure what is faster - allocate indices or have a doubly nested view.

I did some more benchmarking and I change the code to use view in all cases.

I'd think that views based on a range of indices are quite efficient, but only benchmarking can tell.

The performance difference is small, but it will be more GC friendly (with a large number of groups it is a bit better to use view with a small number of groups it is a bit better to materialize the idx).

PR after the change:

julia> df = DataFrame(x=rand(1:10^6, 10^7), v1=rand(10^7), v2=rand(10^7)); julia> gdf = groupby(df, :x); julia> f(v1,v2) = cor(v1, v2)^2 f (generic function with 1 method) julia> @btime combine($gdf, [:v1, :v2] => f => :r2); 932.859 ms (1000270 allocations: 179.30 MiB) julia> df = DataFrame(x=rand(1:10, 10^7), v1=rand(10^7), v2=rand(10^7)); julia> gdf = groupby(df, :x); julia> f(v1,v2) = cor(v1, v2)^2 f (generic function with 1 method) julia> @btime combine($gdf, [:v1, :v2] => f => :r2); 287.085 ms (321 allocations: 76.31 MiB)

current release:

julia> df = DataFrame(x=rand(1:10^6, 10^7), v1=rand(10^7), v2=rand(10^7)); julia> gdf = groupby(df, :x); julia> f(v1,v2) = cor(v1, v2)^2 f (generic function with 1 method) julia> @btime combine($gdf, [:v1, :v2] => f => :r2); 810.070 ms (315 allocations: 22.91 MiB) julia> df = DataFrame(x=rand(1:10, 10^7), v1=rand(10^7), v2=rand(10^7)); julia> gdf = groupby(df, :x); julia> f(v1,v2) = cor(v1, v2)^2 f (generic function with 1 method) julia> @btime combine($gdf, [:v1, :v2] => f => :r2); 269.337 ms (302 allocations: 18.81 KiB)

bkamins · 2021-05-05T18:57:40Z

I was unable to go lower than 8 allocations per group 😭 (which is bad for many groups).

NEWS.md

bkamins · 2021-05-07T06:44:43Z

src/groupeddataframe/splitapplycombine.jl

@@ -543,7 +543,7 @@ function _combine(gd::GroupedDataFrame,
        end
        idx_keeprows = prepare_idx_keeprows(gd.idx, gd.starts, gd.ends, nrow(parent(gd)))
    else
-        idx_keeprows = nothing
+        idx_keeprows = Int[]


this reduces specialization; if keeprows is false we never use idx_keeprows in processing.

NEWS.md

bkamins · 2021-05-07T07:14:41Z

do not merge this PR before #2750 is merged (as I update NEWS.md in this PR to avoid having to resolve merge conflicts later)

…rmance' into fix_multirow_aggregation_performance

src/groupeddataframe/complextransforms.jl

src/groupeddataframe/splitapplycombine.jl

nalimilan · 2021-05-07T08:22:19Z

src/groupeddataframe/complextransforms.jl

        eltys = map(typeof, first)
        if any(x -> x <: AbstractVector, eltys)
            throw(ArgumentError("mixing single values and vectors in a named tuple is not allowed"))
        end
    end
    idx = idx_agg === NOTHING_IDX_AGG ? Vector{Int}(undef, n) : idx_agg
+    sizehint!(idx, lgd)


This assumes that there's going to be at least one row per group, right? Maybe add a comment explaining this? That's not necessarily the case if groups are dropped, though that's probably quite common.

Added. What you write is exactly the thinking. sizehint! is cheap and dropping groups is uncommon (actually in the past it was even super hard to do 😄):

julia> x = Int[]; julia> @time sizehint!(x, 10^7); 0.000018 seconds (2 allocations: 76.294 MiB) julia> x = Int[]; julia> @time sizehint!(x, 10^7); 0.000021 seconds (2 allocations: 76.294 MiB)

bkamins · 2021-05-07T09:55:26Z

Thank you!

fix performance issue in multirow split-apply-combine

29b11b5

bkamins requested a review from nalimilan May 5, 2021 17:14

bkamins added the performance label May 5, 2021

bkamins added this to the patch milestone May 5, 2021

bkamins added the grouping label May 5, 2021

nalimilan reviewed May 5, 2021

View reviewed changes

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved

add signatures

34c2ae2

bkamins commented May 5, 2021

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

Update NEWS.md

373d39c

bkamins commented May 5, 2021

View reviewed changes

src/groupeddataframe/complextransforms.jl Outdated Show resolved Hide resolved

bkamins added 2 commits May 5, 2021 20:47

use view in single column case

ba6529e

Merge remote-tracking branch 'upstream/fix_multirow_aggregation_perfo…

948690f

…rmance' into fix_multirow_aggregation_performance

bkamins commented May 5, 2021

View reviewed changes

bkamins added 2 commits May 5, 2021 21:01

we do not use idx_keeprows otherwise

519e9af

remove sizehint! as it might be undefined

7825b73

bkamins commented May 7, 2021

View reviewed changes

NEWS.md Show resolved Hide resolved

bkamins commented May 7, 2021

View reviewed changes

Update NEWS.md

17eeefc

bkamins commented May 7, 2021

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

Update NEWS.md

5c20885

bkamins commented May 7, 2021

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

Update NEWS.md

bfa0cbf

bkamins added 2 commits May 7, 2021 09:31

always use view in do_call indexing

7be1e5c

Merge remote-tracking branch 'upstream/fix_multirow_aggregation_perfo…

1846762

…rmance' into fix_multirow_aggregation_performance

nalimilan reviewed May 7, 2021

View reviewed changes

src/groupeddataframe/complextransforms.jl Outdated Show resolved Hide resolved

src/groupeddataframe/complextransforms.jl Outdated Show resolved Hide resolved

src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved

fixes after code review

23350bf

nalimilan reviewed May 7, 2021

View reviewed changes

nalimilan approved these changes May 7, 2021

View reviewed changes

add comment about sizehint!

000559a

bkamins merged commit 32b86d4 into main May 7, 2021

bkamins deleted the fix_multirow_aggregation_performance branch May 7, 2021 09:55

bkamins mentioned this pull request May 17, 2021

Understanding timing of DataFrames.jl joins h2oai/db-benchmark#210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix performance issue in multirow split-apply-combine #2749

fix performance issue in multirow split-apply-combine #2749

bkamins commented May 5, 2021

bkamins May 5, 2021

bkamins May 7, 2021

nalimilan May 7, 2021

bkamins May 7, 2021

bkamins commented May 5, 2021

bkamins May 7, 2021

bkamins commented May 7, 2021

nalimilan May 7, 2021

bkamins May 7, 2021

bkamins commented May 7, 2021

fix performance issue in multirow split-apply-combine #2749

fix performance issue in multirow split-apply-combine #2749

Conversation

bkamins commented May 5, 2021

bkamins May 5, 2021

Choose a reason for hiding this comment

bkamins May 7, 2021

Choose a reason for hiding this comment

nalimilan May 7, 2021

Choose a reason for hiding this comment

bkamins May 7, 2021

Choose a reason for hiding this comment

bkamins commented May 5, 2021

bkamins May 7, 2021

Choose a reason for hiding this comment

bkamins commented May 7, 2021

nalimilan May 7, 2021

Choose a reason for hiding this comment

bkamins May 7, 2021

Choose a reason for hiding this comment

bkamins commented May 7, 2021