Investigate performance of aggregations #2440

bkamins · 2020-09-18T15:21:59Z

The case that should be improved (or checked why it is problematic):

julia> df = DataFrame(rand(10^5, 40));

julia> @time combine(df, AsTable(1:40) => ByRow(sum));
  0.031775 seconds (144 allocations: 31.294 MiB, 15.38% gc time)

julia> @time combine(df, 1:40 => ByRow(+));
  2.719783 seconds (69.70 M allocations: 2.410 GiB, 9.69% gc time)

The text was updated successfully, but these errors were encountered:

bkamins · 2020-09-18T18:58:46Z

This is the reason:

julia> f(a) = +(a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a);

julia> @code_warntype f(a)
Variables
  #self#::Core.Compiler.Const(f, false)
  a::Array{Float64,1}

Body::Array{Float64,1}
1 ─ %1 = (a + a + a + a + a + a + a + a + a + a + a + a + a + a + a + a + a)::Array{Float64,1}
└──      return %1

julia> f(a) = +(a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a);

julia> @code_warntype f(a)
Variables
  #self#::Core.Compiler.Const(f, false)
  a::Array{Float64,1}

Body::Any
1 ─ %1 = (a + a + a + a + a + a + a + a + a + a + a + a + a + a + a + a + a + a)::Any
└──      return %1

bkamins · 2020-09-18T18:59:38Z

@nalimilan - so it seems we have an inherent limitation to splatting as at some point compiler gives up with optimization.

bkamins · 2020-09-18T22:59:00Z

So the conclusion is:

we should add an information in the documentation that having more than 16 columns in the default (splatting) syntax is likely to be slow;
we should add AsVector or AsTuple wrapper for cases when you want to aggregate across e.g. 1000 columns (most likely of a homogenous type). This is tracked in Provide a syntax to perform row aggregations fast #2439. The reason is that currently there is no way to perform it fast without huge run-time or compile-time cost. I will investigate both options and give some recommendation what should be added.

bkamins · 2020-09-20T13:27:00Z

The solution is the following type:

"""
    VectorIterator(::AbstractDataFrame)

Return a `Vector` iterator over rows of a passed.
The returned vector is reused between iterations for efficiency.
It can be modified by the function that accesses it,
but it is not allowed to resize it. This iterator is not thread safe.
"""
struct VectorIterator{S, T}
    v::Vector{S}
    x::Vector{T}

    function VectorIterator(df::AbstractDataFrame)
        ncol(df) == 0 && return new{Union{}, Vector{Any}}([], Union{}[])

        S = reduce(typejoin, eltype(x) for x in eachcol(df))
        T = reduce(typejoin, typeof(x) for x in eachcol(df))
        x = T[x for x in eachcol(df)]
        new{S, T}(Vector{S}(undef, ncol(df)), x)
    end
end

Base.IteratorEltype(::Type{<:VectorIterator}) = Base.HasEltype()
Base.eltype(::Type{VectorIterator{S, T}}) where {S, T} = Vector{S}
Base.IteratorSize(::Type{<:VectorIterator}) where {sch, T} = Base.HasShape{1}()
Base.length(vi::VectorIterator) = isempty(vi.x) ? 0 : length(vi.x[1])
Base.size(vi::VectorIterator) = (length(vi),)

@inline function Base.iterate(vi::VectorIterator{S, T}, i=1) where {S, T}
    i > length(vi) && return nothing
    v, x = vi.v, vi.x
    @assert length(v) == length(x)
    @inbounds for j in eachindex(v, x)
        v[j] = x[j][i]
    end
    return (v, i + 1)
end

Base.broadcastable(::VectorIterator) =
    throw(ArgumentError("broadcasting over `VectorIterator`s is not supported"))

we could use AsVector wrapper for it. For homogenous types it has the following timing on the same data:

julia> f(df) = [sum(row) for row in VectorIterator(df)]
f (generic function with 1 method)

julia> g(m) = [sum(@view m[i, :]) for i in axes(m, 1)]
g (generic function with 1 method)

julia> @time f(df);
  0.009579 seconds (762 allocations: 825.502 KiB)

julia> @time f(df);
  0.006256 seconds (8 allocations: 782.219 KiB)

julia> m = Matrix(df); @time g(m);
  0.029520 seconds (60.47 k allocations: 3.685 MiB)

julia> m = Matrix(df); @time g(m);
  0.004464 seconds (2 allocations: 781.328 KiB)

julia> m = Matrix(df); @time sum(m, dims=2);
  0.004098 seconds (6 allocations: 781.406 KiB)

julia> m = Matrix(df); @time sum(m, dims=2);
  0.002393 seconds (6 allocations: 781.406 KiB)

(and it is not @generated as Tables.namedtupleiterator so compilation time is negligible).
Thus I would say it is not bad

I use typejoin which is a safe approach (and for homogenous types just enough so I would not go over fancy here).

I used Vector not Tuple here, as Vectors are supported by many more functions, so this is a more flexible design.

bkamins added the performance label Sep 18, 2020

bkamins added this to the 1.x milestone Sep 18, 2020

bkamins mentioned this issue Sep 18, 2020

Provide a syntax to perform row aggregations fast #2439

Closed

bkamins mentioned this issue May 10, 2021

Slow row aggregation in presence of missings #2757

Closed

sl-solution mentioned this issue May 19, 2021

Fast row aggregation in DataFrames.jl #2768

Closed

bkamins modified the milestones: 1.x, 1.3 Sep 9, 2021

bkamins mentioned this issue Sep 9, 2021

Add fast reductions #2869

Merged

bkamins linked a pull request Sep 9, 2021 that will close this issue

Add fast reductions #2869

Merged

bkamins closed this as completed in #2869 Oct 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance of aggregations #2440

Investigate performance of aggregations #2440

bkamins commented Sep 18, 2020

bkamins commented Sep 18, 2020

bkamins commented Sep 18, 2020

bkamins commented Sep 18, 2020

bkamins commented Sep 20, 2020 •

edited

Loading

Investigate performance of aggregations #2440

Investigate performance of aggregations #2440

Comments

bkamins commented Sep 18, 2020

bkamins commented Sep 18, 2020

bkamins commented Sep 18, 2020

bkamins commented Sep 18, 2020

bkamins commented Sep 20, 2020 • edited Loading

bkamins commented Sep 20, 2020 •

edited

Loading