Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance of aggregations #2440

Closed
bkamins opened this issue Sep 18, 2020 · 4 comments · Fixed by #2869
Closed

Investigate performance of aggregations #2440

bkamins opened this issue Sep 18, 2020 · 4 comments · Fixed by #2869
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Sep 18, 2020

The case that should be improved (or checked why it is problematic):

julia> df = DataFrame(rand(10^5, 40));

julia> @time combine(df, AsTable(1:40) => ByRow(sum));
  0.031775 seconds (144 allocations: 31.294 MiB, 15.38% gc time)

julia> @time combine(df, 1:40 => ByRow(+));
  2.719783 seconds (69.70 M allocations: 2.410 GiB, 9.69% gc time)
@bkamins bkamins added this to the 1.x milestone Sep 18, 2020
@bkamins
Copy link
Member Author

bkamins commented Sep 18, 2020

This is the reason:

julia> f(a) = +(a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a);

julia> @code_warntype f(a)
Variables
  #self#::Core.Compiler.Const(f, false)
  a::Array{Float64,1}

Body::Array{Float64,1}
1 ─ %1 = (a + a + a + a + a + a + a + a + a + a + a + a + a + a + a + a + a)::Array{Float64,1}
└──      return %1

julia> f(a) = +(a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a);

julia> @code_warntype f(a)
Variables
  #self#::Core.Compiler.Const(f, false)
  a::Array{Float64,1}

Body::Any
1 ─ %1 = (a + a + a + a + a + a + a + a + a + a + a + a + a + a + a + a + a + a)::Any
└──      return %1

@bkamins
Copy link
Member Author

bkamins commented Sep 18, 2020

@nalimilan - so it seems we have an inherent limitation to splatting as at some point compiler gives up with optimization.

@bkamins
Copy link
Member Author

bkamins commented Sep 18, 2020

So the conclusion is:

  1. we should add an information in the documentation that having more than 16 columns in the default (splatting) syntax is likely to be slow;
  2. we should add AsVector or AsTuple wrapper for cases when you want to aggregate across e.g. 1000 columns (most likely of a homogenous type). This is tracked in Provide a syntax to perform row aggregations fast #2439. The reason is that currently there is no way to perform it fast without huge run-time or compile-time cost. I will investigate both options and give some recommendation what should be added.

@bkamins
Copy link
Member Author

bkamins commented Sep 20, 2020

The solution is the following type:

"""
    VectorIterator(::AbstractDataFrame)

Return a `Vector` iterator over rows of a passed.
The returned vector is reused between iterations for efficiency.
It can be modified by the function that accesses it,
but it is not allowed to resize it. This iterator is not thread safe.
"""
struct VectorIterator{S, T}
    v::Vector{S}
    x::Vector{T}

    function VectorIterator(df::AbstractDataFrame)
        ncol(df) == 0 && return new{Union{}, Vector{Any}}([], Union{}[])

        S = reduce(typejoin, eltype(x) for x in eachcol(df))
        T = reduce(typejoin, typeof(x) for x in eachcol(df))
        x = T[x for x in eachcol(df)]
        new{S, T}(Vector{S}(undef, ncol(df)), x)
    end
end

Base.IteratorEltype(::Type{<:VectorIterator}) = Base.HasEltype()
Base.eltype(::Type{VectorIterator{S, T}}) where {S, T} = Vector{S}
Base.IteratorSize(::Type{<:VectorIterator}) where {sch, T} = Base.HasShape{1}()
Base.length(vi::VectorIterator) = isempty(vi.x) ? 0 : length(vi.x[1])
Base.size(vi::VectorIterator) = (length(vi),)

@inline function Base.iterate(vi::VectorIterator{S, T}, i=1) where {S, T}
    i > length(vi) && return nothing
    v, x = vi.v, vi.x
    @assert length(v) == length(x)
    @inbounds for j in eachindex(v, x)
        v[j] = x[j][i]
    end
    return (v, i + 1)
end

Base.broadcastable(::VectorIterator) =
    throw(ArgumentError("broadcasting over `VectorIterator`s is not supported"))

we could use AsVector wrapper for it. For homogenous types it has the following timing on the same data:

julia> f(df) = [sum(row) for row in VectorIterator(df)]
f (generic function with 1 method)

julia> g(m) = [sum(@view m[i, :]) for i in axes(m, 1)]
g (generic function with 1 method)

julia> @time f(df);
  0.009579 seconds (762 allocations: 825.502 KiB)

julia> @time f(df);
  0.006256 seconds (8 allocations: 782.219 KiB)

julia> m = Matrix(df); @time g(m);
  0.029520 seconds (60.47 k allocations: 3.685 MiB)

julia> m = Matrix(df); @time g(m);
  0.004464 seconds (2 allocations: 781.328 KiB)

julia> m = Matrix(df); @time sum(m, dims=2);
  0.004098 seconds (6 allocations: 781.406 KiB)

julia> m = Matrix(df); @time sum(m, dims=2);
  0.002393 seconds (6 allocations: 781.406 KiB)

(and it is not @generated as Tables.namedtupleiterator so compilation time is negligible).
Thus I would say it is not bad

I use typejoin which is a safe approach (and for homogenous types just enough so I would not go over fancy here).

I used Vector not Tuple here, as Vectors are supported by many more functions, so this is a more flexible design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant