Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update @by/@based_on macros to avoid constructing DataFrames #127

Closed
wants to merge 2 commits into from

Conversation

jacobadenbaum
Copy link

Currently, both the @by and the @based_on macros construct DataFrames on each anonymous function call. When aggregating over a large number of groups, this can become very costly. This PR changes the helper functions to return NamedTuples instead.

Before this PR:

julia> using DataFrames, Random, DataFramesMeta
julia> Random.seed!(1234);
julia> df = DataFrame(id = repeat(1:1_000_000, 10), X = randn(10_000_000), Y = randn(10_000_000));
julia> @btime @by($df, :id, X = mean(:X), Y = mean(:Y));
  8.706 s (73999190 allocations: 4.39 GiB)

After:

julia> using DataFrames, Random, DataFramesMeta
julia> Random.seed!(1234);
julia> df = DataFrame(id = repeat(1:1_000_000, 10), X = randn(10_000_000), Y = randn(10_000_000));
julia> @btime @by($df, :id, X = mean(:X), Y = mean(:Y));
  952.237 ms (16999142 allocations: 936.71 MiB)

Avoid constructing unnecessary DataFrame.
Forgot them in the last commit
@tshort
Copy link
Contributor

tshort commented Mar 24, 2019

Good idea.

@nalimilan
Copy link
Member

Thanks, but I think it would be even faster to use the new by API from JuliaData/DataFrames.jl#1601, which ensures full type stability (instead of passing a SubDataFrame). Maybe you can take inspiration from transform_helper?

(I also have a branch locally to transform single-argument calls like sum(x) into just sum so that optimized algorithm for reductions can be used, but I initially wanted to use them with transform, which doesn't support them yet. I can push the branch if you want to experiment with reusing that code for @by.)

@@ -536,7 +536,7 @@ end

function based_on_helper(x, args...)
with_args =
with_anonymous(:($DataFrame($(map(replace_equals_with_kw, args)...))))
with_anonymous(Expr(:tuple, map(replace_kw_with_equals, args)...))
:( DataFrames.DataFrame(map($with_args, $x)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT it would be more efficient to use combine instead of DataFrame(map(...)).

@jacobadenbaum
Copy link
Author

If you wouldn't mind pushing the branch, that would be helpful.

@nalimilan
Copy link
Member

I've pushed the nl/transform branch. It's not correct (since it uses combine inside transform despite these two functions having different behaviors in some cases), but the helpers machinery worked IIRC.

@pdeffebach
Copy link
Collaborator

Closing this as all the @by calls are based on DataFrames.combine now.

@pdeffebach pdeffebach closed this Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants