Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform performance #2479

Closed
tk3369 opened this issue Oct 11, 2020 · 12 comments
Closed

Transform performance #2479

tk3369 opened this issue Oct 11, 2020 · 12 comments
Milestone

Comments

@tk3369
Copy link
Contributor

tk3369 commented Oct 11, 2020

It seems that there's still a lot of room to improve on the table?

df = DataFrame(rand(2000, 5))

function calc1(df)
    return transform(df, [:x1, :x2, :x3] => ((x,y,z) -> x .+ y .* z) => :y)
end

function calc5(df)
    df = copy(df)
    df.y = df.x1 .+ df.x2 .* df.x3 
    return df
end

Test:

julia> calc1(df) == calc5(df)
true

julia> @btime calc1($df);
  23.192 μs (92 allocations: 99.42 KiB)

julia> @btime calc5($df);
  16.263 μs (32 allocations: 96.48 KiB)
@bkamins
Copy link
Member

bkamins commented Oct 11, 2020

This is a constant that we cannot rule out because transform is more complex (i.e. it does more work). If you go for a bigger table you have:

julia> df = DataFrame(rand(5*10^7, 5));

julia> @btime calc1($df);
  1.010 s (125 allocations: 2.24 GiB)

julia> @btime calc5($df);
  1.126 s (38 allocations: 2.24 GiB)

(actually there transform is consistently a bit faster for large data frames)

I will keep it open, but I do not think it will be not easy to do something with this.

@bkamins bkamins added this to the 1.x milestone Oct 11, 2020
@bkamins
Copy link
Member

bkamins commented Oct 11, 2020

If we implement what I proposed in #2476 the situation will become worse in the original post by @tk3369 (but will improve for the case when there is a new anonymous function created each time)

@pdeffebach
Copy link
Contributor

But after #2476 situations where the calculation time is very long are still comparable, right? transform will still scale?

@nalimilan
Copy link
Member

Yes.

After investigating this a bit with @bkamins it appears that the problem notably comes from the fact that e.g. :x => f forces specialization all all functions that take this Pair{Symbol, typeof(f)} argument on the particular f, even if they don't call f themselves. This differs from what the compiler does when passing just f. We could work around this by transforming such pairs to Pair{Symbol, Function} and so on, so that we only specialize on the general structure of the operation, and only specialize on the particular function in the inner loop.

@bkamins
Copy link
Member

bkamins commented Oct 11, 2020

we only specialize on the general structure of the operation, and only specialize on the particular function in the inner loop.

and #2476 does exactly this (but not in the most elegant way - i.e. I just add @nospecialize where needed). In the future it probably can be cleaned up, but I do not expect to improve compilation performance in comparison to what we have now (only the code will potentially be cleaner). But I would wait for next LTS (be it 1.6 or 1.7) to do this clean up as then we will do it against the functionality that compiler offers at that time (as things in the compilation latency area improve).

@pdeffebach
Copy link
Contributor

Thank you for your work on this. I will benchmark #2476 with DataFramesMeta

@bkamins
Copy link
Member

bkamins commented Oct 11, 2020

That would be most welcome!

@pdeffebach
Copy link
Contributor

pdeffebach commented Oct 11, 2020

Update:

I don't think things are really that much better. The performance difference between

transform(df, :a => identity)

and

transform(df :a => identity => :c)

is gone with this update, but any calls which use anonymous functions are still very slow

julia> df = DataFrame(a = [1, 2, 3, 4], b = [5, 6, 7, 8]);

julia> @time transform(df, :a => (t -> t) => :c);
  0.103999 seconds (184.47 k allocations: 9.794 MiB)

This is a comparable time on both #2476 and master, unfortunately.

@pdeffebach
Copy link
Contributor

Here's an update on the DataFramesMeta side

I can get rid of this problem entirely by

  1. Keeping a global Set{Symbol} called FUNSET.
  2. When I see an expression e of the form :(y = :a + :b), then first I check if it's in FUNSET. If it is not in FUNSET then take a Symbol of the hash of e. I define a function in DataFramesMeta's namespace with the same name as Symbol(hash(e)).
  3. If the hash of the expression is in FUNSET then I simply use Symbol(hash(e)).

The final call is

transform(df, [:a, :b] => getproperty(DataFramesMeta, Symbol(hash(:(y = :a + :b))) => :c)

This solves the "anonymous function" problem, but it doesn't solve the core problem that we over-specialize on the type of f more than we need to.

@tk3369
Copy link
Contributor Author

tk3369 commented Oct 12, 2020

Thanks! Interesting discussions. It's nice to see that transform scales well.

I don't think there's anything to do here. When the data frame is small then I don't care much about the little performance difference. When it gets back then transform works well. Feel free to close the ticket.

@bkamins
Copy link
Member

bkamins commented Oct 12, 2020

I can get down the compilation time to 0.04 seconds by making whole selections.jl file wrapped in @nospecialize, but this seems to be a premature decision to do so (especially as it is still only 50% gain and does not disable compilations and might have side effects I am not aware of). I will merge #2476 as is and we can continue doing performance tuning in a separate discussion.

This solves the "anonymous function" problem

This was exactly what I was thinking of. One optimization to consider would be not to hash :(y = :a + :b) but to hash a cannonical function signature (i.e. something like (x1, x2) -> x1 + x2 in this case) so that we are not affected by the changes in input column names (i.e. if you do a function defining addition of two columns it would be optimal to define it only once no matter what the column names are)

@bkamins
Copy link
Member

bkamins commented Oct 12, 2020

I am closing this issue then.

@bkamins bkamins closed this as completed Oct 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants