Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

precompilation for 1.4 release #3182

Merged
merged 5 commits into from
Sep 29, 2022
Merged

precompilation for 1.4 release #3182

merged 5 commits into from
Sep 29, 2022

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Sep 26, 2022

Fixes #3080

Comparison of timing is below. The conclusion (pretty obvious is) that what we precompile is faster, what we do not precompile is comparable or slower (note that in DataFrames.jl 1.3.6 we also did precompilation but using a different mechanism).

So:

  • we should include in precompilation all that requires a lot of compilation
  • however, we should limit ourselves only to common methods (as otherwise we will increase package load time by having many precompile statements that are usually not needed)

Conclusion: we need to think carefully about the list of things we put for precompilation. Please comment what you thing we should put there.

I have chosen operations that are not included in precompilation statements for 1.4 release:

DataFrames.jl 1.3.6

julia> @time using DataFrames
  1.887191 seconds (3.37 M allocations: 232.080 MiB, 4.77% gc time)

julia> df = DataFrame(rand(10, 10), :auto);

julia> @time using DataFrames
  1.637969 seconds (3.37 M allocations: 232.055 MiB, 5.40% gc time)

julia> @time df = DataFrame(rand(10, 10), :auto);
  0.008176 seconds (2.24 k allocations: 131.895 KiB, 98.99% compilation time)

julia> @time select(df, :x2, Not(:x2));
  0.341841 seconds (182.34 k allocations: 9.533 MiB, 99.75% compilation time)

julia> @time combine(df, identity);
  0.189355 seconds (743.18 k allocations: 39.859 MiB, 4.95% gc time, 99.82% compilation time)

julia> @time leftjoin(df, df, on=:x3, makeunique=true);
  2.166899 seconds (2.12 M allocations: 103.064 MiB, 0.94% gc time, 99.96% compilation time)

julia> @time outerjoin(df, df, on=:x3, makeunique=true);
  0.040364 seconds (70.89 k allocations: 3.688 MiB, 99.30% compilation time)

julia> @time transform(df, :x1 => sum);
  0.186119 seconds (531.35 k allocations: 30.123 MiB, 99.68% compilation time)

julia> @time combine(groupby(df, :x4), :x1 => sum);
  1.430754 seconds (3.33 M allocations: 179.667 MiB, 3.18% gc time, 99.90% compilation time)

julia> @time select!(df, Not(r"x"));
  0.037020 seconds (92.96 k allocations: 4.885 MiB, 99.79% compilation time)

DataFrames.jl main

julia> @time using DataFrames
  1.997317 seconds (3.66 M allocations: 232.418 MiB, 3.50% gc time, 32.39% compilation time: 100% of which was recompilation)

julia> df = DataFrame(rand(10, 10), :auto);

julia> @time select(df, :x2, Not(:x2));
  0.479201 seconds (107.14 k allocations: 5.558 MiB, 99.92% compilation time)

julia> @time combine(df, identity);
  0.109145 seconds (294.37 k allocations: 15.965 MiB, 99.74% compilation time)

julia> @time leftjoin(df, df, on=:x3, makeunique=true);
  2.263144 seconds (1.75 M allocations: 83.771 MiB, 7.21% gc time, 99.96% compilation time)

julia> @time outerjoin(df, df, on=:x3, makeunique=true);
  0.053238 seconds (26.19 k allocations: 1.419 MiB, 99.40% compilation time: 20% of which was recompilation)

julia> @time transform(df, :x1 => sum);
  0.102747 seconds (93.63 k allocations: 5.107 MiB, 99.68% compilation time)

julia> @time combine(groupby(df, :x4), :x1 => sum);
  1.187194 seconds (1.20 M allocations: 64.261 MiB, 1.52% gc time, 99.94% compilation time)

julia> @time select!(df, Not(r"x"));
  0.028480 seconds (69.57 k allocations: 3.642 MiB, 99.80% compilation time)

I have chosen operations included in precompilation statements for 1.4 release

DataFrames.jl 1.3.6

julia> using DataFrames, PooledArrays, Statistics

julia> @time begin
julia> @time begin
julia> @time begin
julia> @time begin
           df = DataFrame(a=[2, 5, 3, 1, 0], b=["a", "b", "c", "a", "b"], c=1:5,
                          p=PooledArray(["a", "b", "c", "a", "b"]),
                                             q=[true, false, true, false, true],
                          f=Float64[2, 5, 3, 1, 0])
                              describe(df)
           names(df[1, 1:2])
           sort(df, :a)
           combine(df, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           transform(df, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           groupby(df, :a)
           groupby(df, :q)
           groupby(df, :p)
           gdf = groupby(df, :b)
           combine(gdf, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           transform(gdf, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           innerjoin(df, df, on=:a, makeunique=true)
               innerjoin(df, df, on=:b, makeunique=true)
           innerjoin(df, df, on=:c, makeunique=true)
           outerjoin(df, df, on=:a, makeunique=true)
           outerjoin(df, df, on=:b, makeunique=true)
           outerjoin(df, df, on=:c, makeunique=true)
           semijoin(df, df, on=:a)
           semijoin(df, df, on=:b)
           semijoin(df, df, on=:c)
           leftjoin!(df, DataFrame(a=[2, 5, 3, 1, 0]), on=:a)
           leftjoin!(df, DataFrame(b=["a", "b", "c", "d", "e"]), on=:b)
           leftjoin!(df, DataFrame(c=1:5), on=:c)
           reduce(vcat, [df, df])
           show(IOBuffer(), df)
           subset(df, :q)
               @view df[1:3, :]
           @view df[:, 1:2]
           select!(df, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
       end;
 13.120280 seconds (28.63 M allocations: 1.457 GiB, 2.40% gc time, 99.84% compilation time)

DataFrames.jl main

julia> using DataFrames, PooledArrays, Statistics

julia> @time begin
           df = DataFrame(a=[2, 5, 3, 1, 0], b=["a", "b", "c", "a", "b"], c=1:5,
                          p=PooledArray(["a", "b", "c", "a", "b"]),
                                             q=[true, false, true, false, true],
                          f=Float64[2, 5, 3, 1, 0])
                              describe(df)
           names(df[1, 1:2])
           sort(df, :a)
           combine(df, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           transform(df, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           groupby(df, :a)
           groupby(df, :q)
           groupby(df, :p)
           gdf = groupby(df, :b)
           combine(gdf, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           transform(gdf, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           innerjoin(df, df, on=:a, makeunique=true)
               innerjoin(df, df, on=:b, makeunique=true)
           innerjoin(df, df, on=:c, makeunique=true)
           outerjoin(df, df, on=:a, makeunique=true)
           outerjoin(df, df, on=:b, makeunique=true)
           outerjoin(df, df, on=:c, makeunique=true)
           semijoin(df, df, on=:a)
           semijoin(df, df, on=:b)
           semijoin(df, df, on=:c)
           leftjoin!(df, DataFrame(a=[2, 5, 3, 1, 0]), on=:a)
           leftjoin!(df, DataFrame(b=["a", "b", "c", "d", "e"]), on=:b)
           leftjoin!(df, DataFrame(c=1:5), on=:c)
           reduce(vcat, [df, df])
           show(IOBuffer(), df)
           subset(df, :q)
               @view df[1:3, :]
           @view df[:, 1:2]
           select!(df, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
       end;
  9.799926 seconds (2.06 M allocations: 96.399 MiB, 0.25% gc time, 99.82% compilation time: 15% of which was recompilation)

@bkamins bkamins added the ecosystem Issues in DataFrames.jl ecosystem label Sep 26, 2022
@bkamins bkamins added this to the 1.4 milestone Sep 26, 2022
@bkamins bkamins requested a review from nalimilan September 26, 2022 14:49
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a much simpler way of handling precompilation!

How much does this add to the loading time compared with not precompiling anything?

subset(df, :q)
@view df[1:3, :]
@view df[:, 1:2]
select!(df, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why select! and not select nor transform!?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transform was called above, and it calls select internally. I could use transform! - it should not be that different. I will change it.

@bkamins
Copy link
Member Author

bkamins commented Sep 26, 2022

No precompilation

First call

julia> @time using DataFrames
[ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
  4.837268 seconds (2.82 M allocations: 163.831 MiB, 0.71% gc time, 18.74% compilation time: 95% of which was recompilation)

Next calls

julia> @time using DataFrames
  1.218950 seconds (2.18 M allocations: 131.630 MiB, 1.96% gc time, 57.11% compilation time: 100% of which was recompilation)

With precompilation

First call

julia> @time using DataFrames
[ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
 18.193496 seconds (4.29 M allocations: 264.117 MiB, 0.32% gc time, 4.30% compilation time: 95% of which was recompilation)

Next calls

julia> @time using DataFrames
  1.991196 seconds (3.65 M allocations: 231.916 MiB, 3.51% gc time, 32.72% compilation time: 100% of which was recompilation)

@bkamins
Copy link
Member Author

bkamins commented Sep 26, 2022

@nalimilan - I have also checked that adding additional statements in precompilation code does not add much in the first load time but indeed improves things later.
So, if you have something more to add in mind please comment and I will benchmark it and add if it is beneficial.

@bkamins
Copy link
Member Author

bkamins commented Sep 28, 2022

@nalimilan - given no suggestions I would merge this. We can always change the precompiled method list since it is non-breaking.

@bkamins
Copy link
Member Author

bkamins commented Sep 29, 2022

Thank you!

@bkamins bkamins merged commit 7c1a888 into main Sep 29, 2022
@bkamins bkamins deleted the bk/precompilation branch September 29, 2022 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ecosystem Issues in DataFrames.jl ecosystem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

update precompilation for 1.4 release
2 participants