Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix deleteat! and subset! performance #3249

Merged
merged 3 commits into from
Dec 18, 2022
Merged

fix deleteat! and subset! performance #3249

merged 3 commits into from
Dec 18, 2022

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Dec 15, 2022

Explanation in https://discourse.julialang.org/t/learning-to-benchmark-and-find-the-best-function-to-select-a-subset-of-a-dataframe/91704/12.

Benchmarks

This PR

julia> using Random

julia> Random.seed!(1234)
TaskLocalRNG()

julia> df = DataFrame(rand(10^6, 100), :auto);

julia> df.id = rand(1:100, 10^6);

julia> inds = rand(Bool, 10^6);

julia> x = copy(df); @time deleteat!(x, inds);
  0.084989 seconds (29.05 k allocations: 1.471 MiB, 11.83% compilation time)

julia> x = copy(df); @time deleteat!(x, inds);
  0.074356 seconds (305 allocations: 4.766 KiB)

julia> x = copy(df); @time subset!(x, :x1 => Returns(inds));
  0.091337 seconds (485 allocations: 262.422 KiB)

julia> x = copy(df); @time subset!(x, :x1 => Returns(inds));
  0.148983 seconds (485 allocations: 262.422 KiB)

julia> x = groupby(copy(df), :id); @time subset!(x, :x1 => x -> rand(Bool, length(x)));     
  0.207012 seconds (1.15 M allocations: 63.651 MiB, 31.11% compilation time)

julia> x = groupby(copy(df), :id); @time subset!(x, :x1 => x -> rand(Bool, length(x)));     
  0.216079 seconds (1.15 M allocations: 63.632 MiB, 30.41% compilation time)

1.4.4 release

julia> using Random

julia> Random.seed!(1234)
TaskLocalRNG()

julia> df = DataFrame(rand(10^6, 100), :auto);

julia> df.id = rand(1:100, 10^6);

julia> inds = rand(Bool, 10^6);

julia> x = copy(df); @time deleteat!(x, inds);
  0.300327 seconds (307 allocations: 3.817 MiB)

julia> x = copy(df); @time deleteat!(x, inds);
  0.303013 seconds (307 allocations: 3.817 MiB)

julia> x = copy(df); @time subset!(x, :x1 => Returns(inds));
  0.307057 seconds (483 allocations: 4.074 MiB)

julia> x = copy(df); @time subset!(x, :x1 => Returns(inds));
  0.300665 seconds (483 allocations: 4.074 MiB)

julia> x = groupby(copy(df), :id); @time subset!(x, :x1 => x -> rand(Bool, length(x)));
  0.437468 seconds (1.15 M allocations: 67.427 MiB, 15.85% compilation time)

julia> x = groupby(copy(df), :id); @time subset!(x, :x1 => x -> rand(Bool, length(x)));
  0.417350 seconds (1.15 M allocations: 67.425 MiB, 15.70% compilation time)

@bkamins bkamins requested a review from nalimilan December 15, 2022 19:27
@bkamins bkamins added this to the patch milestone Dec 15, 2022
@nalimilan
Copy link
Member

Interesting. Have you checked with more columns and with a lower percentage of dropped rows? I would expect the findall approach to be less slow (and maybe faster) in these cases.

@bkamins
Copy link
Member Author

bkamins commented Dec 17, 2022

I would expect the findall approach to be less slow (and maybe faster) in these cases.

You are right! 🧠

The threshold value I assessed empirically is less than 5% observations when it is better (it probably also depends on number of columns, but I wanted to have something relatively simple). I have proposed an adaptive algorithm switching between two approaches as needed.

@nalimilan
Copy link
Member

Cool. Can you check when there are many columns? That's a use case that we care about.

@bkamins
Copy link
Member Author

bkamins commented Dec 17, 2022

Here is an example: 100 columns, 10^6 rows. Tested with 5.5% rows to drop (so a bit above 5% threshold).

Setup:

df = DataFrame(rand(10^6, 100), :auto)

This PR:

julia> t = 0.055;

julia> Random.seed!(1234);

julia> idx = rand(10^6) .< t;

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.095975 seconds (303 allocations: 4.734 KiB)

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.096654 seconds (303 allocations: 4.734 KiB)

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.094744 seconds (303 allocations: 4.734 KiB)

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.094374 seconds (303 allocations: 4.734 KiB)

Current release:

julia> t = 0.055;

julia> Random.seed!(1234);

julia> idx = rand(10^6) .< t;

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.092951 seconds (304 allocations: 432.203 KiB)

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.087335 seconds (304 allocations: 432.203 KiB)

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.091471 seconds (304 allocations: 432.203 KiB)

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.092145 seconds (304 allocations: 432.203 KiB)

I did some more tests on even wider tables and it seems that a more precise threshold is 6% on my laptop, so I changed it to that value.

@nalimilan
Copy link
Member

OK, great!

@bkamins bkamins merged commit b240458 into main Dec 18, 2022
@bkamins bkamins deleted the bk/deleteat branch December 18, 2022 14:13
@bkamins
Copy link
Member Author

bkamins commented Dec 18, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants