Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add _findall for AbstractVector{Bool} and use it in internal functions #2769

Merged
merged 8 commits into from
May 28, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented May 21, 2021

Fixes #2765

@bkamins
Copy link
Member Author

bkamins commented May 21, 2021

Timings look acceptable:

julia> BD = OrderedDict(
           "T Big" => trues(100000),
           "F Big" => falses(100000),
           "T64 F64" => [trues(64); falses(64)],
           "F64 T64" => [falses(64); trues(64)],
           "F80 T100" => [falses(85); trues(100)],
           "F256 T32" => [falses(256); trues(32)],
           "F260 T32" => [falses(260); trues(32)],
           "TF Big" => [trues(100000); falses(100000)],
           "FT Big" => [falses(100000);trues(100000)] ,

           # some edge cases
           "TFT small" => [trues(85); falses(100); trues(85)],
           "FTFFT small" => [falses(64 + 32); trues(32); falses(128); trues(32)],
           "TFTF small" => [falses(64); trues(64); falses(64); trues(64)],
           "TFT small" => [trues(64); falses(10); trues(100)],

           "FTF Big" => [falses(8500); trues(100000); falses(65000)],
           "TFT Big" => [trues(8500); falses(100000); trues(65000)],
           "FTFTFTF Big" => [falses(65000); trues(65000); falses(65000); trues(65000); falses(65000); trues(65000); falses(65000)],

           "FTFR small" => [falses(85); trues(100); falses(65); rand([true, false], 20)],
           "R Big" => BitVector(rand([true, false], 200000)),
           "RF Big" => [BitVector(rand([true, false], 100000)) ; falses(100000)],
           "RT Big" => [BitVector(rand([true, false], 100000)) ; trues(100000)],
           "FTFR Big" => [falses(65000);  trues(65000);  falses(65000); rand([true, false], 20000)],
           "T256 R100" => [trues(256);  rand([true, false], 100)],
           "F256 R100" => [falses(256); rand([true, false], 100)],
       );

julia> for (l, B) in BD
           println("Processing $l")
           Bv = Vector{Bool}(B)
           @show Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv)
           @btime Base.findall($B)
           @btime DataFrames._findall($B)
           @btime Base.findall($Bv)
           @btime DataFrames._findall($Bv)
       end
Processing T Big
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  35.020 μs (2 allocations: 781.33 KiB)
  305.137 ns (0 allocations: 0 bytes)
  103.260 μs (2 allocations: 781.33 KiB)
  1.587 μs (0 allocations: 0 bytes)
Processing F Big
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  331.556 ns (1 allocation: 80 bytes)
  305.000 ns (0 allocations: 0 bytes)
  76.873 μs (1 allocation: 80 bytes)
  1.588 μs (0 allocations: 0 bytes)
Processing T64 F64
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  82.086 ns (1 allocation: 624 bytes)
  11.036 ns (0 allocations: 0 bytes)
  148.570 ns (1 allocation: 624 bytes)
  51.768 ns (0 allocations: 0 bytes)
Processing F64 T64
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  81.473 ns (1 allocation: 624 bytes)
  10.787 ns (0 allocations: 0 bytes)
  148.589 ns (1 allocation: 624 bytes)
  73.973 ns (0 allocations: 0 bytes)
Processing F80 T100
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  112.127 ns (1 allocation: 896 bytes)
  15.300 ns (0 allocations: 0 bytes)
  209.357 ns (1 allocation: 896 bytes)
  98.547 ns (0 allocations: 0 bytes)
Processing F256 T32
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  55.576 ns (1 allocation: 336 bytes)
  14.545 ns (0 allocations: 0 bytes)
  260.235 ns (1 allocation: 336 bytes)
  275.608 ns (0 allocations: 0 bytes)
Processing F260 T32
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  55.171 ns (1 allocation: 336 bytes)
  14.550 ns (0 allocations: 0 bytes)
  265.828 ns (1 allocation: 336 bytes)
  281.676 ns (0 allocations: 0 bytes)
Processing TF Big
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  79.269 μs (2 allocations: 781.33 KiB)
  1.399 μs (0 allocations: 0 bytes)
  180.256 μs (2 allocations: 781.33 KiB)
  53.601 μs (0 allocations: 0 bytes)
Processing FT Big
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  116.253 μs (2 allocations: 781.33 KiB)
  2.001 μs (0 allocations: 0 bytes)
  200.459 μs (2 allocations: 781.33 KiB)
  112.174 μs (0 allocations: 0 bytes)
Processing TFT small
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  176.293 ns (1 allocation: 1.45 KiB)
  138.606 ns (3 allocations: 1.53 KiB)
  233.015 ns (1 allocation: 1.45 KiB)
  192.385 ns (1 allocation: 1.45 KiB)
Processing FTFFT small
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  83.331 ns (1 allocation: 624 bytes)
  89.369 ns (3 allocations: 704 bytes)
  299.215 ns (1 allocation: 624 bytes)
  261.375 ns (1 allocation: 624 bytes)
Processing TFTF small
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  139.438 ns (1 allocation: 1.14 KiB)
  124.876 ns (5 allocations: 1.30 KiB)
  288.127 ns (1 allocation: 1.14 KiB)
  234.467 ns (1 allocation: 1.14 KiB)
Processing FTF Big
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  85.502 μs (2 allocations: 781.33 KiB)
  1.439 μs (0 allocations: 0 bytes)
  182.573 μs (2 allocations: 781.33 KiB)
  66.555 μs (0 allocations: 0 bytes)
Processing TFT Big
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  67.038 μs (2 allocations: 574.33 KiB)
  55.845 μs (2032 allocations: 653.62 KiB)
  170.520 μs (2 allocations: 574.33 KiB)
  126.508 μs (2 allocations: 574.33 KiB)
Processing FTFTFTF Big
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  176.434 μs (2 allocations: 1.49 MiB)
  129.894 μs (4062 allocations: 1.64 MiB)
  459.241 μs (2 allocations: 1.49 MiB)
  360.406 μs (2 allocations: 1.49 MiB)
Processing FTFR small
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  123.609 ns (1 allocation: 1008 bytes)
  97.650 ns (3 allocations: 1.06 KiB)
  307.919 ns (1 allocation: 1008 bytes)
  251.489 ns (1 allocation: 1008 bytes)
Processing R Big
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  108.669 μs (2 allocations: 780.33 KiB)
  85.125 μs (2 allocations: 780.33 KiB)
  913.936 μs (2 allocations: 780.33 KiB)
  771.246 μs (2 allocations: 780.33 KiB)
Processing RF Big
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  57.315 μs (2 allocations: 390.83 KiB)
  42.894 μs (2 allocations: 390.83 KiB)
  563.630 μs (2 allocations: 390.83 KiB)
  439.797 μs (2 allocations: 390.83 KiB)
Processing RT Big
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  149.267 μs (2 allocations: 1.14 MiB)
  113.986 μs (3126 allocations: 1.26 MiB)
  564.277 μs (2 allocations: 1.14 MiB)
  460.027 μs (2 allocations: 1.14 MiB)
Processing FTFR Big
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  70.646 μs (2 allocations: 586.58 KiB)
  44.032 μs (4 allocations: 586.66 KiB)
  273.804 μs (2 allocations: 586.58 KiB)
  252.519 μs (2 allocations: 586.58 KiB)
Processing T256 R100
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  446.406 ns (1 allocation: 2.62 KiB)
  364.292 ns (3 allocations: 2.70 KiB)
  581.854 ns (1 allocation: 2.62 KiB)
  487.621 ns (1 allocation: 2.62 KiB)
Processing F256 R100
Base.findall(B) == DataFrames._findall(B) == DataFrames._findall(Bv) = true
  76.712 ns (1 allocation: 544 bytes)
  69.048 ns (1 allocation: 544 bytes)
  332.305 ns (1 allocation: 544 bytes)
  381.655 ns (1 allocation: 544 bytes)

@pstorozenko
Copy link
Contributor

Nice, I had not time to look on it during a week.
I wanted to suggest checking, where _findall could be used internally, good that you looked on it.
Maybe add this information to the pull request name?

@bkamins
Copy link
Member Author

bkamins commented May 22, 2021

Maybe add this information to the pull request name?

Yes. I have just not have had time to finish these changes (as you can see PR is failing + I have not done benchmarks). I am working on it right now.

@bkamins bkamins changed the title add _findall for AbstractVector{Bool} add _findall for AbstractVector{Bool} and use it in internal functions May 22, 2021
@bkamins
Copy link
Member Author

bkamins commented May 22, 2021

Tests of things changed (all timings after compilation; I mostly focus on checking if we do not regress).
In general things look good (except one case on which I comment).

This PR:

julia> f(row) = row.x > 0.5; df = DataFrame(x = 0:0.0000001:1);

julia> @time filter!(AsTable(:x) => f, df);
  0.055787 seconds (42 allocations: 77.493 MiB)

julia> df = DataFrame(x = rand(10^6));

julia> @time filter!(AsTable(:x) => f, df);
  0.010720 seconds (43 allocations: 11.571 MiB)

julia> @time unique!(df);
  0.027130 seconds (20 allocations: 12.290 MiB)

julia> df = DataFrame(x = rand(1:10^4, 10^6));

julia> @time unique!(df);
  0.025073 seconds (41 allocations: 32.137 MiB, 5.88% gc time)

julia> df = DataFrame(rand(4, 1000), :auto);

julia> @time df[[true, true, true, true], 1:end]; # this and below is the only regression
  0.000262 seconds (2.99 k allocations: 206.812 KiB)

julia> @time df[[true, true, true, true], :]; # this is the second regression - I comment on it in the code
  0.000244 seconds (2.99 k allocations: 206.469 KiB)

julia> df = DataFrame(rand(2*10^6, 64), :auto);

julia> @time df[trues(nrow(df)), 1:end];
  0.178173 seconds (674 allocations: 976.846 MiB)

julia> @time df[trues(nrow(df)), :];
  0.168878 seconds (670 allocations: 976.845 MiB)

julia> df = DataFrame(x = rand(10^6));

julia> @time delete!(df, trues(10^6));
  0.000069 seconds (7 allocations: 122.328 KiB)

julia> df1 = DataFrame(x = 1:10^6);

julia> df2 = DataFrame(x = 10^5:10^6+10^5);

julia> @btime outerjoin($df1, $df2, on=:x);
  20.150 ms (259 allocations: 23.91 MiB)

julia> @btime outerjoin($df1, $df2, on=:x);
  20.352 ms (259 allocations: 23.91 MiB)

main:

julia> f(row) = row.x > 0.5; df = DataFrame(x = 0:0.0000001:1);

julia> @time filter!(AsTable(:x) => f, df);
  0.093865 seconds (48 allocations: 115.640 MiB, 7.79% gc time)

julia> df = DataFrame(x = rand(10^6));

julia> @time filter!(AsTable(:x) => f, df);
  0.012052 seconds (48 allocations: 11.567 MiB)

julia> @time unique!(df);
  0.023995 seconds (18 allocations: 12.296 MiB)

julia> df = DataFrame(x = rand(1:10^4, 10^6));

julia> @time unique!(df);
  0.025348 seconds (39 allocations: 32.137 MiB)

julia> df = DataFrame(rand(4, 1000), :auto);

julia> @time df[[true, true, true, true], 1:end];
  0.000268 seconds (1.99 k allocations: 175.672 KiB)

julia> @time df[[true, true, true, true], :];
  0.000250 seconds (1.99 k allocations: 175.328 KiB)

julia> df = DataFrame(rand(2*10^6, 64), :auto);

julia> @time df[trues(nrow(df)), 1:end];
  0.195522 seconds (614 allocations: 992.104 MiB)

julia> @time df[trues(nrow(df)), :];
  0.180493 seconds (609 allocations: 992.102 MiB)

julia> df = DataFrame(x = rand(10^6));

julia> @time delete!(df, trues(10^6));
  0.002477 seconds (8 allocations: 7.749 MiB)

julia> df1 = DataFrame(x = 1:10^6);

julia> df2 = DataFrame(x = 10^5:10^6+10^5);

julia> @btime outerjoin($df1, $df2, on=:x);
  20.936 ms (270 allocations: 25.43 MiB)

julia> @btime outerjoin($df1, $df2, on=:x);
  20.892 ms (270 allocations: 25.43 MiB)

@bkamins bkamins requested a review from nalimilan May 22, 2021 14:44
@@ -510,6 +510,28 @@ function Base.getindex(df::DataFrame, ::typeof(!), col_ind::SymbolOrString)
end

# df[MultiRowIndex, MultiColumnIndex] => DataFrame

function _threaded_getindex(selected_rows::AbstractVector,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the change that is leading to a regression.
Regression is minimal (microsecond level) and only if there are very many columns. Essentially - I have removed code duplication we had in the past differentiating the case when : is used for column selection (now : falls back to general column selector). This leads to a slight regression, but I thought that given it is very small it is OK to accept it.

src/join/composer.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <[email protected]>
@bkamins
Copy link
Member Author

bkamins commented May 23, 2021

@pstorozenko - OK to merge or you want to have another look at it?

@pstorozenko
Copy link
Contributor

Sure, merge it, I'm just looking around and trying to learn something :)

@bkamins
Copy link
Member Author

bkamins commented May 23, 2021

@pstorozenko - thank you. Actually before merging your PR I have checked if findfirst and findnext could not be used in it, but using them lead to performance regressions in comparison to your code (using findfirst and findnext would reduce the complexity of your PR a lot).

@pstorozenko
Copy link
Contributor

I didn't know that such functions exist but after seeing them here I thought about using them in BitVector version. Good to know that you've already checked it.

src/other/utils.jl Outdated Show resolved Hide resolved
src/other/utils.jl Outdated Show resolved Hide resolved

# slow path returning Vector{Int}
I = Vector{Int}(undef, nnzB)
I[1:stop - start + 1] .= start:stop
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will wait with merging till @mbauman comments on #2771 about the allocations induced by broadcasting in such scenarios.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed here also to use a loop for consistency with #2771

src/other/utils.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented May 26, 2021

I have added a news entry. It is not super precise, but I think we do not need to go into all details of when the performance is improved.

@pstorozenko
Copy link
Contributor

I think that discussion in these two PRs is descriptive enough to understand the source of performance gain.

@@ -246,7 +248,7 @@ function _findall(B::BitVector)::Union{UnitRange{Int}, Vector{Int}}
end
if c == ~UInt64(0)
if stop != -1
I = Vector{Int}(undef, nnzB)
I = Vector{Int}(undef,nnzB)
I[1:i-1] .= start:stop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadcasting is still used here. Is it really a problem to use it in general, though? This function should be compiled only once for Vector{Bool} and precompilation should work well, so latency probably doesn't matter a lot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already fixed in main, #2771.
The problem is not with latency but with allocations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan - in particular #2771 (comment) shows that something bad is happening with Julia compiler in this case. Since adding a loop is not that much longer and is a "safe choice" I accepted #2771.

Anyway - are you OK with merging this PR? (as the issue you asked about is unrelated) Thank you!

@bkamins bkamins merged commit 0b4f458 into main May 28, 2021
@bkamins bkamins deleted the _findall-for-AbstractVector{Bool} branch May 28, 2021 07:07
@bkamins
Copy link
Member Author

bkamins commented May 28, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve SubDataFrame creation for AbstractVector{Bool}
3 participants