Significant regression of groupby when threading #2735

bkamins · 2021-04-25T18:08:59Z

Here are timings:

$ julia -e "using DataFrames, BenchmarkTools; n=100_000_000; df=DataFrame(passband=rand(Int8,n)); @btime groupby($df,:passband)"
  178.478 ms (80 allocations: 762.94 MiB)

$ julia -t 2 -e "using DataFrames, BenchmarkTools; n=100_000_000; df=DataFrame(passband=rand(Int8,n)); @btime groupby($df,:passband)"
  450.070 ms (681 allocations: 763.00 MiB)

$ julia -t 4 -e "using DataFrames, BenchmarkTools; n=100_000_000; df=DataFrame(passband=rand(Int8,n)); @btime groupby($df,:passband)"
  525.876 ms (682 allocations: 763.00 MiB)

$ julia -t 8 -e "using DataFrames, BenchmarkTools; n=100_000_000; df=DataFrame(passband=rand(Int8,n)); @btime groupby($df,:passband)"
  458.003 ms (681 allocations: 763.00 MiB)

We have a problem with your new macro I think @nalimilan. Maybe it splits the data into too small chunks? This is hard 😞. Can you please look at it as you have implemented this part? Otherwise I can check - please let me know.

The text was updated successfully, but these errors were encountered:

bkamins · 2021-04-25T20:10:02Z

same on Int64:

$ julia -e "using DataFrames, BenchmarkTools; n=100_000_000; df=DataFrame(passband=Int64.(rand(Int8,n))); @btime groupby($df,:passband)"
  270.989 ms (80 allocations: 762.94 MiB)

$ julia -t 2 -e "using DataFrames, BenchmarkTools; n=100_000_000; df=DataFrame(passband=Int64.(rand(Int8,n))); @btime groupby($df,:passband)"
  374.933 ms (682 allocations: 763.00 MiB)

$ julia -t 4 -e "using DataFrames, BenchmarkTools; n=100_000_000; df=DataFrame(passband=Int64.(rand(Int8,n))); @btime groupby($df,:passband)"
  593.278 ms (682 allocations: 763.00 MiB)

$ julia -t 8 -e "using DataFrames, BenchmarkTools; n=100_000_000; df=DataFrame(passband=Int64.(rand(Int8,n))); @btime groupby($df,:passband)"
  450.173 ms (682 allocations: 763.00 MiB)

bkamins · 2021-04-25T22:15:34Z

I have first benchmarked hashrows. The conclusion is:

for small bitstype (like Bool) using threading leads to a slowdown (but it is not very big - ~ 20%, changing chunk size does not influence this)
for things like String we have a speedup

So we could leave things as is or add something like isbitstype(eltype(v)) && sizeof(eltype(v)) <= 2 (if 1, 2, 4, or 8 should be used we should benchmark) and then do not use threading.

bkamins · 2021-04-25T22:31:43Z

for row_group_slots the only scenario in which I could generate speedup with threading was when we have a column with missing and we use skipmissing=true, but it was small. In general - currently I could not find a case where threading convincingly improves speed. Let us discuss what to do with it (maybe I am missing something)

bkamins · 2021-04-26T07:11:03Z

First - this is a general problem @inbounds annotations in our code do not propagate as @spawn_for_chunks creates a function barrier. This does not resolve the issue but slows down the code in general.

Second - it seems that we do not distribute work among threads correctly. Here is an example on four threads (I have enabled printing which thread was spawned):

julia> n=20_000_000; df=DataFrame(passband=Int64.(rand(Int8,n)));

julia> groupby(df,:passband);
Threads.threadid() = 2
Threads.threadid() = 1
Threads.threadid() = 3
Threads.threadid() = 4
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2

julia> groupby(df,:passband);
Threads.threadid() = 1
Threads.threadid() = 2
Threads.threadid() = 3
Threads.threadid() = 4
Threads.threadid() = 1
Threads.threadid() = 3
Threads.threadid() = 4
Threads.threadid() = 1
Threads.threadid() = 2
Threads.threadid() = 1
Threads.threadid() = 3
Threads.threadid() = 4
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2

julia> groupby(df,:passband);
Threads.threadid() = 1
Threads.threadid() = 2
Threads.threadid() = 1
Threads.threadid() = 4
Threads.threadid() = 3
Threads.threadid() = 1
Threads.threadid() = 2
Threads.threadid() = 1
Threads.threadid() = 4
Threads.threadid() = 1
Threads.threadid() = 2
Threads.threadid() = 3
Threads.threadid() = 4
Threads.threadid() = 3
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2
Threads.threadid() = 2

as you can see one thread (in this case 2) is overloaded and the distribution of work among threads is not correct.

Here is another example (now printing is disabled) - I want to show a skew in the load distribution. Four threads:

julia> function f()
           x = zeros(Int8, 100_000_000)
           DataFrames.@spawn_for_chunks 1_000 for i in eachindex(x)
               @inbounds x[i] = Threads.threadid()
           end
           combine(groupby(DataFrame(x=x), :x), nrow)
       end
f (generic function with 1 method)

julia> f()
3×2 DataFrame
 Row │ x     nrow     
     │ Int8  Int64    
─────┼────────────────
   1 │    2  36481000
   2 │    3  31765000
   3 │    4  31754000

julia> f()
3×2 DataFrame
 Row │ x     nrow     
     │ Int8  Int64    
─────┼────────────────
   1 │    2  36296000
   2 │    3  31747000
   3 │    4  31957000

the same with 2 threads:

julia> f()
2×2 DataFrame
 Row │ x     nrow     
     │ Int8  Int64    
─────┼────────────────
   1 │    1  34189000
   2 │    2  65811000

julia> f()
2×2 DataFrame
 Row │ x     nrow     
     │ Int8  Int64    
─────┼────────────────
   1 │    1  27055000
   2 │    2  72945000

julia> f()
2×2 DataFrame
 Row │ x     nrow     
     │ Int8  Int64    
─────┼────────────────
   1 │    1  24047000
   2 │    2  75953000

bkamins · 2021-04-26T07:27:48Z

One more example showing super strange behavior:

~$ julia -e 'using DataFrames, BenchmarkTools; n=2_000_000; df=DataFrame(passband=rand(Int8,n)); @btime groupby($df,:passband)'
  2.029 ms (68 allocations: 15.26 MiB)
~$ julia -t 2 -e 'using DataFrames, BenchmarkTools; n=2_000_000; df=DataFrame(passband=rand(Int8,n)); @btime groupby($df,:passband)'
  7.255 ms (81 allocations: 15.26 MiB)
~$ julia -t 4 -e 'using DataFrames, BenchmarkTools; n=2_000_000; df=DataFrame(passband=rand(Int8,n)); @btime groupby($df,:passband)'
  6.597 ms (81 allocations: 15.26 MiB)
~$ julia -t 8 -e 'using DataFrames, BenchmarkTools; n=2_000_000; df=DataFrame(passband=rand(Int8,n)); @btime groupby($df,:passband)'
  2.813 ms (81 allocations: 15.26 MiB)
~$ julia -e 'using DataFrames, BenchmarkTools; n=2_000_000; df=DataFrame(passband=rand(Int8,n)); @btime groupby($df,:passband)'
  2.591 ms (68 allocations: 15.26 MiB)
~$ julia -t 2 -e 'using DataFrames, BenchmarkTools; n=2_000_000; df=DataFrame(passband=rand(Int8,n)); @btime groupby($df,:passband)'
  7.998 ms (81 allocations: 15.26 MiB)
~$ julia -t 4 -e 'using DataFrames, BenchmarkTools; n=2_000_000; df=DataFrame(passband=rand(Int8,n)); @btime groupby($df,:passband)'
  3.098 ms (81 allocations: 15.26 MiB)
~$ julia -t 8 -e 'using DataFrames, BenchmarkTools; n=2_000_000; df=DataFrame(passband=rand(Int8,n)); @btime groupby($df,:passband)'
  2.934 ms (81 allocations: 15.26 MiB)

bkamins added performance priority labels Apr 25, 2021

bkamins added this to the patch milestone Apr 25, 2021

bkamins mentioned this issue Apr 26, 2021

fix row_group_slots_threading #2736

Merged

bkamins closed this as completed in #2736 May 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant regression of groupby when threading #2735

Significant regression of groupby when threading #2735

bkamins commented Apr 25, 2021

bkamins commented Apr 25, 2021

bkamins commented Apr 25, 2021

bkamins commented Apr 25, 2021

bkamins commented Apr 26, 2021

bkamins commented Apr 26, 2021

Significant regression of groupby when threading #2735

Significant regression of groupby when threading #2735

Comments

bkamins commented Apr 25, 2021

bkamins commented Apr 25, 2021

bkamins commented Apr 25, 2021

bkamins commented Apr 25, 2021

bkamins commented Apr 26, 2021

bkamins commented Apr 26, 2021