Threaded OpenBLAS is occasionally very slow #262

simonster · 2015-09-28T20:26:11Z

If I keep calling X'X (which is calling OpenBLAS dsyrk), it occasionally takes an exceptionally long time to complete:

julia> X = randn(100000, 1000);

julia> @time X'X;
 56.971655 seconds (632.63 k allocations: 37.585 MB, 0.01% gc time)

julia> @time X'X;
  1.510004 seconds (9 allocations: 7.630 MB)

julia> @time X'X;
  2.670606 seconds (9 allocations: 7.630 MB)

julia> @time X'X;
  1.956606 seconds (9 allocations: 7.630 MB, 0.10% gc time)

julia> @time X'X;
  2.269821 seconds (9 allocations: 7.630 MB)

julia> @time X'X;
  1.673705 seconds (9 allocations: 7.630 MB)

julia> @time X'X;
  1.843355 seconds (9 allocations: 7.630 MB, 0.04% gc time)

julia> @time X'X;
241.520904 seconds (9 allocations: 7.630 MB)

This doesn't seem to happen if I call Base.disable_threaded_libs() beforehand. versioninfo() is:

Julia Version 0.4.0-rc2+31
Commit 09f248f* (2015-09-23 01:56 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

cc @xianyi

The text was updated successfully, but these errors were encountered:

musm · 2015-09-28T21:51:47Z

I can't reproduce

julia>  X = randn(100000, 1000);

julia> for i in 1:10
           @time X'X;
       end
  2.326412 seconds (620.56 k allocations: 37.009 MB, 0.64% gc time)
  1.947887 seconds (5 allocations: 7.630 MB)
  1.967060 seconds (5 allocations: 7.630 MB)
  2.087392 seconds (5 allocations: 7.630 MB)
  2.272428 seconds (5 allocations: 7.630 MB)
  2.257038 seconds (5 allocations: 7.630 MB)
  2.127397 seconds (5 allocations: 7.630 MB)
  2.172600 seconds (5 allocations: 7.630 MB)
  2.198271 seconds (5 allocations: 7.630 MB)
  2.184945 seconds (5 allocations: 7.630 MB)


julia> versioninfo()
Julia Version 0.5.0-dev+321
Commit b939592 (2015-09-23 01:53 UTC)
Platform Info:
  System: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

simonster · 2015-09-29T20:48:13Z

Here's the output with #define TIMING 1 in level3_syrk_threaded.c:

julia> @time X'X;
GEMM   [ 1] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   0.00 Wait2 :  99.62 Wait3 :   0.00 Kernel : 145.06
GEMM   [ 6] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   0.02 Wait2 :  99.63 Wait3 :   0.00 Kernel : 155.24
GEMM   [ 3] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   0.01 Wait2 :  99.68 Wait3 :   0.00 Kernel : 162.53
GEMM   [ 2] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   0.00 Wait2 :  99.06 Wait3 :   0.00 Kernel :  54.54
GEMM   [ 5] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   0.02 Wait2 :  99.65 Wait3 :   0.00 Kernel : 159.28
GEMM   [ 4] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   0.01 Wait2 :  99.62 Wait3 :   0.00 Kernel : 148.61
GEMM   [ 8] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   5.65 Wait2 :  77.41 Wait3 :   0.00 Kernel :   3.15
GEMM   [ 9] Copy_A :   0.01  Copy_B :   0.01  Wait1 :   7.12 Wait2 :  92.51 Wait3 :   0.00 Kernel : 158.34
GEMM   [ 0] Copy_A :   0.00  Copy_B :   0.00  Wait1 :   0.00 Wait2 :  99.81 Wait3 :   0.00 Kernel : 145.80
GEMM   [10] Copy_A :   0.01  Copy_B :   0.01  Wait1 :   7.03 Wait2 :  92.62 Wait3 :   0.00 Kernel : 168.69
GEMM   [ 7] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   1.36 Wait2 :  98.26 Wait3 :   0.00 Kernel : 147.33
GEMM   [11] Copy_A :   0.03  Copy_B :   0.02  Wait1 :  99.65 Wait2 :   0.00 Wait3 :   0.00 Kernel : 319.75
136.467091 seconds (9 allocations: 7.630 MB)

i.e. basically all of the threads are spending basically all of their time waiting. If I set OPENBLAS_NUM_THREADS=6 (taking into account that I actually only have 6 CPU cores before HT), then extremely slow runs are less common (but still happen occasionally).

xianyi · 2015-09-30T21:11:00Z

Interesting. I didn't look at syrk threading before.

damiendr · 2015-10-16T19:23:23Z

Same problem here with Julia 0.4.0 on a Core i7-4790K.
I use gemv! with a Float32 matrix (57x192).

With blas_set_num_threads(8) I report a similar amount of time spent waiting.

Calling blas_set_num_threads(1) improved the overall performance of my code by 2x. I don't have precise numbers for the gemv call itself.

Maybe a change of defaults or a warning in the LinAlg section of the manual would be called for?

EDIT: seems to be related to JuliaLang/julia#3239

simonster · 2015-11-09T00:11:41Z

Since updating to Ubuntu 15.10 (from 15.04) I can no longer reproduce this. I'm not sure what that means, but FWIW even under 15.04 I didn't have this issue with parallel MKL BLAS.

ViralBShah · 2015-11-10T15:35:28Z

We had an openblas upgrade recently - did that change things or are you using the same setup as before?

simonster · 2015-11-10T15:43:07Z

I'm still on OpenBLAS 0.2.14, so I don't think that's the issue.

I think it's possible the way that OpenBLAS is handling threading could probably be better, although I'm not exactly qualified to offer an opinion on such things. Right now it is using sched_yield when waiting for other threads to release their locks, but @poulson suggested that this is essentially a no-op on newer kernels. Perhaps some kind of locking primitive that uses futex under the hood (e.g. pthread locks) would be better, since it would avoid waking the threads that are waiting for locks until the threads holding the locks are complete.

simonster · 2017-03-28T20:20:37Z

I can no longer reproduce this, but I'm not sure if it's actually fixed or something changed with my system configuration. I believe this issue is related to OpenMathLib/OpenBLAS#660

ViralBShah · 2018-05-30T03:17:38Z

Can't reproduce on openblas 0.3.0. Can you reopen if you still see on master?

simonster added the linear algebra label Sep 28, 2015

kshyatt added the performance Must go faster label Sep 29, 2015

ViralBShah added the upstream The issue is with an upstream dependency, e.g. LLVM label Oct 1, 2015

ViralBShah closed this as completed May 30, 2018

KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threaded OpenBLAS is occasionally very slow #262

Threaded OpenBLAS is occasionally very slow #262

simonster commented Sep 28, 2015

musm commented Sep 28, 2015

simonster commented Sep 29, 2015

xianyi commented Sep 30, 2015

damiendr commented Oct 16, 2015

simonster commented Nov 9, 2015

ViralBShah commented Nov 10, 2015

simonster commented Nov 10, 2015

simonster commented Mar 28, 2017

ViralBShah commented May 30, 2018

Threaded OpenBLAS is occasionally very slow #262

Threaded OpenBLAS is occasionally very slow #262

Comments

simonster commented Sep 28, 2015

musm commented Sep 28, 2015

simonster commented Sep 29, 2015

xianyi commented Sep 30, 2015

damiendr commented Oct 16, 2015

simonster commented Nov 9, 2015

ViralBShah commented Nov 10, 2015

simonster commented Nov 10, 2015

simonster commented Mar 28, 2017

ViralBShah commented May 30, 2018