Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threaded OpenBLAS is occasionally very slow #262

Closed
simonster opened this issue Sep 28, 2015 · 9 comments
Closed

Threaded OpenBLAS is occasionally very slow #262

simonster opened this issue Sep 28, 2015 · 9 comments
Labels
performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM

Comments

@simonster
Copy link
Member

If I keep calling X'X (which is calling OpenBLAS dsyrk), it occasionally takes an exceptionally long time to complete:

julia> X = randn(100000, 1000);

julia> @time X'X;
 56.971655 seconds (632.63 k allocations: 37.585 MB, 0.01% gc time)

julia> @time X'X;
  1.510004 seconds (9 allocations: 7.630 MB)

julia> @time X'X;
  2.670606 seconds (9 allocations: 7.630 MB)

julia> @time X'X;
  1.956606 seconds (9 allocations: 7.630 MB, 0.10% gc time)

julia> @time X'X;
  2.269821 seconds (9 allocations: 7.630 MB)

julia> @time X'X;
  1.673705 seconds (9 allocations: 7.630 MB)

julia> @time X'X;
  1.843355 seconds (9 allocations: 7.630 MB, 0.04% gc time)

julia> @time X'X;
241.520904 seconds (9 allocations: 7.630 MB)

This doesn't seem to happen if I call Base.disable_threaded_libs() beforehand. versioninfo() is:

Julia Version 0.4.0-rc2+31
Commit 09f248f* (2015-09-23 01:56 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

cc @xianyi

@musm
Copy link
Contributor

musm commented Sep 28, 2015

I can't reproduce

julia>  X = randn(100000, 1000);

julia> for i in 1:10
           @time X'X;
       end
  2.326412 seconds (620.56 k allocations: 37.009 MB, 0.64% gc time)
  1.947887 seconds (5 allocations: 7.630 MB)
  1.967060 seconds (5 allocations: 7.630 MB)
  2.087392 seconds (5 allocations: 7.630 MB)
  2.272428 seconds (5 allocations: 7.630 MB)
  2.257038 seconds (5 allocations: 7.630 MB)
  2.127397 seconds (5 allocations: 7.630 MB)
  2.172600 seconds (5 allocations: 7.630 MB)
  2.198271 seconds (5 allocations: 7.630 MB)
  2.184945 seconds (5 allocations: 7.630 MB)


julia> versioninfo()
Julia Version 0.5.0-dev+321
Commit b939592 (2015-09-23 01:53 UTC)
Platform Info:
  System: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

@kshyatt kshyatt added the performance Must go faster label Sep 29, 2015
@simonster
Copy link
Member Author

Here's the output with #define TIMING 1 in level3_syrk_threaded.c:

julia> @time X'X;
GEMM   [ 1] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   0.00 Wait2 :  99.62 Wait3 :   0.00 Kernel : 145.06
GEMM   [ 6] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   0.02 Wait2 :  99.63 Wait3 :   0.00 Kernel : 155.24
GEMM   [ 3] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   0.01 Wait2 :  99.68 Wait3 :   0.00 Kernel : 162.53
GEMM   [ 2] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   0.00 Wait2 :  99.06 Wait3 :   0.00 Kernel :  54.54
GEMM   [ 5] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   0.02 Wait2 :  99.65 Wait3 :   0.00 Kernel : 159.28
GEMM   [ 4] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   0.01 Wait2 :  99.62 Wait3 :   0.00 Kernel : 148.61
GEMM   [ 8] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   5.65 Wait2 :  77.41 Wait3 :   0.00 Kernel :   3.15
GEMM   [ 9] Copy_A :   0.01  Copy_B :   0.01  Wait1 :   7.12 Wait2 :  92.51 Wait3 :   0.00 Kernel : 158.34
GEMM   [ 0] Copy_A :   0.00  Copy_B :   0.00  Wait1 :   0.00 Wait2 :  99.81 Wait3 :   0.00 Kernel : 145.80
GEMM   [10] Copy_A :   0.01  Copy_B :   0.01  Wait1 :   7.03 Wait2 :  92.62 Wait3 :   0.00 Kernel : 168.69
GEMM   [ 7] Copy_A :   0.01  Copy_B :   0.00  Wait1 :   1.36 Wait2 :  98.26 Wait3 :   0.00 Kernel : 147.33
GEMM   [11] Copy_A :   0.03  Copy_B :   0.02  Wait1 :  99.65 Wait2 :   0.00 Wait3 :   0.00 Kernel : 319.75
136.467091 seconds (9 allocations: 7.630 MB)

i.e. basically all of the threads are spending basically all of their time waiting. If I set OPENBLAS_NUM_THREADS=6 (taking into account that I actually only have 6 CPU cores before HT), then extremely slow runs are less common (but still happen occasionally).

@xianyi
Copy link

xianyi commented Sep 30, 2015

Interesting. I didn't look at syrk threading before.

@ViralBShah ViralBShah added the upstream The issue is with an upstream dependency, e.g. LLVM label Oct 1, 2015
@damiendr
Copy link

Same problem here with Julia 0.4.0 on a Core i7-4790K.
I use gemv! with a Float32 matrix (57x192).

With blas_set_num_threads(8) I report a similar amount of time spent waiting.

Calling blas_set_num_threads(1) improved the overall performance of my code by 2x. I don't have precise numbers for the gemv call itself.

Maybe a change of defaults or a warning in the LinAlg section of the manual would be called for?

EDIT: seems to be related to JuliaLang/julia#3239

@simonster
Copy link
Member Author

Since updating to Ubuntu 15.10 (from 15.04) I can no longer reproduce this. I'm not sure what that means, but FWIW even under 15.04 I didn't have this issue with parallel MKL BLAS.

@ViralBShah
Copy link
Member

We had an openblas upgrade recently - did that change things or are you using the same setup as before?

@simonster
Copy link
Member Author

I'm still on OpenBLAS 0.2.14, so I don't think that's the issue.

I think it's possible the way that OpenBLAS is handling threading could probably be better, although I'm not exactly qualified to offer an opinion on such things. Right now it is using sched_yield when waiting for other threads to release their locks, but @poulson suggested that this is essentially a no-op on newer kernels. Perhaps some kind of locking primitive that uses futex under the hood (e.g. pthread locks) would be better, since it would avoid waking the threads that are waiting for locks until the threads holding the locks are complete.

@simonster
Copy link
Member Author

I can no longer reproduce this, but I'm not sure if it's actually fixed or something changed with my system configuration. I believe this issue is related to OpenMathLib/OpenBLAS#660

@ViralBShah
Copy link
Member

Can't reproduce on openblas 0.3.0. Can you reopen if you still see on master?

@KristofferC KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

No branches or pull requests

6 participants