Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuners segfault #243

Closed
ysh329 opened this issue Jan 14, 2018 · 18 comments
Closed

Tuners segfault #243

ysh329 opened this issue Jan 14, 2018 · 18 comments
Labels

Comments

@ysh329
Copy link

ysh329 commented Jan 14, 2018

Hi, I wanna tune the GEMM performance on my AMD gpu referring the commands below from readme:

mkdir build
cd build
cmake -DTUNERS=ON ..
make
make alltuners
python ../scripts/database/database.py . ..
make

However, I met problems below:

gpu@gpu-FP4:~/yuanshuai/code/CLBlast$ cd build/
gpu@gpu-FP4:~/yuanshuai/code/CLBlast/build$ ls
aticonfig.log            clblast_tuner_copy_pad       clblast_tuner_routine_xtrsv   clblast_tuner_xaxpy  clblast_tuner_xgemm_direct  CMakeCache.txt       install_manifest.txt
clblast.pc               clblast_tuner_invert         clblast_tuner_transpose_fast  clblast_tuner_xdot   clblast_tuner_xgemv         CMakeFiles           libclblast.so
clblast_tuner_copy_fast  clblast_tuner_routine_xgemm  clblast_tuner_transpose_pad   clblast_tuner_xgemm  clblast_tuner_xger          cmake_install.cmake  Makefile
gpu@gpu-FP4:~/yuanshuai/code/CLBlast/build$ cmake -DTUNERS=ON ..
-- Building CLBlast with OpenCL API (default)
-- Configuring done
-- Generating done
-- Build files have been written to: /home/gpu/yuanshuai/code/CLBlast/build
gpu@gpu-FP4:~/yuanshuai/code/CLBlast/build$ make
[ 78%] Built target clblast
[ 86%] Built target tuners_common_library
[ 87%] Built target clblast_tuner_copy_fast
[ 88%] Built target clblast_tuner_copy_pad
[ 89%] Built target clblast_tuner_invert
[ 90%] Built target clblast_tuner_routine_xgemm
[ 91%] Built target clblast_tuner_routine_xtrsv
[ 92%] Built target clblast_tuner_transpose_fast
[ 93%] Built target clblast_tuner_transpose_pad
[ 94%] Built target clblast_tuner_xaxpy
[ 95%] Built target clblast_tuner_xdot
[ 96%] Built target clblast_tuner_xgemm
[ 97%] Built target clblast_tuner_xgemm_direct
[ 98%] Built target clblast_tuner_xgemv
[100%] Built target clblast_tuner_xger
gpu@gpu-FP4:~/yuanshuai/code/CLBlast/build$ make alltuners
[ 38%] Built target tuners_common_library
[ 44%] Built target clblast_tuner_invert
[ 50%] Built target clblast_tuner_copy_fast
[ 55%] Built target clblast_tuner_transpose_pad
[ 61%] Built target clblast_tuner_copy_pad
[ 66%] Built target clblast_tuner_transpose_fast
[ 72%] Built target clblast_tuner_xaxpy
[ 77%] Built target clblast_tuner_xdot
[ 83%] Built target clblast_tuner_xger
[ 88%] Built target clblast_tuner_xgemm
[ 94%] Built target clblast_tuner_xgemm_direct
[100%] Built target clblast_tuner_xgemv
Scanning dependencies of target alltuners
* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -precision 32 (single) [=default]
    -m 1024 [=default]
    -n 1024 [=default]
    -alpha 2.00 [=default]
    -fraction 1.00 [=default]
    -runs 10 [=default]
    -max_l2_norm 0.00 [=default]

* Found 144 configuration(s)
* Parameters explored: COPY_DIMX COPY_DIMY COPY_WPT COPY_VW 

|   ID | total |               param |       compiles |         time |   GB/s |            status |
x------x-------x---------------------x----------------x--------------x--------x-------------------x
CMakeFiles/alltuners.dir/build.make:49: recipe for target 'CMakeFiles/alltuners' failed
make[3]: *** [CMakeFiles/alltuners] Segmentation fault (core dumped)
CMakeFiles/Makefile2:70: recipe for target 'CMakeFiles/alltuners.dir/all' failed
make[2]: *** [CMakeFiles/alltuners.dir/all] Error 2
CMakeFiles/Makefile2:78: recipe for target 'CMakeFiles/alltuners.dir/rule' failed
make[1]: *** [CMakeFiles/alltuners.dir/rule] Error 2
Makefile:150: recipe for target 'alltuners' failed
make: *** [alltuners] Error 2
gpu@gpu-FP4:~/yuanshuai/code/CLBlast/build$
@CNugteren CNugteren changed the title recipe for target 'alltuners' failed when make alltuners Tuners segfault Jan 14, 2018
@CNugteren
Copy link
Owner

I've just renamed this because this has nothing to do with the "alltuners", it is just that whenever you would run a tuner (e.g. ./clblast_tuner_copy_fast) it seems it segfaults. This is also what you reported in the other issue you opened, and there I asked you whether you have run any OpenCL successfully on your set-up/device? From an other post of yours I also learned you have two devices in your systems. What happens if you run it on another one instead (e.g. ./clblast_tuner_copy_fast --platform 0 --device 1)?

@ysh329
Copy link
Author

ysh329 commented Jan 15, 2018

Greatly thanks, my big brother. Strange, I tried the commands ./clblast_tuner_copy_fast --platform 0 --device 1 and ./clblast_tuner_copy_fast --platform 0 --device 0 as below:

gpu@gpu-FP4:~/yuanshuai/code/CLBlast/build$ ./clblast_tuner_copy_fast --platform 0 --device 1
* Options given/available:
    -platform 0 [=default]
    -device 1 
    -precision 32 (single) [=default]
    -m 1024 [=default]
    -n 1024 [=default]
    -alpha 2.00 [=default]
    -fraction 1.00 [=default]
    -runs 10 [=default]
    -max_l2_norm 0.00 [=default]

* Found 144 configuration(s)
* Parameters explored: COPY_DIMX COPY_DIMY COPY_WPT COPY_VW 

|   ID | total |               param |       compiles |         time |   GB/s |            status |
x------x-------x---------------------x----------------x--------------x--------x-------------------x
Segmentation fault (core dumped)
gpu@gpu-FP4:~/yuanshuai/code/CLBlast/build$ ./clblast_tuner_copy_fast --platform 0 --device 0
* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -precision 32 (single) [=default]
    -m 1024 [=default]
    -n 1024 [=default]
    -alpha 2.00 [=default]
    -fraction 1.00 [=default]
    -runs 10 [=default]
    -max_l2_norm 0.00 [=default]

* Found 144 configuration(s)
* Parameters explored: COPY_DIMX COPY_DIMY COPY_WPT COPY_VW 

|   ID | total |               param |       compiles |         time |   GB/s |            status |
x------x-------x---------------------x----------------x--------------x--------x-------------------x
Segmentation fault (core dumped)

I can successfully run opencl program (of course, including opencl kernel files) using our company's internal opencl framework. But strange either, segmentfault using clpeak as below:

./clpeak 

Platform: AMD Accelerated Parallel Processing
  Device: Carrizo
    Driver version  : 1912.5 (VM) (Linux x64)
    Compute units   : 6
    Clock frequency : 576 MHz
Segmentation fault (core dumped)

Thus, I think maybe some configs in host functions about OCL are not correct or etc. I think maybe I need to check configs in our company framework about OCL next.

@CNugteren
Copy link
Owner

Yes, this looks like something is wrong with your system rather than with CLBlast. To verify you can also try the clblast_test_diagnostics binary (if you compiled with -DTESTS=ON), which will probably also segfault on your system...

@ysh329
Copy link
Author

ysh329 commented Jan 17, 2018

Thanks, I'll have a try 🙇

@CNugteren
Copy link
Owner

You could also run in a debugger (e.g. gdb) and post a back-trace here. Optionally also compile the library with debug symbols to get more info. Perhaps your issue is also similar to what is reported here, here or here?

@ghost
Copy link

ghost commented Feb 15, 2018

I am experiencing similar problems to the ones described above. When running the invert tuner on an Intel platform it segfaults with one of several back-traces, for example:

#0  std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long> > >::_M_erase (this=this@entry=0x1c91ee0, __x=0x22621f38a23966fc) at /usr/include/c++/5/bits/stl_tree.h:1612
#1  0x00000000004397a8 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long> > >::~_Rb_tree (this=0x1c91ee0, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/stl_tree.h:858
#2  std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long> > >::~map (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/stl_map.h:96
#3  std::_Destroy<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long> > > > (__pointer=<optimized out>) at /usr/include/c++/5/bits/stl_construct.h:93
#4  std::_Destroy_aux<false>::__destroy<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long> > >*> (__last=<optimized out>, __first=0x1c91ee0) at /usr/include/c++/5/bits/stl_construct.h:103
#5  std::_Destroy<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long> > >*> (__last=<optimized out>, __first=<optimized out>) at /usr/include/c++/5/bits/stl_construct.h:126
#6  std::_Destroy<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long> > >*, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long> > > > (__last=0x1c91f10, __first=<optimized out>) at /usr/include/c++/5/bits/stl_construct.h:151
#7  std::vector<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long> > >, std::allocator<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned long> > > > >::~vector (this=0x7ffd0b058820, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/stl_vector.h:424
#8  0x000000000044ce67 in clblast::Tuner<float>(int, char**, int, std::function<clblast::TunerDefaults (int)>, std::function<clblast::TunerSettings (int, clblast::Arguments<float> const&)>, std::function<void (int, clblast::Arguments<float> const&)>, std::function<std::vector<clblast::Constraint, std::allocator<clblast::Constraint> > (int)>, std::function<void (int, clblast::Kernel&, clblast::Arguments<float> const&, std::vector<clblast::Buffer<float>, std::allocator<clblast::Buffer<float> > >&)>) (argc=argc@entry=3, argv=argv@entry=0x7ffd0b05ad28, V=V@entry=0, GetTunerDefaults=..., GetTunerSettings=..., TestValidArguments=..., SetConstraints=..., SetArguments=...) at /home/richard/CLionProjects/ma-richard-schulze/libraries/CLBlast/src/tuning/tuning.cpp:335
#9  0x0000000000418392 in main (argc=3, argv=0x7ffd0b05ad28) at /home/richard/CLionProjects/ma-richard-schulze/libraries/CLBlast/src/tuning/kernels/invert.cpp:118
#10 0x00007f93df276830 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000419359 in _start ()

or

#0  clblast::Tuner<float>(int, char**, int, std::function<clblast::TunerDefaults (int)>, std::function<clblast::TunerSettings (int, clblast::Arguments<float> const&)>, std::function<void (int, clblast::Arguments<float> const&)>, std::function<std::vector<clblast::Constraint, std::allocator<clblast::Constraint> > (int)>, std::function<void (int, clblast::Kernel&, clblast::Arguments<float> const&, std::vector<clblast::Buffer<float>, std::allocator<clblast::Buffer<float> > >&)>) (argc=argc@entry=3, argv=argv@entry=0x7ffe3a72c8d8, V=V@entry=0, GetTunerDefaults=..., GetTunerSettings=..., TestValidArguments=..., SetConstraints=..., SetArguments=...) at /home/richard/CLionProjects/ma-richard-schulze/libraries/CLBlast/src/tuning/tuning.cpp:229
#1  0x0000000000418392 in main (argc=3, argv=0x7ffe3a72c8d8) at /home/richard/CLionProjects/ma-richard-schulze/libraries/CLBlast/src/tuning/kernels/invert.cpp:118
#2  0x00007fc282fb3830 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x0000000000419359 in _start ()

I also got a segmentation fault with the following back-trace that looks like it originated in the reference kernel itself, because the thread running the tuner stopped at a line that times the reference kernel:

#0  0x00007f01026360b0 in TripleMatMul16Part1Lower ()
#1  0x00007f00fe18226f in ?? () from /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so
#2  0x00007f00ff5aae79 in ?? () from /opt/intel/opencl-1.2-6.4.0.25/lib64/libcpu_device.so
#3  0x00007f00ffa4342a in ?? () from /opt/intel/opencl-1.2-6.4.0.25/lib64/libtask_executor.so
#4  0x00007f00ffa43d5e in ?? () from /opt/intel/opencl-1.2-6.4.0.25/lib64/libtask_executor.so
#5  0x00007f00ffa44559 in ?? () from /opt/intel/opencl-1.2-6.4.0.25/lib64/libtask_executor.so
#6  0x00007f01004bc0c5 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f00ff557e80, parent=..., child=0xc) at ../../src/tbb/custom_scheduler.h:474
#7  0x00007f01004ba4f2 in tbb::internal::custom_scheduler<tbb::internal::DefaultSchedulerTraits>::local_wait_for_all (this=0x0, parent=..., child=<optimized out>, $�7=<optimized out>, $�8=..., $�9=<optimized out>) at ../../src/tbb/scheduler.cpp:680
#8  tbb::internal::custom_scheduler<tbb::internal::DefaultSchedulerTraits>::wait_for_all (this=0x0, parent=..., child=0xc) at ../../src/tbb/custom_scheduler.h:81
#9  0x00007f00ffa36c96 in ?? () from /opt/intel/opencl-1.2-6.4.0.25/lib64/libtask_executor.so
#10 0x00007f00ffa36d5a in ?? () from /opt/intel/opencl-1.2-6.4.0.25/lib64/libtask_executor.so
#11 0x00007f00ffa34c1e in ?? () from /opt/intel/opencl-1.2-6.4.0.25/lib64/libtask_executor.so
#12 0x00007f00ffa2f593 in ?? () from /opt/intel/opencl-1.2-6.4.0.25/lib64/libtask_executor.so
#13 0x00007f00ffa2f854 in ?? () from /opt/intel/opencl-1.2-6.4.0.25/lib64/libtask_executor.so
#14 0x00007f00ffa2c18d in ?? () from /opt/intel/opencl-1.2-6.4.0.25/lib64/libtask_executor.so
#15 0x00007f01004bc0c5 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f00ff557e80, parent=..., child=0xc) at ../../src/tbb/custom_scheduler.h:474
#16 0x00007f01004b78f2 in tbb::internal::arena::process (this=0x0, s=...) at ../../src/tbb/arena.cpp:96
#17 0x00007f01004b5c48 in tbb::internal::market::process (this=0x0, j=...) at ../../src/tbb/market.cpp:495
#18 0x00007f01004b1949 in tbb::internal::rml::private_server::remove_server_ref (this=<optimized out>, $`6=<optimized out>) at ../../src/tbb/private_server.cpp:275
#19 tbb::internal::rml::private_server::request_close_connection (this=0x0) at ../../src/tbb/private_server.cpp:192
#20 0x00007f01004b18d6 in tbb::internal::rml::private_worker::thread_routine (arg=0x0) at ../../src/tbb/private_server.cpp:228
#21 0x00007f01012eb6ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#22 0x00007f010180c41d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Maybe this is the actual source of the problem?

I tried on two different machine with Intel platforms with no luck. I was able to run all tuners on an NVIDIA platform just fine.

Your help would be very much appreciated, many thanks in advance!

@CNugteren
Copy link
Owner

I doubt that this is related... the original question made all tuners crash at the very first kernel invocation, whereas you have something specific for the invert tuner (all others work?). Also you report that this only happens on Intel platforms, whereas the original question was on an AMD platform.

Nevertheless, it should be fixed :-) Can you give details of your system, e.g. OS, Intel OpenCL version, and device you are testing on. Perhaps I can then try to reproduce.

@ghost
Copy link

ghost commented Feb 18, 2018

Sorry, you are right, it is not directly related. I can open a new issue if you'd prefer that.

I am testing on CentOS 7.4.1708 on the following platform and device (clinfo output):

  Platform Name                                   Intel(R) OpenCL
  Platform Vendor                                 Intel(R) Corporation
  Platform Version                                OpenCL 1.2
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64
  Platform Extensions function suffix             INTEL
  Device Name                                     Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz
  Device Vendor                                   Intel(R) Corporation
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 1.2 (Build 400)
  Driver Version                                  1.2.0.400
  Device OpenCL C Version                         OpenCL C 1.2
  Device Type                                     CPU
  Device Available                                Yes
  Device Profile                                  FULL_PROFILE
  Max compute units                               32
  Max clock frequency                             2000MHz
  Device Partition                                (core)
    Max number of sub-devices                     32
    Supported partition types                     by counts, equally, by names (Intel)
  Max work item dimensions                        3
  Max work item sizes                             8192x8192x8192
  Max work group size                             8192
  Compiler Available                              Yes
  Linker Available                                Yes
  Preferred work group size multiple              128
  Preferred / native vector sizes
    char                                                 1 / 16
    short                                                1 / 8
    int                                                  1 / 4
    long                                                 1 / 2
    half                                                 0 / 0        (n/a)
    float                                                1 / 8
    double                                               1 / 4        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Address bits                                    64, Little-Endian
  Global memory size                              134991880192 (125.7GiB)
  Error Correction support                        No
  Max memory allocation                           33747970048 (31.43GiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        262144 (256KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             480
    Max size for 1D images from buffer            2109248128 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 480
    Max number of write image args                480
  Local memory type                               Global
  Local memory size                               32768 (32KiB)
  Max constant buffer size                        131072 (128KiB)
  Max number of constant args                     480
  Max size of kernel argument                     3840 (3.75KiB)
  Queue properties
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Local thread execution (Intel)                Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1ns
  Execution capabilities
    Run OpenCL kernels                            Yes
    Run native kernels                            Yes
    SPIR versions                                 1.2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels
  Device Extensions                               cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64

All other tuners are running correctly. I was also able to tune CLBlast version 0.11.0 a few months ago on that same machine.

@CNugteren
Copy link
Owner

OK, thanks for the info. With that I actually managed to reproduce it on Debian system with an Intel CPU and Intel OpenCL. I'll investigate.

@CNugteren
Copy link
Owner

I fixed quite a few issues in the tuner. The tuner now works successfully on my test system, could you also have a go with the latest master?

Background info: The invert tuner is new and isn't really used yet, it more like a stub. But still good to not make it crash of course :-)

@kpot
Copy link
Contributor

kpot commented Feb 21, 2018

Hi guys! I'm not sure how relevant it is, but a couple of months ago I ran into similar problems with my project (I used AMD APP SDK 3.0 on Ubuntu 14.04.3 LTS).

In my case both the drivers and the SDK were installed properly, clinfo and other pre-built OpenCL software were working correctly and showing some meaningful information.
Yet any time I tried to build any OpenCL-based code myself and lauch it - it would fail with Segfault looking similarly to what you've shown.
I also tried to download and build clpeak and it failed too.
Desperately, I even installed newer CMake (3.10.2) and clang5 to make sure it wasn't toolchain's fault. No luck.

Eventually I've solved the issue. Here's what I suggest you to check/try:

  1. Make sure you use -fPIC or its analogue when you compile the code. In CMake
    you can do this by simply adding set(CMAKE_POSITION_INDEPENDENT_CODE ON)
    to the project config.
  2. Forbid CMake to add RPATH entries to binaries you link with AMD SDK.
    Something like set_target_properties(my_executable_file PROPERTIES SKIP_BUILD_RPATH ON)
    will do the job.
    Or you can do this project-wise with set(CMAKE_SKIP_BUILD_RPATH ON).

After these changes both clpeak and all my code began working like a charm. So try it, at least if you're using the same configuration.

@ghost
Copy link

ghost commented Feb 21, 2018

Thanks for the quick fix! I can now successfully run the invert tuner.

@CNugteren
Copy link
Owner

CNugteren commented Feb 21, 2018

@richardschulze: Good to hear your issue is solved 👍

@kpot: Thanks for the info. I think I'll add your info to the CLBlast documentation under frequently asked questions. I don't think this is something that should go into the mainly CLBlast, right? Or do you think this applies for every platform?

@kpot
Copy link
Contributor

kpot commented Feb 21, 2018

@CNugteren. I don't think this should be included into CLBlast's config. Such things can and should be controlled externally - by the developers who use CLBlast. But I agree that adding this to the FAQ might be helpful. Because it definitely applies at least to Linux.

With -fPIC the behavior can vary between distros and patches they apply to their packages. For example, some distros of Ubuntu will enable -fPIC by default. But during my investigation I've seen dozens of reports from people about the same issue appearing in many software projects. And some eventually include CMAKE_POSITION_INDEPENDENT_CODE into their CMake configs. Without linking, or until you start using the foreign function, your code works perfectly well.

The RPATH problem is more tricky and seems to appear only with recent versions of CMake, where RPATH may be turned on by default. It looks like this: you build your code, launch it, and it constantly segfaults somewhere deep into opencl guts, leaving you clueless on what might be wrong. But objdump -x <executable> shows dependency from libOpenCL.so and RPATH /opt/AMDAPPSDK-3.0/lib/x86_64, and that is bad. Even though you link the code with a particular SDK, you normally don't want it to be a "hard link". The software should be able to pick up and use whatever libOpenCL.so environment the user has on the machine. Otherwise it may not work at all. In my case, even when the SDK and the graphical driver both came from the same manufacturer and were installed on the same machine. I've also spotted the same problem in some some other software projects, though not so often as with -fPIC. In my CL code I've set PROPERTIES SKIP_BUILD_RPATH ON and it works both on my Mac and Linux machines with Intel and AMD SDKs (haven't tested with NVidia yet).

@CNugteren
Copy link
Owner

@kpot: OK, thanks for the explanation. I've added a note in the installation guide

Since @richardschulze issue is confirmed to be solved and @ysh329's issue seems to be more a general OpenCL issue (and can perhaps be solved by the above suggestion), I'm closing this issue.

@kodonnell
Copy link

Sorry, novice cmaker (etc.) here ... can someone confirm where the 'project config' is? I'm also using Intel OpenCL like @richardschulze - and just dumping

set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_SKIP_BUILD_RPATH ON)

at the start of CMakeLists.txt doesn't help, nor adding it in a *.cmake file inside the cmake directory.

@CNugteren
Copy link
Owner

I think CMakeLists.txt should be the right place. Are you sure you then removed all intermediate build files and started from scratch? CMake keeps a cache of previous settings.

@kodonnell
Copy link

I think so - I just removed the entire build directory and rebuilt, and still segfaults. Actually, the first time it ran through without any (though heaps of failed tests), but other times it segfaults. Device is intel i5-5250u

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants