Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting a SIGBUS on the cluster #4

Closed
msoos opened this issue Nov 30, 2020 · 2 comments
Closed

Getting a SIGBUS on the cluster #4

msoos opened this issue Nov 30, 2020 · 2 comments

Comments

@msoos
Copy link
Collaborator

msoos commented Nov 30, 2020

Hi Nicolas,

I'm getting a SIGBUS on the cluster, it's apparently: "The BUS signal is sent to a process when it causes a bus error, such as an incorrect memory access alignment or non-existent physical address.."

Output:

c
c This is glucose-gpu 1.0 --  based on MiniSAT (Many thanks to MiniSAT team)
c
c Setting block count guideline to 30 (twice the number of multiprocessors)
c running elimination
c |  Eliminated clauses:           0.02 Mb                                                                |
c finished running elimination, 70 variables were eliminated
c |<C2><A0> all clones generated. Memory = 44386.14Mb.                                                             |
c ========================================================================================================|
All solvers launched

That's it, it exits there. I run everything under /usr/bin/time -v so I get some kernel info about the process that ran this is:

Command terminated by signal 7
        Command being timed: "./glucose-gpu -thread-count=12 mp1-22.1.cnf"
        User time (seconds): 0.62
        System time (seconds): 0.30
        Percent of CPU this job got: 91%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.01
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 135404
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 35
        Minor (reclaiming a frame) page faults: 17013
        Voluntary context switches: 196
        Involuntary context switches: 133
        Swaps: 0
        File system inputs: 0
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

I guess I'd need to have access to an interactive shell to be able to debug? Do you think we could up the verbosity somehow so we could get some info where it's dying? I think that might help with #1 as well. In the meanwhile, I'm filing a support request with the cluster developers to learn how I can get an interactive GPU-enabled shell :)

@msoos
Copy link
Collaborator Author

msoos commented Nov 30, 2020

OK, I managed to get an interactive shell on the cluster with a Tesla K40, yay! I have compiled this on the node itself, which is nice. It compiles and runs to a point, but then:

[matesoos@gpu1701 gpu]$ ./glucose-gpu -thread-count=2 mizh-md5-47-3.cnf.gz 
c
c This is glucose-gpu 1.0 --  based on MiniSAT (Many thanks to MiniSAT team)
c
c Setting block count guideline to 30 (twice the number of multiprocessors)
c running elimination
c |  Eliminated clauses:           1.68 Mb                                                                |
c finished running elimination, 34016 variables were eliminated
c |  all clones generated. Memory = 151983.09Mb.                                                             |
c ========================================================================================================|
All solvers launched
Bus error
[matesoos@gpu1701 gpu]$ cuda-gdb ./glucose-gpu 
NVIDIA (R) CUDA Debugger
10.1 release
Portions Copyright (C) 2007-2018 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./glucose-gpu...done.
(cuda-gdb) r  -thread-count=2 mizh-md5-47-3.cnf.gz
Starting program: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/glucose-gpu -thread-count=2 mizh-md5-47-3.cnf.gz
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: File "/app/gcc/4.9.3/lib64/libstdc++.so.6.0.20-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /app/gcc/4.9.3/lib64/libstdc++.so.6.0.20-gdb.py
line to your configuration file "/home/users/industry/iitk/matesoos/.cuda-gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/users/industry/iitk/matesoos/.cuda-gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
c
c This is glucose-gpu 1.0 --  based on MiniSAT (Many thanks to MiniSAT team)
c
[New Thread 0x2aaaadc1c700 (LWP 15474)]
[New Thread 0x2aaaade1d700 (LWP 15475)]
c Setting block count guideline to 30 (twice the number of multiprocessors)
c running elimination
c |  Eliminated clauses:           1.68 Mb                                                                |
c finished running elimination, 34016 variables were eliminated
c |  all clones generated. Memory = 151983.08Mb.                                                             |
c ========================================================================================================|
[New Thread 0x2aaaac036700 (LWP 15478)]
[New Thread 0x2aaaac237700 (LWP 15479)]
All solvers launched

Thread 1 "glucose-gpu" received signal SIGBUS, Bus error.
0x0000000000438d43 in Glucose::DestrCheckPointer::DestrCheckPointer (this=0x7fffffefc730, destrCheck=...)
    at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cu:60
60	    val = *destrCheck.ptr;
(cuda-gdb) bt
#0  0x0000000000438d43 in Glucose::DestrCheckPointer::DestrCheckPointer (this=0x7fffffefc730, destrCheck=...)
    at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cu:60
#1  0x000000000043553f in Glucose::ArrAllocator<char>::getDArr (this=0x102b8e0)
    at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:531
#2  Glucose::CorrespArr<char>::tryGetDArr (careAboutCurrentDeviceValues=true, dArr=..., this=0x102b8a0)
    at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:687
#3  Glucose::CorrespArr<char>::getDArr (careAboutCurrentDeviceValues=true, this=0x102b8a0)
    at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:696
#4  Glucose::ArrPair<Glucose::DClauseUpdate>::getDArr (this=0x7fffffefc9f0)
    at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/ContigCopy.cuh:173
#5  Glucose::ClUpdateSet::getDClauseUpdates (this=0x7fffffefc9f0)
    at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/ClauseUpdates.cu:24
#6  0x000000000043fa18 in Glucose::GpuRunner::<lambda(int, int)>::operator() (threadsPerBlock=512, blockCount=30, 
    __closure=0x102eed0) at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/GpuRunner.cu:317
#7  std::_Function_handler<void(int, int), Glucose::GpuRunner::startGpuRunAsync(CUstream_st*&, Glucose::vec<Glucose::AssigIdsPerSolver>&, std::unique_ptr<Glucose::Reporter<Glucose::ReportedClause> >&)::<lambda(int, int)> >::_M_invoke(const std::_Any_data &, int, int) (__functor=..., __args#0=30, __args#1=512) at /app/gcc/4.9.3/include/c++/4.9.3/functional:2039
#8  0x00000000004453e3 in std::function<void (int, int)>::operator()(int, int) const (__args#1=<optimized out>, 
    __args#0=<optimized out>, this=0x7fffffefc910) at /app/gcc/4.9.3/include/c++/4.9.3/functional:2439
#9  Glucose::runGpuAdjustingDims(int&, int, std::function<void (int, int)>) (warpsPerBlockGuideline=@0x102b990: 16, 
    totalWarps=480, func=...) at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/Helper.cu:96
#10 0x0000000000443249 in Glucose::GpuRunner::startGpuRunAsync (this=this@entry=0x102b8a0, stream=@0x7fffffefcdd0: 0xe51c90, 
    assigIdsPerSolver=..., reporter=...) at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/GpuRunner.cu:318
#11 0x0000000000443c55 in Glucose::GpuRunner::wholeRun (canStart=true, this=0x102b8a0)
    at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/GpuRunner.cu:230
#12 Glucose::GpuRunner::execute (this=0x102b8a0)
    at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/GpuRunner.cu:246
#13 0x000000000043f0ae in Glucose::GpuMultiSolver::solve (this=this@entry=0x10310f0, _cpuThreadCount=_cpuThreadCount@entry=2)
    at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/GpuMultiSolver.cu:119
#14 0x00000000004455ba in runGpuSolver (compRoot=..., gpuOptions=..., commonOpts=..., memUsedOneSolver=32.1171875)
    at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/Main.cu:56
#15 0x0000000000409d57 in main (argc=2, argv=<optimized out>)
    at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/Main.cu:106
(cuda-gdb) f 0
#0  0x0000000000438d43 in Glucose::DestrCheckPointer::DestrCheckPointer (this=0x7fffffefc730, destrCheck=...)
    at /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cu:60
60	    val = *destrCheck.ptr;
(cuda-gdb) p destrCheck 
$1 = (const Glucose::DestrCheck &) @0x102b8f8: {ptr = 0x2300e97000}
(cuda-gdb) p destrCheck.ptr
$2 = (@managed int *) 0x2300e97000
(cuda-gdb) p *destrCheck.ptr
$3 = 711645630 // Resident on GPU

Note that this is suspiciously close to what I'm getting on my GTX940MX:

soos@vvv-dejavu:gpu$ cuda-gdb ./glucose-gpu 
NVIDIA (R) CUDA Debugger
11.1 release
Portions Copyright (C) 2007-2020 NVIDIA Corporation
GNU gdb (GDB) 8.3.1
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./glucose-gpu...
(cuda-gdb) r mizh-md5-47-3.cnf.gz 
Starting program: /home/soos/development/sat_solvers/gpu-share-sat/gpu/glucose-gpu mizh-md5-47-3.cnf.gz
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
c
c This is glucose-gpu 1.0 --  based on MiniSAT (Many thanks to MiniSAT team)
c
[Detaching after fork from child process 99882]
[New Thread 0x7ffff63fa000 (LWP 99898)]
[New Thread 0x7ffff5bf9000 (LWP 99899)]
[New Thread 0x7ffff53f8000 (LWP 99900)]
c Setting block count guideline to 6 (twice the number of multiprocessors)
c running elimination
c |  Eliminated clauses:           1.68 Mb                                                                |
c finished running elimination, 34016 variables were eliminated
c |  Automatic Adjustement of the number of solvers. MaxMemory= 8000, MaxCores=  7.                       |
c |  One Solver is taking 31.27Mb... Let's take 7 solvers for this run (max 40% of the maxMemory).       |
c |  all clones generated. Memory = 45688.56Mb.                                                             |
c ========================================================================================================|
[New Thread 0x7ffff3d99000 (LWP 99914)]
[New Thread 0x7ffff3598000 (LWP 99915)]
[New Thread 0x7ffff2d97000 (LWP 99916)]
[New Thread 0x7ffff2596000 (LWP 99917)]
[New Thread 0x7ffff1d95000 (LWP 99918)]
[New Thread 0x7ffff1594000 (LWP 99919)]
[New Thread 0x7ffff0d93000 (LWP 99920)]
All solvers launched
warning: Cuda API error detected: cudaLaunchKernel returned (0x2bd)

warning: Cuda API error detected: cudaGetLastError returned (0x2bd)

Got error too many resources requested for launch when launching the GPU, decreasing the number of warps per block from 16 to 13. Total warps was 96
warning: Cuda API error detected: cudaLaunchKernel returned (0x2bd)

warning: Cuda API error detected: cudaGetLastError returned (0x2bd)

Got error too many resources requested for launch when launching the GPU, decreasing the number of warps per block from 13 to 10. Total warps was 96

Thread 1 "glucose-gpu" received signal CUDA_EXCEPTION_15, Invalid Managed Memory Access.
0x0000555555594973 in Glucose::DestrCheckPointer::DestrCheckPointer (this=0x7fffffefcbc0, destrCheck=...)
    at /home/soos/development/sat_solvers/gpu-share-sat/gpu/CorrespArr.cu:60
60	    val = *destrCheck.ptr;
(cuda-gdb) bt
#0  0x0000555555594973 in Glucose::DestrCheckPointer::DestrCheckPointer (this=0x7fffffefcbc0, destrCheck=...)
    at /home/soos/development/sat_solvers/gpu-share-sat/gpu/CorrespArr.cu:60
#1  0x00005555555917c0 in Glucose::ArrAllocator<char>::getDArr (this=0x555555e8cdb0)
    at /home/soos/development/sat_solvers/gpu-share-sat/gpu/CorrespArr.cuh:530
#2  Glucose::CorrespArr<char>::tryGetDArr (careAboutCurrentDeviceValues=true, dArr=..., this=0x555555e8cd70)
    at /home/soos/development/sat_solvers/gpu-share-sat/gpu/CorrespArr.cuh:687
#3  Glucose::CorrespArr<char>::getDArr (careAboutCurrentDeviceValues=true, this=0x555555e8cd70)
    at /home/soos/development/sat_solvers/gpu-share-sat/gpu/CorrespArr.cuh:696
#4  Glucose::ArrPair<Glucose::DClauseUpdate>::getDArr (this=this@entry=0x7fffffefcee0)
    at /home/soos/development/sat_solvers/gpu-share-sat/gpu/ContigCopy.cuh:173
#5  0x00005555555915a2 in Glucose::ClUpdateSet::getDClauseUpdates (this=0x7fffffefcee0)
    at /home/soos/development/sat_solvers/gpu-share-sat/gpu/ClauseUpdates.cu:24
#6  0x000055555559c0ea in operator() (threadsPerBlock=320, blockCount=6, __closure=0x555555d8f540)
    at /opt/cuda/targets/x86_64-linux/include/vector_types.h:421
#7  std::__invoke_impl<void, Glucose::GpuRunner::startGpuRunAsync(CUstream_st*&, Glucose::vec<Glucose::AssigIdsPerSolver>&, std::unique_ptr<Glucose::Reporter<Glucose::ReportedClause> >&)::<lambda(int, int)>&, int, int> (__f=...)
    at /usr/include/c++/10.2.0/bits/invoke.h:60
#8  std::__invoke_r<void, Glucose::GpuRunner::startGpuRunAsync(CUstream_st*&, Glucose::vec<Glucose::AssigIdsPerSolver>&, std::unique_ptr<Glucose::Reporter<Glucose::ReportedClause> >&)::<lambda(int, int)>&, int, int> (__fn=...)
    at /usr/include/c++/10.2.0/bits/invoke.h:153
#9  std::_Function_handler<void(int, int), Glucose::GpuRunner::startGpuRunAsync(CUstream_st*&, Glucose::vec<Glucose::AssigIdsPerSolver>&, std::unique_ptr<Glucose::Reporter<Glucose::ReportedClause> >&)::<lambda(int, int)> >::_M_invoke(const std::_Any_data &, int &, int &) (__functor=..., __args#0=<optimized out>, __args#1=<optimized out>)
    at /usr/include/c++/10.2.0/bits/std_function.h:291
#10 0x00005555555a1a1e in std::function<void (int, int)>::operator()(int, int) const (__args#1=<optimized out>, 
    __args#0=<optimized out>, this=0x7fffffefd0d0) at /usr/include/c++/10.2.0/bits/std_function.h:618
#11 Glucose::runGpuAdjustingDims(int&, int, std::function<void (int, int)>) (warpsPerBlockGuideline=@0x555555e8ce60: 10, 
    totalWarps=60, func=...) at /home/soos/development/sat_solvers/gpu-share-sat/gpu/Helper.cu:96
#12 0x000055555559e6f3 in Glucose::GpuRunner::startGpuRunAsync (this=<optimized out>, 
    stream=@0x7fffffefd380: 0x555555deb680, assigIdsPerSolver=..., reporter=...)
    at /usr/include/c++/10.2.0/bits/std_function.h:87
#13 0x000055555559f5a1 in Glucose::GpuRunner::wholeRun (this=0x555555e8cd70, canStart=true)
    at /home/soos/development/sat_solvers/gpu-share-sat/gpu/GpuRunner.cu:230
#14 0x000055555559f712 in Glucose::GpuRunner::execute (this=0x555555e8cd70)
    at /home/soos/development/sat_solvers/gpu-share-sat/gpu/GpuRunner.cu:246
#15 0x000055555559b229 in Glucose::GpuMultiSolver::solve (this=this@entry=0x555555cd5170, 
    _cpuThreadCount=_cpuThreadCount@entry=7) at /home/soos/development/sat_solvers/gpu-share-sat/gpu/GpuMultiSolver.cu:119
#16 0x00005555555a1cba in runGpuSolver (compRoot=..., gpuOptions=..., commonOpts=..., memUsedOneSolver=31.2734375)
    at /home/soos/development/sat_solvers/gpu-share-sat/gpu/Main.cu:56
#17 0x0000555555561b21 in main (argc=<optimized out>, argv=<optimized out>)
    at /home/soos/development/sat_solvers/gpu-share-sat/gpu/GpuMultiSolver.cuh:78
(cuda-gdb) f 0
#0  0x0000555555594973 in Glucose::DestrCheckPointer::DestrCheckPointer (this=0x7fffffefcbc0, destrCheck=...)
    at /home/soos/development/sat_solvers/gpu-share-sat/gpu/CorrespArr.cu:60
60	    val = *destrCheck.ptr;
(cuda-gdb) p destrCheck
$1 = (const Glucose::DestrCheck &) @0x555555e8cdc8: {ptr = 0xb00e17000}
(cuda-gdb) p *destrCheck.ptr
$2 = 1414647625 // Resident on GPU

So I am guessing this is similar/close? Except in my case, I get to run a bit then bump into signal CUDA_EXCEPTION_15, Invalid Managed Memory Access, while the other gets SIGBUS, Bus error earlier, at the same exact spot. Note that the SIGBUS could probably be the the same as the other, but less precise -- it seems to me that e.g. an unaligned memory read/write/etc could cause both, but one of them is more "precise" than the other. BTW, I think we are making a lot of progress, maybe this is a single bug somewhere that's once fixed, it'll solve both issues :)

I hope this helps! Thanks in advance for helping debug this,

Mate

PS: Note to self. To create an interactive GPU shell on the cluster, use: qsub -I -q gpu -l walltime=1:00:00 -P 11000744 then: module unload gcc/4.9.3, module load cuda/10.1, module load gcc/4.9.3.

@msoos
Copy link
Collaborator Author

msoos commented Nov 30, 2020

Yaaaaay! It's working now!

Ah, I think this was the one that was causing issues, in particular: Since we are launching several NAND gates concurrently on a single device, while one NAND gate is running a kernel that accesses some unified memory, another NAND gate accesses some other unified memory from the host. This is not allowed on devices with compute capability < 6.x: Unified memory coherency and concurrency.

So that would explain it maybe? Anyway, it's all good now! And it just finished solving an instance. Amazing! So we'll be able to use the cluster to run tons of experiments with 24 real cores! This is fantastic. Thank you so much for fixing this issue. I'm about to schedule a run, but I see there is too much verbosity by default -- there is slow IO on the cluster, and IO costs space which is scarce. I'm opening a new issue about that and the configs you'd like me to run. Then we'll be good to go :)

Looking forward to running this on the cluster,

Mate

@msoos msoos closed this as completed Nov 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant