Skip to content
This repository has been archived by the owner on Dec 30, 2019. It is now read-only.

cublas tests fail #2

Open
drahnr opened this issue May 24, 2017 · 45 comments
Open

cublas tests fail #2

drahnr opened this issue May 24, 2017 · 45 comments
Assignees
Labels

Comments

@drahnr
Copy link
Member

drahnr commented May 24, 2017

They are not failing all the time, sometimes the first few pass, sometimes they all pass.

@Anton-4
Copy link

Anton-4 commented Oct 24, 2017

I've narrowed down this issue a bit. The dot and nrm2 operation seem to be the core of the problem.
When ignoring these and running the tests single-threaded, all other tests always succeed.
When one operation fails the CUDA context is broken and all following cuda tests fail.
Due to parallel testing this resulted in a varying number of tests failing with different runs.

@drahnr
Copy link
Member Author

drahnr commented Oct 24, 2017

Awesome! I would love to know what is going wrong there exactly :)

Note that @hobofan suspected that the memory allocated does not suffice some constraints required - I did not verify this yet.

@Anton-4
Copy link

Anton-4 commented Oct 24, 2017

Can you assign me here?
The underlying CUDA error code is for unknown error so this won't be easy to figure out.
I'll check if it is caused by a problem with memory allocation.

@drahnr
Copy link
Member Author

drahnr commented Oct 24, 2017

@Anton-4 right now it seems I can only assign team members, so I am going to do some magic to make this work and add you to a subteam named @spearow/contributor - as soon as you accept that I can assign issues to you :)

I am aware that this is not easy, I'd recommend to start with the API docs NVIDIA provides, maybe unknown is not exactly correct and only there because not all error codes are differentiated properly.

Memory alignment can be checked with assert_eq!(mem::align_of::<Foo>(), 16);

@Anton-4
Copy link

Anton-4 commented Oct 25, 2017

Some updates:

  • the issue is definitely memory related and most likely happens in coaster self, not coaster-blas
  • running the dot test by itself succeeds. When running another test before dot, the dot test fails.
    so my current guess is some memory is not correctly cleaned up after writing.
  • some cuda-memcheck errors:
======== Program hit cudaErrorInvalidDevice (error 10) due to "invalid device ordinal" on CUDA API call to cudaEventQuery. 

========= Program hit cudaErrorInvalidDevice (error 10) due to "invalid device ordinal" on CUDA API call to cudaGetLastError. 

========= Invalid __global__ write of size 4
=========     at 0x000004d0 in void dot_kernel<float, float, float, int=128, int=0, int=0>(cublasDotParams<float, float>)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x705c40000 is out of bounds

Getting a little bit closer every day :)

@drahnr
Copy link
Member Author

drahnr commented Oct 25, 2017

Did you try cuda-memcheck --leak-check full to get the origin of the allocations?

This sounds awfully similar to coaster-nn/#11 which seems to points even more towards coaster ... specifically ./src/frameworks/cuda/api/driver/memory.rs and the ffi in there.

Thanks for taking care of this. Much appreciated!

@Anton-4
Copy link

Anton-4 commented Oct 27, 2017

relevant output for cuda-memcheck --leak-check ful:

test cuda_f32::it_computes_correct_asum ... ========= Leaked 32768 bytes at 0x705c40000
=========     Saved host backtrace up to driver entry point at cudaMalloc time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuMemAlloc_v2 + 0x17f) [0x22076f]
=========
========= Leaked 1024 bytes at 0x705a40600
=========     Saved host backtrace up to driver entry point at cudaMalloc time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuMemAlloc_v2 + 0x17f) [0x22076f]
=========
========= Leaked 112 bytes at 0x705a40400
=========     Saved host backtrace up to driver entry point at cudaMalloc time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuMemAlloc_v2 + 0x17f) [0x22076f]

========= LEAK SUMMARY: 33904 bytes leaked in 3 allocations

0x705c40000 points to an assembly instruction (lea) any suggestions on how to find the corresponding rust code?

@Anton-4
Copy link

Anton-4 commented Oct 28, 2017

updates:

  • I checked if all the memory allocated with
    coaster/src/frameworks/cuda/api/driver/memory.rs is freed and this is the case. The total memory allocated here is 48 bytes which is a big difference
    with the 32 kb leaked.
  • I also ran the memcheck with an extra option --report-api-errors all, which revealed another error:
    Program hit CUDA_ERROR_NO_BINARY_FOR_GPU (error 209) due to "no kernel image is    available for execution on the device" on CUDA API call to load binary via CUDA runtime.
    
    I haven't been able to find anything useful with this error yet.
  • I also set up debugging with lldb in vscode to try and trace the dot operation as deep as possible.
    I still need to replace some macro's so I can trace deeper.

@drahnr
Copy link
Member Author

drahnr commented Oct 29, 2017

  • The internal stuff here might be caused by some static allocations by the library itself?
  • similar issue essentially lacking the PTX JIT compiler so CUDA_ERROR_NO_BINARY_FOR_GPU Theano/libgpuarray#289
  • Which macros are you talking about? The rust or some cudblas header macros?

@Anton-4
Copy link

Anton-4 commented Oct 29, 2017

  • Indeed I've been thinking the same thing
  • will check it out
  • rust macros

@drahnr
Copy link
Member Author

drahnr commented Oct 29, 2017

Using Fedora ptxjitcompiler is provided by the following packages (from negativio17.org nvidia repo)

dnf provides '/usr/lib64/libnvidia-ptxjitcompiler.so.384.90'                                                                                                23:03:42   130 
Last metadata expiration check: 0:09:27 ago on Sun 29 Oct 2017 10:54:39 PM CET.

nvidia-driver-cuda-libs-2:384.90-1.fc26.x86_64 : Libraries for nvidia-driver-cuda
Repo        : @System
Matched from:
Filename    : /usr/lib64/libnvidia-ptxjitcompiler.so.384.90

nvidia-driver-cuda-libs-2:384.90-1.fc26.x86_64 : Libraries for nvidia-driver-cuda
Repo        : fedora-nvidia
Matched from:
Filename    : /usr/lib64/libnvidia-ptxjitcompiler.so.384.90

So this file is present on my test machine, yet this happens too.


Did you figure out a way to map the instruction back to a rust function?

If there are only a few suspects, #[no_mangle] might be enough.

I am currently tied to a GTX 460 but I can test later this week on a GTX 1050 if it makes any difference.

@Anton-4
Copy link

Anton-4 commented Oct 30, 2017

From what I've read NO_BINARY_FOR_GPU can have many causes.
I will give #[no_mangle] a try.

@drahnr
Copy link
Member Author

drahnr commented Oct 30, 2017

The cublas doc also states that CUDA_FORCE_PTX_JIT=1 uses a fallback path based on PTX, so that might be a cross test too.

@Anton-4
Copy link

Anton-4 commented Oct 30, 2017

Running the dot test by itself succeeds otherwise but fails when setting CUDA_FORCE_PTX_JIT=1.
The memcheck also shows cudaErrorInvalidDeviceFunction errors that didn't occur before.
The CUDA_ERROR_NO_BINARY_FOR_GPU is repeated a lot more, 203 times vs 34 times before.
The amount of bytes leaked is still the same.

(repeated 203 times)========= Program hit CUDA_ERROR_NO_BINARY_FOR_GPU (error 209) due to "no kernel image is available for execution on the device" on CUDA API call to load binary via CUDA runtime. 

========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaFuncGetAttributes. 

========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaGetLastError. 

========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaLaunch. 

========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaGetLastError. 

@Anton-4
Copy link

Anton-4 commented Oct 30, 2017

I've seen some cases where the CUDA_ERROR_NO_BINARY_FOR_GPU happens when no compute capability is specified and a default value is used.
Usually this is solved by providing the compute capability as a flag to nvcc but I don't know where we could set the compute capability.
From what I can see the compute capability also isn't queried anywhere so I think it's quite likely this could be the root cause.

@drahnr
Copy link
Member Author

drahnr commented Oct 31, 2017

In our case there is nothing we can specify. There is no nvcc pass. We only use the compiled binary. So the next step would be to figure out which targets are compiled into cublas.

  • a way to specify the used target via an API call
  • a way to specify the used target via ENV vars

So far I did not find anything related in the cublas doc.
I also checked the tensorflow integration of cublas, but I did not see anything that would pop.

@Anton-4
Copy link

Anton-4 commented Oct 31, 2017

The minimal compute capability for CUBLAS is 3.5 (I have 3.5), I also ran the simpleDevLibCUBLAS.cpp sample (in cuda/samples) with success.

@drahnr
Copy link
Member Author

drahnr commented Oct 31, 2017

Alright, that also explaines why the CI fails - since GTX 460 has compute cap 2.1.

I will get rid of the GTX 460 in the CI system and replace it with GTX 1050 Ti (given that it works there)

@drahnr
Copy link
Member Author

drahnr commented Nov 2, 2017

Does this explain all cases of failing tests though?

@Anton-4
Copy link

Anton-4 commented Nov 3, 2017

I think the NO_BINARY_FOR_GPU error occurs for each individual test, I will double check this evening.

@Anton-4
Copy link

Anton-4 commented Nov 3, 2017

It does indeed occur for every test. It's so strange that the dot product can be calculated while the error reports that there is no binary.

@drahnr
Copy link
Member Author

drahnr commented Nov 3, 2017

Tests with the GTX 1050:

4x rcublas tests without any errors.
4x rcudnn tests without any errors.
but coaster-blas still fails:

test cuda_f32::it_computes_correct_copy ... ok
test cuda_f32::it_computes_correct_nrm2 ... FAILED
test cuda_f32::it_computes_correct_scal ... ok
test cuda_f32::it_computes_correct_axpy ... ok
test cuda_f32::it_computes_correct_swap ... FAILED
test cuda_f32::it_computes_correct_dot ... FAILED
test cuda_f32::it_computes_correct_gemm ... FAILED
test cuda_f32::it_computes_correct_asum ... FAILED

failures:

---- cuda_f32::it_computes_correct_nrm2 stdout ----
	thread 'cuda_f32::it_computes_correct_nrm2' panicked at 'called `Result::unwrap()` on an `Err` value: Plugin(Operation("Unable to execute operation nrm2"))', /checkout/src/libcore/result.rs:906:4
note: Run with `RUST_BACKTRACE=1` for a backtrace.

---- cuda_f32::it_computes_correct_swap stdout ----
	thread 'cuda_f32::it_computes_correct_swap' panicked at 'called `Result::unwrap()` on an `Err` value: Plugin(Operation("Unable to execute operation swap"))', /checkout/src/libcore/result.rs:906:4

---- cuda_f32::it_computes_correct_dot stdout ----
	thread 'cuda_f32::it_computes_correct_dot' panicked at 'called `Result::unwrap()` on an `Err` value: Cuda(Unknown("Unable to synchronize CUDA context."))', /checkout/src/libcore/result.rs:906:4

---- cuda_f32::it_computes_correct_gemm stdout ----
	thread 'cuda_f32::it_computes_correct_gemm' panicked at 'called `Result::unwrap()` on an `Err` value: Plugin(Operation("Unable to execute operation gemm"))', /checkout/src/libcore/result.rs:906:4

---- cuda_f32::it_computes_correct_asum stdout ----
	thread 'cuda_f32::it_computes_correct_asum' panicked at 'called `Result::unwrap()` on an `Err` value: Cuda(Unknown("Unable to synchronize CUDA context."))', /checkout/src/libcore/result.rs:906:4


failures:
    cuda_f32::it_computes_correct_asum
    cuda_f32::it_computes_correct_dot
    cuda_f32::it_computes_correct_gemm
    cuda_f32::it_computes_correct_nrm2
    cuda_f32::it_computes_correct_swap

test result: FAILED. 19 passed; 5 failed; 0 ignored; 0 measured; 0 filtered out

@Anton-4
Copy link

Anton-4 commented Nov 4, 2017

As expected, that's how it is with me too.

@drahnr
Copy link
Member Author

drahnr commented Nov 4, 2017

I wasn't aware the raw coaster free version succeeded reproducable. In essence, the above findings are nice and good moving forward but are not/inderictly related. Right now I think the most sane next step is to figure out the difference in Mem allocation in the tests. This must give a hint since that is the only difference I can see right now between those tests.

@Anton-4
Copy link

Anton-4 commented Nov 5, 2017

I used the nvidia profiler nvprof to show all api calls for running just the gemm test(with memory leak):

==8463== Profiling application: ./target/debug/blas_specs-fcffebea5c727b82
==8463== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 36.66%  10.112us         1  10.112us  10.112us  10.112us  sgemm_sm35_ldg_nn_64x16x64x16x16
 36.43%  10.048us         1  10.048us  10.048us  10.048us  sgemm_sm35_ldg_tt_32x16x64x8x16
 13.46%  3.7120us         2  1.8560us  1.7600us  1.9520us  [CUDA memcpy DtoH]
 13.46%  3.7120us         5     742ns     640ns  1.0880us  [CUDA memcpy HtoD]

==8463== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 47.38%  152.09ms         1  152.09ms  152.09ms  152.09ms  cudaFree
 28.72%  92.173ms         1  92.173ms  92.173ms  92.173ms  cuCtxDestroy
 23.59%  75.730ms         1  75.730ms  75.730ms  75.730ms  cuCtxCreate
  0.09%  275.69us        93  2.9640us     100ns  108.88us  cuDeviceGetAttribute
  0.05%  159.40us         3  53.132us  7.2020us  141.04us  cudaMalloc
  0.04%  139.33us         6  23.222us  4.2070us  113.24us  cuMemAlloc
  0.03%  111.82us         3  37.273us  28.873us  42.365us  cuDeviceGetName
  0.02%  61.118us         2  30.559us  28.246us  32.872us  cuMemcpyDtoH
  0.02%  59.542us         2  29.771us  28.549us  30.993us  cuCtxSynchronize
  0.02%  48.699us         1  48.699us  48.699us  48.699us  cuDeviceTotalMem
  0.01%  30.451us         6  5.0750us  4.3540us  6.4760us  cuMemFree
  0.01%  29.671us         2  14.835us  10.136us  19.535us  cudaLaunch
  0.01%  19.915us         4  4.9780us  3.4100us  8.5930us  cuMemcpyHtoD
  0.00%  8.3340us         1  8.3340us  8.3340us  8.3340us  cudaMemcpy
  0.00%  7.6730us        16     479ns     372ns  1.2070us  cudaEventCreateWithFlags
  0.00%  6.0230us        28     215ns     110ns  2.5490us  cudaSetupArgument
  0.00%  5.8080us         2  2.9040us     490ns  5.3180us  cudaConfigureCall
  0.00%  3.4660us        11     315ns     211ns     857ns  cudaDeviceGetAttribute
  0.00%  2.1220us         5     424ns     148ns  1.1020us  cuDeviceGetCount
  0.00%  1.3860us         5     277ns     160ns     429ns  cuDeviceGet
  0.00%  1.0540us         2     527ns     428ns     626ns  cuInit
  0.00%     974ns         1     974ns     974ns     974ns  cudaGetDevice
  0.00%     551ns         2     275ns     234ns     317ns  cudaGetLastError
  0.00%     315ns         1     315ns     315ns     315ns  cuDriverGetVersion

I have also done the same for a working c++ binary doing the sgemm operation:

==9113== Profiling application: ./Debug/cublaslib
==9113== Profiling result:
Time(%)      Time  Calls (host)  Calls (device)       Avg       Min       Max  Name
 40.26%  164.19us             1               0  164.19us  164.19us  164.19us  invokeDeviceCublasSgemm()
 25.12%  102.43us             6               -  17.072us     960ns  25.472us  [CUDA memcpy HtoD]
 21.96%  89.536us             1               1  44.768us  39.904us  49.632us  sgemm_sm35_ldg_nn_64x16x64x16x16
 12.66%  51.648us             3               -  17.216us  2.1760us  24.768us  [CUDA memcpy DtoH]

==9113== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 52.81%  266.27ms         8  33.284ms  5.2240us  265.97ms  cudaMalloc
 43.04%  217.02ms         9  24.113ms  5.7130us  216.67ms  cudaFree
  3.84%  19.371ms         2  9.6853ms  27.948us  19.343ms  cudaLaunch
  0.14%  707.35us       178  3.9730us     101ns  197.12us  cuDeviceGetAttribute
  0.13%  642.55us         9  71.394us  5.6600us  203.29us  cudaMemcpy
  0.02%  107.64us         2  53.821us  48.422us  59.220us  cuDeviceTotalMem
  0.01%  65.841us         2  32.920us  29.260us  36.581us  cuDeviceGetName
  0.00%  8.1910us        21     390ns     112ns  3.3750us  cudaSetupArgument
  0.00%  7.2520us        16     453ns     353ns  1.1850us  cudaEventCreateWithFlags
  0.00%  6.4340us        16     402ns     314ns  1.2740us  cudaEventDestroy
  0.00%  5.1700us         2  2.5850us  2.5320us  2.6380us  cudaThreadSynchronize
  0.00%  3.4650us         2  1.7320us     965ns  2.5000us  cudaConfigureCall
  0.00%  3.2740us        11     297ns     212ns     799ns  cudaDeviceGetAttribute
  0.00%  2.0050us         4     501ns     139ns  1.1960us  cuDeviceGetCount
  0.00%  1.0880us         4     272ns     194ns     383ns  cuDeviceGet
  0.00%     971ns         1     971ns     971ns     971ns  cudaGetDevice
  0.00%     774ns         2     387ns     303ns     471ns  cudaGetLastError
  0.00%     500ns         1     500ns     500ns     500ns  cuInit
  0.00%     416ns         1     416ns     416ns     416ns  cuDriverGetVersion

nvprof can also provide detailed info about the memory allocation.
I think next I'll change the c++ code to use the sum operation, since gemm is quite complicated.
If we then compare all API calls and memory allocations for both binaries it shouldn't be so hard to
figure out what goes wrong.

Glad I found this :)

@drahnr
Copy link
Member Author

drahnr commented Nov 6, 2017

That's nice, I did not know this tool yet!

@Anton-4
Copy link

Anton-4 commented Nov 9, 2017

I wrote some c++ wherein I call asum with the same arguments as in our test, then I ran the profiler, yielding:

==6284== Profiling application: ./cublaslib
==6284== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 67.42%  6.6880us         2  3.3440us  2.4960us  4.1920us  void asum_kernel<float, float, int=0>(cublasAsumParams<float, float>)
 17.10%  1.6960us         2     848ns     640ns  1.0560us  [CUDA memcpy HtoD]
 15.48%  1.5360us         1  1.5360us  1.5360us  1.5360us  [CUDA memcpy DtoH]

==6284== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 86.33%  1.12172s         4  280.43ms  8.2760us  1.12157s  cudaMalloc
 13.61%  176.79ms         5  35.358ms  6.8350us  176.58ms  cudaFree
  0.05%  606.13us       178  3.4050us     100ns  159.32us  cuDeviceGetAttribute
  0.01%  129.47us         2  64.733us  58.526us  70.940us  cuDeviceTotalMem
  0.01%  66.624us         2  33.312us  28.107us  38.517us  cuDeviceGetName
  0.00%  24.177us         2  12.088us  5.6080us  18.569us  cudaLaunch
  0.00%  17.985us         2  8.9920us  8.1330us  9.8520us  cudaMemcpy
  0.00%  12.037us         1  12.037us  12.037us  12.037us  cudaMemcpyAsync
  0.00%  7.5020us        16     468ns     351ns  1.2500us  cudaEventCreateWithFlags
  0.00%  5.9210us        16     370ns     307ns     797ns  cudaEventDestroy
  0.00%  4.6400us         2  2.3200us  2.0900us  2.5500us  cudaThreadSynchronize
  0.00%  4.5450us         4  1.1360us     277ns  3.2820us  cuDeviceGetCount
  0.00%  3.3310us        11     302ns     210ns     846ns  cudaDeviceGetAttribute
  0.00%  3.2420us         1  3.2420us  3.2420us  3.2420us  cudaFuncGetAttributes
  0.00%  2.1970us         1  2.1970us  2.1970us  2.1970us  cudaStreamSynchronize
  0.00%  1.8710us         2     935ns     211ns  1.6600us  cudaSetupArgument
  0.00%  1.4160us         4     354ns     177ns     733ns  cuDeviceGet
  0.00%  1.4020us         1  1.4020us  1.4020us  1.4020us  cudaEventQuery
  0.00%  1.2490us         1  1.2490us  1.2490us  1.2490us  cudaGetDevice
  0.00%  1.0260us         1  1.0260us  1.0260us  1.0260us  cudaEventRecord
  0.00%     768ns         4     192ns     111ns     374ns  cudaGetLastError
  0.00%     731ns         1     731ns     731ns     731ns  cuInit
  0.00%     702ns         2     351ns     160ns     542ns  cudaConfigureCall
  0.00%     498ns         1     498ns     498ns     498ns  cuDriverGetVersion

I also ran only the asum test from coaster-blas:

==6234== Profiling application: ./target/debug/blas_specs-fcffebea5c727b82
==6234== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 61.36%  6.8160us         2  3.4080us  2.5280us  4.2880us  void asum_kernel<float, float, int=0>(cublasAsumParams<float, float>)
 18.44%  2.0480us         2  1.0240us     928ns  1.1200us  [CUDA memcpy HtoD]
 14.69%  1.6320us         1  1.6320us  1.6320us  1.6320us  [CUDA memcpy DtoH]
  5.52%     613ns         1     613ns     613ns     613ns  [CUDA memset]

==6234== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 50.52%  206.55ms         1  206.55ms  206.55ms  206.55ms  cudaFree
 28.98%  118.48ms         1  118.48ms  118.48ms  118.48ms  cuCtxDestroy
 20.31%  83.034ms         1  83.034ms  83.034ms  83.034ms  cuCtxCreate
  0.06%  250.46us        93  2.6930us     105ns  111.92us  cuDeviceGetAttribute
  0.04%  155.68us         3  51.894us  7.2610us  137.21us  cudaMalloc
  0.03%  104.98us         2  52.489us  5.1890us  99.790us  cuMemAlloc
  0.02%  77.925us         3  25.975us  23.796us  27.635us  cuDeviceGetName
  0.01%  60.167us         1  60.167us  60.167us  60.167us  cuDeviceTotalMem
  0.01%  30.249us         1  30.249us  30.249us  30.249us  cuMemcpyDtoH
  0.01%  25.167us         1  25.167us  25.167us  25.167us  cuCtxSynchronize
  0.01%  24.114us         2  12.057us  5.4050us  18.709us  cudaLaunch
  0.00%  14.169us         2  7.0840us  6.0380us  8.1310us  cuMemFree
  0.00%  8.6320us         1  8.6320us  8.6320us  8.6320us  cuMemcpyHtoD
  0.00%  8.3400us         1  8.3400us  8.3400us  8.3400us  cudaMemcpy
  0.00%  7.0170us        16     438ns     324ns  1.2780us  cudaEventCreateWithFlags
  0.00%  5.9830us         1  5.9830us  5.9830us  5.9830us  cudaMemsetAsync
  0.00%  5.8500us         4  1.4620us     115ns  5.2940us  cudaGetLastError
  0.00%  3.5370us         1  3.5370us  3.5370us  3.5370us  cudaFuncGetAttributes
  0.00%  3.3540us        11     304ns     216ns     757ns  cudaDeviceGetAttribute
  0.00%  2.5470us         2  1.2730us     186ns  2.3610us  cudaSetupArgument
  0.00%  1.7590us         1  1.7590us  1.7590us  1.7590us  cudaEventQuery
  0.00%  1.6560us         5     331ns     120ns  1.0230us  cuDeviceGetCount
  0.00%  1.4340us         1  1.4340us  1.4340us  1.4340us  cudaEventRecord
  0.00%  1.0920us         1  1.0920us  1.0920us  1.0920us  cudaGetDevice
  0.00%     991ns         5     198ns     148ns     310ns  cuDeviceGet
  0.00%     922ns         2     461ns     454ns     468ns  cuInit
  0.00%     612ns         2     306ns     168ns     444ns  cudaConfigureCall
  0.00%     351ns         1     351ns     351ns     351ns  cuDriverGetVersion

Looking at difference in API calls, I think cuMemAlloc, cuMemFree, cudaMemsetAsync look
supspicious. Will continue investigating tomorrow.

@drahnr
Copy link
Member Author

drahnr commented Nov 9, 2017

Essentially what we should try to use cudaMalloc instead of cuMem

@Anton-4
Copy link

Anton-4 commented Nov 10, 2017

I ran a full trace of all API calls with nvprof --print-api-trace and marked the differences.
For asum.cpp:

asum.cpp

--print-api-trace

109.40ms  1.1290us  cuDeviceGetCount
109.40ms     170ns  cuDeviceGetCount
109.40ms     314ns  cuDeviceGet
109.40ms     277ns  cuDeviceGetAttribute
109.58ms     348ns  cuDeviceGet
109.58ms     189ns  cuDeviceGetAttribute
109.58ms     129ns  cuDeviceGetAttribute
109.58ms     157ns  cuDeviceGetAttribute
109.61ms     187ns  cuDeviceGetCount
109.61ms     144ns  cuDeviceGet
109.61ms  28.775us  cuDeviceGetName
----->109.64ms  71.380us  cuDeviceTotalMem
109.71ms     168ns  cuDeviceGetAttributex30+
----->109.99ms  1.12856s  cudaMalloc
1.23860s     385ns  cuDriverGetVersion
1.23860s     539ns  cuInit
1.23862s     241ns  cuDeviceGetCount
1.23862s     232ns  cuDeviceGet
1.23862s  30.694us  cuDeviceGetName
----->1.23865s  62.987us  cuDeviceTotalMem
1.23871s     290ns  cuDeviceGetAttributex30+
1.23899s  173.62ms  cudaFree
1.41261s     949ns  cudaGetDevice
1.41262s     816ns  cudaDeviceGetAttributex11
1.41262s  13.617us  cudaMalloc
1.41263s  8.3810us  cudaMemcpy
1.41264s  7.8840us  cudaMalloc
1.41265s  1.2690us  cudaEventCreateWithFlagsx9
1.41265s  116.49us  cudaMalloc
1.41277s  1.0130us  cudaEventCreateWithFlagsx9
1.41278s  8.2660us  cudaMemcpy
1.41279s  3.1920us  cudaFuncGetAttributes
1.41280s  1.4350us  cudaEventQuery
1.41280s     452ns  cudaGetLastError
1.41280s     498ns  cudaConfigureCall
1.41280s  1.8740us  cudaSetupArgument
1.41280s  17.917us  cudaLaunch (void asum_kernel<float, float, int=0>(cublasAsumParams<float, float>) [233])
1.41282s     138ns  cudaGetLastError
1.41282s     113ns  cudaGetLastError
1.41282s     206ns  cudaConfigureCall
1.41282s     204ns  cudaSetupArgument
1.41282s  5.3510us  cudaLaunch (void asum_kernel<float, float, int=0>(cublasAsumParams<float, float>) [238])
1.41283s     171ns  cudaGetLastError
----->1.41283s  11.561us  cudaMemcpyAsync
----->1.41284s  2.1880us  cudaStreamSynchronize
1.41284s  1.0630us  cudaEventRecord
----->1.41285s  11.823us  cudaFree
----->1.41287s  6.7440us  cudaFree
----->1.41287s  2.1190us  cudaThreadSynchronize
----->1.41288s     844ns  cudaEventDestroy x9
----->1.41288s  81.004us  cudaFree
----->1.41296s  2.5300us  cudaThreadSynchronize
----->1.41296s     560ns  cudaEventDestroy x9
----->1.41297s  110.32us  cudaFree

for coaster-blas asum spec:

112.34ms  1.2900us  cuDeviceGetCount
112.34ms     131ns  cuDeviceGetCount
112.34ms     226ns  cuDeviceGet
112.34ms     419ns  cuDeviceGetAttribute
112.47ms     458ns  cuDeviceGet
112.47ms     289ns  cuDeviceGetAttribute
112.47ms     196ns  cuDeviceGetAttribute
112.47ms     208ns  cuDeviceGetAttribute
112.49ms     215ns  cuDeviceGetCount
112.50ms     189ns  cuDeviceGet
112.50ms  28.942us  cuDeviceGetName
112.59ms     206ns  cuDeviceGetAttribute
112.60ms     546ns  cuInit
112.60ms     139ns  cuDeviceGetCount
112.60ms     160ns  cuDeviceGet
112.60ms  24.736us  cuDeviceGetName
112.69ms     186ns  cuDeviceGetAttribute
112.70ms  84.444ms  cuCtxCreate
----->197.18ms  98.817us  cuMemAlloc
----->197.28ms  11.487us  cuMemcpyHtoD
----->197.30ms  4.9560us  cuMemAlloc
197.35ms     323ns  cuDriverGetVersion
----->197.35ms     439ns  cuInit ==> used twice instead of once
197.36ms     189ns  cuDeviceGetCount
197.36ms     235ns  cuDeviceGet
197.36ms  27.945us  cuDeviceGetName
197.39ms  61.523us  cuDeviceTotalMem
197.45ms     265ns  cuDeviceGetAttributex30+
197.71ms  178.42ms  cudaFree
376.13ms  1.0900us  cudaGetDevice
376.13ms     825ns  cudaDeviceGetAttributex11
376.14ms  12.347us  cudaMalloc
376.15ms  8.9220us  cudaMemcpy
376.16ms  7.6060us  cudaMalloc
376.17ms  1.2770us  cudaEventCreateWithFlagsx9
376.17ms  116.68us  cudaMalloc
376.29ms  1.0360us  cudaEventCreateWithFlagsx9
----->376.30ms  5.5050us  cudaMemsetAsync
376.31ms  3.1270us  cudaFuncGetAttributes
376.31ms  1.8690us  cudaEventQuery
376.31ms  5.0240us  cudaGetLastError
376.32ms     551ns  cudaConfigureCall
376.32ms  2.1220us  cudaSetupArgument
376.32ms  17.465us  cudaLaunch (void asum_kernel<float, float, int=0>(cublasAsumParams<float, float>) [154])
376.34ms     217ns  cudaGetLastError
376.34ms     117ns  cudaGetLastError
376.34ms     202ns  cudaConfigureCall
376.34ms     180ns  cudaSetupArgument
376.34ms  5.1890us  cudaLaunch (void asum_kernel<float, float, int=0>(cublasAsumParams<float, float>) [159])
376.35ms     178ns  cudaGetLastError
376.35ms  1.3250us  cudaEventRecord
----->376.35ms  3.7370us  cuCtxSynchronize
----->376.38ms  21.177us  cuMemcpyDtoH
----->376.41ms  7.6910us  cuMemFree
----->376.42ms  5.8090us  cuMemFree
----->376.43ms  113.21ms  cuCtxDestroy

In coaster-blas a new context(Ctx) is created for every sequence of operations which is not efficient,
cuInit is also called twice and the cuMem methods are probably best avoided.
I think I'll start changing the code so the same API calls are used as in the asum c++.

@drahnr
Copy link
Member Author

drahnr commented Nov 10, 2017

The thing is that each test runs in a fork afaik and I am not sure how the profiler handles that. Can you try using something like test::asum (or so) to filter out all tests except for one?

@drahnr
Copy link
Member Author

drahnr commented Nov 10, 2017

The context is to the best of my memory wrapped in a lazy_static! scope, so it should really be called only once. The counter test would be to do an operation multiple times in a test and the cuInit should only appear once.

@Anton-4
Copy link

Anton-4 commented Nov 10, 2017

All other tests were already filtered out. I think I misinterpreted context creation, it is indeed only
created once, I think the c++ version perhaps uses a default context and does not create it's own, I'll look into it.

@Anton-4
Copy link

Anton-4 commented Nov 10, 2017

I think cuInit is done automatically during the fetching of the attibutes.
In coaster we also explicitly call cuInit, hence it appears twice.

@drahnr
Copy link
Member Author

drahnr commented Nov 10, 2017

lazy_static! {
    static ref CONTEXT: cublas::Context = {
        let mut context = cublas::Context::new().unwrap();
        context.set_pointer_mode(cublas::api::PointerMode::Device).unwrap();
        context
    };
}

which when reading the cublas specification and explained in more detail.

This might be unrelated, but this implies that sync needs to be called.

This is what fails in coaster-blas

pub fn test_asum<T, F>(backend: Backend<F>)
    where T: Float + fmt::Debug,
            F: IFramework,
            Backend<F>: Asum<T> + IBackend {
    let mut x = SharedTensor::<T>::new(&[3]);
    let mut result = SharedTensor::<T>::new(&[1]);

    write_to_tensor(&mut x, &[1., -2., 3.]);
    backend.asum(&x, &mut result).unwrap();
    backend.synchronize().unwrap();
    tensor_assert_eq(&result, &[6.0], 0.);
}

vs.

rust-cublas

    #[test]
    fn use_cuda_memory_for_asum() {
        let native = get_native_backend();
        let cuda = get_cuda_backend();

        // set up input
        let n = 20i32;
        let val = 2f32;
        let mut x = filled_tensor(&native, n as usize, val);



        // set up result
        let mut result = SharedTensor::<f32>::new(&vec![1]);



        {
            let cuda_mem = x.read(cuda.device()).unwrap();
            let cuda_mem_result = result.write_only(cuda.device()).unwrap();
            let mut ctx = Context::new().unwrap();
            ctx.set_pointer_mode(PointerMode::Device).unwrap();
            unsafe {
                let x_addr = ::std::mem::transmute::<u64, *mut f32>(*cuda_mem.id_c());
                let res_addr = ::std::mem::transmute::<u64, *mut f32>(*cuda_mem_result.id_c());
                API::ffi_sasum(*ctx.id_c(), n, x_addr, 1, res_addr).unwrap();
            }
        }

        let native_res = result.read(native.device()).unwrap();
        assert_eq!(&[40f32], native_res.as_slice::<f32>());
    }

which works reliably.

So the difference here is the synchronize which ends up cuCtxSynchronize (coaster-blas) vs backend.read(...) which in this case ends up in sync_out which bails down to cuMemcpyDtoH_v2.

Copying the memory to the host seems to be safe (I've never seen this fail).

Things to try:

  • change the synchronize() into a backend.read() in coaster-blas
  • change the backend.read() into synchronize() a in rcublas
  • change the pointer mode from Device to Host

@drahnr
Copy link
Member Author

drahnr commented Nov 12, 2017

According to the docs, to synchronize the memory from/to the device, cudaDeviceSynchronize is supposed to be called, but it is actually never used.

@Anton-4
Copy link

Anton-4 commented Nov 12, 2017

I've also taken a look at the API calls for rust-cublas vs coaster-blas, the only difference is rust-cublas
having some extra instructions at the end:

436.70ms  1.3100us  cudaEventRecord
--->436.70ms  9.2470us  cudaFree
--->436.71ms  2.2990us  cudaThreadSynchronize
--->436.71ms  2.3120us  cudaEventDestroyx10
--->436.72ms  5.5930us  cudaFree
--->436.72ms  1.4220us  cudaThreadSynchronize
--->436.73ms     363ns  cudaEventDestroyx10
--->436.73ms  75.397us  cudaFree
436.82ms  30.162us  cuMemcpyDtoH

vs for coaster-blas:

376.35ms  1.3250us  cudaEventRecord
--->376.35ms  3.7370us  cuCtxSynchronize
376.38ms  21.177us  cuMemcpyDtoH

In rust-cublas you can see the 3 cudaFree, the omission of those in coaster-blas causes the 3 memory leaks 💡.

@drahnr
Copy link
Member Author

drahnr commented Nov 12, 2017

SideNote: cudaFree does an implicit memory sync, so if this actually makes sense regarding the sync statuc if the cuMemcpyDtoH is called after cudaFree (but that does not seem to make much sense at all, accessing freeed memory on the device)

I am not 100% sure I understand what you are implying. The memory leaks won't explain the behaviour of a context corrupting asum operation right?

@Anton-4
Copy link

Anton-4 commented Nov 13, 2017

I'm just happy we're getting close to the solution 😃

@Anton-4
Copy link

Anton-4 commented Nov 14, 2017

  • Changing the backend.synchronize() into a result.read(...) in coaster-blas does not result in different behavior.
  • Adding backend.synchronize() to rust-cublas does not result in different behavior.
  • running cuda-memcheck --report-api-errors all also reveals NO_BINARY_FOR_GPU
    for rust-cublas both in the version with synchronize and without.
  • changing the pointer mode from Device to Host in rust-cublas results in rust reporting error: An unknown error occurred

@drahnr
Copy link
Member Author

drahnr commented Nov 14, 2017

Not the results I hoped for, but results after all. This leaves pretty much only hope for cudaDeviceSync to fix it all. I am still puzzled why it would throw NO_BINARY_FOR_GPU.

@drahnr
Copy link
Member Author

drahnr commented Nov 15, 2017

I think I understood a little more about why this is failing.

I.e. gemm is an async operation. In combination with the CUBLAS_POINTER_MODE_DEVICE it is expected that the parameters a & b passed to i.e. gemm live until the operation is done, but our test scope kills them at some point. The *Sync ops report errors of previous async ops, thus we get an error (don't ask me why it is the NO_BINARY one).

The cuCtxSynchronize Driver API does apparently (!assumption!) not include a cudaDeviceSync.
The API docs are not very specific on that: cuda Driver API Document

Description
Blocks until the device has completed all preceding requested tasks. cuCtxSynchronize()
returns an error if one of the preceding tasks failed. If the context was created with the
CU_CTX_SCHED_BLOCKING_SYNC flag, the CPU thread will block until the GPU
context has finished its work.

There are benchmarks using cudaDeviceSync to synchronize https://github.com/MatthieuCourbariaux/BinaryNet/blob/master/Run-time/benchmark-cublas.cu#L85-L86

cudart provides the sync function:

# nm -D /usr/lib64/libcudart.so| rg cudaDeviceSynchronize
0000000000033340 T cudaDeviceSynchronize

So I guess this is the next thing to do, create ffi bindings for cudart and try to use them. Note that there is already a wrapper for this function in rcudnn-sys!

@drahnr
Copy link
Member Author

drahnr commented Nov 22, 2017

@Anton-4 did you start digging into it any further?

@Anton-4
Copy link

Anton-4 commented Nov 22, 2017

No sorry, I didn't have the time. I'll see if I can find some spare time this weekend.

@Anton-4
Copy link

Anton-4 commented Nov 26, 2017

I added a call to cudaDeviceSynchronize and there is no change in behavior,
I also double checked with the profiler to make sure the call to cudaDeviceSynchronize was made
and it was.

Do you know where the threadSynchronize and cudaFree instructions at the end of the calls
for rust-cublas originate from?

@drahnr
Copy link
Member Author

drahnr commented Nov 27, 2017

No, unfortunately I don't right this second, I will check tonight. I guess the next step would be to call the cudaThreadSynchronize explicitly. Other than that I am running out of ideas how to nails this down, which means I need to create a nivida ticket ASAP.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants