Openpose with latest drivers based on "NVCaffe Release 18.11" #1002

chrigima · 2019-01-04T23:35:27Z

Hi

I'm currently trying to get openpose with the latest CUDA, cuDNN and NVIDIA driver working.

System:
Docker Image based on NVCaffe Release 18.11:
(https://docs.nvidia.com/deeplearning/dgx/caffe-release-notes/rel_18.11.html)
Ubuntu 16.04 including Python 2.7
NVIDIA CUDA 10.0.130 including CUDA® Basic Linear Algebra Subroutines library™ (cuBLAS) 10.0.130
NVIDIA CUDA® Deep Neural Network library™ (cuDNN) 7.4.1
NCCL 2.3.7 (optimized for NVLink™ )
OpenMPI 2.0
TensorRT 5.0.2
NVCaffe 0.17.1 container image version 18.11 is based on Caffe Deep Learning Framework by the BVLC.
Latest version of NCCL 2.3.7.
Latest version of NVIDIA cuDNN 7.4.1.
Latest version of TensorRT 5.0.2
Ubuntu 16.04 with October 2018 updates

NVIDIA Driver Version: 410.78

I modified the source code to be compatible with compatible with caffe.

So far, the binaries are generated.
When I run the application with
build/examples/openpose/openpose.bin -video examples/media/video.avi --net_resolution 160x80 --output_resolution 960x640 --hand --face
the openCV window opens and the video is playing.
No heatmaps, no skeleton.

Any Ideas where to search for the issue? Logs?

Looks like CUDA is working... (using nvprof)
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 66.73% 2.49482s 14350 173.86us 63.074us 725.02us maxwell_scudnn_128x128_relu_small_nn
18.98% 709.44ms 6150 115.36us 70.658us 431.15us maxwell_scudnn_128x32_relu_small_nn
2.93% 109.43ms 410 266.90us 11.392us 543.38us [CUDA memcpy DtoH]
1.52% 56.844ms 23370 2.4320us 1.3120us 24.641us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=0, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const , float, float, float, float, dimArray, reducedDivisorArray)
1.44% 53.766ms 1230 43.712us 34.593us 246.35us maxwell_scudnn_128x128_relu_interior_nn
1.17% 43.733ms 535 81.743us 704ns 2.9667ms [CUDA memcpy HtoD]
1.14% 42.706ms 820 52.080us 32.769us 210.50us maxwell_scudnn_128x64_relu_interior_nn
0.98% 36.774ms 20295 1.8110us 1.3760us 13.056us void caffe::PReLUForward(int, int, int, float const , caffe::PReLUForward, float const , int)
0.70% 26.158ms 410 63.800us 18.272us 325.36us maxwell_scudnn_128x64_relu_small_nn
0.68% 25.267ms 21115 1.1960us 896ns 13.184us void caffe::Concat(int, float const , bool, int, int, int, int, int, caffe::Concat)
0.65% 24.413ms 23370 1.0440us 864ns 13.217us cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)
0.61% 22.809ms 15990 1.4260us 1.3120us 5.6960us void op::resizeKernel(float, op::resizeKernel const , int, int, int, int)
0.57% 21.463ms 20910 1.0260us 768ns 7.5530us [CUDA memcpy DtoD]
0.42% 15.884ms 1845 8.6090us 2.6880us 22.240us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=8, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float, cudnnTensorStruct, float const *, cudnnTensorStruct, float const , float, float, float, float, dimArray, reducedDivisorArray)
0.41% 15.373ms 410 37.494us 26.625us 160.55us maxwell_scudnn_128x32_relu_interior_nn
0.35% 12.900ms 5125 2.5170us 1.9840us 5.4080us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__scan::ScanAgent<thrust::device_ptr, thrust::device_ptr, thrust::plus, int, int, thrust::detail::integral_constant<bool, bool=0>>, thrust::device_ptr, thrust::device_ptr, thrust::plus, int, thrust::cuda_cub::cub::ScanTileState<int, bool=1>, thrust::cuda_cub::__scan::AddInitToExclusiveScan<int, thrust::plus>>(thrust::device_ptr, thrust::device_ptr, int, thrust::plus, int, int)
0.17% 6.4752ms 5125 1.2630us 1.0240us 7.3600us void op::writeResultKernel(float, int, int const *, op::writeResultKernel const , int, int, int, op::writeResultKernel, op::writeResultKernel)
0.16% 5.8985ms 205 28.773us 28.129us 36.257us void op::pafScoreKernel(float, op::pafScoreKernel const *, op::pafScoreKernel const , unsigned int const , unsigned int const , unsigned int, int, int, int, op::pafScoreKernel, op::pafScoreKernel)
0.14% 5.2613ms 5125 1.0260us 864ns 2.5280us void op::nmsRegisterKernel(int, float const , int, int, int)
0.14% 5.1110ms 615 8.3100us 3.1360us 16.832us void cudnn::detail::pooling_fw_4d_kernel<float, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, int=0, bool=0>(cudnnTensorStruct, float const , cudnn::detail::pooling_fw_4d_kernel<float, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, int=0, bool=0>, cudnnTensorStruct, cudnnPoolingStruct, float, cudnnPoolingStruct, int, cudnn::reduced_divisor, float)
0.11% 4.1062ms 5125 801ns 640ns 7.0080us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__scan::InitAgent<thrust::cuda_cub::cub::ScanTileState<int, bool=1>, int>, thrust::cuda_cub::cub::ScanTileState<int, bool=1>, int>(bool=1, thrust::cuda_cub::cub::ScanTileState<int, bool=1>)
0.00% 186.92us 264 708ns 608ns 2.4640us [CUDA memset]
API calls: 52.16% 4.17146s 92691 45.003us 929ns 8.8293ms cudaStreamSynchronize
12.38% 990.40ms 8 123.80ms 8.5580us 990.33ms cudaStreamCreateWithFlags
10.97% 877.71ms 150675 5.8250us 4.2130us 387.61us cudaLaunchKernel
6.38% 509.88ms 2 254.94ms 23.702us 509.86ms cudaStreamCreate
4.49% 359.19ms 317 1.1331ms 2.6010us 34.062ms cuModuleUnload
2.38% 190.39ms 21442 8.8790us 4.3500us 36.830us cudaMemcpyAsync
2.03% 162.08ms 4860 33.350us 5.6880us 2.2706ms cudaMallocHost
1.77% 141.67ms 413 343.02us 5.5690us 717.47us cudaMemcpy
1.44% 115.43ms 5128 22.509us 1.4570us 366.89us cudaDeviceSynchronize
1.37% 109.50ms 5729 19.113us 2.3110us 3.7621ms cudaMalloc
1.29% 102.78ms 4861 21.143us 4.7700us 1.8110ms cudaFreeHost
1.05% 83.989ms 238286 352ns 197ns 353.94us cudaGetDevice
0.86% 68.732ms 4 17.183ms 425ns 68.731ms cuDevicePrimaryCtxRelease
0.51% 40.652ms 5151 7.8920us 272ns 12.843ms cudaFree
0.29% 23.015ms 10250 2.2450us 1.9460us 102.61us cudaFuncGetAttributes
0.21% 17.138ms 101659 168ns 77ns 340.04us cudaGetLastError
0.14% 11.066ms 14 790.42us 1.6490us 11.038ms cudaStreamDestroy
0.09% 6.8833ms 42864 160ns 82ns 337.78us cudaPeekAtLastError
0.07% 5.7059ms 20527 277ns 214ns 4.6960us cudaDeviceGetAttribute
0.03% 2.7199ms 9 302.22us 269.40us 358.23us cudaGetDeviceProperties
0.03% 2.2528ms 264 8.5330us 2.9590us 33.710us cudaMemsetAsync
0.02% 1.4868ms 375 3.9640us 112ns 219.00us cuDeviceGetAttribute
0.01% 871.83us 8 108.98us 98.094us 115.93us cudaMemGetInfo
0.01% 553.10us 4 138.28us 97.003us 178.85us cuDeviceTotalMem
0.01% 508.71us 592 859ns 559ns 4.0490us cudaEventRecord
0.01% 472.61us 1188 397ns 237ns 30.544us cudaSetDevice
0.01% 435.24us 619 703ns 384ns 2.0740us cudaEventCreateWithFlags
0.01% 407.06us 4 101.77us 33.413us 268.54us cuDeviceGetName
0.00% 148.14us 1271 116ns 82ns 777ns cudaGetDeviceCount
0.00% 34.977us 4 8.7440us 8.4330us 8.9670us cudaStreamCreateWithPriority
0.00% 19.093us 31 615ns 341ns 1.9600us cudaEventDestroy
0.00% 16.488us 1 16.488us 16.488us 16.488us cudaHostAlloc
0.00% 3.3750us 1 3.3750us 3.3750us 3.3750us cuDeviceGetPCIBusId
0.00% 2.9150us 3 971ns 631ns 1.2480us cuInit
0.00% 2.5410us 6 423ns 196ns 1.2900us cuDeviceGetCount
0.00% 2.4130us 3 804ns 409ns 1.1950us cuDriverGetVersion
0.00% 1.9640us 5 392ns 243ns 798ns cuDeviceGet
0.00% 1.3460us 1 1.3460us 1.3460us 1.3460us cudaDeviceGetStreamPriorityRange
0.00% 1.0640us 4 266ns 221ns 388ns cuDeviceGetUuid
0.00% 960ns 1 960ns 960ns 960ns cudaHostGetDevicePointer

chrigima · 2019-01-06T15:46:48Z

I managed to get it working. The mistake happened in updating the Blob to the TBlob (template) class. In netCaffe.cpp I copied the data of the output blob during initialization which of course didnt't generate any output during runtime.
The boost::dynamic_pointer_cast returned NULL. Using boost::static_pointer_cast it worked.

gineshidalgo99 · 2019-03-07T16:37:29Z

Hi @chrigima , your finding is really useful. Would you mind doing a PR with your working version? That way other people can benefit from NVCaffe compatibility :)

gineshidalgo99 · 2019-04-15T22:48:43Z

@HBadertscher fixes this issue with #1169

Feel free to use that PR temporarily until we merge it. Once I merge it, I will notify it on that PR. Thanks.

stale bot added the stale/old label Mar 7, 2019

CMU-Perceptual-Computing-Lab deleted a comment from stale bot Mar 7, 2019

stale bot removed the stale/old label Mar 7, 2019

gineshidalgo99 added the enhancement New feature or request label Mar 7, 2019

HBadertscher mentioned this issue Apr 2, 2019

Add support for Nvidia NVCaffe #1169

Merged

gineshidalgo99 closed this as completed Apr 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Openpose with latest drivers based on "NVCaffe Release 18.11" #1002

Openpose with latest drivers based on "NVCaffe Release 18.11" #1002

chrigima commented Jan 4, 2019

chrigima commented Jan 6, 2019 •

edited

Loading

gineshidalgo99 commented Mar 7, 2019 •

edited

Loading

gineshidalgo99 commented Apr 15, 2019 •

edited

Loading

Openpose with latest drivers based on "NVCaffe Release 18.11" #1002

Openpose with latest drivers based on "NVCaffe Release 18.11" #1002

Comments

chrigima commented Jan 4, 2019

chrigima commented Jan 6, 2019 • edited Loading

gineshidalgo99 commented Mar 7, 2019 • edited Loading

gineshidalgo99 commented Apr 15, 2019 • edited Loading

chrigima commented Jan 6, 2019 •

edited

Loading

gineshidalgo99 commented Mar 7, 2019 •

edited

Loading

gineshidalgo99 commented Apr 15, 2019 •

edited

Loading