[Speed up compiling]: reduce the NVCC compiling (some .cu operators can be compiled by G++) #5491

qingqing01 · 2017-11-08T12:11:19Z

Compiling time comparison between NVCC and G++

Conclusion：
- NVCC is slower than G++, more than 1 min. For example, in elementwise_mul_op, the comiper time is 13s (G++) vs 1m41s (NVCC).
- more cuda gencodes(gencode: sm_xx), more slower NVCC compiling

Experiment 1: elementwise_mul_op, this op uses Eigen to compute

G++

compile :

time /home/dangqingqing/.jumbo/opt/gcc48/bin/c++   -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_DISABLE_PROFILER -DPADDLE_DISABLE_RDMA -DPADDLE_DISABLE_TIMER -DPADDLE_USE_DSO -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_VERSION=0.10.0rc4 -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_WITH_TESTING -mavx -std=c++11 -fPIC -fno-omit-frame-pointer -Wall -Wextra -Werror -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wno-unused-parameter -Wno-unused-function -Wno-error=literal-suffix -Wno-error=sign-compare -Wno-error=unused-local-typedefs -O2 -g -DNDEBUG -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c    -o CMakeFiles/elementwise_mul_op.dir/elementwise_mul_op.cc.o -c /home/dangqingqing/github/myfork/Paddle/paddle/operators/elementwise_mul_op.cc

time:

real	0m13.116s
user	0m12.264s
sys	0m0.849s

NVCC

gencode: sm_30, sm_35, sm_50,sm_52
compile :

time /usr/local/cuda/bin/nvcc /home/dangqingqing/github/myfork/Paddle/paddle/operators/elementwise_mul_op.cu -c -o /home/dangqingqing/github/myfork/build/paddle/operators/CMakeFiles/elementwise_mul_op.dir//./elementwise_mul_op_generated_elementwise_mul_op.cu.o -ccbin /home/dangqingqing/.jumbo/opt/gcc48/bin/g++ -m64 -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_USE_DSO -DPADDLE_WITH_TESTING -DPADDLE_DISABLE_TIMER -DPADDLE_DISABLE_PROFILER -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_DISABLE_RDMA -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_VERSION=0.10.0rc4 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcompiler -mavx -Xcompiler -Wall -Xcompiler -Wextra -Xcompiler -Werror -Xcompiler -fPIC -Xcompiler -fno-omit-frame-pointer -Xcompiler -Wno-unused-parameter -Xcompiler -Wno-unused-function -Xcompiler -Wno-error=sign-compare -Xcompiler -Wno-error=literal-suffix -Xcompiler -Wno-error=unused-local-typedefs -Xcompiler -Wno-error=unused-function -Xcompiler -Wno-error=array-bounds -std=c++11 --use_fast_math -O2 -g -DNDEBUG -DNVCC -I/usr/local/cuda/include -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -I/usr/include

time:

real	1m41.708s
user	1m30.757s
sys   0m10.992s

gencode: only sm_35

compile :

time /usr/local/cuda/bin/nvcc /home/dangqingqing/github/myfork/Paddle/paddle/operators/elementwise_mul_op.cu -c -o /home/dangqingqing/github/myfork/build/paddle/operators/CMakeFiles/elementwise_mul_op.dir//./elementwise_mul_op_generated_elementwise_mul_op.cu.o -ccbin /home/dangqingqing/.jumbo/opt/gcc48/bin/g++ -m64 -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_USE_DSO -DPADDLE_WITH_TESTING -DPADDLE_DISABLE_TIMER -DPADDLE_DISABLE_PROFILER -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_DISABLE_RDMA -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_VERSION=0.10.0rc4 -gencode arch=compute_35,code=sm_35 -Xcompiler -mavx -Xcompiler -Wall -Xcompiler -Wextra -Xcompiler -Werror -Xcompiler -fPIC -Xcompiler -fno-omit-frame-pointer -Xcompiler -Wno-unused-parameter -Xcompiler -Wno-unused-function -Xcompiler -Wno-error=sign-compare -Xcompiler -Wno-error=literal-suffix -Xcompiler -Wno-error=unused-local-typedefs -Xcompiler -Wno-error=unused-function -Xcompiler -Wno-error=array-bounds -std=c++11 --use_fast_math -O2 -g -DNDEBUG -DNVCC -I/usr/local/cuda/include -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -I/usr/include

time:

 real	0m34.035s
 user	0m30.629s
 sys	0m3.414s

Experiment 2: mul_op, this op uses math::matmul to compute.

G++

compile :

  time /home/dangqingqing/.jumbo/opt/gcc48/bin/c++   -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_DISABLE_PROFILER -DPADDLE_DISABLE_RDMA -DPADDLE_DISABLE_TIMER -DPADDLE_USE_DSO -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_VERSION=0.10.0rc4 -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_WITH_TESTING -mavx -std=c++11 -fPIC -fno-omit-frame-pointer -Wall -Wextra -Werror -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wno-unused-parameter -Wno-unused-function -Wno-error=literal-suffix -Wno-error=sign-compare -Wno-error=unused-local-typedefs -O2 -g -DNDEBUG -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c    -o CMakeFiles/mul_op.dir/mul_op.cc.o -c /home/dangqingqing/github/myfork/Paddle/paddle/operators/mul_op.cc

time:

real	0m11.383s
user	0m10.568s
sys	0m0.825s

NVCC: gencode: sm_30, sm_35, sm_50,sm_52

compile：

time /usr/local/cuda/bin/nvcc /home/dangqingqing/github/myfork/Paddle/paddle/operators/mul_op.cu -c -o /home/dangqingqing/github/myfork/build/paddle/operators/CMakeFiles/mul_op.dir//./mul_op_generated_mul_op.cu.o -ccbin /home/dangqingqing/.jumbo/opt/gcc48/bin/g++ -m64 -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_USE_DSO -DPADDLE_WITH_TESTING -DPADDLE_DISABLE_TIMER -DPADDLE_DISABLE_PROFILER -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_DISABLE_RDMA -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_VERSION=0.10.0rc4 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcompiler -mavx -Xcompiler -Wall -Xcompiler -Wextra -Xcompiler -Werror -Xcompiler -fPIC -Xcompiler -fno-omit-frame-pointer -Xcompiler -Wno-unused-parameter -Xcompiler -Wno-unused-function -Xcompiler -Wno-error=sign-compare -Xcompiler -Wno-error=literal-suffix -Xcompiler -Wno-error=unused-local-typedefs -Xcompiler -Wno-error=unused-function -Xcompiler -Wno-error=array-bounds -std=c++11 --use_fast_math -O2 -g -DNDEBUG -DNVCC -I/usr/local/cuda/include -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -I/usr/include

time

real	1m31.902s
user	1m21.839s
sys	0m10.026s

The .cu operators which can be compiled by G++

Following .cu operators can be compiled by G++, since some the dependent CUDA kernels have been compiled in math libraries (paddle/operator/math/ file). And the cuDNN can also be compiled by G++.

batch_norm_op.cu
concat_op.cu
conv2d_transpose_cudnn_op.cu
conv_cudnn_op.cu
conv_op.cu
conv_transpose_op.cu
fill_constant_batch_size_like_op.cu
fill_constant_op.cu
fill_zeros_like_op.cu
gru_op.cu
linear_chain_crf_op.cu
lstm_op.cu
matmul_op.cu
mul_op.cu
nccl_op.cu
nccl_op_test.cu
pool_cudnn_op.cu
pool_op.cu
pool_with_index_op.cu
sequence_conv_op.cu
reshape_op.cu
sequence_concat_op.cu
sequence_softmax_op.cu
softmax_op.cu
split_op.cu

But different compiling rules for different operators are a little confused for developers.

Also ralated to #5413

The text was updated successfully, but these errors were encountered:

luotao1 · 2017-11-08T12:21:06Z

The experiment seems so great!

The .cu operators which can be compiled by G++

How to find which op should be compiled by G++, but another should be compiled by nvcc?

But different compiling rules for different operators are a little confused for developers.

Is the .cu which doesn't contain #define EIGEN_USE_GPU could be compiled by G++ ？

chengduoZH · 2017-11-09T02:22:54Z

The compile time above seems to be in Debug mode. Do you have compile time in Release mode?

qingqing01 · 2017-11-09T03:06:55Z

@luotao1

If there are CUDA keywords in the .cu file, such as __device__, __global__, dim3, these files should be compiled by NVCC.
The cuDNN operator, such as, conv_cudnn_op.cu, batch_norm_op.cu, conv2d_transpose_cudnn_op.cu, can be compiled by G++.
If the CUDA dependences have been compiled into libraries, then the operators only based on these libraries, can be compiled by G++, such as, mul_op, conv_op.cu.

qingqing01 · 2017-11-09T03:12:09Z

@chengduoZH The compiling uses the default flags RelWithDebInfo, not Debug. Switch to Release, the time for mul_op is as follows:

G++

time /home/dangqingqing/.jumbo/opt/gcc48/bin/c++   -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_DISABLE_PROFILER -DPADDLE_DISABLE_RDMA -DPADDLE_DISABLE_TIMER -DPADDLE_USE_DSO -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_VERSION=0.10.0rc4 -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_WITH_TESTING -mavx -std=c++11 -fPIC -fno-omit-frame-pointer -Wall -Wextra -Werror -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wno-unused-parameter -Wno-unused-function -Wno-error=literal-suffix -Wno-error=sign-compare -Wno-error=unused-local-typedefs -O3 -DNDEBUG -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c    -o CMakeFiles/mul_op.dir/mul_op.cc.o -c /home/dangqingqing/github/myfork/Paddle/paddle/operators/mul_op.cc

real	0m9.541s
user	0m8.819s
sys	0m0.730s

NVCC

time /usr/local/cuda/bin/nvcc /home/dangqingqing/github/myfork/Paddle/paddle/operators/mul_op.cu -c -o /home/dangqingqing/github/myfork/build/paddle/operators/CMakeFiles/mul_op.dir//./mul_op_generated_mul_op.cu.o -ccbin /home/dangqingqing/.jumbo/opt/gcc48/bin/g++ -m64 -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_USE_DSO -DPADDLE_WITH_TESTING -DPADDLE_DISABLE_TIMER -DPADDLE_DISABLE_PROFILER -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_DISABLE_RDMA -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_VERSION=0.10.0rc4 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcompiler -mavx -Xcompiler -Wall -Xcompiler -Wextra -Xcompiler -Werror -Xcompiler -fPIC -Xcompiler -fno-omit-frame-pointer -Xcompiler -Wno-unused-parameter -Xcompiler -Wno-unused-function -Xcompiler -Wno-error=sign-compare -Xcompiler -Wno-error=literal-suffix -Xcompiler -Wno-error=unused-local-typedefs -Xcompiler -Wno-error=unused-function -Xcompiler -Wno-error=array-bounds -std=c++11 --use_fast_math -O3 -DNDEBUG -DNVCC -I/usr/local/cuda/include -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -I/usr/include


real	1m28.979s
user	1m20.179s
sys	0m8.848s

emailweixu · 2017-11-13T20:10:20Z

Even though CUDA keywords in the .cu file, such as __device__, __global__, dim3 may not be used in the .cu file, they could be used in the header files from eigen. It seems that many of our operators are in this category (e.g, elementwise_mul_op.cu). Is there any easier way to figure this out?

qingqing01 mentioned this issue Nov 11, 2017

[Speed Compiling]: Reduce NVCC compiling files. #5573

Merged

qingqing01 mentioned this issue Nov 16, 2017

[Speed Compiling]: Refine cmake about CUDA to reduce the computing arch of GPU. #5712

Closed

luotao1 mentioned this issue Nov 20, 2017

Compile speed is slower than 1 month ago (13 minutes -> 22 minutes) #5051

Closed

qingqing01 closed this as completed Dec 20, 2017

wangkuiyi mentioned this issue Jan 31, 2018

Improve CI speed #7992

Closed

wanghaoshuang mentioned this issue May 23, 2018

Reduce op compile slowly #10874

Closed

dzhwinter mentioned this issue May 30, 2018

split reduce op into multiple libraries, accelerate the compiling #11029

Merged

luotao1 mentioned this issue Nov 7, 2019

Expand refine #21063

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speed up compiling]: reduce the NVCC compiling (some .cu operators can be compiled by G++) #5491

[Speed up compiling]: reduce the NVCC compiling (some .cu operators can be compiled by G++) #5491

qingqing01 commented Nov 8, 2017 •

edited

Loading

luotao1 commented Nov 8, 2017 •

edited

Loading

chengduoZH commented Nov 9, 2017

qingqing01 commented Nov 9, 2017

qingqing01 commented Nov 9, 2017

emailweixu commented Nov 13, 2017 •

edited

Loading

[Speed up compiling]: reduce the NVCC compiling (some .cu operators can be compiled by G++) #5491

[Speed up compiling]: reduce the NVCC compiling (some .cu operators can be compiled by G++) #5491

Comments

qingqing01 commented Nov 8, 2017 • edited Loading

Compiling time comparison between NVCC and G++

The .cu operators which can be compiled by G++

luotao1 commented Nov 8, 2017 • edited Loading

chengduoZH commented Nov 9, 2017

qingqing01 commented Nov 9, 2017

qingqing01 commented Nov 9, 2017

emailweixu commented Nov 13, 2017 • edited Loading

qingqing01 commented Nov 8, 2017 •

edited

Loading

luotao1 commented Nov 8, 2017 •

edited

Loading

emailweixu commented Nov 13, 2017 •

edited

Loading