Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

不同编译环境编译出来的paddle_trainer性能差别较大 #282

Closed
linrongyi opened this issue Oct 28, 2016 · 12 comments · Fixed by #392
Closed

不同编译环境编译出来的paddle_trainer性能差别较大 #282

linrongyi opened this issue Oct 28, 2016 · 12 comments · Fixed by #392
Assignees
Milestone

Comments

@linrongyi
Copy link

linrongyi commented Oct 28, 2016

用wangyanfei同学编译的出来的paddle_trainer 和我cmake出来的paddle_trainer训练性能差别很大

相同的数据, 一个pass:

wangyanfei: 22s
me: 35s

另外, 编译出来的paddle_trainer的大小也差别很大

me: 22M
wangyanfei: 130MB

trainer的参数为

${TRAINER_BIN} --config=trainer_config.conf --save_dir=output --trainer_count=11 --parallel_thread_num=1 --use_old_updater=1 --use_gpu=0 --save_dir=./output --enable_grad_share=0 --dot_period=200 --log_period=2000 --num_passes=1

我编译的版本, 已经开启了release选项, 使用mkl lib

@backyes backyes added this to the 0.8.1 milestone Oct 28, 2016
@backyes backyes self-assigned this Oct 28, 2016
@backyes
Copy link
Contributor

backyes commented Oct 28, 2016

That's probably the compiler option related issue. Later we will provide static bin and full auto-compile scripts to handle it.

performance issues:

  • -O3 enable.
  • libraries linked
  • specical instruction.
  • etc.

@reyoung be aware of this issues.
related:
#259

@reyoung
Copy link
Collaborator

reyoung commented Oct 28, 2016

@linrongyi see https://cmake.org/cmake/help/v3.0/variable/CMAKE_BUILD_TYPE.html

Default, Paddle use -O2 -g, RelWithDebInfo.

Set -D CMAKE_BUILD_TYPE=Release will set CFlags to -O3

@linrongyi
Copy link
Author

@reyoung
"-D CMAKE_BUILD_TYPE" was set to 'Release' when i was compiling.

@backyes
Copy link
Contributor

backyes commented Oct 28, 2016

I remenber that the performance gap comes from more than the -O3 option, even with others that are not clear. @reyoung

@reyoung
Copy link
Collaborator

reyoung commented Oct 28, 2016

@backyes 所以究竟是为啥有这么大的性能差距呢?如果两周之内你有时间搞定,就关联到 0.8.1上吧。没有的话就关联到 0.8.2上。如何?

因为其实现在0.8.1的很多事情还没做完。

@wangkuiyi
Copy link
Collaborator

wangkuiyi commented Oct 30, 2016

解决问题的一个基础是“问题可以被复现”。 @linrongyi@backyes ,请描述一下如何复现这个问题,然后才有可能有人对解法有点感觉,我们也才能有道理地把这个issue可以被关联到某个milestone。

@backyes
Copy link
Contributor

backyes commented Oct 31, 2016

@wangkuiyi 复现方法涉及内部用户环境,@linrongyi 方便话,请将复现方法适当公开。

@linrongyi
Copy link
Author

linrongyi commented Nov 1, 2016

先说一下今天的发现.

CPU训练速度的问题, 今天发现如果编译时候把-DWITH_GPU=OFF, 也就是不编译GPU版本, CPU训练就很快; 如果-DWITH_GPU=ON, 训练CPU就会慢很多.

具体的指标, 2827967条protobuf格式的二进制训练数据,

  • -DWITH_GPU=ON, 27.s
  • -DWITH_GPU=OFF, 16.9s

速度差别很大.

我用的版本是github最新的dev版本12945b2c9017c8, 编译命令是

 cmake  \
          -DCMAKE_BUILD_TYPE="Release" \
          -DWITH_RELEASE_G=ON \
          -DWITH_RDMA=$l_rdma \
          -DWITH_GPU=$l_gpu \
          -DWITH_DOC=OFF \
          -DWITH_SWIG_PY=OFF \
          -DWITH_PYTHON=OFF \
          -DMKL_ROOT=/home/aladdin/paddle_internal_release_tools/idl/paddle/thirdparties/mkl  \
          -DRDMA_ROOT=${DEP_HOME}/rdma \
          -DCUDNN_INCLUDE_DIR=/home/work/cudnn/cudnn_v5/cuda/include \
          -DCUDNN_LIBRARY=/home/work/cudnn/cudnn_v5/cuda/lib64 \
          -DWITH_GFLAGS=OFF \
          -DWITH_GLOG=ON \
          -DWITH_TESTING=OFF \
          -DCMAKE_INSTALL_PREFIX=$l_install_dir \
          ${SRC_ROOT}

训练的命令是

${TRAINER_BIN} --config=trainer_config.conf --save_dir=output --trainer_count=11 --parallel_thread_num=1 --use_old_updater=1 --use_gpu=0 --save_dir=./output --enable_grad_share=0 --dot_period=200 --log_period=2000 --num_passes=1

这样的结果表明, 如果追求性能的话, 需要单独编译一个CPU的版本, 总觉得比较trick.

然而, 曾经发现过WITH/WITHOUT CPU的编译结果在CPU 训练上并没有显著差异. @backyes 简单说一下当时的过程, 我是在10.6日checkout最新内部的icode上面的paddle. 使用comake2编译, 训练速度就会很快, 使用cmake编译, 速度就很慢 . comake2编译的paddle_trainer和你给我的paddle_trainer速度是一样的. 同时都是with gpu的.

今天 @backyes 专门做了一个一键编译脚本, 编译出来paddle_trainer大小是100MB了, 可是速度还是慢 (27s).

@pineking
Copy link

pineking commented Nov 1, 2016

我也碰到自己编译 CPU 版 paddle 很慢的问题,
基于官方 paddle:cpu-devel-latest docker 镜像
用的 develop 分支, commit 12945b2
编译命令如下:

cd ~/paddle
git fetch
git checkout develop
mkdir -p build
cd build
cmake .. -DCMAKE_BUILD_TYPE="Release" -DWITH_RELEASE_G=ON \
   -DWITH_DOC=OFF -DWITH_GPU=OFF -DWITH_SWIG_PY=ON \
   -DCUDNN_ROOT=/usr/ -DWITH_STYLE_CHECK=OFF -DWITH_AVX=ON
make -j8
make install

编译后的文件大小如下:

root@6006d11b7879:/usr/local/opt/paddle/bin# ll
total 16964
drwxr-xr-x 2 root root      78 Nov  1 14:29 ./
drwxr-xr-x 4 root root      28 Oct 26 14:23 ../
-rwxr-xr-x 1 root root 5675189 Nov  1 14:29 paddle_merge_model*
-rwxr-xr-x 1 root root 5641468 Nov  1 14:29 paddle_pserver_main*
-rwxr-xr-x 1 root root 6049191 Nov  1 14:29 paddle_trainer*

@hedaoyuan
Copy link
Contributor

-DWITH_GPU=ON 编译出来的CPU版本比较慢是由于CMakeLists.txt里面有个bug. Fix的方式见#239 中CMakeLists.txt. GPU编译的时候nvcc没有增加-Xcompiler -mavx参数,导致编译的CPU代码没有AVX优化。

@linrongyi
Copy link
Author

@hedaoyuan

@backyes
Copy link
Contributor

backyes commented Nov 5, 2016

@hedaoyuan can you help to fix it ASAP?

thisjiang pushed a commit to thisjiang/Paddle that referenced this issue Oct 28, 2021
 update pybind's version to 2.4.3
gglin001 added a commit to graphcore/Paddle-fork that referenced this issue Dec 8, 2021
wangxicoding pushed a commit to wangxicoding/Paddle that referenced this issue Dec 9, 2021
AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this issue Sep 19, 2022
lizexu123 pushed a commit to lizexu123/Paddle that referenced this issue Feb 23, 2024
WAYKEN-TSE pushed a commit to WAYKEN-TSE/Paddle that referenced this issue Dec 6, 2024
…pretrained

[PPDiffusers] update from_pretrained patch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants