Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vgg16 dist train fails with latest paddle #10720

Closed
putcn opened this issue May 16, 2018 · 7 comments
Closed

vgg16 dist train fails with latest paddle #10720

putcn opened this issue May 16, 2018 · 7 comments
Assignees

Comments

@putcn
Copy link
Contributor

putcn commented May 16, 2018

paddle built with following settings:

WITH_TESTING=OFF
WITH_GOLANG=OFF 
CMAKE_BUILD_TYPE=Release 
WITH_GPU=ON 
WITH_STYLE_CHECK=OFF 
WITH_FLUID_ONLY=ON 
WITH_MKLDNN=off 
WITH_DISTRIBUTE=ON 

packaged the build result with docker file and python script as in the repo:
https://github.com/putcn/vgg16_dist_test
tagged it with paddlepaddlece/vgg16_dist:latest

the training script is basically the same as https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/cluster/vgg16/vgg16_fluid.py, only removed the dependency of import paddle.v2 as paddle by changing it to import paddle

then tried to run the cluster with commands as
pserver:

docker run --network="host" -i \
-e "GLOG_logtostderr=1" \
-e "GLOG_vmodule=executor=3" \
-e "SERVER_ENDPOINT=172.19.56.198:5436" \
-e "MASTER_ENDPOINT=172.19.56.198:5436" \
-e "TASK_NAME=nostalgic_raman" \
-e "TRAINER_INDEX=0" \
-e "TRAINING_ROLE=PSERVER" \
-e "TRAINER_COUNT=1" \
-e "TRAINERS=1" \
-e "PSERVER_HOSTS=172.19.56.198:5436" \
-e "PSERVERS=172.19.56.198:5436" \
paddlepaddlece/vgg16_dist:latest --device CPU --local no --num_passes 1 --batch_size 128

trainer:

nvidia-docker run --network="host" -i  \
-e "GLOG_logtostderr=1" \
-e "GLOG_vmodule=executor=3" \
-e "MASTER_ENDPOINT=172.31.48.60:5436" \
-e "TASK_NAME=kind_colden" \
-e "TRAINER_COUNT=1" \
-e "TRAINERS=1" \
-e "TRAINER_INDEX=0"  \
-e "PADDLE_INIT_TRAINER_ID=0" \
-e "TRAINING_ROLE=TRAINER"  \
-e "PSERVER_HOSTS=172.19.56.198:5436"  \
-e "PSERVERS=172.19.56.198:5436" \
paddlepaddlece/vgg16_dist:latest --device GPU --local no --num_passes 1 --batch_size 128

pserver started with no issue, but trainer failed while trying to exec the trainer porgram, the error is

*** Aborted at 1526510318 (unix time) try "date -d @1526510318" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 591 (TID 0x7f47667a2700) from PID 0; stack trace: ***
    @     0x7f4766380390 (unknown)
    @                0x0 (unknown)
Segmentation fault (core dumped)

any ideas?

@putcn
Copy link
Contributor Author

putcn commented May 17, 2018

worked with @tonyyang-svail on this issue, looks there might have something to do with op conv2d. this test fails even with --local yes and --device GPU.
going to try turn the build flag mkldnn=ON

@putcn
Copy link
Contributor Author

putcn commented May 18, 2018

No luck with the mkldnn flag

@typhoonzero
Copy link
Contributor

typhoonzero commented May 18, 2018

I've tried two ways to reproduce this issue:

  1. Use the docker image that you provide but on my host, I can reproduce this error exactly.
  2. Use the script under your repo, but my own docker image and build paddlepaddle by myself, then the issue is gone.

I'm wondering are you using cudnn7 and what the nvidia driver version is, I think that this may be the cause, cudnn7 can only run on very recent drivers.

@putcn
Copy link
Contributor Author

putcn commented May 18, 2018

thanks @typhoonzero, I was testing in the server 172.19.38.75, the driver version is 390.25, and the paddle docker image is based on nvidia/cuda:8.0-cudnn5-runtime-ubuntu16.04, which is generated by build process by default. do you think this combination is an issue?

@typhoonzero
Copy link
Contributor

If you are using paddle:latest-dev to build paddlepaddle, the build environment uses cudnn7, so you might need to try build runtime image based on nvidia/cuda:8.0-cudnn7-runtime-ubuntu16.04

@putcn
Copy link
Contributor Author

putcn commented May 18, 2018

I see, there might be something we need to fix with the dockerfile in build directory about the base image uri. I'll create a ticket for @wanglei828 regarding this.

@putcn
Copy link
Contributor Author

putcn commented May 18, 2018

changing the image cudnn number does the fix, thanks again @typhoonzero

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants