vgg16 dist train fails with latest paddle #10720

putcn · 2018-05-16T22:49:48Z

paddle built with following settings:

WITH_TESTING=OFF
WITH_GOLANG=OFF 
CMAKE_BUILD_TYPE=Release 
WITH_GPU=ON 
WITH_STYLE_CHECK=OFF 
WITH_FLUID_ONLY=ON 
WITH_MKLDNN=off 
WITH_DISTRIBUTE=ON

packaged the build result with docker file and python script as in the repo:
https://github.com/putcn/vgg16_dist_test
tagged it with paddlepaddlece/vgg16_dist:latest

the training script is basically the same as https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/cluster/vgg16/vgg16_fluid.py, only removed the dependency of import paddle.v2 as paddle by changing it to import paddle

then tried to run the cluster with commands as
pserver:

docker run --network="host" -i \
-e "GLOG_logtostderr=1" \
-e "GLOG_vmodule=executor=3" \
-e "SERVER_ENDPOINT=172.19.56.198:5436" \
-e "MASTER_ENDPOINT=172.19.56.198:5436" \
-e "TASK_NAME=nostalgic_raman" \
-e "TRAINER_INDEX=0" \
-e "TRAINING_ROLE=PSERVER" \
-e "TRAINER_COUNT=1" \
-e "TRAINERS=1" \
-e "PSERVER_HOSTS=172.19.56.198:5436" \
-e "PSERVERS=172.19.56.198:5436" \
paddlepaddlece/vgg16_dist:latest --device CPU --local no --num_passes 1 --batch_size 128

trainer:

nvidia-docker run --network="host" -i  \
-e "GLOG_logtostderr=1" \
-e "GLOG_vmodule=executor=3" \
-e "MASTER_ENDPOINT=172.31.48.60:5436" \
-e "TASK_NAME=kind_colden" \
-e "TRAINER_COUNT=1" \
-e "TRAINERS=1" \
-e "TRAINER_INDEX=0"  \
-e "PADDLE_INIT_TRAINER_ID=0" \
-e "TRAINING_ROLE=TRAINER"  \
-e "PSERVER_HOSTS=172.19.56.198:5436"  \
-e "PSERVERS=172.19.56.198:5436" \
paddlepaddlece/vgg16_dist:latest --device GPU --local no --num_passes 1 --batch_size 128

pserver started with no issue, but trainer failed while trying to exec the trainer porgram, the error is

*** Aborted at 1526510318 (unix time) try "date -d @1526510318" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 591 (TID 0x7f47667a2700) from PID 0; stack trace: ***
    @     0x7f4766380390 (unknown)
    @                0x0 (unknown)
Segmentation fault (core dumped)

any ideas?

The text was updated successfully, but these errors were encountered:

putcn · 2018-05-17T22:36:15Z

worked with @tonyyang-svail on this issue, looks there might have something to do with op conv2d. this test fails even with --local yes and --device GPU.
going to try turn the build flag mkldnn=ON

putcn · 2018-05-18T00:39:01Z

No luck with the mkldnn flag

typhoonzero · 2018-05-18T03:00:16Z

I've tried two ways to reproduce this issue:

Use the docker image that you provide but on my host, I can reproduce this error exactly.
Use the script under your repo, but my own docker image and build paddlepaddle by myself, then the issue is gone.

I'm wondering are you using cudnn7 and what the nvidia driver version is, I think that this may be the cause, cudnn7 can only run on very recent drivers.

putcn · 2018-05-18T03:07:11Z

thanks @typhoonzero, I was testing in the server 172.19.38.75, the driver version is 390.25, and the paddle docker image is based on nvidia/cuda:8.0-cudnn5-runtime-ubuntu16.04, which is generated by build process by default. do you think this combination is an issue?

typhoonzero · 2018-05-18T03:11:16Z

If you are using paddle:latest-dev to build paddlepaddle, the build environment uses cudnn7, so you might need to try build runtime image based on nvidia/cuda:8.0-cudnn7-runtime-ubuntu16.04

putcn · 2018-05-18T03:20:14Z

I see, there might be something we need to fix with the dockerfile in build directory about the base image uri. I'll create a ticket for @wanglei828 regarding this.

putcn · 2018-05-18T04:06:27Z

changing the image cudnn number does the fix, thanks again @typhoonzero

putcn assigned Yancey1989 and typhoonzero May 16, 2018

putcn closed this as completed May 18, 2018

putcn mentioned this issue May 18, 2018

Production Dockerfile's base image uri is not following the dev image base uri that builds it #10764

Closed

typhoonzero mentioned this issue May 18, 2018

fix runtime docker base image #10767

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vgg16 dist train fails with latest paddle #10720

vgg16 dist train fails with latest paddle #10720

putcn commented May 16, 2018 •

edited

Loading

putcn commented May 17, 2018

putcn commented May 18, 2018 •

edited

Loading

typhoonzero commented May 18, 2018 •

edited

Loading

putcn commented May 18, 2018

typhoonzero commented May 18, 2018

putcn commented May 18, 2018

putcn commented May 18, 2018

vgg16 dist train fails with latest paddle #10720

vgg16 dist train fails with latest paddle #10720

Comments

putcn commented May 16, 2018 • edited Loading

putcn commented May 17, 2018

putcn commented May 18, 2018 • edited Loading

typhoonzero commented May 18, 2018 • edited Loading

putcn commented May 18, 2018

typhoonzero commented May 18, 2018

putcn commented May 18, 2018

putcn commented May 18, 2018

putcn commented May 16, 2018 •

edited

Loading

putcn commented May 18, 2018 •

edited

Loading

typhoonzero commented May 18, 2018 •

edited

Loading