-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vgg16 dist train fails with latest paddle #10720
Comments
worked with @tonyyang-svail on this issue, looks there might have something to do with op conv2d. this test fails even with --local yes and --device GPU. |
No luck with the mkldnn flag |
I've tried two ways to reproduce this issue:
I'm wondering are you using cudnn7 and what the nvidia driver version is, I think that this may be the cause, cudnn7 can only run on very recent drivers. |
thanks @typhoonzero, I was testing in the server 172.19.38.75, the driver version is 390.25, and the paddle docker image is based on nvidia/cuda:8.0-cudnn5-runtime-ubuntu16.04, which is generated by build process by default. do you think this combination is an issue? |
If you are using |
I see, there might be something we need to fix with the dockerfile in build directory about the base image uri. I'll create a ticket for @wanglei828 regarding this. |
changing the image cudnn number does the fix, thanks again @typhoonzero |
paddle built with following settings:
packaged the build result with docker file and python script as in the repo:
https://github.com/putcn/vgg16_dist_test
tagged it with
paddlepaddlece/vgg16_dist:latest
the training script is basically the same as https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/cluster/vgg16/vgg16_fluid.py, only removed the dependency of
import paddle.v2 as paddle
by changing it toimport paddle
then tried to run the cluster with commands as
pserver:
trainer:
pserver started with no issue, but trainer failed while trying to exec the trainer porgram, the error is
any ideas?
The text was updated successfully, but these errors were encountered: