-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make benchmark really working #11215
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have run_fluid_benchmark.sh
, do we need the run step in README.md?
@@ -29,9 +29,11 @@ Currently supported `--model` argument include: | |||
You can choose to use GPU/CPU training. With GPU training, you can specify | |||
`--gpus <gpu_num>` to run multi GPU training. | |||
* Run distributed training with parameter servers: | |||
* see run_fluid_benchmark.sh as an example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a link here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is needed. link can be broken and the file is just in this folder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[run_fluid_benchmark.sh](./run_fluid_benchmark.sh)
@@ -29,9 +29,11 @@ Currently supported `--model` argument include: | |||
You can choose to use GPU/CPU training. With GPU training, you can specify | |||
`--gpus <gpu_num>` to run multi GPU training. | |||
* Run distributed training with parameter servers: | |||
* see run_fluid_benchmark.sh as an example. | |||
* start parameter servers: | |||
```bash | |||
PADDLE_TRAINING_ROLE=PSERVER PADDLE_PSERVER_PORT=7164 PADDLE_PSERVER_IPS=127.0.0.1 PADDLE_TRAINERS=1 PADDLE_CURRENT_IP=127.0.0.1 PADDLE_TRAINER_ID=0 python fluid_benchmark.py --model mnist --device GPU --update_method pserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pserver runs on CPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -0,0 +1,10 @@ | |||
#!/bin/bash | |||
|
|||
PADDLE_TRAINING_ROLE=PSERVER PADDLE_PSERVER_PORT=7164 PADDLE_PSERVER_IPS=127.0.0.1 PADDLE_TRAINERS=2 PADDLE_CURRENT_IP=127.0.0.1 PADDLE_TRAINER_ID=0 python fluid_benchmark.py --model resnet --device GPU --update_method pserver --iterations=10000 & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this command would print all logs on the terminal, we can startup them as follows:
PADDLE_TRAINING_ROLE=PSERVER ... stdbuf -oL nohup python fluid_benchmark.py <args> 2>&1 > server.log &
And then users would check the logs in the server.log
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to print out some logs to give user some feedback. There aren't many outputs
Pass 0, batch 162, loss [2.7855887 2.973915 ]
Pass 0, batch 162, loss [3.0754983 3.2426462]
Pass 0, batch 171, loss [3.4701207 4.438573 ]
Pass 0, batch 171, loss [3.7791452 3.3191109]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
user complain they crash when following our doc.