-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws benchmarking tool #9638
aws benchmarking tool #9638
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's a very useful PR.
I saw we need some Python dependencies to run the client, so maybe we can run the client in a Docker image? If that we need a Dockerfile under the path: tools/aws_benchmarking/Dockerfile
.
@@ -0,0 +1 @@ | |||
nvidia-docker run -i -e "TRAINING_ROLE=PSERVER" -e "PSERVER_HOSTS={PSERVER_HOSTS}" {DOCKER_IMAGE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May we need a argument -p
to port the container port to host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, will add it
@Yancey1989 this is a great idea to run this tool in docker, will do, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we can achieve:
-
the user don't have to change anything in the
train.py
, all they have to do is to provide thetrain.py
and specify the command line arguments for pserver and trainer. -
terminate ec2 instance automatically when the training process finishes (either due to completion, or due to crash).
-
save all output (stdout and stderr) from the process, able to show the output in realtime.
parser.add_argument( | ||
'--pserver_instance_type', | ||
type=str, | ||
default="p2.8xlarge", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need GPU instance for pserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, will update default pserver instance type
'--pserver_count', type=int, default=1, help="Pserver count") | ||
|
||
parser.add_argument( | ||
'--action', type=str, default="serve", help="create|cleanup|status") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
serve is not in the set "create|cleanup|status"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for catching this
@helinwang thanks for the great ideas, will update |
Going to make some final tweaks and a Readme file tomorrow. |
``` | ||
|
||
***Please Note*** | ||
Training nodes will run your `ENTRYPOINT` script with the following environment variables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this works with our current benchmark scripts?
The vgg16 script takes env variable: SERVER_ENDPOINT PSERVERS TRAINERS TRAINING_ROLE
example usages:
pserver:
SERVER_ENDPOINT=172.19.61.250:8000 PSERVERS=172.19.61.250:8000 TRAINERS=1 TRAINING_ROLE=PSERVER CUDA_VISIBLE_DEVICES=2 LD_LIBRARY_PATH=`pwd`:/usr/local/cuda-8.0/lib64:/usr/local/lib/ python vgg16_fluid.py --local false --device GPU --data_set flowers --batch_size 4
trainer:
SERVER_ENDPOINT=172.19.61.250:8000 PSERVERS=172.19.61.250:8000 TRAINERS=1 TRAINING_ROLE=TRAINER CUDA_VISIBLE_DEVICES=1 LD_LIBRARY_PATH=`pwd`:/usr/local/cuda-8.0/lib64:/usr/local/lib/ python vgg16_fluid.py --local false --device GPU --data_set flowers --batch_size 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, will update the env vars
tools/aws_benchmarking/README.md
Outdated
To access the master log: | ||
|
||
```bash | ||
docker run -i -v $HOME/.aws:/root/.aws -v <full path to your pem file>:/<key pare name>.pem \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why accessing master log requires *.pem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, good catch, we don't need pem to access log, will update.
tools/aws_benchmarking/README.md
Outdated
To retrieve training logs | ||
TBD | ||
|
||
### Tech details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one special character here that needs to be deleted (rendered as �Tech details
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment, otherwise LGTM!
|
||
- `TASK_NAME`: unique name to identify this training process. | ||
- `TRAINING_ROLE`: current node's role in this training process, either "PSERVER" or "TRAINER" | ||
- `PSERVER_HOSTS`: comma separated value of pserver end points, I.E. "192.168.1.2:5436,192.168.1.3:5436" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need PSERVER_HOSTS
? It's not in @typhoonzero 's script or transpiler.py, could we remove it, otherwise it causes confusion like why there are "PSERVER_HOSTS" and "PSERVERS" with the exact same meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason leaving these duplicated env vars is to be compatible with other existing tests, they may require different env vars for the same purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!!
Purpose
This is an automation tool for deploying paddlepaddle benchmark testing to aws.
Features
for more info, please refer to the README.md in this PR.