Running PaddlePaddle distributed training job on Kubernetes cluster.
You can implement a distributed dataset with reader function
, an example:
def dataset_from_reader(filename, reader):
with open(filename, "w") as fn:
for batch_id, batch_data in enumerate(reader()):
batch_data_str = [str(d) for d in batch_data]
fn.write(",".join(batch_data_str))
fn.write("\n")
An complete example for dataset: imikolov is here
If you haven't configurated kubectl
, do as the tutorial please.
-
Fetch Runtime information:
trainer id
: the unique id for each trainer, you can fetch current trainer id from environment variableTRAINER_ID
trainer count
: the trainer process count, you can fetch this one from environment variableTRAINERS
-
Dist Reader Interface
You can implement a
dist_reader
to reading data when the trainer is running on Kubernetes. An example implemention for dist reader creator:def dist_reader(filename, trainers, trainer_id): def dist_reader_creator(): with open (filename) as f: cnt = 0 for line in f: cnt += 1 if cnt % trainers == trainer_id: csv_data = [int(cell) for cell in line.split(",")] yield tuple(csv_data) return dist_reader_creator
NOTE: You can read files from CephFS on directory:
/data/...
-
Create PaddleJob instance
import paddle.job as job paddle_job=job.PaddleJob( runtime_image="yancey1989/paddle-job", job_name="paddle-job", cpu_nums=3, trainer_package="/example/word2vec", entry_point="python train.py", cephfs_volume=job.CephFSVolume( monitors_addr="172.19.32.166:6789" ))
-
Call
paddle.dist_trainer
to submit the Paddle Jobjob.dist_train( trainer=dist_trainer(), paddle_job=paddle_job)
- trainer is a trainer function, an example:
def dist_trainer(): def trainer_creator(): ... return trainer_creator
- trainer is a trainer function, an example:
-
Build Runtime Docker Image on Base Docker Image
You can build a runtime Docker Image with the tools:
./tools/build_docker.sh
, such as:./tools/build_docker.sh <src_trainer_package> <dest_trainer_package> <base Docker image> <runtime Docker image>
src_trainer_package
: the trainer package on your host.dest_trainer_package
: it's an absolute path, copies the src_trainer_package to the filesystem of the image at the path dest_trainer_packagebase Docker image
: Usually, it's PaddlePaddle product Docker image which including paddle binary files and python packages. And of course, you can specify and image name hosted on any docker registry which users have the access right.runtime Docker image
: your train package files are packaged into the runtime Docker image on base Docker image. Example:
./tools/build_docker.sh ./example/ /example paddlepaddle/paddle yancey1989/paddle-job
-
Push the Runtime Docker Image
You can push your Runtime Docker Image to Docker registry server
docker push <runtime Docker image>
Example:
docker push yancey1989/paddle-job
-
Submit Distributed Job
docker run --rm -it -v $HOME/.kube/config:/root/.kube/config <runtime image name> <entry point>
Example:
docker run --rm -it -v $HOME/.kube/config:/root/.kube/config python /example/train.py
- Required Parameters
parameter | type | explanation |
---|---|---|
job_name | string | the unique name for the training job |
entry_point | string | entry point for startup trainer process |
memory | string | memory allocated for the job, a plain integer using one of these suffixes: E, P, T, G, M, K |
cpu_nums | int | CPU count for the job |
runtime_image | string | runtime Docker image |
- Advanced Parameters
parameter | type | default | explanation |
---|---|---|---|
pservers | int | 2 | Parameter Server process count |
trainers | int | 3 | Trainer process count |
gpu_nums | int | 0 | GPU count for the job |
cephfs_volume | CephFSVolume | None | CephFS volume configuration |
- Required Parameters
parameter | type | explanation |
---|---|---|
monitors_addr | string | the address for Ceph cluster monitors. |
- Advanced Parameters
parameter | type | default | explanation |
---|---|---|---|
user | string | admin | Ceph cluster user name |
secret_name | string | cephfs-secret | Ceph cluster secret, it's Kubernetes Secret name |
mount_path | string | /data |
CephFS mount path in Pod |
path | string | / |
CephFS path |