Simpler cluster train job submit code #2047

typhoonzero · 2017-05-08T03:26:58Z

@Yancey1989 wrote this job submit tools at: https://github.com/Yancey1989/paddle-job

currently submiting a job looks like:

paddle.init(
            use_gpu=False,
            trainer_count=1,
            port=7164,
            ports_num=1,
            ports_num_for_sparse=1,
            num_gradient_servers=1,
            trainer_id=fetch_trainer_id(),
            pservers=fetch_pserver_ips())
job.dist_train(
        trainer=trainer,
        reader=paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
        num_passes=30,
        event_handler=event_handler,
        paddle_job=job.PaddleJob(
            pservers=3,
            base_image="yancey1989/paddle-cloud",
            input="/yanxu05",
            output="/yanxu05",
            job_name="paddle-cloud",
            namespace="yanxu",
            use_gpu=False,
            cpu_num=3,
            trainer_package_path="/example/word2vec",
            entry_point="python api_train_v2.py"))

We want to make it simpler like:

# init from ENV "PADDLE_*", args below will overwrite the ENVs
paddle.init(use_gpu=False)
...
myjob = job.dist_train(
        trainer=trainer,
        reader=my_dist_reader("dataset-name"),
        num_passes=30,
        event_handler=event_handler,
        paddle_job=job.PaddleJob(
            [cluster configurations...]))
print "view job status at: ", myjob.status_url()

Required ENVs:

"PADDLE_PSERVERS"
"PADDLE_TRAINER_ID"
"PADDLE_TRAINER_COUNT"
"PADDLE_NUM_GRADIENT_SERVERS"
"PADDLE_PORTS_NUM_FOR_SPARSE"

Optional ENVs:

"PADDLE_PORT": default 7164
"PADDLE_PORTS_NUM": default 1
"PADDLE_USE_GPU": default False

Cluster Job Configurations:

Job Resources

parallism: parallism equals to num of trainer, the num of pservers is caculated from parallism.
num_gpus: gpu resources needed, if num_gpus ==0 and env "PADDLE_USE_GPU" set to True or the oppsite, paddle will throw a warning message when submiting a job.
num_cpus: cpu resource
entry_point: command to start your trainning program: python /data/cloud/storage/path/train.py
NOTE: Paddle will default mount your cloud storage volume at /data, so your trainning program can read data any where under /data

Advanced settings:

pservers: if this is set, num of pservers will be set to this value instead of auto caculated from parallism.
base_image: use your own image to run
job_name: use your own job name
NOTE: namespace is read from ENV: "USER_NAMESPACE"

The text was updated successfully, but these errors were encountered:

Yancey1989 · 2017-05-08T05:59:34Z

Thanks for @typhoonzero 's suggestions, it's very useful!

Yancey1989 · 2017-05-09T05:38:52Z

parallism: parallism equals to num of trainer, the num of pservers is caculated from parallism

parallism is the concept of Kubernetes, in PaddlePaddle, trainers maybe more clearly

num_gpus: gpu resources needed, if num_gpus ==0 and env "PADDLE_USE_GPU" set to True or the oppsite, paddle will throw a warning message when submiting a job.

dist_train will generate Kubernetes YAML files, and only if num_gpus>0, PADDLE_USE_GPU=True will be setting in env of the YAML file. So maybe we does need to check PADDLE_US_GPU with the submit stage?

pservers: if this is set, num of pservers will be set to this value instead of auto caculated from parallism

Maybe we use pserver_bucket instead of pservers, pserver_cpu, pserver_memoery

The design doc about submit the paddle job is here

Yancey1989 · 2017-05-09T06:25:11Z

I discuss #2047 with @typhoonzero in office just now, for simple the parameters, maybe we will have required parameters for beginners and advanced parameters for professional.

Required parameters

parameter	type	default	explanation
num_gpus	int	0	GPU count for the job
num_cpus	int	1	CPU count for the job
memory	string	1G	memory allocated for the job

Advanced parameters

parameter	type	exaplanation
pservers	int	pserver process count
pserver_cpu	int	CPU count for each pserver
pserver_mem	string	memory allocated for each pserver
trainers	int	trainer process count
trainer_cpu	int	CPU count for each trainer
trainer_gpu	int	GPU count for each trainer
trainer_mem	string	memory allocated for each trainer

Related issue: #2019

typhoonzero assigned Yancey1989 May 8, 2017

typhoonzero mentioned this issue May 9, 2017

Design doc: submit a distributed job #1770

Closed

typhoonzero closed this as completed May 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simpler cluster train job submit code #2047

Simpler cluster train job submit code #2047

typhoonzero commented May 8, 2017

Yancey1989 commented May 8, 2017

Yancey1989 commented May 9, 2017 •

edited

Loading

Yancey1989 commented May 9, 2017 •

edited

Loading

Simpler cluster train job submit code #2047

Simpler cluster train job submit code #2047

Comments

typhoonzero commented May 8, 2017

Required ENVs:

Optional ENVs:

Cluster Job Configurations:

Job Resources

Advanced settings:

Yancey1989 commented May 8, 2017

Yancey1989 commented May 9, 2017 • edited Loading

Yancey1989 commented May 9, 2017 • edited Loading

Required parameters

Advanced parameters

Yancey1989 commented May 9, 2017 •

edited

Loading

Yancey1989 commented May 9, 2017 •

edited

Loading