Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simpler cluster train job submit code #2047

Closed
typhoonzero opened this issue May 8, 2017 · 3 comments
Closed

Simpler cluster train job submit code #2047

typhoonzero opened this issue May 8, 2017 · 3 comments
Assignees

Comments

@typhoonzero
Copy link
Contributor

@Yancey1989 wrote this job submit tools at: https://github.com/Yancey1989/paddle-job

currently submiting a job looks like:

paddle.init(
            use_gpu=False,
            trainer_count=1,
            port=7164,
            ports_num=1,
            ports_num_for_sparse=1,
            num_gradient_servers=1,
            trainer_id=fetch_trainer_id(),
            pservers=fetch_pserver_ips())
job.dist_train(
        trainer=trainer,
        reader=paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
        num_passes=30,
        event_handler=event_handler,
        paddle_job=job.PaddleJob(
            pservers=3,
            base_image="yancey1989/paddle-cloud",
            input="/yanxu05",
            output="/yanxu05",
            job_name="paddle-cloud",
            namespace="yanxu",
            use_gpu=False,
            cpu_num=3,
            trainer_package_path="/example/word2vec",
            entry_point="python api_train_v2.py"))

We want to make it simpler like:

# init from ENV "PADDLE_*", args below will overwrite the ENVs
paddle.init(use_gpu=False)
...
myjob = job.dist_train(
        trainer=trainer,
        reader=my_dist_reader("dataset-name"),
        num_passes=30,
        event_handler=event_handler,
        paddle_job=job.PaddleJob(
            [cluster configurations...]))
print "view job status at: ", myjob.status_url()

Required ENVs:

  • "PADDLE_PSERVERS"
  • "PADDLE_TRAINER_ID"
  • "PADDLE_TRAINER_COUNT"
  • "PADDLE_NUM_GRADIENT_SERVERS"
  • "PADDLE_PORTS_NUM_FOR_SPARSE"

Optional ENVs:

  • "PADDLE_PORT": default 7164
  • "PADDLE_PORTS_NUM": default 1
  • "PADDLE_USE_GPU": default False

Cluster Job Configurations:

Job Resources

  • parallism: parallism equals to num of trainer, the num of pservers is caculated from parallism.
  • num_gpus: gpu resources needed, if num_gpus ==0 and env "PADDLE_USE_GPU" set to True or the oppsite, paddle will throw a warning message when submiting a job.
  • num_cpus: cpu resource
  • entry_point: command to start your trainning program: python /data/cloud/storage/path/train.py
  • NOTE: Paddle will default mount your cloud storage volume at /data, so your trainning program can read data any where under /data

Advanced settings:

  • pservers: if this is set, num of pservers will be set to this value instead of auto caculated from parallism.
  • base_image: use your own image to run
  • job_name: use your own job name
  • NOTE: namespace is read from ENV: "USER_NAMESPACE"
@Yancey1989
Copy link
Contributor

Thanks for @typhoonzero 's suggestions, it's very useful!

@Yancey1989
Copy link
Contributor

Yancey1989 commented May 9, 2017

parallism: parallism equals to num of trainer, the num of pservers is caculated from parallism

parallism is the concept of Kubernetes, in PaddlePaddle, trainers maybe more clearly

num_gpus: gpu resources needed, if num_gpus ==0 and env "PADDLE_USE_GPU" set to True or the oppsite, paddle will throw a warning message when submiting a job.

dist_train will generate Kubernetes YAML files, and only if num_gpus>0, PADDLE_USE_GPU=True will be setting in env of the YAML file. So maybe we does need to check PADDLE_US_GPU with the submit stage?

pservers: if this is set, num of pservers will be set to this value instead of auto caculated from parallism

Maybe we use pserver_bucket instead of pservers, pserver_cpu, pserver_memoery

The design doc about submit the paddle job is here

@Yancey1989
Copy link
Contributor

Yancey1989 commented May 9, 2017

I discuss #2047 with @typhoonzero in office just now, for simple the parameters, maybe we will have required parameters for beginners and advanced parameters for professional.

Required parameters

parameter type default explanation
num_gpus int 0 GPU count for the job
num_cpus int 1 CPU count for the job
memory string 1G memory allocated for the job

Advanced parameters

parameter type exaplanation
pservers int pserver process count
pserver_cpu int CPU count for each pserver
pserver_mem string memory allocated for each pserver
trainers int trainer process count
trainer_cpu int CPU count for each trainer
trainer_gpu int GPU count for each trainer
trainer_mem string memory allocated for each trainer

Related issue: #2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants