Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CFS design #468

Merged
merged 3 commits into from
Nov 20, 2017
Merged

Conversation

typhoonzero
Copy link
Collaborator

No description provided.

- `ReplicaSet` of `pserver` process
- `Job` of `trainer` process
- Queue to sort `TrainingJob` resource for schedule
- Scheduler to determine which job to run or to scale by:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这四种Job priority具体的含义是什么?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这下面描述的并不是4种priority,是scheduler根据这些数值决定每个job期望的运行状态,即Job的打分值。

a job is waiting enough long, it can finally be scheduled to the cluster, no
matter it may have very low priority (except that the cluster is full of
production service).
1. A cluster may run both online service and offline batch jobs. The online
Copy link
Collaborator

@gongweibao gongweibao Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的理解:我们可以按照任务的性质分成

  • 在线
  • 离线
  • 团队实验
  • 个人实验

几种层别,这样用户可能对Job priority有更直观的理解。一定程度避免现在大家都把自己任务调高优先级别的情况出现。

高层别的任务优先级别一定大于低层别的优先级别,在同一个层别里边才有更细的任务优先级别的排序。

提交任务的时候,可以让用户同时提交Job nature Job priority两个参数。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

优先级在下面 Interface章节已经介绍了。

consumption will be considered.

Scheduler stores all nodes in a red-black tree, sorted by score
`sum(Prio() * ResourceScore() * Running() * running time)`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个公式出现的比较突兀。不清楚排序的效果如何?有无解释?
另外running time是如何估算出来的?


## References

https://en.wikipedia.org/wiki/Completely_Fair_Scheduler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indention issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments all done.


### Interface

Scheduler deal with atomic scheduling unit named `Node`. The `TraniningJob`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be confused with Node in Kubernetes as a physical machine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Will change the naming.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


## Background

We are going to define PaddlePaddle cluster job as a Kubernetes [TPR]() or
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

services have high priority and is not interuptable. But trainingjobs can
re-use the cluster resource when the online service came to certain time of
day that is not that active.
1. About quota, users quota should be considered so that scheduled job is not
Copy link
Collaborator

@Yancey1989 Yancey1989 Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pserver and etcd should have higher priority than trainers, and because

jobs require GPU resource should
have higher priority to run on GPU machines than CPU only jobs

pserver and etcd will be assigned at CPU node with higher priority.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought pserver, etcd, master, trainers in a single TrainingJob should have same priority.


Cases that need to be considerd during the implementaion:

1. GPU is much more expensive than CPUs, jobs require GPU resource should
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

laugch -> launch

Running() int64

// Obj returns inner scheduling unit.
Obj() *interface{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Obj() interface, interface is itself a "pointer".

}
```

Currently we only support 4 levels of priority. Note that the priority is not
Copy link
Collaborator

@helinwang helinwang Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

const (
  Experiement PrioLevel = 10
  Offline = 100
  Normal = 1000
  Production = 10000
)

could be more extensible?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All done.

then job result can be updated to the production service. Other jobs like
experiement and one-shot jobs have lower priority, they can be scaled up when
cluster is free and can be scaled down when cluster is busy.
1. Otherwise, jobs should share the cluster resource fairly, which means, if
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Topic for discussion: how "fair" do we want? If we are really fair, every jobs' trainer count will be constantly in flux, but cold starting a trainer have cost.

Maybe we can have some "freezing window", we only do the non-urgent scaling when entering the next window.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. The idea of "freezing window" is awesome, will add to design.


- Parser to parse `TrainingJob` resource to corresponding job components,
including:
- `ReplicaSet` of master process
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ReplicaSet / StatefulSet / Job means we will depend on Kubernetes' scheduler for scheduling Pods. Since we are creating our own scheduler, should we rely on Kubernetes' scheduler or not? What is the pros and cons of both cases?

Copy link
Collaborator Author

@typhoonzero typhoonzero Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply.
Yes, you are right. Not using default k8s scheduler will let us have more control over TraningJobs see here. The scheduler is in charge of putting pods on nodes.

Pros:

  • Can add weight to resource types like GPU when scheduling pods.
  • Scaling and scheduling can be in the same process.
  • New resource types can be taken care of like FPGA.

Cons:

  • The core function of schedule pods to nodes seems to be same for the TrainginJob scheduler. Resource request per node won't be changed, we only change the number of pods to run, which is already done by autoscaler.
  • Hard to implement, we have to implement leader selection for "High Availability".

I think using default-scheduler for the pod->nodes job is enough currently, we only need to queue TrainingJobs by priority and hand them to k8s, for now.

Copy link
Collaborator

@helinwang helinwang Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero I see, thanks! Agree that using the default scheduler is better for our current use case.

Another related question: do we need to use k8s Job / StateSet, another possibility is we can submit the creation and deletion of Pods directly (but still using the default scheduler).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another related question: do we need to use k8s Job / StateSet, another possibility is we can submit the creation and deletion of Pods directly (but still using the default scheduler).

This is possible and may be useful. Instead, the controller has to track all pods' status, which is implemented in the k8s Job/StateSet controller.

Pros for directly control pods:

  • control scale by pod status, i.e. scale up/down the slowest pod (pod runs the smallest batch-id)
  • dynamically change resource request of pods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@helinwang @Yancey1989 what you think of "submit the creation and deletion of Pods directly", I'll update the doc if we all agree with this.

Copy link
Collaborator

@Yancey1989 Yancey1989 Nov 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another possibility is we can submit the creation and deletion of Pods directly (but still using the default scheduler)

Maybe we also need a custom scheduler, because only creation and deletion of Pods also produce Pending Pod, and the default-scheduler use a FIFO queue to scheduler the Pod, we can not dynamic adjust the priorities for all the pending Pods.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gongweibao : please take a look at this discussion, it's related to your converting Python start job to Go controller.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

Copy link
Collaborator

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@typhoonzero typhoonzero merged commit 304c4fc into PaddlePaddle:develop Nov 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants