-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CFS design #468
Add CFS design #468
Conversation
doc/design/scheduler.md
Outdated
- `ReplicaSet` of `pserver` process | ||
- `Job` of `trainer` process | ||
- Queue to sort `TrainingJob` resource for schedule | ||
- Scheduler to determine which job to run or to scale by: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这四种Job priority
具体的含义是什么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这下面描述的并不是4种priority,是scheduler根据这些数值决定每个job期望的运行状态,即Job的打分值。
a job is waiting enough long, it can finally be scheduled to the cluster, no | ||
matter it may have very low priority (except that the cluster is full of | ||
production service). | ||
1. A cluster may run both online service and offline batch jobs. The online |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我的理解:我们可以按照任务的性质分成
- 在线
- 离线
- 团队实验
- 个人实验
几种层别,这样用户可能对Job priority
有更直观的理解。一定程度避免现在大家都把自己任务调高优先级别的情况出现。
高层别的任务优先级别一定大于低层别的优先级别,在同一个层别里边才有更细的任务优先级别的排序。
提交任务的时候,可以让用户同时提交Job nature
Job priority
两个参数。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
优先级在下面 Interface章节已经介绍了。
doc/design/scheduler.md
Outdated
consumption will be considered. | ||
|
||
Scheduler stores all nodes in a red-black tree, sorted by score | ||
`sum(Prio() * ResourceScore() * Running() * running time)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个公式出现的比较突兀。不清楚排序的效果如何?有无解释?
另外running time
是如何估算出来的?
doc/design/scheduler.md
Outdated
|
||
## References | ||
|
||
https://en.wikipedia.org/wiki/Completely_Fair_Scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indention issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments all done.
doc/design/scheduler.md
Outdated
|
||
### Interface | ||
|
||
Scheduler deal with atomic scheduling unit named `Node`. The `TraniningJob` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this be confused with Node
in Kubernetes as a physical machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Will change the naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/scheduler.md
Outdated
|
||
## Background | ||
|
||
We are going to define PaddlePaddle cluster job as a Kubernetes [TPR]() or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/scheduler.md
Outdated
services have high priority and is not interuptable. But trainingjobs can | ||
re-use the cluster resource when the online service came to certain time of | ||
day that is not that active. | ||
1. About quota, users quota should be considered so that scheduled job is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pserver
and etcd
should have higher priority than trainers
, and because
jobs require GPU resource should
have higher priority to run on GPU machines than CPU only jobs
pserver
and etcd
will be assigned at CPU node with higher priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought pserver
, etcd
, master
, trainers
in a single TrainingJob
should have same priority.
doc/design/scheduler.md
Outdated
|
||
Cases that need to be considerd during the implementaion: | ||
|
||
1. GPU is much more expensive than CPUs, jobs require GPU resource should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
laugch -> launch
doc/design/scheduler.md
Outdated
Running() int64 | ||
|
||
// Obj returns inner scheduling unit. | ||
Obj() *interface{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Obj() interface
, interface is itself a "pointer".
doc/design/scheduler.md
Outdated
} | ||
``` | ||
|
||
Currently we only support 4 levels of priority. Note that the priority is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
const (
Experiement PrioLevel = 10
Offline = 100
Normal = 1000
Production = 10000
)
could be more extensible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All done.
doc/design/scheduler.md
Outdated
then job result can be updated to the production service. Other jobs like | ||
experiement and one-shot jobs have lower priority, they can be scaled up when | ||
cluster is free and can be scaled down when cluster is busy. | ||
1. Otherwise, jobs should share the cluster resource fairly, which means, if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Topic for discussion: how "fair" do we want? If we are really fair, every jobs' trainer count will be constantly in flux, but cold starting a trainer have cost.
Maybe we can have some "freezing window", we only do the non-urgent scaling when entering the next window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasonable. The idea of "freezing window" is awesome, will add to design.
|
||
- Parser to parse `TrainingJob` resource to corresponding job components, | ||
including: | ||
- `ReplicaSet` of master process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ReplicaSet
/ StatefulSet
/ Job
means we will depend on Kubernetes' scheduler for scheduling Pods. Since we are creating our own scheduler, should we rely on Kubernetes' scheduler or not? What is the pros and cons of both cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply.
Yes, you are right. Not using default k8s scheduler will let us have more control over TraningJobs
see here. The scheduler is in charge of putting pods on nodes.
Pros:
- Can add weight to resource types like GPU when scheduling pods.
- Scaling and scheduling can be in the same process.
- New resource types can be taken care of like FPGA.
Cons:
- The core function of schedule pods to nodes seems to be same for the
TrainginJob
scheduler. Resource request per node won't be changed, we only change the number of pods to run, which is already done byautoscaler
. - Hard to implement, we have to implement leader selection for "High Availability".
I think using default-scheduler
for the pod->nodes job is enough currently, we only need to queue TrainingJobs
by priority and hand them to k8s, for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero I see, thanks! Agree that using the default scheduler is better for our current use case.
Another related question: do we need to use k8s Job / StateSet, another possibility is we can submit the creation and deletion of Pods directly (but still using the default scheduler).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another related question: do we need to use k8s Job / StateSet, another possibility is we can submit the creation and deletion of Pods directly (but still using the default scheduler).
This is possible and may be useful. Instead, the controller has to track all pods' status, which is implemented in the k8s Job/StateSet controller.
Pros for directly control pods:
- control scale by pod status, i.e. scale up/down the slowest pod (pod runs the smallest batch-id)
- dynamically change resource request of pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@helinwang @Yancey1989 what you think of "submit the creation and deletion of Pods directly", I'll update the doc if we all agree with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another possibility is we can submit the creation and deletion of Pods directly (but still using the default scheduler)
Maybe we also need a custom scheduler, because only creation and deletion of Pods
also produce Pending
Pod, and the default-scheduler use a FIFO queue to scheduler the Pod, we can not dynamic adjust the priorities for all the pending Pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gongweibao : please take a look at this discussion, it's related to your converting Python start job to Go controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
No description provided.