-
Notifications
You must be signed in to change notification settings - Fork 264
Adding node priority #587
Adding node priority #587
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
CLA Signed |
please make CI happy and add e2e test for that. |
just go though the code diff, I'll review it in detail later :) |
6139a93
to
6120e2b
Compare
805ddc7
to
93af353
Compare
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: k82cn, thandayuthapani The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
test/e2e/prioritize.go
Outdated
// All Pods Should be Scheduled in same node | ||
nodeName := pods[0].Spec.NodeName | ||
for _, pod := range pods { | ||
Expect(pod.Spec.NodeName).To(Equal(nodeName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rep
means all resources from all worker nodes; they can not be dispatched into the same node. But how this test passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rep is not all resources from all worker nodes.
for _, node := range nodes.Items {
........
........
alloc := kbapi.NewResource(node.Status.Allocatable)
for slot.LessEqual(alloc) {
alloc.Sub(slot)
res++
}
if res > 0 {
return node.Name, res
}
}
It is number of workload, for particular amount of requested resource that can be scheduled in that node. If res
value is greater than zero, then it returns the value for that particular worker node, it does not traverse through all the worker nodes to calculate res
. If for that particular workload's requested resource cannot be scheduled then res
value will be zero, so it checks next node, if res
is not zero, it does not check other nodes, it will return those values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, your're right; so that's ok to me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, it seems not stable for this case; would you help to check the CI result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah checking CI failures, will fix it.
0ec201c
to
68a364e
Compare
test/e2e/predicates.go
Outdated
for _, pod := range pods { | ||
Expect(pod.Spec.NodeName).To(Equal(nodeName)) | ||
} | ||
podsWithAffinity := getPodOfPodGroup(context, pg1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we change that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the pods are getting unschedulable, so all the pods are in pending state, so it is getting time out while checking waitPodGroupReady
, so have changed the test case such that, first a task with one pod and label is deployed, and then second task with podAffinity is scheduled, and checking whether both are having same node name, verifying pod affinity predicate is working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is resource of DIND-nodes, that is being created in travis, one node always has difference in resources, so if is it trying to schedule pods in that node, for the rep
value of other nodes, then one pod goes unschedulable, so waitPodGroupReady
returns error, so test case fails, to avoid that case, this case where one pod is created with label and next pod with podAffinity
and check is done whether both pods are same scheduled in same node or not, that verifies our pod affinity predicate and avoid unschedulable scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if so, just change oneCpu to halfCpu when creating the Job should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In InterPodAffinity Priority function, they try to get the node of existingPod from it's Pod object,
processPod := func(existingPod *v1.Pod) error {
existingPodNode, err := ipa.info.GetNodeInfo(existingPod.Spec.NodeName)
........
........
}
But in gang scheduling, we bind task only when the number of tasks is greater or equal to the MinAvailable number.
occupied := readyTaskNum(job)
return occupied >= job.MinAvailable
So if our job has two tasks, and first get allotted to a node and the cache(schedulercache) info in our scheduler gets reflected with the new pod(v1.Pod) Object, but that pod object will not have nodeName(pod.Spec.NodeName), so when second task comes and checks for InterPodAffinity score, it checks for all the pods from the cache(schedulercache.NodeInfo), and tries to find the nodeName of the pod from Pod object(v1.Pod), but for first task in our case will not have nodeName in pod(v1.Pod) object, since it is not yet sent for binding, it will be in pending state, so for second task, we will get error that no node with nill value(empty string)
Calculate Inter Pod Affinity Priority Failed because of Error: failed to find node <>
So both not being able to switch from Pending to Running state since the occupied value will be less than MinAvailable
, so our gang scheduling will not work with K8s InterPod Affinity Priority logic.
In K8s Pod Affinity Predicates, they use topologyPairsAntiAffinityPodsMap
in their predicateMetaData
, and run predicates for the task, so predicates was working fine with gang scheduling, but InterPodAffinity Priority function in K8s will not support gang scheduling model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that listing is done by InterPod Affinity code provided by K8s, if we have to change the code, we will have to bring it out of vendor and place it in kube-batch. We just send current pod object(v1.Pod), list of nodes(both schedulercache.NodeInfo(as map) and v1.Nodes(as slice)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that listing is done by InterPod Affinity code provided by K8s
It will use lister from our code, right?
mapFn := priorities.NewInterPodAffinityPriority(cn, nl, pl, v1.DefaultHardPodAffinitySymmetricWeight)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, have made change in our lister code, and Affinity to itself case also passes now. And have changed predicate Pod affinity test case to the one that was existing before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, please only do DeepCopy when task.NodeName != pod.NodeName
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use List
and FilteredList
function in interPodAffinity priority code. Even if that check is included, it will not affect behaviour of priority, so have just implemented the same as that of implemented in predicate plugin.
988d7aa
to
a3aaf9e
Compare
|
||
for _, task := range tasks { | ||
if selector.Matches(labels.Set(task.Pod.Labels)) { | ||
pod := task.Pod.DeepCopy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only need DeepCopy the pod that task.NodeName != Pod.NodeName
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In interPodAffinity Priority code, we do not use PodLister's List
and FilteredList
function, list is done implicitly in the interPodAffinity code, where is ranges over every node and get pods from nodeinfo object, and calls processPod function, which uses only GetNodeInfo
function alone.
processNode := func(i int) {
nodeInfo := nodeNameToInfo[allNodeNames[i]]
......
......
for _, existingPod := range nodeInfo.Pods() {
if err := processPod(existingPod); err != nil {
In interPodAffinity Priority code of K8s, uses
existingPodNode, err := ipa.info.GetNodeInfo(existingPod.Spec.NodeName)
function, since existingPod.Spec.NodeName
value was nil, we were getting error, so have made changes only in func (c *cachedNodeInfo) GetNodeInfo(name string) (*v1.Node, error)
function alone, other listers are same as there in Predicate plugin. Changes has been done only to GetNodeInfo
function for fixing that case. Not in FilteredList
and List
, those are same as that of present in predicate plugin. FilteredList
and List
function is not used in interPodAffinity priority code, it is defined only because that struct should implement a interface. Changes in those function will not affect the behaviour of priority function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if task.NodeName == pod.NodeName
, append pod as before; if task.NodeName != pod.NodeName
, do DeepCopy and then append copied pod. Do you mean that does not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That List
function where you are asking me to change, is not used by interPodAffinity priority code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's different? If NodeName not change, why do we do DeepCopy? priority and predicate only read pod info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have changed the implementation, according to your comments
|
||
for _, task := range tasks { | ||
if podFilter(task.Pod) && selector.Matches(labels.Set(task.Pod.Labels)) { | ||
pod := task.Pod.DeepCopy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
a3aaf9e
to
4ea817d
Compare
4ea817d
to
e2b2f3c
Compare
/lgtm Thanks for your contribution :) |
…riority Adding node priority
…riority Adding node priority
…riority Adding node priority
Signed-off-by: thandayuthapani [email protected]
What this PR does / why we need it:
This PR includes the implementation for Node Priority Feature in Kube-Batch
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #520
Special notes for your reviewer:
Have added priority only in allocate action, will add it in preempt action once the implementation for allocate is reviewed.