Adding node priority #587

thandayuthapani · 2019-02-13T14:28:42Z

Signed-off-by: thandayuthapani [email protected]
What this PR does / why we need it:
This PR includes the implementation for Node Priority Feature in Kube-Batch

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #520

Special notes for your reviewer:
Have added priority only in allocate action, will add it in preempt action once the implementation for allocate is reviewed.

k8s-ci-robot · 2019-02-13T14:28:55Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

thandayuthapani · 2019-02-13T14:39:07Z

CLA Signed

pkg/scheduler/plugins/priority/priority.go

pkg/scheduler/util/priorities/image_locality.go

pkg/scheduler/actions/allocate/allocate.go

pkg/scheduler/plugins/priority/priority.go

k82cn · 2019-02-14T01:57:46Z

please make CI happy and add e2e test for that.

k82cn · 2019-02-14T01:58:14Z

just go though the code diff, I'll review it in detail later :)

k82cn · 2019-02-14T02:00:14Z

/cc @wukong1992 , @YiannisGkoufas , @chenyangxueHDU

pkg/scheduler/util.go

pkg/scheduler/actions/allocate/allocate.go

pkg/scheduler/framework/session_plugins.go

pkg/scheduler/plugins/prioritize/priorities/image_locality.go

k82cn · 2019-02-15T12:42:29Z

/approve

k8s-ci-robot · 2019-02-15T12:42:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: k82cn, thandayuthapani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [k82cn]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/scheduler/actions/allocate/allocate.go

k82cn · 2019-02-20T05:59:05Z

test/e2e/prioritize.go

+		// All Pods Should be Scheduled in same node
+		nodeName := pods[0].Spec.NodeName
+		for _, pod := range pods {
+			Expect(pod.Spec.NodeName).To(Equal(nodeName))


rep means all resources from all worker nodes; they can not be dispatched into the same node. But how this test passed?

rep is not all resources from all worker nodes.

for _, node := range nodes.Items { ........ ........ alloc := kbapi.NewResource(node.Status.Allocatable) for slot.LessEqual(alloc) { alloc.Sub(slot) res++ } if res > 0 { return node.Name, res } }

It is number of workload, for particular amount of requested resource that can be scheduled in that node. If res value is greater than zero, then it returns the value for that particular worker node, it does not traverse through all the worker nodes to calculate res. If for that particular workload's requested resource cannot be scheduled then res value will be zero, so it checks next node, if res is not zero, it does not check other nodes, it will return those values

yes, your're right; so that's ok to me :)

btw, it seems not stable for this case; would you help to check the CI result?

Yeah checking CI failures, will fix it.

k82cn · 2019-02-20T10:09:40Z

test/e2e/predicates.go

-		for _, pod := range pods {
-			Expect(pod.Spec.NodeName).To(Equal(nodeName))
-		}
+		podsWithAffinity := getPodOfPodGroup(context, pg1)


why do we change that?

All the pods are getting unschedulable, so all the pods are in pending state, so it is getting time out while checking waitPodGroupReady, so have changed the test case such that, first a task with one pod and label is deployed, and then second task with podAffinity is scheduled, and checking whether both are having same node name, verifying pod affinity predicate is working.

This is resource of DIND-nodes, that is being created in travis, one node always has difference in resources, so if is it trying to schedule pods in that node, for the rep value of other nodes, then one pod goes unschedulable, so waitPodGroupReady returns error, so test case fails, to avoid that case, this case where one pod is created with label and next pod with podAffinity and check is done whether both pods are same scheduled in same node or not, that verifies our pod affinity predicate and avoid unschedulable scenario.

if so, just change oneCpu to halfCpu when creating the Job should be enough.

In InterPodAffinity Priority function, they try to get the node of existingPod from it's Pod object,

processPod := func(existingPod *v1.Pod) error { existingPodNode, err := ipa.info.GetNodeInfo(existingPod.Spec.NodeName) ........ ........ }

But in gang scheduling, we bind task only when the number of tasks is greater or equal to the MinAvailable number.

occupied := readyTaskNum(job) return occupied >= job.MinAvailable

So if our job has two tasks, and first get allotted to a node and the cache(schedulercache) info in our scheduler gets reflected with the new pod(v1.Pod) Object, but that pod object will not have nodeName(pod.Spec.NodeName), so when second task comes and checks for InterPodAffinity score, it checks for all the pods from the cache(schedulercache.NodeInfo), and tries to find the nodeName of the pod from Pod object(v1.Pod), but for first task in our case will not have nodeName in pod(v1.Pod) object, since it is not yet sent for binding, it will be in pending state, so for second task, we will get error that no node with nill value(empty string)

Calculate Inter Pod Affinity Priority Failed because of Error: failed to find node <>

So both not being able to switch from Pending to Running state since the occupied value will be less than MinAvailable, so our gang scheduling will not work with K8s InterPod Affinity Priority logic.
In K8s Pod Affinity Predicates, they use topologyPairsAntiAffinityPodsMap in their predicateMetaData, and run predicates for the task, so predicates was working fine with gang scheduling, but InterPodAffinity Priority function in K8s will not support gang scheduling model.

But that listing is done by InterPod Affinity code provided by K8s, if we have to change the code, we will have to bring it out of vendor and place it in kube-batch. We just send current pod object(v1.Pod), list of nodes(both schedulercache.NodeInfo(as map) and v1.Nodes(as slice)).

But that listing is done by InterPod Affinity code provided by K8s

It will use lister from our code, right?

mapFn := priorities.NewInterPodAffinityPriority(cn, nl, pl, v1.DefaultHardPodAffinitySymmetricWeight)

Yes, have made change in our lister code, and Affinity to itself case also passes now. And have changed predicate Pod affinity test case to the one that was existing before.

LGTM overall, please only do DeepCopy when task.NodeName != pod.NodeName.

We don't use List and FilteredList function in interPodAffinity priority code. Even if that check is included, it will not affect behaviour of priority, so have just implemented the same as that of implemented in predicate plugin.

k82cn · 2019-02-27T08:23:46Z

pkg/scheduler/plugins/prioritize/prioritize.go

+
+			for _, task := range tasks {
+				if selector.Matches(labels.Set(task.Pod.Labels)) {
+					pod := task.Pod.DeepCopy()


we only need DeepCopy the pod that task.NodeName != Pod.NodeName.

In interPodAffinity Priority code, we do not use PodLister's List and FilteredList function, list is done implicitly in the interPodAffinity code, where is ranges over every node and get pods from nodeinfo object, and calls processPod function, which uses only GetNodeInfo function alone.

processNode := func(i int) { nodeInfo := nodeNameToInfo[allNodeNames[i]] ...... ...... for _, existingPod := range nodeInfo.Pods() { if err := processPod(existingPod); err != nil {

In interPodAffinity Priority code of K8s, uses
existingPodNode, err := ipa.info.GetNodeInfo(existingPod.Spec.NodeName)
function, since existingPod.Spec.NodeName value was nil, we were getting error, so have made changes only in func (c *cachedNodeInfo) GetNodeInfo(name string) (*v1.Node, error) function alone, other listers are same as there in Predicate plugin. Changes has been done only to GetNodeInfo function for fixing that case. Not in FilteredList and List, those are same as that of present in predicate plugin. FilteredList and List function is not used in interPodAffinity priority code, it is defined only because that struct should implement a interface. Changes in those function will not affect the behaviour of priority function.

if task.NodeName == pod.NodeName, append pod as before; if task.NodeName != pod.NodeName, do DeepCopy and then append copied pod. Do you mean that does not work?

That List function where you are asking me to change, is not used by interPodAffinity priority code.

what's different? If NodeName not change, why do we do DeepCopy? priority and predicate only read pod info.

Have changed the implementation, according to your comments

k82cn · 2019-02-27T08:23:58Z

pkg/scheduler/plugins/prioritize/prioritize.go

+
+			for _, task := range tasks {
+				if podFilter(task.Pod) && selector.Matches(labels.Set(task.Pod.Labels)) {
+					pod := task.Pod.DeepCopy()


k82cn · 2019-02-27T12:46:58Z

/lgtm

Thanks for your contribution :)

…610-#609-#604-#613-upstream-release-0.4 Automated cherry pick of #587: Adding Node Priority #607: Fixed flaky test. #610: Updated log level. #609: Added conformance plugin. #604: Moved default implementation from pkg. #613: Removed namespaceAsQueue.

…riority Adding node priority

thandayuthapani added 2 commits February 13, 2019 19:42

Adding Node Priority

e697c20

Update Dependencies

ffb35b1

k8s-ci-robot requested review from animeshsingh and k82cn February 13, 2019 14:28

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 13, 2019

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 13, 2019