Skip to content
This repository has been archived by the owner on May 25, 2023. It is now read-only.

Adding node priority #587

Merged
merged 4 commits into from
Feb 27, 2019

Conversation

thandayuthapani
Copy link
Contributor

Signed-off-by: thandayuthapani [email protected]
What this PR does / why we need it:
This PR includes the implementation for Node Priority Feature in Kube-Batch

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #520

Special notes for your reviewer:
Have added priority only in allocate action, will add it in preempt action once the implementation for allocate is reviewed.

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 13, 2019
@thandayuthapani
Copy link
Contributor Author

CLA Signed

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 13, 2019
@k82cn
Copy link
Contributor

k82cn commented Feb 14, 2019

please make CI happy and add e2e test for that.

@k82cn
Copy link
Contributor

k82cn commented Feb 14, 2019

just go though the code diff, I'll review it in detail later :)

@k82cn
Copy link
Contributor

k82cn commented Feb 14, 2019

pkg/scheduler/util.go Outdated Show resolved Hide resolved
@thandayuthapani thandayuthapani force-pushed the node_priority branch 3 times, most recently from 805ddc7 to 93af353 Compare February 15, 2019 07:15
@k82cn
Copy link
Contributor

k82cn commented Feb 15, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: k82cn, thandayuthapani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2019
// All Pods Should be Scheduled in same node
nodeName := pods[0].Spec.NodeName
for _, pod := range pods {
Expect(pod.Spec.NodeName).To(Equal(nodeName))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rep means all resources from all worker nodes; they can not be dispatched into the same node. But how this test passed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rep is not all resources from all worker nodes.

	for _, node := range nodes.Items {
            ........
            ........
                alloc := kbapi.NewResource(node.Status.Allocatable)
		for slot.LessEqual(alloc) {
			alloc.Sub(slot)
			res++
		}

		if res > 0 {
			return node.Name, res
		}
	}

It is number of workload, for particular amount of requested resource that can be scheduled in that node. If res value is greater than zero, then it returns the value for that particular worker node, it does not traverse through all the worker nodes to calculate res. If for that particular workload's requested resource cannot be scheduled then res value will be zero, so it checks next node, if res is not zero, it does not check other nodes, it will return those values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, your're right; so that's ok to me :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, it seems not stable for this case; would you help to check the CI result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah checking CI failures, will fix it.

@thandayuthapani thandayuthapani force-pushed the node_priority branch 6 times, most recently from 0ec201c to 68a364e Compare February 20, 2019 10:02
for _, pod := range pods {
Expect(pod.Spec.NodeName).To(Equal(nodeName))
}
podsWithAffinity := getPodOfPodGroup(context, pg1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we change that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the pods are getting unschedulable, so all the pods are in pending state, so it is getting time out while checking waitPodGroupReady, so have changed the test case such that, first a task with one pod and label is deployed, and then second task with podAffinity is scheduled, and checking whether both are having same node name, verifying pod affinity predicate is working.

Copy link
Contributor Author

@thandayuthapani thandayuthapani Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

screenshot from 2019-02-20 16-18-26
This is resource of DIND-nodes, that is being created in travis, one node always has difference in resources, so if is it trying to schedule pods in that node, for the rep value of other nodes, then one pod goes unschedulable, so waitPodGroupReady returns error, so test case fails, to avoid that case, this case where one pod is created with label and next pod with podAffinity and check is done whether both pods are same scheduled in same node or not, that verifies our pod affinity predicate and avoid unschedulable scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if so, just change oneCpu to halfCpu when creating the Job should be enough.

Copy link
Contributor Author

@thandayuthapani thandayuthapani Feb 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In InterPodAffinity Priority function, they try to get the node of existingPod from it's Pod object,

processPod := func(existingPod *v1.Pod) error {
		existingPodNode, err := ipa.info.GetNodeInfo(existingPod.Spec.NodeName)
                ........
                ........
    }

But in gang scheduling, we bind task only when the number of tasks is greater or equal to the MinAvailable number.

        occupied := readyTaskNum(job)
	return occupied >= job.MinAvailable

So if our job has two tasks, and first get allotted to a node and the cache(schedulercache) info in our scheduler gets reflected with the new pod(v1.Pod) Object, but that pod object will not have nodeName(pod.Spec.NodeName), so when second task comes and checks for InterPodAffinity score, it checks for all the pods from the cache(schedulercache.NodeInfo), and tries to find the nodeName of the pod from Pod object(v1.Pod), but for first task in our case will not have nodeName in pod(v1.Pod) object, since it is not yet sent for binding, it will be in pending state, so for second task, we will get error that no node with nill value(empty string)

Calculate Inter Pod Affinity Priority Failed because of Error: failed to find node <>

So both not being able to switch from Pending to Running state since the occupied value will be less than MinAvailable, so our gang scheduling will not work with K8s InterPod Affinity Priority logic.
In K8s Pod Affinity Predicates, they use topologyPairsAntiAffinityPodsMap in their predicateMetaData, and run predicates for the task, so predicates was working fine with gang scheduling, but InterPodAffinity Priority function in K8s will not support gang scheduling model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that listing is done by InterPod Affinity code provided by K8s, if we have to change the code, we will have to bring it out of vendor and place it in kube-batch. We just send current pod object(v1.Pod), list of nodes(both schedulercache.NodeInfo(as map) and v1.Nodes(as slice)).

Copy link
Contributor

@k82cn k82cn Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that listing is done by InterPod Affinity code provided by K8s

It will use lister from our code, right?

mapFn := priorities.NewInterPodAffinityPriority(cn, nl, pl, v1.DefaultHardPodAffinitySymmetricWeight)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, have made change in our lister code, and Affinity to itself case also passes now. And have changed predicate Pod affinity test case to the one that was existing before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, please only do DeepCopy when task.NodeName != pod.NodeName.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use List and FilteredList function in interPodAffinity priority code. Even if that check is included, it will not affect behaviour of priority, so have just implemented the same as that of implemented in predicate plugin.

@thandayuthapani thandayuthapani force-pushed the node_priority branch 2 times, most recently from 988d7aa to a3aaf9e Compare February 27, 2019 06:41

for _, task := range tasks {
if selector.Matches(labels.Set(task.Pod.Labels)) {
pod := task.Pod.DeepCopy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only need DeepCopy the pod that task.NodeName != Pod.NodeName.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In interPodAffinity Priority code, we do not use PodLister's List and FilteredList function, list is done implicitly in the interPodAffinity code, where is ranges over every node and get pods from nodeinfo object, and calls processPod function, which uses only GetNodeInfo function alone.

processNode := func(i int) {
		nodeInfo := nodeNameToInfo[allNodeNames[i]]
                ......
		......		
               for _, existingPod := range nodeInfo.Pods() {
			if err := processPod(existingPod); err != nil {

In interPodAffinity Priority code of K8s, uses
existingPodNode, err := ipa.info.GetNodeInfo(existingPod.Spec.NodeName)
function, since existingPod.Spec.NodeName value was nil, we were getting error, so have made changes only in func (c *cachedNodeInfo) GetNodeInfo(name string) (*v1.Node, error) function alone, other listers are same as there in Predicate plugin. Changes has been done only to GetNodeInfo function for fixing that case. Not in FilteredList and List, those are same as that of present in predicate plugin. FilteredList and List function is not used in interPodAffinity priority code, it is defined only because that struct should implement a interface. Changes in those function will not affect the behaviour of priority function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if task.NodeName == pod.NodeName, append pod as before; if task.NodeName != pod.NodeName, do DeepCopy and then append copied pod. Do you mean that does not work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That List function where you are asking me to change, is not used by interPodAffinity priority code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's different? If NodeName not change, why do we do DeepCopy? priority and predicate only read pod info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have changed the implementation, according to your comments


for _, task := range tasks {
if podFilter(task.Pod) && selector.Matches(labels.Set(task.Pod.Labels)) {
pod := task.Pod.DeepCopy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@k82cn
Copy link
Contributor

k82cn commented Feb 27, 2019

/lgtm

Thanks for your contribution :)

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2019
@k8s-ci-robot k8s-ci-robot merged commit e1204f3 into kubernetes-retired:master Feb 27, 2019
k8s-ci-robot added a commit that referenced this pull request Mar 3, 2019
…610-#609-#604-#613-upstream-release-0.4

Automated cherry pick of #587: Adding Node Priority #607: Fixed flaky test. #610: Updated log level. #609: Added conformance plugin. #604: Moved default implementation from pkg. #613: Removed namespaceAsQueue.
@thandayuthapani thandayuthapani deleted the node_priority branch March 8, 2019 10:22
kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019
kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019
kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support prioritize algorithm in kube-batch
3 participants