scheduler: fix tracking of concurrent events #120413

pohly · 2023-09-04T17:30:41Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

The previous approach was based on the assumption that an in-flight pod can use the head of the received event list as marker for identifying all events that occur while the pod is in flight. That assumption is incorrect: when that existing element gets removed from the list because all pods that were in-flight when it was received are done, that marker's Next method returns nil and the code which should have seen several concurrent events (if there were any) missed all of those.

Which issue(s) this PR fixes:

A pod with concurrent events could incorrectly get moved to the unschedulable queue where it could got stuck until the next periodic purging after 5 minutes if there was no other event for it.

Special notes for your reviewer:

The approach with maintaining a single list of concurrent events can be fixed by inserting each in-flight pod into the list and using that element to identify "more recent" events for the pod.

This affects Kubernetes >= 1.28. We need to backport the fix.

Does this PR introduce a user-facing change?

Fix a 1.28 regression in scheduler: a pod with concurrent events could incorrectly get moved to the unschedulable queue where it could got stuck until the next periodic purging after 5 minutes if there was no other event for it.

k8s-ci-robot · 2023-09-04T17:30:49Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pohly

/cc @Huang-Wei @alculquicondor

pohly · 2023-09-04T17:42:26Z

pkg/scheduler/internal/queue/scheduling_queue.go

+			logger.Error(nil, "In flight Pod isn't found in the scheduling queue. If you see this error log, it's likely a bug in the scheduler.", "pod", klog.KObj(pInfo.Pod))
+			return framework.QueueAfterBackoff
+		}
+		if p.inFlightEvents.Len() > len(p.inFlightPods) {


I am replicating the previous if len(pInfo.UnschedulablePlugins) == 0 shortcut here, but only if absolutely necessary (= when pod is not in flight).

I am not exactly sure whether there are legitimate reasons for having it. One unit test depends on it, but perhaps that test is broken (see comment below).

pohly · 2023-09-04T17:44:06Z

pkg/scheduler/internal/queue/scheduling_queue_test.go

 	// Construct a Pod, but don't associate its scheduler failure to any plugin
-	hpp1 := highPriorityPodInfo.Pod.DeepCopy()
-	hpp1.Name = "hpp1"


Having two pods with the same UID breaks the inFlightPod tracking of pods. Somehow it worked before, but not anymore with the revised implementation.

good catch. We shouldn't be testing that.

pohly · 2023-09-04T17:45:24Z

pkg/scheduler/internal/queue/scheduling_queue_test.go

 	q.activeQ.Add(q.newQueuedPodInfo(hpp1))
 	if p, err := q.Pop(); err != nil || p.Pod != hpp1 {
 		t.Errorf("Expected: %v after Pop, but got: %v", hpp1, p.Pod.Name)
 	}
+	expectInFlightPods(t, q, hpp1.UID)


I used expectInFlightPods to figure out where the test flow went astray when it still used two pods with the same UID. I kept it because it provides some additional test coverage for both inFlightPods and inFlightEvents and because it makes the test easier to read.

pohly · 2023-09-04T17:47:58Z

pkg/scheduler/internal/queue/scheduling_queue_test.go

 	q.AddUnschedulableIfNotPresent(logger, q.newQueuedPodInfo(unschedulablePodInfo.Pod, "fooPlugin"), q.SchedulingCycle())
+	expectInFlightPods(t, q, medPriorityPodInfo.Pod.UID, highPriorityPodInfo.Pod.UID, hpp1.UID)


This is about the line above with q.newQueuedPodInfo: creating a new PodInfo instance feels wrong.

The test should more realistic and use the podInfo instance that it got from Pop, just like a real scheduling attempt.

I kept it because I don't want to rock the boat too much in one PR, but this might be worth a follow-up.

pohly · 2023-09-04T19:04:16Z

pkg/scheduler/internal/queue/scheduling_queue.go

+		// This can happen while updating a pod. In that case pInfo.UnschedulablePlugins should
+		// be empty. If it is not, we may have a problem.
+		if len(pInfo.UnschedulablePlugins) != 0 {
+			logger.Error(nil, "In flight Pod isn't found in the scheduling queue. If you see this error log, it's likely a bug in the scheduler.", "pod", klog.KObj(pInfo.Pod))


Code coverage and dlv show that this line is hit when running tests, for example:

2669: for _, op := range test.operations { 2670: for _, pInfo := range pInfos { =>2671: op(logger, queue, pInfo) 2672: } 2673: }

This is another instance of "pInfo is not the result of Pop". Doesn't look easy to fix, though.

Is this line being hit consistently? Do you have the test case for it?

This code hits it reliably:

kubernetes/pkg/scheduler/internal/queue/scheduling_queue_test.go

Line 2655 in 7e9fbc4

op(logger, queue, pInfo)

That's addUnschedulablePodBackToBackoffQ called for a fabricated QueuedPodInfo, without making sure that "inFlightPods" is set up.

Current master also hits the error log message. This seems to be the only test which does, so it might be easier to fix than I thought.

ah I see. Yeah, we should make sure to add a Pop before calling unschedulable. I remember Kensei added a lot of those in several tests. These probably slipped.

alculquicondor · 2023-09-05T14:34:05Z

That assumption is incorrect: when that existing element gets removed from the list because all pods that were in-flight when it was received are done

But it only gets removed once that event no longer matters. Or is the calculation for "does it still matter" wrong?

pohly · 2023-09-05T15:13:23Z

But it only gets removed once that event no longer matters. Or is the calculation for "does it still matter" wrong?

The current definition of "does it still matter" is "the pods that where in flight when the event was received are still in flight". That's what the PodInfo counter is for.

What happens is this:

pod1 popped
pod2 popped
event1 received -> stored as list head with count == 2
pod3 popped, using that list head as previousEvent
pod1 done -> counter == 1
pod2 done -> counter == 0 -> element gets removed, which means that its Next always returns nil
event2 received and relevant for in-flight pod3, gets stored as new list head with count == 1
pod3 done -> previousEvent.Next returns nil and thus event2 is not getting checked

alculquicondor · 2023-09-05T14:51:00Z

pkg/scheduler/internal/queue/scheduling_queue.go

+		// This can happen while updating a pod. In that case pInfo.UnschedulablePlugins should
+		// be empty. If it is not, we may have a problem.
+		if len(pInfo.UnschedulablePlugins) != 0 {
+			logger.Error(nil, "In flight Pod isn't found in the scheduling queue. If you see this error log, it's likely a bug in the scheduler.", "pod", klog.KObj(pInfo.Pod))


Is this line being hit consistently? Do you have the test case for it?

alculquicondor · 2023-09-05T15:42:37Z

pkg/scheduler/internal/queue/scheduling_queue_test.go

 	}
-	return clusterEvents
+	return values
 }

 func Test_InFlightPods(t *testing.T) {


Did you add a test that would have failed before this change?

I think one scenario is

pop Pod-A

event X

pop Pod-B

event Y

requeue pod-A

requeue pod-B (should be hinted according to event Y)

I added one now. It's a bit more complicated, see second commit message:

To make it work without assumptions about the implementation, the QueuedPodInfo
returned by Pop must be the one passed to AddUnschedulableIfNotPresent
after (potentially) populating UnschedulablePlugins. This is done via callback
functions which bind to the same shared variable.

alculquicondor · 2023-09-05T15:46:03Z

pkg/scheduler/internal/queue/scheduling_queue_test.go

 				t.Errorf("Unexpected diff in inFlightPods (-want, +got):\n%s", diff)
 			}

-			if diff := cmp.Diff(listToClusterEvents(test.wantReceivedEvents), listToClusterEvents(q.receivedEvents), cmp.AllowUnexported(clusterEvent{})); diff != "" {
-				t.Errorf("Unexpected diff in receivedEvents (-want, +got):\n%s", diff)
+			wantInFlightEvents := make([]interface{}, 0)


Suggested change

wantInFlightEvents := make([]interface{}, 0)

var wantInFlightEvents []interface{}

Then I need to make the same change in listToValues and remove the pre-allocation, otherwise I end up with nil != []interface{}` differences in some cases - done.

cmpopts.EquateEmpty

alculquicondor · 2023-09-05T15:49:30Z

pkg/scheduler/internal/queue/scheduling_queue_test.go

+	}
+	sortUIDs := cmpopts.SortSlices(func(a, b types.UID) bool { return a < b })
+	if diff := cmp.Diff(uids, actualUIDs, sortUIDs); diff != "" {
+		t.Fatalf("Unexpected content of inFlightPods (--- want, +++ have):\n%s", diff)


Suggested change

t.Fatalf("Unexpected content of inFlightPods (--- want, +++ have):\n%s", diff)

t.Fatalf("Unexpected content of inFlightPods (-want, +got):\n%s", diff)

alculquicondor · 2023-09-05T15:54:28Z

pod2 done -> counter == 0 -> element gets removed, which means that its Next always returns nil

I think the bug was on the fact that we assumed that Next would still work (point to the previous next). In any case, the new approach looks good.

However, let's try to get into why the pod UID is sometimes not found in inFlightPods

The previous approach was based on the assumption that an in-flight pod can use the head of the received event list as marker for identifying all events that occur while the pod is in flight. That assumption is incorrect: when that existing element gets removed from the list because all pods that were in-flight when it was received are done, that marker's Next method returns nil and the code which should have seen several concurrent events (if there were any) missed all of those. As a result, a pod with concurrent events could incorrectly get moved to the unschedulable queue where it could got stuck until the next periodic purging after 5 minutes if there was no other event for it. The approach with maintaining a single list of concurrent events can be fixed by inserting each in-flight pod into the list and using that element to identify "more recent" events for the pod.

alculquicondor

/approve
/lgtm

alculquicondor · 2023-09-05T18:51:23Z

pkg/scheduler/internal/queue/scheduling_queue_test.go

+			wantActiveQPodNames: []string{pod2.Name},
+			wantInFlightPods:    nil,
+			wantInFlightEvents:  nil,
+			queueingHintMap: QueueingHintMapPerProfile{


I see that we didn't do this in other tests, but we should be listing outputs (wantX) last.

alculquicondor · 2023-09-05T18:53:26Z

pkg/scheduler/internal/queue/scheduling_queue_test.go

 	// Construct a Pod, but don't associate its scheduler failure to any plugin
-	hpp1 := highPriorityPodInfo.Pod.DeepCopy()
-	hpp1.Name = "hpp1"


good catch. We shouldn't be testing that.

alculquicondor · 2023-09-05T18:56:25Z

pkg/scheduler/internal/queue/scheduling_queue_test.go

 	q.AddUnschedulableIfNotPresent(logger, q.newQueuedPodInfo(unschedulablePodInfo.Pod, "fooPlugin"), q.SchedulingCycle())
+	expectInFlightPods(t, q, medPriorityPodInfo.Pod.UID, highPriorityPodInfo.Pod.UID, hpp1.UID)


k8s-ci-robot · 2023-09-05T18:56:44Z

LGTM label has been added.

Git tree hash: bd8b719532b631f6f7bda4f1314bbc7b0bdbfbf7

k8s-ci-robot · 2023-09-05T18:56:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The problematic scenario was having one pod in flight, one event in the list, and then detecting a concurrent event for a second pod after the first pod is done. The new test case covers that. To make it work without assumptions about the implementation, the QueuedPodInfo returned by Pop must be the one passed to AddUnschedulableIfNotPresent after (potentially) populating UnschedulablePlugins. This is done via callback functions which bind to the same shared variable.

pohly · 2023-09-05T19:01:39Z

/hold

pull-kubernetes-linter-hints is right, my new code lacks errors checking:

ERROR: pkg/scheduler/internal/queue/scheduling_queue_test.go:508:36: Error return value of `q.AddUnschedulableIfNotPresent` is not checked (errcheck)
ERROR: 					q.AddUnschedulableIfNotPresent(logger, poppedPod, q.SchedulingCycle())
ERROR: 					                              ^
ERROR: pkg/scheduler/internal/queue/scheduling_queue_test.go:513:36: Error return value of `q.AddUnschedulableIfNotPresent` is not checked (errcheck)
ERROR: 					q.AddUnschedulableIfNotPresent(logger, poppedPod2, q.SchedulingCycle())

Fixed: https://github.com/kubernetes/kubernetes/compare/14ba163bd05d79db4315fea34cff9c367ff19159..c131c92b9f36f7fbd2b3625c3e1d0573932007b2

pohly · 2023-09-05T19:04:39Z

/hold cancel

@alculquicondor: needs another LGTM.

alculquicondor · 2023-09-05T19:12:23Z

/lgtm
Please prepare the cherry-pick

k8s-ci-robot · 2023-09-05T19:12:29Z

LGTM label has been added.

Git tree hash: f222a7c29149f1a37c581888213d26e5850e1087

pohly · 2023-09-06T05:45:56Z

Cherry-pick is in #120445

…3-origin-release-1.28 Automated cherry pick of #120413: scheduler: fix tracking of concurrent events

k8s-ci-robot requested review from chendave and denkensk September 4, 2023 17:31

pohly commented Sep 4, 2023

View reviewed changes

k8s-ci-robot requested review from alculquicondor and Huang-Wei September 4, 2023 17:48

pohly commented Sep 4, 2023

View reviewed changes

alculquicondor reviewed Sep 5, 2023

View reviewed changes

pohly force-pushed the scheduler-in-flight-events-fix branch from c97552d to 14ba163 Compare September 5, 2023 18:29

pohly mentioned this pull request Sep 5, 2023

scheduler: fix TestIncomingPodsMetrics unit test #120434

Merged

alculquicondor reviewed Sep 5, 2023

View reviewed changes

k8s-ci-robot assigned alculquicondor Sep 5, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 5, 2023

pohly force-pushed the scheduler-in-flight-events-fix branch from 14ba163 to c131c92 Compare September 5, 2023 19:01

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2023

k8s-ci-robot requested a review from alculquicondor September 5, 2023 19:01

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 5, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 5, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2023

k8s-ci-robot merged commit a7f9e70 into kubernetes:master Sep 5, 2023

k8s-ci-robot added this to the v1.29 milestone Sep 5, 2023

pohly mentioned this pull request Sep 6, 2023

Automated cherry pick of #120413: scheduler: fix tracking of concurrent events #120445

Merged

k8s-ci-robot added a commit that referenced this pull request Sep 6, 2023

Merge pull request #120445 from pohly/automated-cherry-pick-of-#12041…

b77dd74

…3-origin-release-1.28 Automated cherry pick of #120413: scheduler: fix tracking of concurrent events

alculquicondor mentioned this pull request Oct 31, 2023

scheduler: handle in-flight Pods with less memory #120622

Closed

sanposhiho mentioned this pull request Jul 3, 2024

Improve lock utilization in scheduling queue to increase scheduling throughput while many gated pods #125692

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: fix tracking of concurrent events #120413

scheduler: fix tracking of concurrent events #120413

pohly commented Sep 4, 2023 •

edited by liggitt

Loading

k8s-ci-robot commented Sep 4, 2023

pohly left a comment

pohly Sep 4, 2023

pohly Sep 4, 2023

alculquicondor Sep 5, 2023

pohly Sep 4, 2023

pohly Sep 4, 2023

alculquicondor Sep 5, 2023

pohly Sep 4, 2023

alculquicondor Sep 5, 2023

pohly Sep 5, 2023

alculquicondor Sep 5, 2023

alculquicondor commented Sep 5, 2023

pohly commented Sep 5, 2023

alculquicondor Sep 5, 2023

alculquicondor Sep 5, 2023

pohly Sep 5, 2023

alculquicondor Sep 5, 2023

pohly Sep 5, 2023

alculquicondor Sep 5, 2023

alculquicondor Sep 5, 2023

pohly Sep 5, 2023

alculquicondor commented Sep 5, 2023

alculquicondor left a comment

alculquicondor Sep 5, 2023

alculquicondor Sep 5, 2023

alculquicondor Sep 5, 2023

k8s-ci-robot commented Sep 5, 2023

k8s-ci-robot commented Sep 5, 2023

pohly commented Sep 5, 2023

pohly commented Sep 5, 2023

alculquicondor commented Sep 5, 2023

k8s-ci-robot commented Sep 5, 2023

pohly commented Sep 6, 2023

		q.AddUnschedulableIfNotPresent(logger, q.newQueuedPodInfo(unschedulablePodInfo.Pod, "fooPlugin"), q.SchedulingCycle())
		expectInFlightPods(t, q, medPriorityPodInfo.Pod.UID, highPriorityPodInfo.Pod.UID, hpp1.UID)

	wantInFlightEvents := make([]interface{}, 0)
	var wantInFlightEvents []interface{}

	t.Fatalf("Unexpected content of inFlightPods (--- want, +++ have):\n%s", diff)
	t.Fatalf("Unexpected content of inFlightPods (-want, +got):\n%s", diff)

scheduler: fix tracking of concurrent events #120413

scheduler: fix tracking of concurrent events #120413

Conversation

pohly commented Sep 4, 2023 • edited by liggitt Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Sep 4, 2023

pohly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Sep 5, 2023

pohly commented Sep 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Sep 5, 2023

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 5, 2023

k8s-ci-robot commented Sep 5, 2023

pohly commented Sep 5, 2023

pohly commented Sep 5, 2023

alculquicondor commented Sep 5, 2023

k8s-ci-robot commented Sep 5, 2023

pohly commented Sep 6, 2023

pohly commented Sep 4, 2023 •

edited by liggitt

Loading