Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: fix tracking of concurrent events #120413

Merged

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Sep 4, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

The previous approach was based on the assumption that an in-flight pod can use the head of the received event list as marker for identifying all events that occur while the pod is in flight. That assumption is incorrect: when that existing element gets removed from the list because all pods that were in-flight when it was received are done, that marker's Next method returns nil and the code which should have seen several concurrent events (if there were any) missed all of those.

Which issue(s) this PR fixes:

A pod with concurrent events could incorrectly get moved to the unschedulable queue where it could got stuck until the next periodic purging after 5 minutes if there was no other event for it.

Special notes for your reviewer:

The approach with maintaining a single list of concurrent events can be fixed by inserting each in-flight pod into the list and using that element to identify "more recent" events for the pod.

This affects Kubernetes >= 1.28. We need to backport the fix.

Does this PR introduce a user-facing change?

Fix a 1.28 regression in scheduler: a pod with concurrent events could incorrectly get moved to the unschedulable queue where it could got stuck until the next periodic purging after 5 minutes if there was no other event for it.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 4, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 4, 2023
Copy link
Contributor Author

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.Error(nil, "In flight Pod isn't found in the scheduling queue. If you see this error log, it's likely a bug in the scheduler.", "pod", klog.KObj(pInfo.Pod))
return framework.QueueAfterBackoff
}
if p.inFlightEvents.Len() > len(p.inFlightPods) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am replicating the previous if len(pInfo.UnschedulablePlugins) == 0 shortcut here, but only if absolutely necessary (= when pod is not in flight).

I am not exactly sure whether there are legitimate reasons for having it. One unit test depends on it, but perhaps that test is broken (see comment below).

// Construct a Pod, but don't associate its scheduler failure to any plugin
hpp1 := highPriorityPodInfo.Pod.DeepCopy()
hpp1.Name = "hpp1"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having two pods with the same UID breaks the inFlightPod tracking of pods. Somehow it worked before, but not anymore with the revised implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. We shouldn't be testing that.

q.activeQ.Add(q.newQueuedPodInfo(hpp1))
if p, err := q.Pop(); err != nil || p.Pod != hpp1 {
t.Errorf("Expected: %v after Pop, but got: %v", hpp1, p.Pod.Name)
}
expectInFlightPods(t, q, hpp1.UID)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used expectInFlightPods to figure out where the test flow went astray when it still used two pods with the same UID. I kept it because it provides some additional test coverage for both inFlightPods and inFlightEvents and because it makes the test easier to read.

q.AddUnschedulableIfNotPresent(logger, q.newQueuedPodInfo(unschedulablePodInfo.Pod, "fooPlugin"), q.SchedulingCycle())
expectInFlightPods(t, q, medPriorityPodInfo.Pod.UID, highPriorityPodInfo.Pod.UID, hpp1.UID)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about the line above with q.newQueuedPodInfo: creating a new PodInfo instance feels wrong.

The test should more realistic and use the podInfo instance that it got from Pop, just like a real scheduling attempt.

I kept it because I don't want to rock the boat too much in one PR, but this might be worth a follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg

// This can happen while updating a pod. In that case pInfo.UnschedulablePlugins should
// be empty. If it is not, we may have a problem.
if len(pInfo.UnschedulablePlugins) != 0 {
logger.Error(nil, "In flight Pod isn't found in the scheduling queue. If you see this error log, it's likely a bug in the scheduler.", "pod", klog.KObj(pInfo.Pod))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code coverage and dlv show that this line is hit when running tests, for example:

  2669:				for _, op := range test.operations {
  2670:					for _, pInfo := range pInfos {
=>2671:						op(logger, queue, pInfo)
  2672:					}
  2673:				}

This is another instance of "pInfo is not the result of Pop". Doesn't look easy to fix, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line being hit consistently? Do you have the test case for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code hits it reliably:

That's addUnschedulablePodBackToBackoffQ called for a fabricated QueuedPodInfo, without making sure that "inFlightPods" is set up.

Current master also hits the error log message. This seems to be the only test which does, so it might be easier to fix than I thought.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see. Yeah, we should make sure to add a Pop before calling unschedulable. I remember Kensei added a lot of those in several tests. These probably slipped.

@alculquicondor
Copy link
Member

That assumption is incorrect: when that existing element gets removed from the list because all pods that were in-flight when it was received are done

But it only gets removed once that event no longer matters. Or is the calculation for "does it still matter" wrong?

@pohly
Copy link
Contributor Author

pohly commented Sep 5, 2023

But it only gets removed once that event no longer matters. Or is the calculation for "does it still matter" wrong?

The current definition of "does it still matter" is "the pods that where in flight when the event was received are still in flight". That's what the PodInfo counter is for.

What happens is this:

  • pod1 popped
  • pod2 popped
  • event1 received -> stored as list head with count == 2
  • pod3 popped, using that list head as previousEvent
  • pod1 done -> counter == 1
  • pod2 done -> counter == 0 -> element gets removed, which means that its Next always returns nil
  • event2 received and relevant for in-flight pod3, gets stored as new list head with count == 1
  • pod3 done -> previousEvent.Next returns nil and thus event2 is not getting checked

// This can happen while updating a pod. In that case pInfo.UnschedulablePlugins should
// be empty. If it is not, we may have a problem.
if len(pInfo.UnschedulablePlugins) != 0 {
logger.Error(nil, "In flight Pod isn't found in the scheduling queue. If you see this error log, it's likely a bug in the scheduler.", "pod", klog.KObj(pInfo.Pod))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line being hit consistently? Do you have the test case for it?

}
return clusterEvents
return values
}

func Test_InFlightPods(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you add a test that would have failed before this change?

I think one scenario is

  1. pop Pod-A
  2. event X
  3. pop Pod-B
  4. event Y
  5. requeue pod-A
  6. requeue pod-B (should be hinted according to event Y)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added one now. It's a bit more complicated, see second commit message:

To make it work without assumptions about the implementation, the QueuedPodInfo
returned by Pop must be the one passed to AddUnschedulableIfNotPresent
after (potentially) populating UnschedulablePlugins. This is done via callback
functions which bind to the same shared variable.

t.Errorf("Unexpected diff in inFlightPods (-want, +got):\n%s", diff)
}

if diff := cmp.Diff(listToClusterEvents(test.wantReceivedEvents), listToClusterEvents(q.receivedEvents), cmp.AllowUnexported(clusterEvent{})); diff != "" {
t.Errorf("Unexpected diff in receivedEvents (-want, +got):\n%s", diff)
wantInFlightEvents := make([]interface{}, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wantInFlightEvents := make([]interface{}, 0)
var wantInFlightEvents []interface{}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I need to make the same change in listToValues and remove the pre-allocation, otherwise I end up with nil != []interface{}` differences in some cases - done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmpopts.EquateEmpty

}
sortUIDs := cmpopts.SortSlices(func(a, b types.UID) bool { return a < b })
if diff := cmp.Diff(uids, actualUIDs, sortUIDs); diff != "" {
t.Fatalf("Unexpected content of inFlightPods (--- want, +++ have):\n%s", diff)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Fatalf("Unexpected content of inFlightPods (--- want, +++ have):\n%s", diff)
t.Fatalf("Unexpected content of inFlightPods (-want, +got):\n%s", diff)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@alculquicondor
Copy link
Member

pod2 done -> counter == 0 -> element gets removed, which means that its Next always returns nil

I think the bug was on the fact that we assumed that Next would still work (point to the previous next). In any case, the new approach looks good.

However, let's try to get into why the pod UID is sometimes not found in inFlightPods

The previous approach was based on the assumption that an in-flight pod can use
the head of the received event list as marker for identifying all events that
occur while the pod is in flight. That assumption is incorrect: when that
existing element gets removed from the list because all pods that were
in-flight when it was received are done, that marker's Next method returns nil
and the code which should have seen several concurrent events (if there were
any) missed all of those.

As a result, a pod with concurrent events could incorrectly get moved to the
unschedulable queue where it could got stuck until the next periodic purging
after 5 minutes if there was no other event for it.

The approach with maintaining a single list of concurrent events can be fixed
by inserting each in-flight pod into the list and using that element to
identify "more recent" events for the pod.
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

wantActiveQPodNames: []string{pod2.Name},
wantInFlightPods: nil,
wantInFlightEvents: nil,
queueingHintMap: QueueingHintMapPerProfile{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we didn't do this in other tests, but we should be listing outputs (wantX) last.

// Construct a Pod, but don't associate its scheduler failure to any plugin
hpp1 := highPriorityPodInfo.Pod.DeepCopy()
hpp1.Name = "hpp1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. We shouldn't be testing that.

q.AddUnschedulableIfNotPresent(logger, q.newQueuedPodInfo(unschedulablePodInfo.Pod, "fooPlugin"), q.SchedulingCycle())
expectInFlightPods(t, q, medPriorityPodInfo.Pod.UID, highPriorityPodInfo.Pod.UID, hpp1.UID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: bd8b719532b631f6f7bda4f1314bbc7b0bdbfbf7

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 5, 2023
The problematic scenario was having one pod in flight, one event in the list,
and then detecting a concurrent event for a second pod after the first pod is
done. The new test case covers that.

To make it work without assumptions about the implementation, the QueuedPodInfo
returned by Pop must be the one passed to AddUnschedulableIfNotPresent
after (potentially) populating UnschedulablePlugins. This is done via callback
functions which bind to the same shared variable.
@pohly pohly force-pushed the scheduler-in-flight-events-fix branch from 14ba163 to c131c92 Compare September 5, 2023 19:01
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2023
@pohly
Copy link
Contributor Author

pohly commented Sep 5, 2023

/hold

pull-kubernetes-linter-hints is right, my new code lacks errors checking:

ERROR: pkg/scheduler/internal/queue/scheduling_queue_test.go:508:36: Error return value of `q.AddUnschedulableIfNotPresent` is not checked (errcheck)
ERROR: 					q.AddUnschedulableIfNotPresent(logger, poppedPod, q.SchedulingCycle())
ERROR: 					                              ^
ERROR: pkg/scheduler/internal/queue/scheduling_queue_test.go:513:36: Error return value of `q.AddUnschedulableIfNotPresent` is not checked (errcheck)
ERROR: 					q.AddUnschedulableIfNotPresent(logger, poppedPod2, q.SchedulingCycle())

Fixed: https://github.com/kubernetes/kubernetes/compare/14ba163bd05d79db4315fea34cff9c367ff19159..c131c92b9f36f7fbd2b3625c3e1d0573932007b2

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 5, 2023
@pohly
Copy link
Contributor Author

pohly commented Sep 5, 2023

/hold cancel

@alculquicondor: needs another LGTM.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 5, 2023
@alculquicondor
Copy link
Member

/lgtm
Please prepare the cherry-pick

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: f222a7c29149f1a37c581888213d26e5850e1087

@pohly
Copy link
Contributor Author

pohly commented Sep 6, 2023

Cherry-pick is in #120445

k8s-ci-robot added a commit that referenced this pull request Sep 6, 2023
…3-origin-release-1.28

Automated cherry pick of #120413: scheduler: fix tracking of concurrent events
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants