Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: fix TestIncomingPodsMetrics unit test #120434

Merged
merged 1 commit into from
Sep 18, 2023

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Sep 5, 2023

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

addUnschedulablePodBackToBackoffQ happened to put the pod into the backoff queue because

  • the pod was not popped earlier and thus not in flight
  • the PodInfo had UnschedulablePlugins set
  • determineSchedulingHintForInFlightPod has code for "if UnschedulablePlugins is set and pod not in flight -> internal error, use backoff"

Relying on such special code is not good. A better way to force backoff is by recording some concurrent event. isPodWorthRequeuing then calls the queueHintReturnQueueAfterBackoff function and the pod goes to the backoff queue.

Which issue(s) this PR fixes:

Related-to: #120413 (comment)

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 5, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Sep 5, 2023
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 5, 2023
@alculquicondor
Copy link
Member

/cc

@pohly pohly force-pushed the scheduler-backoff-metric-test branch 2 times, most recently from 7e2fb1e to 7a5000e Compare September 6, 2023 07:27
@pohly
Copy link
Contributor Author

pohly commented Sep 6, 2023

/retest

addUnschedulablePodBackToUnschedulablePods = func(logger klog.Logger, queue *PriorityQueue, pInfo *framework.QueuedPodInfo) {
// To simulate the pod is failed in scheduling in the real world, Pop() the pod from activeQ before AddUnschedulableIfNotPresent() below.
queue.activeQ.Add(queue.newQueuedPodInfo(pInfo.Pod))
if p, err := queue.Pop(); err != nil || p.Pod != pInfo.Pod {
expectNoError("Add", queue.activeQ.Add(queue.newQueuedPodInfo(pInfo.Pod)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have preferred that myself. But t is not passed down into these operations. I could change that, but then the diff becomes larger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or did you mean not having a expectNoError function (DAMP)?

I find expectNoError more readable than an if check. But if that's just me, then I can change to if ... t.Fatal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both done (using t.Fatal and removal of expectNoError).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! More importantly I wanted you to use t.Fatal.
But I also don't like adding "assert" libraries.

expectNoError("Add", queue.activeQ.Add(queue.newQueuedPodInfo(pInfo.Pod)))
p, err := queue.Pop()
expectNoError("Pop", err)
if p.Pod != pInfo.Pod {
panic(fmt.Sprintf("Expected: %v after Pop, but got: %v", pInfo.Pod.Name, p.Pod.Name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm... this one slipped, please use t.Fatal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if err != nil {
panic(fmt.Sprintf("Unexpected error from %s: %v", what, err))
}
}
addUnschedulablePodBackToUnschedulablePods = func(logger klog.Logger, queue *PriorityQueue, pInfo *framework.QueuedPodInfo) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change the function name to: popAndRequeueAsUnschedulable. Similarly for the other one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// A concurrent event forces the pod into the backoff queue instead of the unschedulable queue.
queue.MoveAllToActiveOrBackoffQueue(logger, NodeAdd, nil, nil, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the fact that, to understand what each test case is doing, I have to go back and read this function.
I would prefer if we would just have:

operations: []{pop, moveAll, addUnschedulable, ...}

But up to you if you want to do it or leave it for someone else to follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone else, please 😓

@pohly pohly force-pushed the scheduler-backoff-metric-test branch from 7a5000e to 4ffc9a4 Compare September 11, 2023 11:12
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 11, 2023
@pohly
Copy link
Contributor Author

pohly commented Sep 11, 2023

/retest

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7ad32faa35c49b86862d918ca068b3f5f9b2c9c1

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 11, 2023
@pacoxu
Copy link
Member

pacoxu commented Sep 12, 2023

/test pull-kubernetes-node-e2e-containerd

@pohly
Copy link
Contributor Author

pohly commented Sep 12, 2023

The tests pass in their branch, but not when merged into master. That's because of 0d3eafd: together with that, some pods in the unit tests no longer go to the expected queues:

   --- FAIL: TestIncomingPodsMetrics/add_pods_to_unschedulablePods_and_then_move_all_to_activeQ (0.10s)
        scheduling_queue.go:617: I0912 07:53:19.010475] Checking events for in-flight pod pod="ns1/test-pod-1" unschedulablePlugins={} inFlightEventsSize=1 inFlightPodsSize=1
        scheduling_queue.go:752: I0912 07:53:19.010547] Pod moved to an internal scheduling queue pod="ns1/test-pod-1" event="ScheduleAttemptFailure" queue="Backoff" schedulingCycle=1 hint="QueueAfterBackoff"
        scheduling_queue.go:617: I0912 07:53:19.010586] Checking events for in-flight pod pod="ns2/test-pod-2" unschedulablePlugins={} inFlightEventsSize=1 inFlightPodsSize=1
        scheduling_queue.go:752: I0912 07:53:19.010620] Pod moved to an internal scheduling queue pod="ns2/test-pod-2" event="ScheduleAttemptFailure" queue="Backoff" schedulingCycle=1 hint="QueueAfterBackoff"
        scheduling_queue.go:617: I0912 07:53:19.010652] Checking events for in-flight pod pod="ns3/test-pod-3" unschedulablePlugins={} inFlightEventsSize=1 inFlightPodsSize=1
        scheduling_queue.go:752: I0912 07:53:19.010681] Pod moved to an internal scheduling queue pod="ns3/test-pod-3" event="ScheduleAttemptFailure" queue="Backoff" schedulingCycle=1 hint="QueueAfterBackoff"
        scheduling_queue_test.go:2939: unexpected collecting result:
            
            
            Diff:
            --- metric output does not match expectation; want
            +++ got:
            @@ -2,4 +2,3 @@
             # TYPE scheduler_queue_incoming_pods_total counter
            -scheduler_queue_incoming_pods_total{event="ScheduleAttemptFailure",queue="unschedulable"} 3
            -scheduler_queue_incoming_pods_total{event="UnschedulableTimeout",queue="active"} 3
            +scheduler_queue_incoming_pods_total{event="ScheduleAttemptFailure",queue="backoff"} 3

I'll rebase and check whether the unit tests made some invalid assumptions about how the queue works.

addUnschedulablePodBackToBackoffQ happened to put the pod into the backoff
queue because
- the pod was not popped earlier and thus not in flight
- the PodInfo had UnschedulablePlugins set
- determineSchedulingHintForInFlightPod has code for "if UnschedulablePlugins
  is set and pod not in flight -> internal error, use backoff"

Relying on such special code is not good. A better way to force backoff is by
recording some concurrent event. isPodWorthRequeuing then calls the
queueHintReturnQueueAfterBackoff function and the pod goes to the backoff
queue.
@pohly pohly force-pushed the scheduler-backoff-metric-test branch from 4ffc9a4 to 819edda Compare September 12, 2023 06:39
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 12, 2023
@pohly
Copy link
Contributor Author

pohly commented Sep 12, 2023

The relevant difference is that pods with no unschedulable plugins now go into the backoff queue. To simulate putting a pod into the unschedulable queue, pInfo.UnschedulablePlugins must be set. Also, a concurrent event is no longer needed when the backoff queue is the intended target:

diff --git a/pkg/scheduler/internal/queue/scheduling_queue_test.go b/pkg/scheduler/internal/queue/scheduling_queue_test.go
index 13a9a89bb02..b17e5a3c3cc 100644
--- a/pkg/scheduler/internal/queue/scheduling_queue_test.go
+++ b/pkg/scheduler/internal/queue/scheduling_queue_test.go
@@ -2288,6 +2288,8 @@ var (
        }
        popAndRequeueAsUnschedulable = func(t *testing.T, logger klog.Logger, queue *PriorityQueue, pInfo *framework.QueuedPodInfo) {
                // To simulate the pod is failed in scheduling in the real world, Pop() the pod from activeQ before AddUnschedulableIfNotPresent() below.
+               // UnschedulablePlugins will get cleared by Pop, so make a copy first.
+               unschedulablePlugins := pInfo.UnschedulablePlugins.Clone()
                if err := queue.activeQ.Add(queue.newQueuedPodInfo(pInfo.Pod)); err != nil {
                        t.Fatalf("Unexpected error during Add: %v", err)
                }
@@ -2298,6 +2300,8 @@ var (
                if p.Pod != pInfo.Pod {
                        t.Fatalf("Expected: %v after Pop, but got: %v", pInfo.Pod.Name, p.Pod.Name)
                }
+               // Simulate plugins that are waiting for some events.
+               p.UnschedulablePlugins = unschedulablePlugins
                if err := queue.AddUnschedulableIfNotPresent(logger, p, 1); err != nil {
                        t.Fatalf("Unexpected error during AddUnschedulableIfNotPresent: %v", err)
                }
@@ -2314,10 +2318,7 @@ var (
                if p.Pod != pInfo.Pod {
                        t.Fatalf("Expected: %v after Pop, but got: %v", pInfo.Pod.Name, p.Pod.Name)
                }
-
-               // A concurrent event forces the pod into the backoff queue instead of the unschedulable queue.
-               queue.MoveAllToActiveOrBackoffQueue(logger, NodeAdd, nil, nil, nil)
-
+               // When there is no known unschedulable plugin, pods always go to the backoff queue.
                if err := queue.AddUnschedulableIfNotPresent(logger, p, 1); err != nil {
                        t.Fatalf("Unexpected error during AddUnschedulableIfNotPresent: %v", err)
                }

@alculquicondor
Copy link
Member

/approve cancel
I'll leave this review to @sanposhiho

@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 12, 2023
@sanposhiho
Copy link
Member

/assign

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 18, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: eeb1b2270bc6d52f11ae720e7eb784398d93b2e8

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 18, 2023
@k8s-ci-robot
Copy link
Contributor

@pohly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-kind 819edda link unknown /test pull-kubernetes-e2e-kind
pull-kubernetes-e2e-gce 819edda link unknown /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@pohly
Copy link
Contributor Author

pohly commented Sep 18, 2023

/retest

@k8s-ci-robot k8s-ci-robot merged commit 3cfdf3c into kubernetes:master Sep 18, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants