Optimize a request to find device plugin pods. #599

alan-kut · 2024-01-30T09:50:22Z

Add a resourceVersion=0 to the reuqest to find device plugin pods on a given node from the latest cache from the kube-apiserver. Without a resourceVersion the request needs to reach etcd first and then kube-apiserver will filter out the pods.

Due to this etcd can get overloaded. Since it runs in daemonset, they get created on every node. If a cluster has a lot of nodes the list requests will put a cluster down.

github-actions · 2024-01-30T09:50:35Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

alan-kut · 2024-01-30T09:51:27Z

/test-all

adrianchiris

@alan-kut
at what number of nodes did you hit this issue ?

LGTM.

adrianchiris · 2024-01-30T11:14:11Z

pkg/daemon/daemon.go

-		FieldSelector: "spec.nodeName=" + vars.NodeName,
+		LabelSelector:   "app=sriov-device-plugin",
+		FieldSelector:   "spec.nodeName=" + vars.NodeName,
+		ResourceVersion: "0",


as we just care for the existance of such pod i believe we are OK with specifying this.

some additional reading on the semantics of ResourceVersion field for other reviewers[1]

[1] https://kubernetes.io/docs/reference/using-api/api-concepts/#semantics-for-get-and-list

@adrianchiris thanks for linking the documentation.

alan-kut · 2024-01-30T11:18:24Z

I hit the issue with 4k nodes running the plugin

alan-kut · 2024-01-30T13:21:15Z

The failures are in the controllers, not daemon.

adrianchiris · 2024-01-31T15:09:58Z

@alan-kut could you rebase ?

@SchSeba @zeeke PTAL on this one when you get a chance

zeeke

LGTM

Add a resourceVersion=0 to the reuqest to find device plugin pods on a given node from the latest cache from the kube-apiserver. Without a resourceVersion the request needs to reach etcd first and then kube-apiserver will filter out the pods. Due to this etcd can get overloaded. Since it runs in daemonset, they get created on every node. If a cluster has a lot of nodes the list requests will put a cluster down.

github-actions · 2024-02-01T13:17:33Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

alan-kut · 2024-02-01T13:18:08Z

Rebased

coveralls · 2024-02-01T13:26:11Z

Pull Request Test Coverage Report for Build 7741878100

0 of 3 (100.0%) changed or added relevant lines in 1 file are covered.
6 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.1%) to 25.098%

Files with Coverage Reduction	New Missed Lines	%
controllers/sriovnetworkpoolconfig_controller.go	6	69.7%

Totals
Change from base Build 7740442862:	0.1%
Covered Lines:	2810
Relevant Lines:	11196

💛 - Coveralls

adrianchiris approved these changes Jan 30, 2024

View reviewed changes

zeeke approved these changes Jan 31, 2024

View reviewed changes

alan-kut force-pushed the master branch from e6dbb56 to 98f1b77 Compare February 1, 2024 13:17

adrianchiris approved these changes Feb 1, 2024

View reviewed changes

adrianchiris merged commit 1a3019e into k8snetworkplumbingwg:master Feb 1, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize a request to find device plugin pods. #599

Optimize a request to find device plugin pods. #599

alan-kut commented Jan 30, 2024

github-actions bot commented Jan 30, 2024

alan-kut commented Jan 30, 2024

adrianchiris left a comment

adrianchiris Jan 30, 2024

zeeke Jan 31, 2024

alan-kut commented Jan 30, 2024

alan-kut commented Jan 30, 2024

adrianchiris commented Jan 31, 2024

zeeke left a comment

github-actions bot commented Feb 1, 2024

alan-kut commented Feb 1, 2024

coveralls commented Feb 1, 2024

Optimize a request to find device plugin pods. #599

Optimize a request to find device plugin pods. #599

Conversation

alan-kut commented Jan 30, 2024

github-actions bot commented Jan 30, 2024

alan-kut commented Jan 30, 2024

adrianchiris left a comment

Choose a reason for hiding this comment

adrianchiris Jan 30, 2024

Choose a reason for hiding this comment

zeeke Jan 31, 2024

Choose a reason for hiding this comment

alan-kut commented Jan 30, 2024

alan-kut commented Jan 30, 2024

adrianchiris commented Jan 31, 2024

zeeke left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 1, 2024

alan-kut commented Feb 1, 2024

coveralls commented Feb 1, 2024

Pull Request Test Coverage Report for Build 7741878100

💛 - Coveralls