-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-19074: Performance and efficiency improvements in daemon/server mode #181
OCPBUGS-19074: Performance and efficiency improvements in daemon/server mode #181
Conversation
We want the in-cluster client that the multus server uses to use the same client config (QPS, protobuf, grpc, etc) as the regular client. Signed-off-by: Dan Williams <[email protected]> (cherry picked from commit fff8519)
Multus is a pretty critical piece of infrastructure, so it shouldn't be subject to the same lower QPS limits as most components are. Signed-off-by: Dan Williams <[email protected]> (cherry picked from commit 752f28c)
Then we can just use the Server struct kube client and exec rather than passing them through the function parameters. Signed-off-by: Dan Williams <[email protected]> (cherry picked from commit 6b8d24c)
We'll need these for the next commit. Signed-off-by: Dan Williams <[email protected]> (cherry picked from commit 7c68481)
Signed-off-by: Dan Williams <[email protected]> (cherry picked from commit 1605ffc)
Move server start code to a common function that both regular and test code can use. Also shut down the server from the testcases. Signed-off-by: Dan Williams <[email protected]> (cherry picked from commit cec1a53)
…access When running in server mode we can use a shared informer to listen for Pod events from the apiserver, and grab pod info from that cache rather than doing direct apiserver requests each time. This reduces apiserver load and retry latency, since multus can poll the local cache more frequently than it should do direct apiserver requests. Oddly static pods don't show up in the informer by the timeout and require a direct apiserver request. Since static pods are not common and are typically long-running, it should not be a big issue to fall back to direct apiserver access for them. Signed-off-by: Dan Williams <[email protected]> (cherry picked from commit 50c0357)
A couple of the setup variables for NewManager*() are already in the multus config that it gets passed, so use those instead of passing explicitly. Signed-off-by: Dan Williams <[email protected]> (cherry picked from commit 4ade856)
Simplify setup by moving the post-creation operations like GenerateConfig() and PersistMultusConfig() into a new Start() function that also begins watching the configuration directory. This better encapsulates the manager functionality in the object. We can also get rid of the done channel passed to the config manager and just use the existing WaitGroup to determine when to exit the daemon main(). Signed-off-by: Dan Williams <[email protected]> (cherry picked from commit 8539a47)
The test was comparing the same configuration to itself, since nothing in the changed CNI configuration is used in the written multus configuration. Instead make sure the updated CNI config contains something that will be reflected in the written multus configuration, and while we're there use a more robust way to wait for the config to be written via gomega.Eventually(). Signed-off-by: Dan Williams <[email protected]> (cherry picked from commit c2add82)
…dule Signed-off-by: Dan Williams <[email protected]> (cherry picked from commit fb4f4aa)
Signed-off-by: Dan Williams <[email protected]> (cherry picked from commit b0df7dd)
…Manager For whatever reason calling os.Stat() on the readiness indicator file from CmdAdd()/CmdDel() when multus is running in server mode and is containerized often returns "file not found", which triggers the polling behavior of GetReadinessIndicatorFile(). This greatly delays CNI operations that should be pretty quick. Even if an exponential backoff is used, os.Stat() can still return "file not found" multiple times, even though the file clearly exists. But it turns out we don't need to check the readiness file in server mode when running with MultusConfigFile == "auto". In this mode the server starts the ConfigManager which (a) waits until the file exists and (b) fsnotify watches the readiness and (c) exits the daemon immediately if the file is deleted or moved. This means we can assume that while the daemon is running and the server is handling CNI requests that the readiness file exists; otherwise the daemon would have exited. Thus CmdAdd/CmdDel don't need to run a lot of possibly failing os.Stat() calls in the CNI hot paths. Signed-off-by: Dan Williams <[email protected]> (cherry picked from commit d9c06e9)
@dcbw: This pull request references Jira Issue OCPBUGS-19074, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@dcbw: This pull request references Jira Issue OCPBUGS-19074, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@dcbw: This pull request references Jira Issue OCPBUGS-19074, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
/label backport-risk-assessed |
/retest-required |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dcbw, dougbtv The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/jira refresh |
@dougbtv: This pull request references Jira Issue OCPBUGS-19074, which is valid. The bug has been moved to the POST state. 6 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@dcbw: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/label cherry-pick-approved |
@dcbw: Jira Issue OCPBUGS-19074: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-19074 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fix included in accepted release 4.14.0-0.nightly-2023-09-19-201452 |
4.14 backport of #180