Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-19074: Performance and efficiency improvements in daemon/server mode #181

Merged
merged 13 commits into from
Sep 18, 2023

Conversation

dcbw
Copy link
Contributor

@dcbw dcbw commented Sep 15, 2023

4.14 backport of #180

dcbw added 13 commits September 15, 2023 13:06
We want the in-cluster client that the multus server uses to use
the same client config (QPS, protobuf, grpc, etc) as the regular
client.

Signed-off-by: Dan Williams <[email protected]>
(cherry picked from commit fff8519)
Multus is a pretty critical piece of infrastructure, so it shouldn't
be subject to the same lower QPS limits as most components are.

Signed-off-by: Dan Williams <[email protected]>
(cherry picked from commit 752f28c)
Then we can just use the Server struct kube client and exec rather
than passing them through the function parameters.

Signed-off-by: Dan Williams <[email protected]>
(cherry picked from commit 6b8d24c)
We'll need these for the next commit.

Signed-off-by: Dan Williams <[email protected]>
(cherry picked from commit 7c68481)
Signed-off-by: Dan Williams <[email protected]>
(cherry picked from commit 1605ffc)
Move server start code to a common function that both regular
and test code can use. Also shut down the server from the
testcases.

Signed-off-by: Dan Williams <[email protected]>
(cherry picked from commit cec1a53)
…access

When running in server mode we can use a shared informer to listen for
Pod events from the apiserver, and grab pod info from that cache rather
than doing direct apiserver requests each time.

This reduces apiserver load and retry latency, since multus can poll
the local cache more frequently than it should do direct apiserver
requests.

Oddly static pods don't show up in the informer by the timeout and
require a direct apiserver request. Since static pods are not common
and are typically long-running, it should not be a big issue to
fall back to direct apiserver access for them.

Signed-off-by: Dan Williams <[email protected]>
(cherry picked from commit 50c0357)
A couple of the setup variables for NewManager*() are already in the
multus config that it gets passed, so use those instead of passing
explicitly.

Signed-off-by: Dan Williams <[email protected]>
(cherry picked from commit 4ade856)
Simplify setup by moving the post-creation operations like
GenerateConfig() and PersistMultusConfig() into a new Start() function
that also begins watching the configuration directory. This better
encapsulates the manager functionality in the object.

We can also get rid of the done channel passed to the config
manager and just use the existing WaitGroup to determine when to
exit the daemon main().

Signed-off-by: Dan Williams <[email protected]>
(cherry picked from commit 8539a47)
The test was comparing the same configuration to itself, since
nothing in the changed CNI configuration is used in the written
multus configuration.

Instead make sure the updated CNI config contains something
that will be reflected in the written multus configuration,
and while we're there use a more robust way to wait for the
config to be written via gomega.Eventually().

Signed-off-by: Dan Williams <[email protected]>
(cherry picked from commit c2add82)
…dule

Signed-off-by: Dan Williams <[email protected]>
(cherry picked from commit fb4f4aa)
Signed-off-by: Dan Williams <[email protected]>
(cherry picked from commit b0df7dd)
…Manager

For whatever reason calling os.Stat() on the readiness indicator file
from CmdAdd()/CmdDel() when multus is running in server mode and is
containerized often returns "file not found", which triggers the
polling behavior of GetReadinessIndicatorFile(). This greatly delays
CNI operations that should be pretty quick. Even if an exponential
backoff is used, os.Stat() can still return "file not found"
multiple times, even though the file clearly exists.

But it turns out we don't need to check the readiness file in server
mode when running with MultusConfigFile == "auto". In this mode the
server starts the ConfigManager which (a) waits until the file exists
and (b) fsnotify watches the readiness and (c) exits the daemon
immediately if the file is deleted or moved.

This means we can assume that while the daemon is running and the
server is handling CNI requests that the readiness file exists;
otherwise the daemon would have exited. Thus CmdAdd/CmdDel don't
need to run a lot of possibly failing os.Stat() calls in the CNI
hot paths.

Signed-off-by: Dan Williams <[email protected]>
(cherry picked from commit d9c06e9)
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 15, 2023
@openshift-ci-robot
Copy link
Contributor

@dcbw: This pull request references Jira Issue OCPBUGS-19074, which is invalid:

  • expected the bug to target the "4.15.0" version, but it targets "4.14.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

4.14 backport of #180

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dcbw dcbw changed the base branch from master to release-4.14 September 15, 2023 18:09
@openshift-ci-robot
Copy link
Contributor

@dcbw: This pull request references Jira Issue OCPBUGS-19074, which is invalid:

  • expected dependent Jira Issue OCPBUGS-18307 to be in one of the following states: MODIFIED, ON_QA, VERIFIED, but it is POST instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

4.14 backport of #180

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from fepan and s1061123 September 15, 2023 18:10
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 15, 2023
@dcbw
Copy link
Contributor Author

dcbw commented Sep 15, 2023

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@dcbw: This pull request references Jira Issue OCPBUGS-19074, which is invalid:

  • expected dependent Jira Issue OCPBUGS-18307 to be in one of the following states: MODIFIED, ON_QA, VERIFIED, but it is POST instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dougbtv
Copy link
Member

dougbtv commented Sep 16, 2023

/lgtm

@dougbtv
Copy link
Member

dougbtv commented Sep 16, 2023

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Sep 16, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 16, 2023
@dougbtv
Copy link
Member

dougbtv commented Sep 16, 2023

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 16, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dcbw, dougbtv

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dougbtv
Copy link
Member

dougbtv commented Sep 16, 2023

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 16, 2023
@openshift-ci-robot
Copy link
Contributor

@dougbtv: This pull request references Jira Issue OCPBUGS-19074, which is valid. The bug has been moved to the POST state.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-18307 is in the state ON_QA, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
  • dependent Jira Issue OCPBUGS-18307 targets the "4.15.0" version, which is one of the valid target versions: 4.15.0
  • bug has dependents

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 16, 2023

@dcbw: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws 14bfa73 link true /test e2e-aws
ci/prow/e2e-aws-upgrade 14bfa73 link true /test e2e-aws-upgrade
ci/prow/images 14bfa73 link true /test images

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@weliang1
Copy link

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Sep 18, 2023
@openshift-merge-robot openshift-merge-robot merged commit 4b13f2d into openshift:release-4.14 Sep 18, 2023
@openshift-ci-robot
Copy link
Contributor

@dcbw: Jira Issue OCPBUGS-19074: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-19074 has been moved to the MODIFIED state.

In response to this:

4.14 backport of #180

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.14.0-0.nightly-2023-09-19-201452

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.