Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-18676: ovnkube: set northd backoff-interval and use a single thread to save CPU #1990

Merged
merged 2 commits into from
Sep 12, 2023

Conversation

dcbw
Copy link
Contributor

@dcbw dcbw commented Sep 7, 2023

northd has an option to sleep for a short amount of time after processing changes from NB/SB that allows it to trade off a bit of latency for a lot of CPU savings. Since events from NB come frequently during scale tests northd doesn't have a lot of time to sleep. Until we have more incremental processing, most of that CPU time is burned just recalculating things that haven't changed, so it's mostly wasted.

Letting northd sleep has been shown in density-light and density-cni 120 node scale tests to have almost no adverse effect on P99 PodReady times, but a huge improvement in CPU utilization.

Upstream equivalent is ovn-kubernetes/ovn-kubernetes#3877

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 7, 2023
@dcbw
Copy link
Contributor Author

dcbw commented Sep 7, 2023

From an NBDB container on the cluster:

sh-5.1# ovn-nbctl list NB_Global | grep backoff   
options             : {e2e_timestamp="1694097130", ipsec_encapsulation="false", mac_prefix="12:ff:3d", max_tunid="16711680", name=ci-op-qirq0bxi-b3a20-jmtrb-master-0, northd-backoff-interval-ms="300", northd_internal_version="23.06.1-20.27.2-70.6", northd_probe_interval="10000", svc_monitor_mac="b2:37:8d:4b:ce:00"}

@dcbw
Copy link
Contributor Author

dcbw commented Sep 7, 2023

/retest

AWS InsufficientInstanceCapacity

@dcbw dcbw changed the title ovnkube: set northd backoff-interval to save CPU OCPBUGS-18676: ovnkube: set northd backoff-interval to save CPU Sep 7, 2023
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Sep 7, 2023
@openshift-ci-robot
Copy link
Contributor

@dcbw: This pull request references Jira Issue OCPBUGS-18676, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

northd has an option to sleep for a short amount of time after processing changes from NB/SB that allows it to trade off a bit of latency for a lot of CPU savings. Since events from NB come frequently during scale tests northd doesn't have a lot of time to sleep. Until we have more incremental processing, most of that CPU time is burned just recalculating things that haven't changed, so it's mostly wasted.

Letting northd sleep has been shown in density-light and density-cni 120 node scale tests to have almost no adverse effect on P99 PodReady times, but a huge improvement in CPU utilization.

Upstream equivalent is ovn-kubernetes/ovn-kubernetes#3877

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@trozet
Copy link
Contributor

trozet commented Sep 7, 2023

/test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 7, 2023

@trozet: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

  • /test 4.14-upgrade-from-stable-4.13-images
  • /test e2e-aws-ovn-network-migration
  • /test e2e-aws-ovn-windows
  • /test e2e-aws-sdn-multi
  • /test e2e-aws-sdn-network-migration-rollback
  • /test e2e-aws-sdn-network-reverse-migration
  • /test e2e-gcp-ovn
  • /test e2e-gcp-sdn
  • /test e2e-hypershift-ovn
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-vsphere-ovn-windows
  • /test images
  • /test lint
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test 4.14-upgrade-from-stable-4.13-e2e-aws-ovn-upgrade
  • /test 4.14-upgrade-from-stable-4.13-e2e-azure-ovn-upgrade
  • /test 4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-upgrade
  • /test e2e-aws-hypershift-ovn-kubevirt
  • /test e2e-aws-ovn-local-to-shared-gateway-mode-migration
  • /test e2e-aws-ovn-serial
  • /test e2e-aws-ovn-shared-to-local-gateway-mode-migration-periodic
  • /test e2e-aws-ovn-single-node
  • /test e2e-aws-sdn-upgrade
  • /test e2e-azure-ovn
  • /test e2e-azure-ovn-dualstack
  • /test e2e-azure-ovn-manual-oidc
  • /test e2e-gcp-ovn-upgrade
  • /test e2e-metal-ipi-ovn-ipv6-ipsec
  • /test e2e-network-mtu-migration-ovn-ipv4
  • /test e2e-network-mtu-migration-ovn-ipv6
  • /test e2e-network-mtu-migration-sdn-ipv4
  • /test e2e-openstack-kuryr
  • /test e2e-openstack-ovn
  • /test e2e-openstack-sdn
  • /test e2e-ovn-hybrid-step-registry
  • /test e2e-ovn-ipsec-step-registry
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere-ovn
  • /test e2e-vsphere-ovn-dualstack
  • /test qe-perfscale-aws-ovn-cluster-density

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-cluster-network-operator-master-4.14-upgrade-from-stable-4.13-images
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-hypershift-ovn-kubevirt
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-serial
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-shared-to-local-gateway-mode-migration-periodic
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-single-node
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-windows
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-network-migration-rollback
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-network-reverse-migration
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-upgrade
  • pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn
  • pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn
  • pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-cluster-network-operator-master-e2e-gcp-sdn
  • pull-ci-openshift-cluster-network-operator-master-e2e-hypershift-ovn
  • pull-ci-openshift-cluster-network-operator-master-e2e-metal-ipi-ovn-ipv6
  • pull-ci-openshift-cluster-network-operator-master-e2e-metal-ipi-ovn-ipv6-ipsec
  • pull-ci-openshift-cluster-network-operator-master-e2e-network-mtu-migration-ovn-ipv4
  • pull-ci-openshift-cluster-network-operator-master-e2e-network-mtu-migration-ovn-ipv6
  • pull-ci-openshift-cluster-network-operator-master-e2e-network-mtu-migration-sdn-ipv4
  • pull-ci-openshift-cluster-network-operator-master-e2e-openstack-ovn
  • pull-ci-openshift-cluster-network-operator-master-e2e-openstack-sdn
  • pull-ci-openshift-cluster-network-operator-master-e2e-ovn-hybrid-step-registry
  • pull-ci-openshift-cluster-network-operator-master-e2e-ovn-ipsec-step-registry
  • pull-ci-openshift-cluster-network-operator-master-e2e-ovn-step-registry
  • pull-ci-openshift-cluster-network-operator-master-e2e-vsphere-ovn
  • pull-ci-openshift-cluster-network-operator-master-e2e-vsphere-ovn-dualstack
  • pull-ci-openshift-cluster-network-operator-master-e2e-vsphere-ovn-windows
  • pull-ci-openshift-cluster-network-operator-master-images
  • pull-ci-openshift-cluster-network-operator-master-lint
  • pull-ci-openshift-cluster-network-operator-master-unit
  • pull-ci-openshift-cluster-network-operator-master-verify

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@trozet
Copy link
Contributor

trozet commented Sep 7, 2023

/test qe-perfscale-aws-ovn-cluster-density

1 similar comment
@trozet
Copy link
Contributor

trozet commented Sep 9, 2023

/test qe-perfscale-aws-ovn-cluster-density

@jtaleric
Copy link

/test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 11, 2023

@jtaleric: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

  • /test 4.14-upgrade-from-stable-4.13-images
  • /test e2e-aws-ovn-network-migration
  • /test e2e-aws-ovn-windows
  • /test e2e-aws-sdn-multi
  • /test e2e-aws-sdn-network-migration-rollback
  • /test e2e-aws-sdn-network-reverse-migration
  • /test e2e-gcp-ovn
  • /test e2e-gcp-sdn
  • /test e2e-hypershift-ovn
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-vsphere-ovn-windows
  • /test images
  • /test lint
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test 4.14-upgrade-from-stable-4.13-e2e-aws-ovn-upgrade
  • /test 4.14-upgrade-from-stable-4.13-e2e-azure-ovn-upgrade
  • /test 4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-upgrade
  • /test e2e-aws-hypershift-ovn-kubevirt
  • /test e2e-aws-ovn-local-to-shared-gateway-mode-migration
  • /test e2e-aws-ovn-serial
  • /test e2e-aws-ovn-shared-to-local-gateway-mode-migration-periodic
  • /test e2e-aws-ovn-single-node
  • /test e2e-aws-sdn-upgrade
  • /test e2e-azure-ovn
  • /test e2e-azure-ovn-dualstack
  • /test e2e-azure-ovn-manual-oidc
  • /test e2e-gcp-ovn-upgrade
  • /test e2e-metal-ipi-ovn-ipv6-ipsec
  • /test e2e-network-mtu-migration-ovn-ipv4
  • /test e2e-network-mtu-migration-ovn-ipv6
  • /test e2e-network-mtu-migration-sdn-ipv4
  • /test e2e-openstack-kuryr
  • /test e2e-openstack-ovn
  • /test e2e-openstack-sdn
  • /test e2e-ovn-hybrid-step-registry
  • /test e2e-ovn-ipsec-step-registry
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere-ovn
  • /test e2e-vsphere-ovn-dualstack
  • /test qe-perfscale-aws-ovn-cluster-density

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-cluster-network-operator-master-4.14-upgrade-from-stable-4.13-images
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-hypershift-ovn-kubevirt
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-serial
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-shared-to-local-gateway-mode-migration-periodic
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-single-node
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-windows
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-network-migration-rollback
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-network-reverse-migration
  • pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-upgrade
  • pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn
  • pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn
  • pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-cluster-network-operator-master-e2e-gcp-sdn
  • pull-ci-openshift-cluster-network-operator-master-e2e-hypershift-ovn
  • pull-ci-openshift-cluster-network-operator-master-e2e-metal-ipi-ovn-ipv6
  • pull-ci-openshift-cluster-network-operator-master-e2e-metal-ipi-ovn-ipv6-ipsec
  • pull-ci-openshift-cluster-network-operator-master-e2e-network-mtu-migration-ovn-ipv4
  • pull-ci-openshift-cluster-network-operator-master-e2e-network-mtu-migration-ovn-ipv6
  • pull-ci-openshift-cluster-network-operator-master-e2e-network-mtu-migration-sdn-ipv4
  • pull-ci-openshift-cluster-network-operator-master-e2e-openstack-ovn
  • pull-ci-openshift-cluster-network-operator-master-e2e-openstack-sdn
  • pull-ci-openshift-cluster-network-operator-master-e2e-ovn-hybrid-step-registry
  • pull-ci-openshift-cluster-network-operator-master-e2e-ovn-ipsec-step-registry
  • pull-ci-openshift-cluster-network-operator-master-e2e-ovn-step-registry
  • pull-ci-openshift-cluster-network-operator-master-e2e-vsphere-ovn
  • pull-ci-openshift-cluster-network-operator-master-e2e-vsphere-ovn-dualstack
  • pull-ci-openshift-cluster-network-operator-master-e2e-vsphere-ovn-windows
  • pull-ci-openshift-cluster-network-operator-master-images
  • pull-ci-openshift-cluster-network-operator-master-lint
  • pull-ci-openshift-cluster-network-operator-master-unit
  • pull-ci-openshift-cluster-network-operator-master-verify

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@trozet
Copy link
Contributor

trozet commented Sep 11, 2023

/test qe-perfscale-aws-ovn-cluster-density

northd has an option to sleep for a short amount of time after
processing changes from NB/SB that allows it to trade off a bit
of latency for a lot of CPU savings. Since events from NB come
frequently during scale tests northd doesn't have a lot of time
to sleep. Until we have more incremental processing, most of that
CPU time is burned just recalculating things that haven't changed,
so it's mostly wasted.

Letting northd sleep has been shown in density-light and density-cni
120 node scale tests to have almost no adverse effect on P99 PodReady
times, but a huge improvement in CPU utilization.
Northd threading parallelizes the logical flow (lflow) building part
of the northd processing loop. While this speeds up northd processing
it does have a slight CPU cost (~20%) to map/reduce the work. Threading
improved latency when northd processed large numbers of logical flows
in centralized OVN clusters.

With IC each northd only handles a single node in the cluster and
thus processes fewer lflows. Scale testing indicates that the threading
tradeoff is no longer worth it; we achieve the same P99 PodReadyLatency
across multiple scenarios with 1 or 4 threads. We might as well save
the CPU if there no longer any latency benefit.
@trozet
Copy link
Contributor

trozet commented Sep 11, 2023

Critical fix for perf/scale. From our testing it this cuts northd CPU in half, while having no impact to pod latency.

/label acknowledge-critical-fixes-only
/lgtm

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Sep 11, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 11, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 11, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dcbw, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 692a344 and 2 for PR HEAD 703ccb5 in total

@dcbw dcbw changed the title OCPBUGS-18676: ovnkube: set northd backoff-interval to save CPU OCPBUGS-18676: ovnkube: set northd backoff-interval and use a single thread to save CPU Sep 11, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 619ceff and 1 for PR HEAD 703ccb5 in total

@dcbw
Copy link
Contributor Author

dcbw commented Sep 12, 2023

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 12, 2023

@dcbw: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-ovn-ipsec-step-registry 703ccb5 link false /test e2e-ovn-ipsec-step-registry
ci/prow/e2e-network-mtu-migration-sdn-ipv4 703ccb5 link false /test e2e-network-mtu-migration-sdn-ipv4
ci/prow/e2e-network-mtu-migration-ovn-ipv6 703ccb5 link false /test e2e-network-mtu-migration-ovn-ipv6
ci/prow/e2e-network-mtu-migration-ovn-ipv4 703ccb5 link false /test e2e-network-mtu-migration-ovn-ipv4
ci/prow/e2e-openstack-sdn 703ccb5 link false /test e2e-openstack-sdn
ci/prow/e2e-aws-hypershift-ovn-kubevirt 703ccb5 link false /test e2e-aws-hypershift-ovn-kubevirt
ci/prow/e2e-vsphere-ovn-dualstack 703ccb5 link false /test e2e-vsphere-ovn-dualstack
ci/prow/e2e-metal-ipi-ovn-ipv6-ipsec 703ccb5 link false /test e2e-metal-ipi-ovn-ipv6-ipsec
ci/prow/e2e-vsphere-ovn 703ccb5 link false /test e2e-vsphere-ovn
ci/prow/e2e-ovn-hybrid-step-registry 703ccb5 link false /test e2e-ovn-hybrid-step-registry

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dcbw
Copy link
Contributor Author

dcbw commented Sep 12, 2023

/tide refresh

@openshift-merge-robot openshift-merge-robot merged commit 14c0445 into openshift:master Sep 12, 2023
@openshift-ci-robot
Copy link
Contributor

@dcbw: Jira Issue OCPBUGS-18676: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-18676 has been moved to the MODIFIED state.

In response to this:

northd has an option to sleep for a short amount of time after processing changes from NB/SB that allows it to trade off a bit of latency for a lot of CPU savings. Since events from NB come frequently during scale tests northd doesn't have a lot of time to sleep. Until we have more incremental processing, most of that CPU time is burned just recalculating things that haven't changed, so it's mostly wasted.

Letting northd sleep has been shown in density-light and density-cni 120 node scale tests to have almost no adverse effect on P99 PodReady times, but a huge improvement in CPU utilization.

Upstream equivalent is ovn-kubernetes/ovn-kubernetes#3877

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dcbw
Copy link
Contributor Author

dcbw commented Sep 12, 2023

/cherry-pick release-4.14

@openshift-cherrypick-robot

@dcbw: new pull request created: #1998

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants