Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cnf-tests: dynamically created KubeletConfig #1583

Conversation

zeeke
Copy link
Member

@zeeke zeeke commented Jul 31, 2023

Make NUMA/SR-IOV integration tests creating their own PerformanceProfile that sets single-numa-node policy and reserve an enitre NUMA node as Isolated.

Add functions to manipulate node-role.kubernetes.io/x labels on node to apply the performacen profile to arbitrary nodes.

cc @gregkopels

@openshift-ci openshift-ci bot requested review from aneeshkp and lack July 31, 2023 10:59
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 31, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zeeke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 31, 2023
@zeeke
Copy link
Member Author

zeeke commented Jul 31, 2023

/test ?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 31, 2023

@zeeke: The following commands are available to trigger required jobs:

  • /test ci
  • /test e2e-gcp-ovn
  • /test images

The following commands are available to trigger optional jobs:

  • /test e2e-aws-ran-profile

Use /test all to run all jobs.

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@zeeke zeeke force-pushed the numa-sriov-dynamic-perfprofile branch 2 times, most recently from 0f296c3 to bea3895 Compare July 31, 2023 14:18
@zeeke zeeke force-pushed the numa-sriov-dynamic-perfprofile branch 4 times, most recently from e616fda to 6029de6 Compare July 31, 2023 15:27
@zeeke
Copy link
Member Author

zeeke commented Aug 1, 2023

/hold

Reconfiguring a PerformanceProfile causes multiple node reboots that can lead to a job timeout
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/41815/rehearse-41815-periodic-ci-openshift-release-master-nightly-4.14-e2e-telco5g-cnftests/1686044202741272576

Need to find a different way to test these scenarios

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 1, 2023
@zeeke zeeke force-pushed the numa-sriov-dynamic-perfprofile branch 7 times, most recently from 41cf0d9 to 287969f Compare August 4, 2023 16:22
@zeeke zeeke changed the title cnf-tests: dynamically created PerformanceProfile cnf-tests: dynamically created KubeletConfig Aug 4, 2023
@zeeke
Copy link
Member Author

zeeke commented Aug 7, 2023

/hold cancel

Using KubeletConfig instead of PerformanceProfile makes it easier to reboot the node only once, leading to a suite setup time of ~8 minutes.

This rehearsal job [1] shows NUMA test passing:

  Aug  6 11:02:00.702: [INFO]: found mcd machine-config-daemon-j2cl4 for node cnfdu10
I0806 11:02:01.370117   18034 machineconfigpool.go:199] Created KubeletConfig test-sriov-numa
I0806 11:02:01.494985   18034 machineconfigpool.go:206] Created MachineConfigPool test-sriov-numa
I0806 11:02:01.495006   18034 machineconfigpool.go:208] Waiting for KubeletConfig to be rendered to MCP
I0806 11:02:21.666136   18034 machineconfigpool.go:221] Removed role[worker-cnf] from node cnfdu10
I0806 11:02:21.771728   18034 machineconfigpool.go:231] Added role[test-sriov-numa] to node cnfdu10
  �[1mSTEP:�[0m Performance profile test-sriov-numa applied to cnfdu10 �[38;5;243m@ 08/06/23 11:09:55.058�[0m

(see [2] for full build-log.txt. The last line says "Performance profile ..." because of a leftover, fixed in a recent commit)

@gregkopels, @SchSeba Please take a look

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/41815/rehearse-41815-periodic-ci-openshift-release-master-nightly-4.14-e2e-telco5g-cnftests/1688107838901063680
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/41815/rehearse-41815-periodic-ci-openshift-release-master-nightly-4.14-e2e-telco5g-cnftests/1688107838901063680/artifacts/e2e-telco5g-cnftests/telco5g-cnf-tests/build-log.txt

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 7, 2023
@zeeke zeeke force-pushed the numa-sriov-dynamic-perfprofile branch 2 times, most recently from 67c7507 to 6b40d67 Compare August 10, 2023 09:44
@zeeke
Copy link
Member Author

zeeke commented Aug 10, 2023

/retest

mcp, err := client.Client.MachineConfigPools().Get(context.Background(), name, metav1.GetOptions{})
if err != nil {
klog.Warningf("Error while waiting for MachineConfigPool[%s] to be updated: %v", name, err)
return false, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we expect here to always succeed to get the mcp?
Wondering why returning nil and not err.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be an API server error or some other transient error. Here we return nil because the wait.Poll(...) function stops looping if the callback returns an error. It's different than the gomega Eventually loop.

Comment on lines +405 to +409
if label == "node-role.kubernetes.io/worker" {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement that filters out "worker" nodes imposes a limitation to this function that is not trivial.
I think maybe making this function more generic by removing this statement and make the caller dealing with the returned value as he needs.
It's just an opinion. wdyt?

Copy link
Member Author

@zeeke zeeke Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but node-role.kubernetes.io/worker is a kind of special role. Indeed, a node can have only one role other than worker, called "custom role". If you try putting two or more roles on a node, MachineConfigOperator stops managing that node and starts complaining about that. [1]

I would rename this function FindCustomRoleLabel(...) and add a comment explaining that.

[1] https://github.com/openshift/machine-config-operator/blob/3fb306d53f555ab6125d82cc790833f8a7bffa30/pkg/controller/node/node_controller.go#L623

@zeeke zeeke force-pushed the numa-sriov-dynamic-perfprofile branch from 6b40d67 to a871c19 Compare August 11, 2023 10:15
@zeeke zeeke force-pushed the numa-sriov-dynamic-perfprofile branch from a871c19 to c359af3 Compare August 11, 2023 10:17
@liornoy
Copy link
Contributor

liornoy commented Aug 11, 2023

Thank you for addressing the comments. just a small nit, but besides that it's lgtm
(no reviewer permissions for me in this repo in this moment)

@zeeke zeeke force-pushed the numa-sriov-dynamic-perfprofile branch from c359af3 to 1ca3ced Compare August 11, 2023 14:11
mcv1 "github.com/openshift/machine-config-operator/pkg/apis/machineconfiguration.openshift.io/v1"
kubeletconfigv1beta1 "k8s.io/kubelet/config/v1beta1"

"github.com/openshift/cluster-node-tuning-operator/pkg/performanceprofile/controller/performanceprofile/components"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT - Maybe move up with other github.com/openshift-kni/ imports.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure! import list looks nicer, thanks

@gkopels
Copy link

gkopels commented Aug 14, 2023

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 14, 2023

@gkopels: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zeeke added 2 commits August 14, 2023 16:47
Make NUMA/SR-IOV integration tests creating their own
KubeletConfig that sets single-numa-node policy
and reserve an enitre NUMA node to system.

Add functions to manipulate `node-role.kubernetes.io/x` labels
on node to apply the performacen profile to arbitrary nodes.

Signed-off-by: Andrea Panattoni <[email protected]>
@zeeke zeeke force-pushed the numa-sriov-dynamic-perfprofile branch from 1ca3ced to 7853e40 Compare August 14, 2023 14:47
@cgoncalves
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 15, 2023
@zeeke
Copy link
Member Author

zeeke commented Aug 21, 2023

/retest

@zeeke
Copy link
Member Author

zeeke commented Aug 21, 2023

Failing test

CNF Features e2e integration tests: [rfe_id:27368][performance] Network latency parameters adjusted by the Node Tuning Operator [test_id:28467][crit:high][vendor:[email protected]][level:acceptance] Should contain configuration injected through the openshift-node-performance profile 

doesn't belong to this PR.

/override ci/prow/e2e-gcp-ovn

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 21, 2023

@zeeke: Overrode contexts on behalf of zeeke: ci/prow/e2e-gcp-ovn

In response to this:

Failing test

CNF Features e2e integration tests: [rfe_id:27368][performance] Network latency parameters adjusted by the Node Tuning Operator [test_id:28467][crit:high][vendor:[email protected]][level:acceptance] Should contain configuration injected through the openshift-node-performance profile 

doesn't belong to this PR.

/override ci/prow/e2e-gcp-ovn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants