Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. #1088

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

SargunNarula
Copy link
Contributor

@SargunNarula SargunNarula commented Jun 21, 2024

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods.

Original bug link - https://bugzilla.redhat.com/show_bug.cgi?id=1910386

@openshift-ci-robot
Copy link
Contributor

@SargunNarula: This pull request explicitly references no jira issue.

In response to this:

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 21, 2024
@openshift-ci openshift-ci bot requested review from jmencak and swatisehgal June 21, 2024 10:08
@SargunNarula SargunNarula changed the title NO-JIRA: E2E: Add test to verify runc uses valid cpus OCPBUGS-35911: E2E: Add test to verify runc uses valid cpus Jun 21, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 21, 2024
@openshift-ci-robot
Copy link
Contributor

@SargunNarula: This pull request references Jira Issue OCPBUGS-35911, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

@SargunNarula: This pull request references Jira Issue OCPBUGS-35911, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods.

Original bug link - https://bugzilla.redhat.com/show_bug.cgi?id=1910386

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@SargunNarula SargunNarula changed the title OCPBUGS-35911: E2E: Add test to verify runc uses valid cpus OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. Jun 21, 2024
Copy link
Contributor

@shajmakh shajmakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I added few initial comments below

@SargunNarula SargunNarula force-pushed the runc_cpu_isolation branch 2 times, most recently from 83fa93a to c6d89e5 Compare August 16, 2024 11:31
@mrniranjan
Copy link
Contributor

looks good to me from my side.

@SargunNarula SargunNarula force-pushed the runc_cpu_isolation branch 2 times, most recently from 02e4f19 to 73a257b Compare September 24, 2024 11:46
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the testing logic can be maybe simplified, but no major objections it seems
questions and possible improvements inside

@SargunNarula SargunNarula force-pushed the runc_cpu_isolation branch 2 times, most recently from 083325f to 10d3c35 Compare September 27, 2024 12:02
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this tests checks the correct thing. We do check a BE pod has no overlap with CPUs exclusively assigned to a Guaranteed pod, but the problem here is not what happens at runtime, but what happened at pod creation time. Once the pod goes running, runc is terminated, and there's no trace of where did it run

@SargunNarula
Copy link
Contributor Author

@ffromani The original issue identified was that when launching a guaranteed pod running a cyclic test, the runc container creation process was observed to be running on isolated CPUs. This process inadvertently utilized the CPUs allocated to the cyclic test.

The resolution involved ensuring that the cpuset.cpus configuration is passed during container creation.

Additionally, since runc follows a two-step creation process, the initialization process (executed as /usr/bin/pod, which is a symlink to /usr/bin/runc) is started within a container. This container is assigned the cpuset.cpus values. This behavior can be confirmed by examining the config.json of the initialization container to verify that the appropriate CPU allocation is applied, reserved CPUs in the case of a guaranteed pod, or all available CPUs in the case of a Best-Effort (BE) pod.

Reference:

Based on these observations, the current patch may not effectively validate this scenario. I will work on a revised patch to accurately verify the CPUs being utilized.

@SargunNarula SargunNarula force-pushed the runc_cpu_isolation branch 2 times, most recently from 6d925ba to 4e72640 Compare December 9, 2024 12:16
@SargunNarula
Copy link
Contributor Author

/retest

Comment on lines +923 to +927
var guaranteedPodCpus, guaranteedInitPodCpus cpuset.CPUSet
var bestEffortPodCpus, bestEffortInitPodCpus cpuset.CPUSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we ever use or need init containers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Init containers here refer to the runc container used to initialize and deploy the main pod container.

return &config, nil
}

func getConfigJsonInfo(pod *corev1.Pod, containerName string, workerRTNode *corev1.Node) []*ContainerConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the best we can do but at the same time AFAIK this is not sufficient to ensure that runc (or crun) never actually run on any isolated CPU.
If, as IIRC, the goal is to test that crun (or runc) never run on isolated CPU, then this test helps but is not sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach to testing is to verify whether runc/crun isolates itself from the isolated CPUs by following the method defined in the OCI specification, which involves generating a config.json and adhering to the specified CPUs based on the applied profile.

Alternatively, a tracing tool like bpftrace can be used to monitor all runc calls and inspect their CPU affinity. However, this method is quite invasive and may be challenging to implement IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach to testing is to verify whether runc/crun isolates itself from the isolated CPUs by following the method defined in the OCI specification, which involves generating a config.json and adhering to the specified CPUs based on the applied profile.

Well, does the process which writes config.json run only on reserved CPUs? And does crun (or runc) honor the settings on config.json when setting its own affinity? IOW, who decides and who enforces that runc (or crun) only runs on reserved CPUs?

your test seems fine for workload, but there are open questions for the infra proper

Alternatively, a tracing tool like bpftrace can be used to monitor all runc calls and inspect their CPU affinity. However, this method is quite invasive and may be challenging to implement IMO.

I agree, but the problem here is first and foremost testing the right thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ffromani Do you have any other way in mind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure the check is robust, I have added another Expect to verify whether it honors the provided CPUs through the profile. You can review the change here: Line The latest commit addresses your concerns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I'll defer the decision to @yanirq and @Tal-or

Copy link
Contributor

openshift-ci bot commented Dec 19, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SargunNarula
Once this PR has been reviewed and has the lgtm label, please assign marsik for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods.

Signed-off-by: Sargun Narula <[email protected]>

Updated cpu checking as per container, runc will provide config.json to each type of pod
But runc will have its own container always using reserved cpus

Signed-off-by: Sargun Narula <[email protected]>
Copy link
Contributor

openshift-ci bot commented Dec 19, 2024

@SargunNarula: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 143f831 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants