OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. #1088

SargunNarula · 2024-06-21T10:06:02Z

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods.

Original bug link - https://bugzilla.redhat.com/show_bug.cgi?id=1910386

openshift-ci-robot · 2024-06-21T10:06:06Z

@SargunNarula: This pull request explicitly references no jira issue.

In response to this:

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-06-21T11:46:15Z

@SargunNarula: This pull request references Jira Issue OCPBUGS-35911, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.17.0) matches configured target version for branch (4.17.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-06-21T11:46:49Z

@SargunNarula: This pull request references Jira Issue OCPBUGS-35911, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.17.0) matches configured target version for branch (4.17.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods.

Original bug link - https://bugzilla.redhat.com/show_bug.cgi?id=1910386

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

shajmakh

Thanks for this. I added few initial comments below

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

mrniranjan · 2024-09-05T14:50:38Z

looks good to me from my side.

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

ffromani

the testing logic can be maybe simplified, but no major objections it seems
questions and possible improvements inside

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

ffromani

I'm not sure this tests checks the correct thing. We do check a BE pod has no overlap with CPUs exclusively assigned to a Guaranteed pod, but the problem here is not what happens at runtime, but what happened at pod creation time. Once the pod goes running, runc is terminated, and there's no trace of where did it run

SargunNarula · 2024-11-22T13:57:14Z

@ffromani The original issue identified was that when launching a guaranteed pod running a cyclic test, the runc container creation process was observed to be running on isolated CPUs. This process inadvertently utilized the CPUs allocated to the cyclic test.

The resolution involved ensuring that the cpuset.cpus configuration is passed during container creation.

Additionally, since runc follows a two-step creation process, the initialization process (executed as /usr/bin/pod, which is a symlink to /usr/bin/runc) is started within a container. This container is assigned the cpuset.cpus values. This behavior can be confirmed by examining the config.json of the initialization container to verify that the appropriate CPU allocation is applied, reserved CPUs in the case of a guaranteed pod, or all available CPUs in the case of a Best-Effort (BE) pod.

Reference:

Based on these observations, the current patch may not effectively validate this scenario. I will work on a revised patch to accurately verify the CPUs being utilized.

SargunNarula · 2024-12-10T05:42:33Z

/retest

ffromani · 2024-12-19T11:35:30Z

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

+		var guaranteedPodCpus, guaranteedInitPodCpus cpuset.CPUSet
+		var bestEffortPodCpus, bestEffortInitPodCpus cpuset.CPUSet


do we ever use or need init containers?

Init containers here refer to the runc container used to initialize and deploy the main pod container.

ffromani · 2024-12-19T11:45:49Z

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

+	return &config, nil
+}
+
+func getConfigJsonInfo(pod *corev1.Pod, containerName string, workerRTNode *corev1.Node) []*ContainerConfig {


I think this is the best we can do but at the same time AFAIK this is not sufficient to ensure that runc (or crun) never actually run on any isolated CPU.
If, as IIRC, the goal is to test that crun (or runc) never run on isolated CPU, then this test helps but is not sufficient.

One approach to testing is to verify whether runc/crun isolates itself from the isolated CPUs by following the method defined in the OCI specification, which involves generating a config.json and adhering to the specified CPUs based on the applied profile.

Alternatively, a tracing tool like bpftrace can be used to monitor all runc calls and inspect their CPU affinity. However, this method is quite invasive and may be challenging to implement IMO.

One approach to testing is to verify whether runc/crun isolates itself from the isolated CPUs by following the method defined in the OCI specification, which involves generating a config.json and adhering to the specified CPUs based on the applied profile.

Well, does the process which writes config.json run only on reserved CPUs? And does crun (or runc) honor the settings on config.json when setting its own affinity? IOW, who decides and who enforces that runc (or crun) only runs on reserved CPUs?

your test seems fine for workload, but there are open questions for the infra proper

Alternatively, a tracing tool like bpftrace can be used to monitor all runc calls and inspect their CPU affinity. However, this method is quite invasive and may be challenging to implement IMO.

I agree, but the problem here is first and foremost testing the right thing.

@ffromani Do you have any other way in mind?

To ensure the check is robust, I have added another Expect to verify whether it honors the provided CPUs through the profile. You can review the change here: Line The latest commit addresses your concerns.

thanks, I'll defer the decision to @yanirq and @Tal-or

@SargunNarula would it be correct to generalize the problem as whether host's processes are respecting the isolated cpus, when they are allocated exclusively for a Guaranteed pod?

If the answer is yes, maybe we can create a short shell script that will run and prints it's associated cpus, then we can check if it also access to ones that are already allocated by a Guaranteed pod.

Another suggestion I had in mind is to examine if we can add a test for that on crun/runc repo.
Maybe the test's context there would be more suitable for this kind of verification.

@Tal-or Yes, the problem statement is correct. However, I cannot definitively state which other host processes we might consider for this purpose.

Create a short shell script that runs and prints its associated CPUs

The concept of using a shell script was addressed in the original verification process, where we employed a wrapper around the runc process to capture and log its associated CPUs to a separate file. The challenge, however, was that despite replacing the binary with the wrapper, capturing the CPUs remained difficult due to the extremely short-lived nature of the runc process. The original test logic was designed to address that testing approach, as detailed in this link.

Add a test for that in the crun/runc repository

For runc, it might be feasible to design a test. However, verifying its behavior with CPU ranges defined by the performance profile presents a challenge, where this test could be valuable.

For crun, this test would not be applicable due to its distinct workflow. Unlike runc, crun does not rely on init containers. Instead, its workflow involves the following steps:

Creating a temporary bind mount of the current executable.

Remounting it as read-only.

Opening the remounted file.

Unmounting and deleting the temporary file.

Using the file descriptor to execute the program.

This difference in implementation makes this testing approach unsuitable for crun.

the challenge, however, was that despite replacing the binary with the wrapper, capturing the CPUs remained difficult due to the extremely short-lived nature of the runc process.

If we're creating our own shell script for test purposes we can make it as longer as needed and the script will print the CPUs. The script doesn't have to be related to runc/crun we only want to make sure it doesn't violate the isolated cpus isolation.

as detailed in this link.

The link is a go file, not sure if that's what you meant.

For runc, it might be feasible to design a test. However, verifying its behavior with CPU ranges defined by the performance profile presents a challenge, where this test could be valuable.

For crun, this test would not be applicable due to its distinct workflow. Unlike runc, crun does not rely on init containers. Instead, its workflow involves the following steps:

Creating a temporary bind mount of the current executable.
Remounting it as read-only.
Opening the remounted file.
Unmounting and deleting the temporary file.
Using the file descriptor to execute the program.
This difference in implementation makes this testing approach unsuitable for crun.

Ok too complex, I get it.

If we aim to check the CPU usage of the shell script, we could extend its runtime. However, this would deviate from the primary objective, which is to verify whether the runc process is violating CPU isolation.

In the future, this might need to be extended to other processes, but as of now, that is not a requirement. The original bug specifically pertains to the runc process utilizing isolated CPUs. Bug link

The Go file contains a single test for this bug, utilizing the shell script approach. Link

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

openshift-ci · 2024-12-19T12:30:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SargunNarula
Once this PR has been reviewed and has the lgtm label, please assign marsik for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods. Signed-off-by: Sargun Narula <[email protected]> Updated cpu checking as per container, runc will provide config.json to each type of pod But runc will have its own container always using reserved cpus Signed-off-by: Sargun Narula <[email protected]>

openshift-ci · 2024-12-19T16:36:17Z

@SargunNarula: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`143f831`	link	false	`/test okd-scos-e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 21, 2024

openshift-ci bot requested review from jmencak and swatisehgal June 21, 2024 10:08

SargunNarula changed the title ~~NO-JIRA: E2E: Add test to verify runc uses valid cpus~~ OCPBUGS-35911: E2E: Add test to verify runc uses valid cpus Jun 21, 2024

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 21, 2024

SargunNarula force-pushed the runc_cpu_isolation branch from 06dcbf0 to 33fd5ac Compare June 21, 2024 11:55

SargunNarula changed the title ~~OCPBUGS-35911: E2E: Add test to verify runc uses valid cpus~~ OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. Jun 21, 2024

mrniranjan reviewed Jun 25, 2024

View reviewed changes

test/e2e/performanceprofile/functests/1_performance/cpu_management.go Outdated Show resolved Hide resolved

mrniranjan reviewed Jun 25, 2024

View reviewed changes

test/e2e/performanceprofile/functests/1_performance/cpu_management.go Outdated Show resolved Hide resolved

mrniranjan reviewed Jun 25, 2024

View reviewed changes

test/e2e/performanceprofile/functests/1_performance/cpu_management.go Show resolved Hide resolved

shajmakh reviewed Jun 26, 2024

View reviewed changes

SargunNarula force-pushed the runc_cpu_isolation branch from 1fa3baa to e983f94 Compare July 17, 2024 17:47

SargunNarula force-pushed the runc_cpu_isolation branch 2 times, most recently from 83fa93a to c6d89e5 Compare August 16, 2024 11:31

SargunNarula force-pushed the runc_cpu_isolation branch from c6d89e5 to 4007668 Compare September 24, 2024 10:11

shajmakh reviewed Sep 24, 2024

View reviewed changes

test/e2e/performanceprofile/functests/1_performance/cpu_management.go Outdated Show resolved Hide resolved

SargunNarula force-pushed the runc_cpu_isolation branch 2 times, most recently from 02e4f19 to 73a257b Compare September 24, 2024 11:46

shajmakh reviewed Sep 24, 2024

View reviewed changes

test/e2e/performanceprofile/functests/1_performance/cpu_management.go Show resolved Hide resolved

ffromani reviewed Sep 25, 2024

View reviewed changes

SargunNarula force-pushed the runc_cpu_isolation branch 2 times, most recently from 083325f to 10d3c35 Compare September 27, 2024 12:02

shajmakh reviewed Sep 27, 2024

View reviewed changes

ffromani reviewed Sep 27, 2024

View reviewed changes

SargunNarula force-pushed the runc_cpu_isolation branch from 10d3c35 to aa8eccc Compare December 6, 2024 12:48

SargunNarula force-pushed the runc_cpu_isolation branch 2 times, most recently from 6d925ba to 4e72640 Compare December 9, 2024 12:16

ffromani reviewed Dec 19, 2024

View reviewed changes

rbaturov reviewed Dec 19, 2024

View reviewed changes

test/e2e/performanceprofile/functests/1_performance/cpu_management.go Outdated Show resolved Hide resolved

SargunNarula force-pushed the runc_cpu_isolation branch from 4e72640 to 0b8046c Compare December 19, 2024 12:29

SargunNarula force-pushed the runc_cpu_isolation branch from 0b8046c to 143f831 Compare December 19, 2024 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. #1088

OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. #1088

SargunNarula commented Jun 21, 2024 •

edited

Loading

openshift-ci-robot commented Jun 21, 2024

openshift-ci-robot commented Jun 21, 2024

openshift-ci-robot commented Jun 21, 2024

shajmakh left a comment

mrniranjan commented Sep 5, 2024

ffromani left a comment

ffromani left a comment

SargunNarula commented Nov 22, 2024

SargunNarula commented Dec 10, 2024

ffromani Dec 19, 2024

SargunNarula Dec 19, 2024

ffromani Dec 19, 2024

SargunNarula Dec 19, 2024

ffromani Dec 19, 2024

SargunNarula Dec 19, 2024

SargunNarula Dec 19, 2024

ffromani Dec 19, 2024

Tal-or Dec 23, 2024 •

edited

Loading

SargunNarula Dec 24, 2024

Tal-or Dec 24, 2024

SargunNarula Dec 24, 2024

openshift-ci bot commented Dec 19, 2024

openshift-ci bot commented Dec 19, 2024

		var guaranteedPodCpus, guaranteedInitPodCpus cpuset.CPUSet
		var bestEffortPodCpus, bestEffortInitPodCpus cpuset.CPUSet

OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. #1088

Are you sure you want to change the base?

OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. #1088

Conversation

SargunNarula commented Jun 21, 2024 • edited Loading

openshift-ci-robot commented Jun 21, 2024

openshift-ci-robot commented Jun 21, 2024

openshift-ci-robot commented Jun 21, 2024

shajmakh left a comment

Choose a reason for hiding this comment

mrniranjan commented Sep 5, 2024

ffromani left a comment

Choose a reason for hiding this comment

ffromani left a comment

Choose a reason for hiding this comment

SargunNarula commented Nov 22, 2024

SargunNarula commented Dec 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tal-or Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Dec 19, 2024

openshift-ci bot commented Dec 19, 2024

SargunNarula commented Jun 21, 2024 •

edited

Loading

Tal-or Dec 23, 2024 •

edited

Loading