Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix ulimit issue #18721

Merged
merged 1 commit into from
May 31, 2023
Merged

fix ulimit issue #18721

merged 1 commit into from
May 31, 2023

Conversation

Cydox
Copy link
Contributor

@Cydox Cydox commented May 29, 2023

Fixes #18714

Does this PR introduce a user-facing change?

Fixed an issue where lowering the ulimit -u after a container got created will lead to the container failing to launch.

This touches a lot of code I'm not familiar with, so I'm playing it slow. Starting by adding tests that are supposed to fail on current main.

@openshift-ci openshift-ci bot added release-note do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels May 29, 2023
@Cydox
Copy link
Contributor Author

Cydox commented May 29, 2023

Looks like the test is "failing successfully"

@Cydox Cydox force-pushed the fix-ulimit-pr branch 2 times, most recently from 63f2459 to b282074 Compare May 29, 2023 03:54
@Cydox
Copy link
Contributor Author

Cydox commented May 29, 2023

I pushed the fix now.

There appear to be two different failures here.

  1. The added Container inspect WIP test still fails for the rootfull configuration. (example)
  2. For the rootless configurations the limits test is failing (example)

@Cydox
Copy link
Contributor Author

Cydox commented May 29, 2023

For 2. (limits test): This test works on my system.

Is there a simple way to get the VM image used in the CI to test locally? The script in /hack/get_ci_vm.sh is asking me for GCP credentials (is it trying to provision a VM on GCP?).

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giuseppe PTAL

@@ -135,38 +133,6 @@ func (s *SpecGenerator) Validate() error {
// default:
// return errors.New("unrecognized option for cgroups; supported are 'default', 'disabled', 'no-conmon'")
// }
invalidUlimitFormatError := errors.New("invalid default ulimit definition must be form of type=soft:hard")
// set ulimits if not rootless
if len(s.ContainerResourceConfig.Rlimits) < 1 && !rootless.IsRootless() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this? This sets the limit based on containers.conf defaults so we need to keep it. Although doing this in Validate() looks wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm under the impression that it's already set by

s.Rlimits, err = GenRlimits(c.Ulimit)

Doing a quick test it looks like a ulimit from containers.conf still makes it into a container, so this was redundant. However it's in there twice right now. So I think the containers.conf defaults also make it to:

rlimits, err := specgenutil.GenRlimits(rtc.Ulimits())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with the limits being in the list twice is fixed.

// If not explicitly overridden by the user, default number of open
// files and number of processes to the maximum they can be set to
// (without overriding a sysctl)
if !nofileSet {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure we actually want this. The problem is that we only set it once on create instead of for each start.

So I think this must be moved into func (c *Container) generateSpec() instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a problem with leaving the ulimits in the spec file empty? The container process should inherit the limits, right?

Is that generateSpec invoked on each run and the spec passed to crun, or is it also saved on disk for the next time the container is started?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so func (c *Container) generateSpec() from container_internal_common.go gets executed each time the container is started if it's in ContainerStateConfigured or ContainerStateExited. I'll put setting the max ulimits in there and see if that fixes the CI issues for rootless.

I'll also disable the new test for rootfull, as this issue is not really applicable to rootfull.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this approach works on my end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporated into latest version of this PR

@Luap99
Copy link
Member

Luap99 commented May 30, 2023

For 2. (limits test): This test works on my system.

Is there a simple way to get the VM image used in the CI to test locally? The script in /hack/get_ci_vm.sh is asking me for GCP credentials (is it trying to provision a VM on GCP?).

Yes, the goal is to have a VM that matches the CI environment to 100%. It is only intended for us (RedHat maintainers) as it requires special GCE permission, however I can create a VM for you and add your public ssh key if you really need access.

@Cydox
Copy link
Contributor Author

Cydox commented May 31, 2023

Alright, ulimits are now set to the maximum possible value (if not overriden) in the generateSpec() function each time the container is started instead of at creation time. Also fixed the issue this PR had with the limits test. The problem was that I was appending limits to the list instead of using g.AddProcessRlimits. If the resulting list of limits had a lower limit before a higher limit, this lead to permission errors.

Also added a test under test/system that starts a container, lowers the ulimit and starts the container a second time. Couldn't get the assert statement properly escaped. It's currently only matching against the number of the ulimit. I would prefer to match against "Soft": $nproc_limit. How do I escape this properly? Could also use jq alternatively.

Assuming all tests pass, this PR is ready from my perspective (except for the matching in the system test).

@Cydox Cydox requested a review from Luap99 May 31, 2023 04:21
test/system/045-start.bats Outdated Show resolved Hide resolved
@Cydox Cydox changed the title [WIP] fix ulimit issue fix ulimit issue May 31, 2023
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 31, 2023
@Cydox
Copy link
Contributor Author

Cydox commented May 31, 2023

/retest

Only changed the assert in the new system test. Unrelated stuff failed. Looks like a flake to me.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 31, 2023

@Cydox: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@giuseppe @umohnani8 PTAL

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 31, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Cydox, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 31, 2023
@umohnani8
Copy link
Member

Changes LGTM but would like a review from @rhatdan @giuseppe

@rhatdan
Copy link
Member

rhatdan commented May 31, 2023

/lgtm
Thanks @Cydox

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

change in ulimit -u causes existing containers to not start
5 participants