Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a memory and cpu limits on deployments #20193

Merged
merged 6 commits into from
Jul 2, 2020

Conversation

carbonin
Copy link
Member

This is based on the worker's memory_threshold setting

Fixes #20163

This really needs to be added to, but I'm not sure where to get the other settings from.
We have something that looks like CPU settings already, but they look either broken or unused as the only values we have are 0 or 100 percent.

I'm also not sure how to handle requests, these are used for scheduling pods on kubernetes nodes. For the rest of the components in github.com/manageiq/manageiq-pods I don't set a request by default so that the containers will be schedule-able out of the box even on small environments like minikube.

In this repo we would need to expose settings for the requests which would be 0 (or some other flavor of "unset") by default to have similar behavior.

@carbonin
Copy link
Member Author

This also doesn't handle updating the value if the settings change. Right now we don't edit the deployment because it's hooked into the worker start. Not sure if that's something we also want for the first pass on this.

@Fryguy
Copy link
Member

Fryguy commented May 20, 2020

@jrafanie can you give some color to the CPU stuff that @carbonin is seeing?

@jrafanie
Copy link
Member

Hmmm, weird, worker_cpu_usage_threshold was added in:

commit a868647c2f26e609a673a79987d33af792631e5a
Author: <REDACTED>
Date:   Fri Mar 6 16:06:36 2009 +0000

    BugzID: 5096
    Added logic to check system CPU usage for generic worker and sleep if over the threshold.

    git-svn-id: http://miq-ubuntusub.manageiq.com/svn/svnrepos/Manageiq/trunk@12912 3c68ef56-dcc3-11dc-9475-a42b84ecc76f

The only info in the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=MANAGEIQ-5096 was "To avoid CPU Alarms, throttle getting units of work from MiqQueue based on configurable CPU threshold"

It would skip delivering queue messages in the generic worker if it reached 90% CPU usage. I believe this is when workers polled the queue on their own.

Looking to see why it was changed to 0 and 100%.... 👀

@jrafanie
Copy link
Member

It was changed to 99% in 326ea6579ab573e4dd5d3288fb6a9baf0962e502 for:
https://bugzilla.redhat.com/show_bug.cgi?id=MANAGEIQ-5668

then to 100% along with other settings in 15de028089b8661077a8ede57091883e90e635b6 for https://bugzilla.redhat.com/show_bug.cgi?id=MANAGEIQ-5749 (oh, that was me... but they said to do it)

2e59d858b5646b9229fecf54ebce6a2a532038a4 changed event handler to have 0 threshold (effectively skipping checks)

So yes, it's effectively disabled right now. Note, it's only checked (and always skipped because of 0 or 100 setting) in queue workers.

We can enable it or change it if we need to repurpose it for pods.

@jrafanie
Copy link
Member

The other "correct" answer is to delete it and add it if and when we want something like it. ✂️ 🗑️ 🚽

Let me know if you agree on the "correct" answer...

@carbonin
Copy link
Member Author

I guess I want to keep it, but it will only do anything in pods as opposed to memory which will be enforced everywhere. So my question now is if I change it to use "milli-cores" as kubernetes does, should I change the name to something pods specific or keep it matching with memory even though we'll probably never enforce it on the appliance?

@carbonin carbonin force-pushed the add_resource_constraints branch from f6f5d60 to f2d2bdc Compare June 1, 2020 19:41
@carbonin carbonin requested review from agrare and kbrock as code owners June 1, 2020 19:41
@carbonin
Copy link
Member Author

carbonin commented Jun 1, 2020

So this now adds a CPU and memory limit to all the workers, but unfortunately this makes the requests default to the limits. ref

This means that by default you would need a cluster with at least 6 cores and 6Gib of memory to even schedule the default set of workers, nevermind adding provider workers or scaling. This also doesn't include the case where you set requests on the rest of the pods in the project.

When we deploy the non-worker containers (postgres, httpd, orchestrator, etc.) we default to no limits/requests, but they can be easily set in the CR at deploy time. I'm not sure how we would make a similar change for the workers. I don't think it's a good experience to have to deploy the whole app then have to go into advanced settings to mess about with resource constraints.

Maybe we have a single flag that tells the orchestrator to either add the constraints or not? Then that can be set in settings at deploy time?

@carbonin carbonin requested a review from gtanzillo as a code owner June 2, 2020 20:40
@carbonin carbonin force-pushed the add_resource_constraints branch from af8f9d3 to ff24b2c Compare June 2, 2020 20:53
@carbonin carbonin changed the title [WIP] Add a memory limit constraint on deployments Add a memory limit constraint on deployments Jun 3, 2020
@carbonin carbonin removed the wip label Jun 3, 2020
@carbonin
Copy link
Member Author

carbonin commented Jun 3, 2020

Settled on adding yet another flag to disable this by default. It can be enabled in ManageIQ/manageiq-pods#532 in the CR at deploy time.

@agrare
Copy link
Member

agrare commented Jun 4, 2020

@carbonin systemd does have a CPUQuota ref which takes a percentage the unit is allowed to use on a single CPU so it might be possible to enforce this on the appliance as well in the future. We should be able to convert between millicore and pct (I think 500 millicore = 0.5 cpu = 50% of a single cpu).

@carbonin
Copy link
Member Author

carbonin commented Jun 4, 2020

Yup, that makes sense @agrare. I guess I meant that we won't be enforcing it in the server process. Pushing the resource monitoring to systemd makes a lot of sense.

config/settings.yml Outdated Show resolved Hide resolved
@carbonin carbonin changed the title Add a memory limit constraint on deployments Add a memory and cpu limits on deployments Jun 5, 2020
@carbonin
Copy link
Member Author

carbonin commented Jun 5, 2020

I also have some pseudo-real-world data that might help with better values for cpu usage:
image

This env has ~125 real VMware VMs, 1000 simulated VMware VMs (vcsim) and ~250 AWS instances, but isn't doing any app side work (reporting, provisioning, automate, etc).

Based on that a half core seems reasonable for things like the generic worker, but maybe we can tone it down for refresh workers? @agrare what do you think?

@jrafanie
Copy link
Member

jrafanie commented Jun 5, 2020

I also have some pseudo-real-world data that might help with better values for cpu usage:

I can't help but see the orchestrator taking up 1/6 of the memory and roughly 1/6 or 1/7 of the CPU for the project. ✨ Shiny object.

carbonin added 6 commits July 2, 2020 10:18
This is based on the worker's memory_threshold setting

Fixes ManageIQ#20163
This was only ever set to either 100% or 0% which means it was
never used in either case.
For now, we'll start with half a core for all the worker types
we can change this for individual workers as we get more data.
This setting will be set by the orchestrator startup process
which is in turn driven by a parameter in the CR.

The default for this is false in the operator so will also be false
in the settings.

Fixes ManageIQ#20163
@carbonin carbonin force-pushed the add_resource_constraints branch from 07c5295 to b06cb7d Compare July 2, 2020 14:21
@miq-bot
Copy link
Member

miq-bot commented Jul 2, 2020

Checked commits carbonin/manageiq@ec269f7~...b06cb7d with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint
3 files checked, 0 offenses detected
Everything looks fine. 🍪

Copy link
Member

@jrafanie jrafanie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will merge when green.

@jrafanie
Copy link
Member

jrafanie commented Jul 2, 2020

hmm, @carbonin failed with

  1) ExtManagementSystem validates type
     Failure/Error: expect(ManageIQ::Providers::Foreman::ConfigurationManager.new(:provider => foreman_provider, :hostname => "abc", :name => "abc", :zone => FactoryBot.build(:zone)).validate!).to eq(true)
     
     ActiveRecord::RecordInvalid:
       Validation failed: ManageIQ::Providers::Foreman::ConfigurationManager: Url can't be blank, ManageIQ::Providers::Foreman::ConfigurationManager: Url can't be blank
     # ./spec/models/ext_management_system_spec.rb:144:in `block (2 levels) in <top (required)>'
Finished in 7 minutes 58 seconds (files took 20.57 seconds to load)
6543 examples, 1 failure, 3 pending
Failed examples:
rspec ./spec/models/ext_management_system_spec.rb:129 # ExtManagementSystem validates type

seems unrelated

@agrare
Copy link
Member

agrare commented Jul 2, 2020

Yeah that's me, I'm on it

@agrare
Copy link
Member

agrare commented Jul 2, 2020

#20330 should resolve that

@agrare agrare closed this Jul 2, 2020
@agrare agrare reopened this Jul 2, 2020
@agrare
Copy link
Member

agrare commented Jul 2, 2020

Merged, sorry for the interruption

@jrafanie
Copy link
Member

jrafanie commented Jul 2, 2020

Merging, it's enforce_resource_constraints: false to start so it should be not turned on yet...

ManageIQ/manageiq-pods#532 will enable and use it.

@jrafanie jrafanie merged commit 7c7d7f7 into ManageIQ:master Jul 2, 2020
@jrafanie jrafanie assigned jrafanie and unassigned Fryguy Jul 2, 2020
simaishi pushed a commit that referenced this pull request Jul 2, 2020
Add a memory and cpu limits on deployments

(cherry picked from commit 7c7d7f7)
@simaishi
Copy link
Contributor

simaishi commented Jul 2, 2020

Jansa backport details:

$ git log -1
commit 6479ebf04ea8c9842496d8aa76cb86b32b80d020
Author: Joe Rafaniello <[email protected]>
Date:   Thu Jul 2 13:39:47 2020 -0400

    Merge pull request #20193 from carbonin/add_resource_constraints

    Add a memory and cpu limits on deployments

    (cherry picked from commit 7c7d7f7d5d16bc849d45bafc090e59a4bccb65d3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Worker deployments should have resource requests and limits
7 participants