Add a memory and cpu limits on deployments #20193

carbonin · 2020-05-20T20:57:14Z

This is based on the worker's memory_threshold setting

Fixes #20163

This really needs to be added to, but I'm not sure where to get the other settings from.
We have something that looks like CPU settings already, but they look either broken or unused as the only values we have are 0 or 100 percent.

I'm also not sure how to handle requests, these are used for scheduling pods on kubernetes nodes. For the rest of the components in github.com/manageiq/manageiq-pods I don't set a request by default so that the containers will be schedule-able out of the box even on small environments like minikube.

In this repo we would need to expose settings for the requests which would be 0 (or some other flavor of "unset") by default to have similar behavior.

carbonin · 2020-05-20T21:12:09Z

This also doesn't handle updating the value if the settings change. Right now we don't edit the deployment because it's hooked into the worker start. Not sure if that's something we also want for the first pass on this.

Fryguy · 2020-05-20T22:13:37Z

@jrafanie can you give some color to the CPU stuff that @carbonin is seeing?

jrafanie · 2020-05-21T13:02:00Z

Hmmm, weird, worker_cpu_usage_threshold was added in:

commit a868647c2f26e609a673a79987d33af792631e5a
Author: <REDACTED>
Date:   Fri Mar 6 16:06:36 2009 +0000

    BugzID: 5096
    Added logic to check system CPU usage for generic worker and sleep if over the threshold.

    git-svn-id: http://miq-ubuntusub.manageiq.com/svn/svnrepos/Manageiq/trunk@12912 3c68ef56-dcc3-11dc-9475-a42b84ecc76f

The only info in the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=MANAGEIQ-5096 was "To avoid CPU Alarms, throttle getting units of work from MiqQueue based on configurable CPU threshold"

It would skip delivering queue messages in the generic worker if it reached 90% CPU usage. I believe this is when workers polled the queue on their own.

Looking to see why it was changed to 0 and 100%.... 👀

jrafanie · 2020-05-21T13:15:02Z

It was changed to 99% in 326ea6579ab573e4dd5d3288fb6a9baf0962e502 for:
https://bugzilla.redhat.com/show_bug.cgi?id=MANAGEIQ-5668

then to 100% along with other settings in 15de028089b8661077a8ede57091883e90e635b6 for https://bugzilla.redhat.com/show_bug.cgi?id=MANAGEIQ-5749 (oh, that was me... but they said to do it)

2e59d858b5646b9229fecf54ebce6a2a532038a4 changed event handler to have 0 threshold (effectively skipping checks)

So yes, it's effectively disabled right now. Note, it's only checked (and always skipped because of 0 or 100 setting) in queue workers.

We can enable it or change it if we need to repurpose it for pods.

jrafanie · 2020-05-21T13:17:16Z

The other "correct" answer is to delete it and add it if and when we want something like it. ✂️ 🗑️ 🚽

Let me know if you agree on the "correct" answer...

carbonin · 2020-05-26T15:34:24Z

I guess I want to keep it, but it will only do anything in pods as opposed to memory which will be enforced everywhere. So my question now is if I change it to use "milli-cores" as kubernetes does, should I change the name to something pods specific or keep it matching with memory even though we'll probably never enforce it on the appliance?

carbonin · 2020-06-01T19:50:01Z

So this now adds a CPU and memory limit to all the workers, but unfortunately this makes the requests default to the limits. ref

This means that by default you would need a cluster with at least 6 cores and 6Gib of memory to even schedule the default set of workers, nevermind adding provider workers or scaling. This also doesn't include the case where you set requests on the rest of the pods in the project.

When we deploy the non-worker containers (postgres, httpd, orchestrator, etc.) we default to no limits/requests, but they can be easily set in the CR at deploy time. I'm not sure how we would make a similar change for the workers. I don't think it's a good experience to have to deploy the whole app then have to go into advanced settings to mess about with resource constraints.

Maybe we have a single flag that tells the orchestrator to either add the constraints or not? Then that can be set in settings at deploy time?

carbonin · 2020-06-03T20:53:55Z

Settled on adding yet another flag to disable this by default. It can be enabled in ManageIQ/manageiq-pods#532 in the CR at deploy time.

agrare · 2020-06-04T12:47:33Z

@carbonin systemd does have a CPUQuota ref which takes a percentage the unit is allowed to use on a single CPU so it might be possible to enforce this on the appliance as well in the future. We should be able to convert between millicore and pct (I think 500 millicore = 0.5 cpu = 50% of a single cpu).

carbonin · 2020-06-04T13:15:13Z

Yup, that makes sense @agrare. I guess I meant that we won't be enforcing it in the server process. Pushing the resource monitoring to systemd makes a lot of sense.

config/settings.yml

carbonin · 2020-06-05T14:57:00Z

I also have some pseudo-real-world data that might help with better values for cpu usage:

This env has ~125 real VMware VMs, 1000 simulated VMware VMs (vcsim) and ~250 AWS instances, but isn't doing any app side work (reporting, provisioning, automate, etc).

Based on that a half core seems reasonable for things like the generic worker, but maybe we can tone it down for refresh workers? @agrare what do you think?

jrafanie · 2020-06-05T16:56:54Z

I also have some pseudo-real-world data that might help with better values for cpu usage:

I can't help but see the orchestrator taking up 1/6 of the memory and roughly 1/6 or 1/7 of the CPU for the project. ✨ Shiny object.

This is based on the worker's memory_threshold setting Fixes ManageIQ#20163

This was only ever set to either 100% or 0% which means it was never used in either case.

For now, we'll start with half a core for all the worker types we can change this for individual workers as we get more data.

This setting will be set by the orchestrator startup process which is in turn driven by a parameter in the CR. The default for this is false in the operator so will also be false in the settings. Fixes ManageIQ#20163

miq-bot · 2020-07-02T14:25:35Z

Checked commits carbonin/manageiq@ec269f7~...b06cb7d with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint
3 files checked, 0 offenses detected
Everything looks fine. 🍪

jrafanie

LGTM, will merge when green.

jrafanie · 2020-07-02T14:55:28Z

hmm, @carbonin failed with

  1) ExtManagementSystem validates type
     Failure/Error: expect(ManageIQ::Providers::Foreman::ConfigurationManager.new(:provider => foreman_provider, :hostname => "abc", :name => "abc", :zone => FactoryBot.build(:zone)).validate!).to eq(true)
     
     ActiveRecord::RecordInvalid:
       Validation failed: ManageIQ::Providers::Foreman::ConfigurationManager: Url can't be blank, ManageIQ::Providers::Foreman::ConfigurationManager: Url can't be blank
     # ./spec/models/ext_management_system_spec.rb:144:in `block (2 levels) in <top (required)>'
Finished in 7 minutes 58 seconds (files took 20.57 seconds to load)
6543 examples, 1 failure, 3 pending
Failed examples:
rspec ./spec/models/ext_management_system_spec.rb:129 # ExtManagementSystem validates type

seems unrelated

agrare · 2020-07-02T14:57:39Z

Yeah that's me, I'm on it

agrare · 2020-07-02T15:02:00Z

#20330 should resolve that

agrare · 2020-07-02T15:26:00Z

Merged, sorry for the interruption

jrafanie · 2020-07-02T17:39:44Z

Merging, it's enforce_resource_constraints: false to start so it should be not turned on yet...

ManageIQ/manageiq-pods#532 will enable and use it.

Add a memory and cpu limits on deployments (cherry picked from commit 7c7d7f7)

simaishi · 2020-07-02T20:11:58Z

Jansa backport details:

$ git log -1
commit 6479ebf04ea8c9842496d8aa76cb86b32b80d020
Author: Joe Rafaniello <[email protected]>
Date:   Thu Jul 2 13:39:47 2020 -0400

    Merge pull request #20193 from carbonin/add_resource_constraints

    Add a memory and cpu limits on deployments

    (cherry picked from commit 7c7d7f7d5d16bc849d45bafc090e59a4bccb65d3)

carbonin added enhancement wip core jansa/yes? labels May 20, 2020

carbonin requested review from jrafanie and Fryguy May 20, 2020 20:57

carbonin force-pushed the add_resource_constraints branch from f6f5d60 to f2d2bdc Compare June 1, 2020 19:41

carbonin requested review from agrare and kbrock as code owners June 1, 2020 19:41

carbonin requested a review from gtanzillo as a code owner June 2, 2020 20:40

carbonin force-pushed the add_resource_constraints branch from af8f9d3 to ff24b2c Compare June 2, 2020 20:53

carbonin mentioned this pull request Jun 3, 2020

Add a flag to gate worker resource constraints ManageIQ/manageiq-pods#532

Merged

carbonin changed the title ~~[WIP] Add a memory limit constraint on deployments~~ Add a memory limit constraint on deployments Jun 3, 2020

carbonin assigned Fryguy Jun 3, 2020

carbonin removed the wip label Jun 3, 2020

agrare reviewed Jun 4, 2020

View reviewed changes

config/settings.yml Outdated Show resolved Hide resolved

agrare approved these changes Jun 4, 2020

View reviewed changes

carbonin changed the title ~~Add a memory limit constraint on deployments~~ Add a memory and cpu limits on deployments Jun 5, 2020

miq-bot added the unmergeable label Jun 19, 2020

carbonin added 6 commits July 2, 2020 10:18

Add a memory limit constraint on deployments

ec269f7

This is based on the worker's memory_threshold setting Fixes ManageIQ#20163

Remove existing worker cpu threshold

2faeb27

This was only ever set to either 100% or 0% which means it was never used in either case.

Add a CPU limit to all worker containers

e820530

For now, we'll start with half a core for all the worker types we can change this for individual workers as we get more data.

Deal with the case where the setting is not present and add specs

54ca92e

Gate enforcing resource constraints on a setting

4b6656f

This setting will be set by the orchestrator startup process which is in turn driven by a parameter in the CR. The default for this is false in the operator so will also be false in the settings. Fixes ManageIQ#20163

Change cpu threshold to a percent

b06cb7d

carbonin force-pushed the add_resource_constraints branch from 07c5295 to b06cb7d Compare July 2, 2020 14:21

jrafanie approved these changes Jul 2, 2020

View reviewed changes

miq-bot removed the unmergeable label Jul 2, 2020

agrare closed this Jul 2, 2020

agrare reopened this Jul 2, 2020

jrafanie merged commit 7c7d7f7 into ManageIQ:master Jul 2, 2020

jrafanie assigned jrafanie and unassigned Fryguy Jul 2, 2020

gtanzillo added jansa/yes and removed jansa/yes? labels Jul 2, 2020

simaishi pushed a commit that referenced this pull request Jul 2, 2020

Merge pull request #20193 from carbonin/add_resource_constraints

6479ebf

Add a memory and cpu limits on deployments (cherry picked from commit 7c7d7f7)

simaishi added jansa/backported and removed jansa/yes labels Jul 2, 2020

Fryguy mentioned this pull request Sep 23, 2020

Need separate request/limit values for worker pods #20589

Closed

carbonin deleted the add_resource_constraints branch September 23, 2020 20:23

Fryguy mentioned this pull request Oct 6, 2020

Add a memory and cpu limits on deployments #20661

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a memory and cpu limits on deployments #20193

Add a memory and cpu limits on deployments #20193

carbonin commented May 20, 2020

carbonin commented May 20, 2020

Fryguy commented May 20, 2020

jrafanie commented May 21, 2020

jrafanie commented May 21, 2020

jrafanie commented May 21, 2020

carbonin commented May 26, 2020

carbonin commented Jun 1, 2020

carbonin commented Jun 3, 2020

agrare commented Jun 4, 2020

carbonin commented Jun 4, 2020

carbonin commented Jun 5, 2020 •

edited

Loading

jrafanie commented Jun 5, 2020

miq-bot commented Jul 2, 2020

jrafanie left a comment

jrafanie commented Jul 2, 2020

agrare commented Jul 2, 2020

agrare commented Jul 2, 2020

agrare commented Jul 2, 2020

jrafanie commented Jul 2, 2020

simaishi commented Jul 2, 2020

Add a memory and cpu limits on deployments #20193

Add a memory and cpu limits on deployments #20193

Conversation

carbonin commented May 20, 2020

carbonin commented May 20, 2020

Fryguy commented May 20, 2020

jrafanie commented May 21, 2020

jrafanie commented May 21, 2020

jrafanie commented May 21, 2020

carbonin commented May 26, 2020

carbonin commented Jun 1, 2020

carbonin commented Jun 3, 2020

agrare commented Jun 4, 2020

carbonin commented Jun 4, 2020

carbonin commented Jun 5, 2020 • edited Loading

jrafanie commented Jun 5, 2020

miq-bot commented Jul 2, 2020

jrafanie left a comment

Choose a reason for hiding this comment

jrafanie commented Jul 2, 2020

agrare commented Jul 2, 2020

agrare commented Jul 2, 2020

agrare commented Jul 2, 2020

jrafanie commented Jul 2, 2020

simaishi commented Jul 2, 2020

carbonin commented Jun 5, 2020 •

edited

Loading