Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce OOM watcher to allow graceful shutdown #628

Merged
merged 3 commits into from
Mar 7, 2023
Merged

Conversation

hiddeco
Copy link
Member

@hiddeco hiddeco commented Mar 3, 2023

This PR introduces an OOM watcher, which can be enabled using
--feature-gates=OOMWatch=true. The OOM watcher watches the current
memory usage as reported by cgroups via memory.current and cancels
the context when it reaches a certain threshold compared to
memory.max (default 95%, configurable using
--oom-watch-memory-threshold).

This allows ongoing Helm processes to gracefully exit with a failure
before the controller is forcefully OOM killed, preventing a deadlock
of releases in a pending state.

The OOM watcher polls the memory.current file on an interval (default
500ms, configurable using --oom-watch-interval), as subscribing to
file updates using inotify is not possible for cgroups (v2) except for
*.events files. Which does provide signals using memory.events, but
these will generally be too late for our use case as for example high
equals max in most containers, buying us little time to gracefully
stop our processes.

In addition, because we simply watch current usage compared to max
usage in bytes. This approach should work for cgroups v1 as well, given
this has (most of the time) files for these values available as well,
albeit at times at different locations (for which this commit does not
introduce a flag yet, but the library takes into account that it could
be configured at some point).

Should help aid #149 when enabled.

@hiddeco hiddeco force-pushed the oom-watcher branch 4 times, most recently from 20ba683 to 304cbed Compare March 3, 2023 23:01
@hiddeco hiddeco added the enhancement New feature or request label Mar 3, 2023
@hiddeco hiddeco force-pushed the oom-watcher branch 5 times, most recently from d7f919a to 3dbb013 Compare March 4, 2023 10:43
Copy link
Member

@pjbgf pjbgf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits, but otherwise really good stuff. 👏

internal/oomwatch/watch.go Outdated Show resolved Hide resolved
internal/oomwatch/watch.go Show resolved Hide resolved
internal/oomwatch/watch.go Show resolved Hide resolved
@hiddeco hiddeco marked this pull request as ready for review March 6, 2023 17:26
@hiddeco hiddeco requested review from pjbgf and stefanprodan March 6, 2023 17:29
@hiddeco hiddeco force-pushed the oom-watcher branch 2 times, most recently from 718daea to 27cd21e Compare March 6, 2023 17:45
hiddeco added 3 commits March 7, 2023 10:39
This commit introduces an OOM watcher, which can be enabled using
`--feature-gates=OOMWatch=true`. The OOM watcher watches the current
memory usage as reported by cgroups via `memory.current` and cancels
the context when it reaches a certain threshold compared to
`memory.max` (default `95`%, configurable using
`--oom-watch-memory-threshold`).

This allows ongoing Helm processes to gracefully exit with a failure
before the controller is forcefully OOM killed, preventing a deadlock
of releases in a pending state.

The OOM watcher polls the `memory.current` file on an interval (default
`500ms`, configurable using `--oom-watch-interval`), as subscribing to
file updates using inotify is not possible for cgroups (v2) except for
`*.events` files. Which does provide signals using `memory.events`, but
these will generally be too late for our use case. As for example `high`
equals `max` in most containers, buying us little time to gracefully
stop our processes.

In addition, because we simply watch current usage compared to max
usage in bytes. This approach should work for cgroups v1 as well, given
this has (most of the time) files for these values available, albeit
at times at different locations. For which this commit does not
introduce a flag yet, but the library takes into account that it could
be configured at some point.

Signed-off-by: Hidde Beydals <[email protected]>
Signed-off-by: Hidde Beydals <[email protected]>
- Change memory usage percent threshold to `uint8` to no longer allow
  fractions.
- Validate interval to prevent configurations `<50ms`.

Signed-off-by: Hidde Beydals <[email protected]>
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @hiddeco 🏅

@hiddeco hiddeco merged commit 352b7f2 into main Mar 7, 2023
@hiddeco hiddeco deleted the oom-watcher branch March 7, 2023 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants