Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix runtime watcher memory leak #473

Merged
merged 1 commit into from
Jun 17, 2022
Merged

Conversation

guilhermocc
Copy link
Contributor

What ❓

Use the pod's informer indexer to retrieve workqueue objects instead of custom storage. Clean resource handlers to only be responsible for enqueuing object keys and not performing any other action.

Why 🤔

We have identified a possible memory leak in runtime watcher through our metrics.

After investigating and performing runtime watcher memory profiling locally, we have identified the memory leak happening in a client-go lib component called StreamWatcher, and the way we are using pod's informer + workqueue seemed what was causing this memory leak.

Test 50 schedulers each with 1 room that keeps being recreated (30 minutes):
profile001

profile009

Test 1 schedulers with 50 rooms that keep being recreated (30 minutes):
profile001
profile003

By searching for the reason why this specific component is causing a memory leak, it was pointed out that it may happen because the resource informer that we are using somehow has handlers that are blocking events consumption, client-go lib keeps in memory events that were still not processed, so when the events consumption is slower than production, this memory leak happens. These are some issues that we used as a reference for finding the memory leak reason:
kubernetes/kubernetes#91686
kubernetes/kubernetes#103789

So, after looking at our code, we have identified what can be the source of this memory leak, which is a misusage of pods informer and worker queue.

As we can refer to this kubernetes-controller official sample, in which it explains go-client pods informer architecture (https://github.com/kubernetes/sample-controller/blob/master/docs/controller-client-go.md), it is showed that the only work the resource handlers should do is to enqueue objects keys to be processed, and then, when processing items from the queue, we should use the informer indexer to retrieve the object from the thread-safe client-go internal storage.

image

Currently what we are doing is creating custom storage, in which our resources handlers, before enqueuing object keys, store the object itself in this storage (which does not agree with the official proposal), then when processing items from the queue the object is retrieved from ours custom storage and not using the informer indexer.

After changing the code to remove the custom storage, and start using the informer indexer, were ran the same tests and got the following result:

Test 50 schedulers each with 1 rooms that keeps being recreated (30 minutes):
profile001
profile008

Test 1 schedulers with 50 rooms that keep being recreated (30 minutes):
profile001
profile003

We can see that StreamWatcher is no longer accumulating memory as before.

Full profiling snapshots can be found here:
Memory pprof.zip

@guilhermocc guilhermocc self-assigned this Jun 17, 2022
Copy link
Contributor

@arthur29 arthur29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants