Fix runtime watcher memory leak #473

guilhermocc · 2022-06-17T13:02:37Z

What ❓

Use the pod's informer indexer to retrieve workqueue objects instead of custom storage. Clean resource handlers to only be responsible for enqueuing object keys and not performing any other action.

Why 🤔

We have identified a possible memory leak in runtime watcher through our metrics.

After investigating and performing runtime watcher memory profiling locally, we have identified the memory leak happening in a client-go lib component called StreamWatcher, and the way we are using pod's informer + workqueue seemed what was causing this memory leak.

Test 50 schedulers each with 1 room that keeps being recreated (30 minutes):

Test 1 schedulers with 50 rooms that keep being recreated (30 minutes):

By searching for the reason why this specific component is causing a memory leak, it was pointed out that it may happen because the resource informer that we are using somehow has handlers that are blocking events consumption, client-go lib keeps in memory events that were still not processed, so when the events consumption is slower than production, this memory leak happens. These are some issues that we used as a reference for finding the memory leak reason:
kubernetes/kubernetes#91686
kubernetes/kubernetes#103789

So, after looking at our code, we have identified what can be the source of this memory leak, which is a misusage of pods informer and worker queue.

As we can refer to this kubernetes-controller official sample, in which it explains go-client pods informer architecture (https://github.com/kubernetes/sample-controller/blob/master/docs/controller-client-go.md), it is showed that the only work the resource handlers should do is to enqueue objects keys to be processed, and then, when processing items from the queue, we should use the informer indexer to retrieve the object from the thread-safe client-go internal storage.

Currently what we are doing is creating custom storage, in which our resources handlers, before enqueuing object keys, store the object itself in this storage (which does not agree with the official proposal), then when processing items from the queue the object is retrieved from ours custom storage and not using the informer indexer.

After changing the code to remove the custom storage, and start using the informer indexer, were ran the same tests and got the following result:

Test 50 schedulers each with 1 rooms that keeps being recreated (30 minutes):

Test 1 schedulers with 50 rooms that keep being recreated (30 minutes):

We can see that StreamWatcher is no longer accumulating memory as before.

Full profiling snapshots can be found here:
Memory pprof.zip

… storage.

arthur29

LGTM

Use informer indexer for retrieve workqueue objects instead of custom…

aec59f8

… storage.

guilhermocc requested review from arthur29 and caiordjesus June 17, 2022 13:02

guilhermocc self-assigned this Jun 17, 2022

caiordjesus approved these changes Jun 17, 2022

View reviewed changes

arthur29 approved these changes Jun 17, 2022

View reviewed changes

guilhermocc merged commit 4e96e3f into main Jun 17, 2022

guilhermocc deleted the fix/watcher-memory-leak branch June 17, 2022 17:31

This was referenced Sep 20, 2024

Kubernetes metadata overwhelms memory limits in the Agent process elastic/elastic-agent#4729

Closed

Reduce the amount of stored ReplicaSet data elastic/elastic-agent#5580

Closed

Reduce memory usage of watcher elastic/elastic-agent-autodiscover#108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix runtime watcher memory leak #473

Fix runtime watcher memory leak #473

guilhermocc commented Jun 17, 2022

arthur29 left a comment

Fix runtime watcher memory leak #473

Fix runtime watcher memory leak #473

Conversation

guilhermocc commented Jun 17, 2022

What ❓

Why 🤔

arthur29 left a comment

Choose a reason for hiding this comment