[Kubernetes secret provider] Add cache for the secrets #3822

constanca-m · 2023-11-27T13:14:44Z

What does this PR do?

Currently, we make a request to the API Server every time we need to get the value of a secret. This issue is better explained here. Discussion also present in the issue.

With this PR, we use a map to store the secrets like this:

Once the kubernetes secret provider starts running, we launch a go routine that will update the cache every TTL minutes. TTL is a new option in the configuration of the provider. By default it is set to 10 minutes.
To update the cache, we range over every secret stored there, and make a new request to the API Server to obtain the current value.

The secrets are accessed from the outside through the function Fetch. If we try to retrieve a secret that is not present in our map, then we make a request to the API Server to obtain its value and store it in the map. This function never causes an update on the secret values.

The only difference from this approach to the current one is that we know store the values of the secrets and secret values.

Warning: Sometimes the secret values are not updated in time and we can be using incorrect ones until the update happens. This is also happening now.

Why is it important?

To stop overwhelming the API Server with secret requests.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

How to test this PR locally

Clone this branch.
Follow these steps from the README file.

Related issues

Relates to Kubernetes secret provider: Caching it to avoid performance issues #3594

Screenshots

There should not be a change in behavior apart from a decrease in calls to the API Server. Everything else works as expected/before:

mergify · 2023-11-27T13:15:20Z

This pull request does not have a backport label. Could you fix it @constanca-m? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

elasticmachine · 2023-11-27T13:53:49Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

internal/pkg/composable/providers/kubernetessecrets/config.go

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go

gizas · 2023-11-27T14:25:41Z

Reminder that we need also documentation update for this change in https://www.elastic.co/guide/en/fleet/current/kubernetes_secrets-provider.html

blakerouse

The biggest issue I see in this PR is how does a key get removed from the cache if it is no longer referenced from the policy? Let say the policy references kubernetes_secrets.somenamespace.somesecret.value1 and then changes to kubernetes_secrets.somenamespace.somesecret.value2. This change will keep caching kubernetes_secrets.somenamespace.somesecret.value1 forever.

To solve the problem you should add a last accessed time per cached key, and during each update cycle and based on an interval you should remove keys that have not been accessed. That will solve the issue I described above.

internal/pkg/composable/providers/kubernetessecrets/config.go

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go

gizas · 2023-11-27T14:55:05Z

To solve the problem you should add a last accessed time per cached key

@blakerouse thanks for this . Makes sense.
We can also consider the fetching from k8s of secrets and check every Xminutes again: #3822 (comment) and to cache if secret exists or value is different. Maybe is simpler from checking a new field again

constanca-m · 2023-11-27T15:10:58Z

To solve the problem you should add a last accessed time per cached key

@blakerouse thanks for this . Makes sense.

And should we give that option also to the user or should we enforce that time ourselves @gizas ?

Edit: I enforced it as 1h. Currently, user cannot change it.

- Update duration config format

internal/pkg/composable/providers/kubernetessecrets/config.go

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go

blakerouse · 2023-11-27T16:12:26Z

This is also missing unit tests coverage for any of these additions, that needs to be added as well.

- Add ttl for both update and delete

internal/pkg/composable/providers/kubernetessecrets/config.go

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go

blakerouse · 2023-12-04T16:36:52Z

Are there any blockers you still want to address @blakerouse , since you did not remove the requested changes?

Yes I am still waiting on my comment about holding the lock the entire time the code is refreshing the cache. I think that is still a big issue that needs to be looked at.

constanca-m · 2023-12-04T17:18:47Z

Yes I am still waiting on my comment about holding the lock the entire time the code is refreshing the cache. I think that is still a big issue that needs to be looked at.

I don't have another explanation on that other than this comment. Is that a blocker? @blakerouse

cmacknz

Took another look, there is still at least one bug and a few cases I could spot where splitting the cache changes across separate "read lock" and "write lock" sections introduces concurrency problems.

This is because the manipulation you are doing to the cache is no longer atomic across a single function body allowing the cache to change in the middle of the function execution when you unlock the read lock and acquire the write lock.

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go

cmacknz

One last thought, it seems like it would be straightforward to add a configuration entry that disables the cache.

That way if there is some unexpected surprise we can revert to the previous behavior without needing a release.

internal/pkg/composable/providers/kubernetessecrets/config.go

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go

- Remove goroutine - Refactor timeout name to requestTimeout

constanca-m · 2023-12-08T07:21:48Z

it seems like it would be straightforward to add a configuration entry that disables the cache.

I added this option in my last commit. I also added the proper unit test for it.

internal/pkg/composable/providers/kubernetessecrets/config.go

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go

internal/pkg/composable/providers/kubernetessecrets/config.go

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go

blakerouse · 2023-12-14T17:00:39Z

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go

+
+	// if value is still not present in cache, it is possible we haven't tried to fetch it yet
+	if !ok {
+		value, ok := p.addToCache(key)


Why grab the lock inside of the function to set the new value. Then only 4 lines below grab it again to set the lastAccess time?

Why not just set lastAccess inside of addToCache and then return inside of the if statement? That would prevent the need to grab the lock twice for the same key.

Why not just set lastAccess inside of addToCache and then return inside of the if statement?

Because then we would be changing lastAccess in two functions, when we don't need it. The logic is that lastAccess is only changed when it is required (by calling the get from cache).

To me its worse to grab a lock, release the lock, grab the lock again, then release it again all in the same path. It should grab the lock do what it needs to do and release it.

elastic-sonarqube · 2023-12-14T18:11:54Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
88.9% 88.9% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

blakerouse

Thanks for all the changes and fixes. I am glad a solution was able to be created for not holding the lock the entire time a refresh loop occurs.

My last point of hold/releasing followed by another hold/release is not a blocker, just seems like it would be less CPU cycles to grab the lock once. I understand your trying to be my DRY with the code, but for something as simple as setting a single attribute timestamp I think less lock contention would be worth it.

Add secret cache

dadc79b

constanca-m added enhancement New feature or request Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team labels Nov 27, 2023

constanca-m requested a review from a team November 27, 2023 13:14

constanca-m self-assigned this Nov 27, 2023

constanca-m requested a review from a team as a code owner November 27, 2023 13:14

constanca-m requested review from michalpristas and pchila November 27, 2023 13:14

constanca-m mentioned this pull request Nov 27, 2023

Kubernetes secret provider: Caching it to avoid performance issues #3594

Closed

2 tasks

mergify bot added the backport-skip label Nov 27, 2023

constanca-m added 2 commits November 27, 2023 14:18

Add changelog

ea3a100

Remove assignment

a71101b

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Nov 27, 2023

gizas reviewed Nov 27, 2023

View reviewed changes

internal/pkg/composable/providers/kubernetessecrets/config.go Outdated Show resolved Hide resolved

Change TTL default to 1 min

125330f

gizas reviewed Nov 27, 2023

View reviewed changes

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go Outdated Show resolved Hide resolved

gizas approved these changes Nov 27, 2023

View reviewed changes

gizas requested a review from axw November 27, 2023 14:34

blakerouse requested changes Nov 27, 2023

View reviewed changes

internal/pkg/composable/providers/kubernetessecrets/config.go Outdated Show resolved Hide resolved

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go Outdated Show resolved Hide resolved

constanca-m added 2 commits November 27, 2023 16:16

- Remove secrets based on last access

09ba2b6

- Update duration config format

- Remove secrets based on last access

4908655

- Update duration config format

blakerouse requested changes Nov 27, 2023

View reviewed changes

constanca-m added 2 commits November 27, 2023 17:24

- Update duration config format

a291bae

- Add ttl for both update and delete

Add unit test

9f5b92c

pchila reviewed Dec 4, 2023

View reviewed changes

Rename TTLUpdate to refresh interval.

c65ae20

Add context timeout.

87f0453

cmacknz reviewed Dec 5, 2023

View reviewed changes

constanca-m added 2 commits December 6, 2023 08:34

Switch reading lock to writing lock.

f02823e

Switch reading lock to writing lock.

97d6c84

cmacknz reviewed Dec 7, 2023

View reviewed changes

internal/pkg/composable/providers/kubernetessecrets/config.go Outdated Show resolved Hide resolved

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go Outdated Show resolved Hide resolved

- Add disable cache option

0f8d86a

- Remove goroutine - Refactor timeout name to requestTimeout

blakerouse requested changes Dec 8, 2023

View reviewed changes

Rename cache fields

70c16ca

MichaelKatsoulis reviewed Dec 13, 2023

View reviewed changes

internal/pkg/composable/providers/kubernetessecrets/config.go Outdated Show resolved Hide resolved

MichaelKatsoulis reviewed Dec 13, 2023

View reviewed changes

internal/pkg/composable/providers/kubernetessecrets/config.go Outdated Show resolved Hide resolved

constanca-m added 3 commits December 13, 2023 12:38

Changes config names

7f40d9c

Merge maps

62862a3

Merge maps

7642f09

MichaelKatsoulis reviewed Dec 14, 2023

View reviewed changes

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go Outdated Show resolved Hide resolved

MichaelKatsoulis reviewed Dec 14, 2023

View reviewed changes

internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go Outdated Show resolved Hide resolved

constanca-m added 3 commits December 14, 2023 11:54

Merge maps, fix mistake

ab76805

Change locks to reading locks

4f3ab48

Change locks to reading locks

044e7d1

MichaelKatsoulis approved these changes Dec 14, 2023

View reviewed changes

blakerouse requested changes Dec 14, 2023

View reviewed changes

Remove read lock for iteration

8b7da6c

blakerouse approved these changes Dec 14, 2023

View reviewed changes

constanca-m merged commit 52a4275 into elastic:main Dec 18, 2023
9 checks passed

constanca-m deleted the kubernetes-cache-secrets branch December 18, 2023 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kubernetes secret provider] Add cache for the secrets #3822

[Kubernetes secret provider] Add cache for the secrets #3822

constanca-m commented Nov 27, 2023 •

edited

Loading

mergify bot commented Nov 27, 2023

elasticmachine commented Nov 27, 2023

gizas commented Nov 27, 2023

blakerouse left a comment

gizas commented Nov 27, 2023

constanca-m commented Nov 27, 2023 •

edited

Loading

blakerouse commented Nov 27, 2023

blakerouse commented Dec 4, 2023

constanca-m commented Dec 4, 2023

cmacknz left a comment

cmacknz left a comment

constanca-m commented Dec 8, 2023

blakerouse Dec 14, 2023

constanca-m Dec 14, 2023

blakerouse Dec 14, 2023

elastic-sonarqube bot commented Dec 14, 2023

blakerouse left a comment

[Kubernetes secret provider] Add cache for the secrets #3822

[Kubernetes secret provider] Add cache for the secrets #3822

Conversation

constanca-m commented Nov 27, 2023 • edited Loading

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

Screenshots

mergify bot commented Nov 27, 2023

elasticmachine commented Nov 27, 2023

gizas commented Nov 27, 2023

blakerouse left a comment

Choose a reason for hiding this comment

gizas commented Nov 27, 2023

constanca-m commented Nov 27, 2023 • edited Loading

blakerouse commented Nov 27, 2023

blakerouse commented Dec 4, 2023

constanca-m commented Dec 4, 2023

cmacknz left a comment

Choose a reason for hiding this comment

cmacknz left a comment

Choose a reason for hiding this comment

constanca-m commented Dec 8, 2023

blakerouse Dec 14, 2023

Choose a reason for hiding this comment

constanca-m Dec 14, 2023

Choose a reason for hiding this comment

blakerouse Dec 14, 2023

Choose a reason for hiding this comment

elastic-sonarqube bot commented Dec 14, 2023

Quality Gate passed

blakerouse left a comment

Choose a reason for hiding this comment

constanca-m commented Nov 27, 2023 •

edited

Loading

constanca-m commented Nov 27, 2023 •

edited

Loading