Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kubernetes secret provider] Add cache for the secrets #3822

Merged
merged 29 commits into from
Dec 18, 2023

Conversation

constanca-m
Copy link
Contributor

@constanca-m constanca-m commented Nov 27, 2023

What does this PR do?

Currently, we make a request to the API Server every time we need to get the value of a secret. This issue is better explained here. Discussion also present in the issue.

With this PR, we use a map to store the secrets like this:

  1. Once the kubernetes secret provider starts running, we launch a go routine that will update the cache every TTL minutes. TTL is a new option in the configuration of the provider. By default it is set to 10 minutes.
  2. To update the cache, we range over every secret stored there, and make a new request to the API Server to obtain the current value.

The secrets are accessed from the outside through the function Fetch. If we try to retrieve a secret that is not present in our map, then we make a request to the API Server to obtain its value and store it in the map. This function never causes an update on the secret values.

The only difference from this approach to the current one is that we know store the values of the secrets and secret values.

Warning: Sometimes the secret values are not updated in time and we can be using incorrect ones until the update happens. This is also happening now.

Why is it important?

To stop overwhelming the API Server with secret requests.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

How to test this PR locally

  1. Clone this branch.
  2. Follow these steps from the README file.

Related issues

Screenshots

There should not be a change in behavior apart from a decrease in calls to the API Server. Everything else works as expected/before:

image

@constanca-m constanca-m added enhancement New feature or request Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team labels Nov 27, 2023
@constanca-m constanca-m requested a review from a team November 27, 2023 13:14
@constanca-m constanca-m self-assigned this Nov 27, 2023
@constanca-m constanca-m requested a review from a team as a code owner November 27, 2023 13:14
Copy link
Contributor

mergify bot commented Nov 27, 2023

This pull request does not have a backport label. Could you fix it @constanca-m? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Nov 27, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@gizas
Copy link
Contributor

gizas commented Nov 27, 2023

Reminder that we need also documentation update for this change in https://www.elastic.co/guide/en/fleet/current/kubernetes_secrets-provider.html

@gizas gizas requested a review from axw November 27, 2023 14:34
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest issue I see in this PR is how does a key get removed from the cache if it is no longer referenced from the policy? Let say the policy references kubernetes_secrets.somenamespace.somesecret.value1 and then changes to kubernetes_secrets.somenamespace.somesecret.value2. This change will keep caching kubernetes_secrets.somenamespace.somesecret.value1 forever.

To solve the problem you should add a last accessed time per cached key, and during each update cycle and based on an interval you should remove keys that have not been accessed. That will solve the issue I described above.

@gizas
Copy link
Contributor

gizas commented Nov 27, 2023

To solve the problem you should add a last accessed time per cached key

@blakerouse thanks for this . Makes sense.
We can also consider the fetching from k8s of secrets and check every Xminutes again: #3822 (comment) and to cache if secret exists or value is different. Maybe is simpler from checking a new field again

@constanca-m
Copy link
Contributor Author

constanca-m commented Nov 27, 2023

To solve the problem you should add a last accessed time per cached key

@blakerouse thanks for this . Makes sense.

And should we give that option also to the user or should we enforce that time ourselves @gizas ?

Edit: I enforced it as 1h. Currently, user cannot change it.

- Update duration config format
- Update duration config format
@blakerouse
Copy link
Contributor

This is also missing unit tests coverage for any of these additions, that needs to be added as well.

@blakerouse
Copy link
Contributor

Are there any blockers you still want to address @blakerouse , since you did not remove the requested changes?

Yes I am still waiting on my comment about holding the lock the entire time the code is refreshing the cache. I think that is still a big issue that needs to be looked at.

@constanca-m
Copy link
Contributor Author

Yes I am still waiting on my comment about holding the lock the entire time the code is refreshing the cache. I think that is still a big issue that needs to be looked at.

I don't have another explanation on that other than this comment. Is that a blocker? @blakerouse

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another look, there is still at least one bug and a few cases I could spot where splitting the cache changes across separate "read lock" and "write lock" sections introduces concurrency problems.

This is because the manipulation you are doing to the cache is no longer atomic across a single function body allowing the cache to change in the middle of the function execution when you unlock the read lock and acquire the write lock.

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thought, it seems like it would be straightforward to add a configuration entry that disables the cache.

That way if there is some unexpected surprise we can revert to the previous behavior without needing a release.

- Remove goroutine
- Refactor timeout name to requestTimeout
@constanca-m
Copy link
Contributor Author

it seems like it would be straightforward to add a configuration entry that disables the cache.

I added this option in my last commit. I also added the proper unit test for it.


// if value is still not present in cache, it is possible we haven't tried to fetch it yet
if !ok {
value, ok := p.addToCache(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why grab the lock inside of the function to set the new value. Then only 4 lines below grab it again to set the lastAccess time?

Why not just set lastAccess inside of addToCache and then return inside of the if statement? That would prevent the need to grab the lock twice for the same key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just set lastAccess inside of addToCache and then return inside of the if statement?

Because then we would be changing lastAccess in two functions, when we don't need it. The logic is that lastAccess is only changed when it is required (by calling the get from cache).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me its worse to grab a lock, release the lock, grab the lock again, then release it again all in the same path. It should grab the lock do what it needs to do and release it.

Copy link

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes and fixes. I am glad a solution was able to be created for not holding the lock the entire time a refresh loop occurs.

My last point of hold/releasing followed by another hold/release is not a blocker, just seems like it would be less CPU cycles to grab the lock once. I understand your trying to be my DRY with the code, but for something as simple as setting a single attribute timestamp I think less lock contention would be worth it.

@constanca-m constanca-m merged commit 52a4275 into elastic:main Dec 18, 2023
9 checks passed
@constanca-m constanca-m deleted the kubernetes-cache-secrets branch December 18, 2023 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip enhancement New feature or request Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants