-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow returning stale data from tenants cache #665
Conversation
In the case where an entry in the cache was expired, if the data could not be refreshed by requesting it to the API, an error was returned. This implied a blocking behavior for various actions, such as: - Check executions: If limits could not be verified. - Publish events: If tenant remotes could not be obtained. These actions should continue to run even if the agents lose connection with the API, or the API is behaving abnormally. Therefore, this change favours returning stale tenants data from the cache in case it can not be refreshed by requesting it to the API. The worst case scenario for each mentioned case after this change is: - Check execution: The check will be executed with limits that might have changed. - Publish: The metrics and logs will be pushed using remotes that might have changed. Both cases can lead to metrics and logs backends to reject our requests. But these changes happen infrequently, so most often the data returned by the cache will still be valid and checks will be able to run even if the agent is disconnected from the API.
// wait for tenants to expire | ||
time.Sleep(cacheExpirationTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a side thought: we could maybe use something like a TestClock
to test cache expirations in general - instead of calling time.Sleep
we would do something like testClock.Advance(200 ms)
and it would validate that these expirations work without having to pause execution.
Then you don't have to consider reducing the timeout from 500ms
-> 200ms
in tests as well 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolutely right that sleeping in a test is not ideal, but mocking time is not so straightforward, as we would have to define an interface for it, have two different implementations, one for std and one for our "test time", and have an instance for it in the cache, so it gets kind of cumbersome. For other cases where it's required, such as when using a ticker, I usually do this, but for this case sleeping some time seems reasonable 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
* dependabot: group prometheus updates (#664) * Chore(deps): Bump the prometheus-go group with 2 updates (#668) * Allow returning stale data from tenants cache (#665) * Chore(deps): Bump the prometheus-go group with 2 updates (#669) * Chore(deps): Bump golang.org/x/net from 0.22.0 to 0.23.0 (#670) * Chore(deps): Bump golang.org/x/net from 0.23.0 to 0.24.0 (#672) * Chore(deps): Bump golang.org/x/sync from 0.6.0 to 0.7.0 (#673) * Update grafana-build-tools to v0.10.0 (#676) Signed-off-by: Marcelo E. Magallon <[email protected]>
* dependabot: group prometheus updates (#664) * Chore(deps): Bump the prometheus-go group with 2 updates (#668) * Allow returning stale data from tenants cache (#665) * Chore(deps): Bump the prometheus-go group with 2 updates (#669) * Chore(deps): Bump golang.org/x/net from 0.22.0 to 0.23.0 (#670) * Chore(deps): Bump golang.org/x/net from 0.23.0 to 0.24.0 (#672) * Chore(deps): Bump golang.org/x/sync from 0.6.0 to 0.7.0 (#673) * Update grafana-build-tools to v0.10.0 (#676) Signed-off-by: Marcelo E. Magallon <[email protected]>
In the case where an entry in the cache was expired, if the data could not be refreshed by requesting it to the API, an error was returned. This implied a blocking behavior for various actions, such as:
These actions should continue to run even if the agents lose connection with the API, or the API is behaving abnormally. Therefore, this change favours returning stale tenants data from the cache in case it can not be refreshed by requesting it to the API.
The worst case scenario for each mentioned case after this change is:
Both cases can lead to metrics and logs backends to reject our requests. But these changes happen infrequently, so most often the data returned by the cache will still be valid and checks will be able to run even if the agent is disconnected from the API. And even in the case on which these errors happen, is better than blocking preventively.