Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow returning stale data from tenants cache #665

Merged
merged 1 commit into from
Apr 2, 2024

Conversation

ka3de
Copy link
Contributor

@ka3de ka3de commented Mar 25, 2024

In the case where an entry in the cache was expired, if the data could not be refreshed by requesting it to the API, an error was returned. This implied a blocking behavior for various actions, such as:

These actions should continue to run even if the agents lose connection with the API, or the API is behaving abnormally. Therefore, this change favours returning stale tenants data from the cache in case it can not be refreshed by requesting it to the API.

The worst case scenario for each mentioned case after this change is:

  • Check execution: The check will be executed with limits that might have changed.
  • Publish: The metrics and logs will be pushed using remotes that might have changed.

Both cases can lead to metrics and logs backends to reject our requests. But these changes happen infrequently, so most often the data returned by the cache will still be valid and checks will be able to run even if the agent is disconnected from the API. And even in the case on which these errors happen, is better than blocking preventively.

In the case where an entry in the cache was expired, if the data could
not be refreshed by requesting it to the API, an error was returned.
This implied a blocking behavior for various actions, such as:
- Check executions: If limits could not be verified.
- Publish events: If tenant remotes could not be obtained.

These actions should continue to run even if the agents lose connection
with the API, or the API is behaving abnormally. Therefore, this change
favours returning stale tenants data from the cache in case it can not
be refreshed by requesting it to the API.

The worst case scenario for each mentioned case after this change is:
- Check execution: The check will be executed with limits that might
have changed.
- Publish: The metrics and logs will be pushed using remotes that might
have changed.
Both cases can lead to metrics and logs backends to reject our requests.
But these changes happen infrequently, so most often the data returned
by the cache will still be valid and checks will be able to run even if
the agent is disconnected from the API.
@ka3de ka3de marked this pull request as ready for review March 26, 2024 08:40
@ka3de ka3de requested a review from a team as a code owner March 26, 2024 08:40
Comment on lines +151 to +152
// wait for tenants to expire
time.Sleep(cacheExpirationTime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a side thought: we could maybe use something like a TestClock to test cache expirations in general - instead of calling time.Sleep we would do something like testClock.Advance(200 ms) and it would validate that these expirations work without having to pause execution.

Then you don't have to consider reducing the timeout from 500ms -> 200ms in tests as well 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right that sleeping in a test is not ideal, but mocking time is not so straightforward, as we would have to define an interface for it, have two different implementations, one for std and one for our "test time", and have an instance for it in the cache, so it gets kind of cumbersome. For other cases where it's required, such as when using a ticker, I usually do this, but for this case sleeping some time seems reasonable 😕

Copy link
Contributor

@The-9880 The-9880 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@ka3de ka3de merged commit 7e0454a into main Apr 2, 2024
4 checks passed
@ka3de ka3de deleted the fix/blocking-tenants-cache branch April 2, 2024 07:41
mem added a commit that referenced this pull request Apr 8, 2024
* dependabot: group prometheus updates (#664)
* Chore(deps): Bump the prometheus-go group with 2 updates (#668)
* Allow returning stale data from tenants cache (#665)
* Chore(deps): Bump the prometheus-go group with 2 updates (#669)
* Chore(deps): Bump golang.org/x/net from 0.22.0 to 0.23.0 (#670)
* Chore(deps): Bump golang.org/x/net from 0.23.0 to 0.24.0 (#672)
* Chore(deps): Bump golang.org/x/sync from 0.6.0 to 0.7.0 (#673)
* Update grafana-build-tools to v0.10.0 (#676)

Signed-off-by: Marcelo E. Magallon <[email protected]>
@mem mem mentioned this pull request Apr 8, 2024
mem added a commit that referenced this pull request Apr 8, 2024
* dependabot: group prometheus updates (#664)
* Chore(deps): Bump the prometheus-go group with 2 updates (#668)
* Allow returning stale data from tenants cache (#665)
* Chore(deps): Bump the prometheus-go group with 2 updates (#669)
* Chore(deps): Bump golang.org/x/net from 0.22.0 to 0.23.0 (#670)
* Chore(deps): Bump golang.org/x/net from 0.23.0 to 0.24.0 (#672)
* Chore(deps): Bump golang.org/x/sync from 0.6.0 to 0.7.0 (#673)
* Update grafana-build-tools to v0.10.0 (#676)

Signed-off-by: Marcelo E. Magallon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants