Allow returning stale data from tenants cache #665

ka3de · 2024-03-25T08:16:28Z

In the case where an entry in the cache was expired, if the data could not be refreshed by requesting it to the API, an error was returned. This implied a blocking behavior for various actions, such as:

Check executions: If limits could not be verified.
Publish events: If tenant remotes could not be obtained.

These actions should continue to run even if the agents lose connection with the API, or the API is behaving abnormally. Therefore, this change favours returning stale tenants data from the cache in case it can not be refreshed by requesting it to the API.

The worst case scenario for each mentioned case after this change is:

Check execution: The check will be executed with limits that might have changed.
Publish: The metrics and logs will be pushed using remotes that might have changed.

Both cases can lead to metrics and logs backends to reject our requests. But these changes happen infrequently, so most often the data returned by the cache will still be valid and checks will be able to run even if the agent is disconnected from the API. And even in the case on which these errors happen, is better than blocking preventively.

In the case where an entry in the cache was expired, if the data could not be refreshed by requesting it to the API, an error was returned. This implied a blocking behavior for various actions, such as: - Check executions: If limits could not be verified. - Publish events: If tenant remotes could not be obtained. These actions should continue to run even if the agents lose connection with the API, or the API is behaving abnormally. Therefore, this change favours returning stale tenants data from the cache in case it can not be refreshed by requesting it to the API. The worst case scenario for each mentioned case after this change is: - Check execution: The check will be executed with limits that might have changed. - Publish: The metrics and logs will be pushed using remotes that might have changed. Both cases can lead to metrics and logs backends to reject our requests. But these changes happen infrequently, so most often the data returned by the cache will still be valid and checks will be able to run even if the agent is disconnected from the API.

The-9880 · 2024-03-26T15:28:30Z

internal/tenants/manager_test.go

+	// wait for tenants to expire
+	time.Sleep(cacheExpirationTime)


Just a side thought: we could maybe use something like a TestClock to test cache expirations in general - instead of calling time.Sleep we would do something like testClock.Advance(200 ms) and it would validate that these expirations work without having to pause execution.

Then you don't have to consider reducing the timeout from 500ms -> 200ms in tests as well 😄

You are absolutely right that sleeping in a test is not ideal, but mocking time is not so straightforward, as we would have to define an interface for it, have two different implementations, one for std and one for our "test time", and have an instance for it in the cache, so it gets kind of cumbersome. For other cases where it's required, such as when using a ticker, I usually do this, but for this case sleeping some time seems reasonable 😕

The-9880

Looks good to me!

* dependabot: group prometheus updates (#664) * Chore(deps): Bump the prometheus-go group with 2 updates (#668) * Allow returning stale data from tenants cache (#665) * Chore(deps): Bump the prometheus-go group with 2 updates (#669) * Chore(deps): Bump golang.org/x/net from 0.22.0 to 0.23.0 (#670) * Chore(deps): Bump golang.org/x/net from 0.23.0 to 0.24.0 (#672) * Chore(deps): Bump golang.org/x/sync from 0.6.0 to 0.7.0 (#673) * Update grafana-build-tools to v0.10.0 (#676) Signed-off-by: Marcelo E. Magallon <[email protected]>

ka3de marked this pull request as ready for review March 26, 2024 08:40

ka3de requested a review from a team as a code owner March 26, 2024 08:40

The-9880 reviewed Mar 26, 2024

View reviewed changes

The-9880 approved these changes Mar 26, 2024

View reviewed changes

ka3de merged commit 7e0454a into main Apr 2, 2024
4 checks passed

ka3de deleted the fix/blocking-tenants-cache branch April 2, 2024 07:41

mem mentioned this pull request Apr 8, 2024

Release v0.23.2 #677

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow returning stale data from tenants cache #665

Allow returning stale data from tenants cache #665

ka3de commented Mar 25, 2024 •

edited

Loading

The-9880 Mar 26, 2024

ka3de Apr 2, 2024

The-9880 left a comment

		// wait for tenants to expire
		time.Sleep(cacheExpirationTime)

Allow returning stale data from tenants cache #665

Allow returning stale data from tenants cache #665

Conversation

ka3de commented Mar 25, 2024 • edited Loading

The-9880 Mar 26, 2024

Choose a reason for hiding this comment

ka3de Apr 2, 2024

Choose a reason for hiding this comment

The-9880 left a comment

Choose a reason for hiding this comment

ka3de commented Mar 25, 2024 •

edited

Loading