Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent cache backend reuse #11022

Closed
wants to merge 3 commits into from
Closed

Conversation

espadolini
Copy link
Contributor

@espadolini espadolini commented Mar 10, 2022

We're currently relying on opening the same sqlite databases as the backend for our caches (in data_dir/cache/componentname) because we also use the cache as some sort of persistent resource storage to be used in case auth is not available at start time; however, all non-auth services except for SSH will outright just block waiting for a connection to auth at init time, and the SSH service will get halfway through the init and then it will block waiting to see if the connection to auth is direct (meaning that the service should open a listening port) or goes through the proxy (meaning that the service should only serve connections through the reverse tunnel) - as it stands, cold starts and restarts while auth is not available are effectively only working for ssh if proxy is available and the ssh service runs in reverse tunnel mode (and even then, I'm not sure that anything would be usable while the proxy itself has no connectivity to auth).

In addition, during restarts (both in-process restarts caused primarily by CA rotations and graceful process restarts caused by HUP/USR2) we end up opening multiple instances of Cache that use the same underlying storage (unless we're using in-memory caches). There isn't (and probably can't be) a mechanism that lets two caches gracefully use the same backend storage, especially not across the process boundary (the fanout system is entirely in memory), and as it stands, the caches will fight between each other and leave the underlying storage (and thus the actual view that the cache has over the upstream) in a potentially incoherent state.

Given that the persistent cache storage is not actually used in practice (as soon as a connection to auth is established each cache will wipe itself and fill itself with fresh data) and that the particular way it's supposed to be used is actually broken in every occasion except a cold start, this PR removes the concept of cache backend reuse, by changing the default cache type to in-memory, setting up a temporary directory that's unique for each TeleportProcess instance to house the sqlite cache backends if the cache type is explicitly configured like that, and removing the (useless, at that point) cache tombstone mechanism.

Support for on-disk caches is maintained for the sake of (partial) backwards compatibility, even though memory usage for an in-memory cache for the Node component wouldn't meaningfully increase over the memory used by RBAC for one or two inbound connections anyway.

Persistent resource storage for startup while auth is not available should be handled in some other, more explicit (and perhaps opt-in?) way but is out of scope for this PR which just addresses the cache correctness problem that we currently have during restarts with on-disk caches.

Product decision:

  • this PR changes the default cache type to in-memory. Should we leave it as sqlite instead? Should we get rid of sqlite caches altogether?

@espadolini espadolini force-pushed the espadolini/sqlite-corruption branch from 65c2539 to 4fc8b86 Compare March 10, 2022 11:06
@espadolini espadolini force-pushed the espadolini/ephemeral-cache branch from 7f934ad to 3d6abeb Compare March 10, 2022 11:06
@espadolini espadolini changed the base branch from espadolini/sqlite-corruption to master March 10, 2022 11:10
@espadolini espadolini force-pushed the espadolini/ephemeral-cache branch from 3d6abeb to 982e4ff Compare March 10, 2022 11:11
@espadolini espadolini changed the base branch from master to espadolini/sqlite-corruption March 10, 2022 11:47
@espadolini espadolini added the robustness Resistance to crashes and reliability label Mar 11, 2022
lib/service/service.go Outdated Show resolved Hide resolved
Base automatically changed from espadolini/sqlite-corruption to master March 15, 2022 16:54
@espadolini espadolini force-pushed the espadolini/ephemeral-cache branch 2 times, most recently from d30ba60 to 65b01f4 Compare March 15, 2022 17:31
@espadolini espadolini marked this pull request as ready for review March 15, 2022 17:43
@espadolini espadolini requested a review from rosstimothy March 15, 2022 17:43
@espadolini espadolini force-pushed the espadolini/ephemeral-cache branch 2 times, most recently from 38f26ba to 5b5c856 Compare March 17, 2022 16:35
As the cache system cannot support multiple caches running over the same
backend storage, we create a temporary cacheDir for each TeleportProcess
instance, and we use it to store all the sqlite backends for the caches.

With on-disk caches being single-use, we can run the underlying sqlite
backend with the MEMORY rollback journal to improve performance and
reduce filesystem activity.
As we're never reopening the same cache backend twice, this is no longer
useful.
@espadolini
Copy link
Contributor Author

espadolini commented Mar 23, 2022

Closed as per @zmb3 in favor of disabling on-disk caches altogether in #11386.

@espadolini espadolini closed this Mar 23, 2022
@espadolini espadolini deleted the espadolini/ephemeral-cache branch March 23, 2022 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
robustness Resistance to crashes and reliability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants