-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Listen for ConfigMap updates to mitigate timing issues during install #2671
Comments
i thought the problem we saw was that the proxy-autoinjector too eagerly loaded a newer config than it knew how to support -- i.e. it's already hot-loading configs. Wouldn't hot-loading configs worsen the existing problem? |
I bet that it happens right at the start of the inject handler, before the injector even gets a chance to decide whether to skip the pod mutation or not. |
In some cases we are hot-loading, in others we are reading at startup time: The goal of this ticket is to identity and fix the failure modes where there is a mismatch between the ConfigMap version and the control-plane component version, typically manifesting during install/upgrade when things are changing. |
Ok, good. I think we explicitly do NOT want to hot-reload configs in some places, then... This needs to be audited on a case-by-case basis. Specifically, the identity service should NOT be able to hot-reload it's trust roots, etc |
We're currently hot-reloading in the proxy auto-injector and the gRPC The auto-injector appears to be behaving fine for config changes, except in the particular case when we upgrade from any version between 2.2.1 and Edge-19.3.3 (as of Edge-19.4.1 we don't fail when the config contains unknown fields) and the updated ConfigMap happens to be available before the auto-injector gets redeployed. I don't see what we can do in this case because it's the pre-update code that is causing the failure. Both the Destination and Identity services are using the config only to extract Identity context info. Should we apply the version check @siggy suggests? If so, what should we do if the check fails? Block the startup until the right version is available? (from what I gather Configmap updates can take up to a minute at worst) The Web service is only concerned about the UUID generated during install (or during upgrade if missing). If upgrading from a version with no UUID in the config and the Web service starts first, the UUID will be empty (I tested this), but eventually it'll be populated whenever there's a redeployment in the future. Is it worth also adding a watch for this? |
One approach would be to create an annotation on each control-plane component that is a hash of the The highest priority part of this task is ensuring the proxy auto-injector does not hot-reload a new config and break during |
Even after multi-stage upgrade we still have this problem, although I can only reproduce it by contrived means:
@siggy your annotations suggestion won't do the trick for this particular issue because it's the old deployed proxy injector what is forbidding the upgrade to complete, not even giving a chance for itself to be redeployed. I believe given this is a problem with the edges and not a stable version, and it's a rare condition, we can just add a note in the 2.4 upgrade notes, advising to delete the proxy injector deployment before upgrading, if coming from one of these edge versions. WDYT? |
@alpeb This all makes sense, and I agree I think we can address this through upgrade docs. I'm ok closing this issue for now. This ticket was written immediately after we implemented upgrade and observed a linkerd-config / auto-injector issue, and the goal was to get an understanding of this class of issue. 👍 |
Great, thanks everyone for the feedback! |
Problem statement
The
linkerd upgrade
command may deploy and restart new control-plane components before an updatedlinkerd-config
ConfigMap is available. Because the control-plane components readlinkerd-config
via a volume mount at startup time, any subsequent updates to the ConfigMap go unnoticed.Proposal
Modify the control-plane components to poll the ConfigMaps, or listen for changes via Kubernetes. Also validate the ConfigMap is the expected version.
Open questions
Upcoming multi-stage install work may render this work moot, if we can guarantee that ConfigMaps are always deployed in a stage prior to dependent pods starting up:
#2656 (comment)
The text was updated successfully, but these errors were encountered: