Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Listen for ConfigMap updates to mitigate timing issues during install #2671

Closed
siggy opened this issue Apr 9, 2019 · 10 comments
Closed

Listen for ConfigMap updates to mitigate timing issues during install #2671

siggy opened this issue Apr 9, 2019 · 10 comments
Assignees

Comments

@siggy
Copy link
Member

siggy commented Apr 9, 2019

Problem statement

The linkerd upgrade command may deploy and restart new control-plane components before an updated linkerd-config ConfigMap is available. Because the control-plane components read linkerd-config via a volume mount at startup time, any subsequent updates to the ConfigMap go unnoticed.

Proposal

Modify the control-plane components to poll the ConfigMaps, or listen for changes via Kubernetes. Also validate the ConfigMap is the expected version.

Open questions

Upcoming multi-stage install work may render this work moot, if we can guarantee that ConfigMaps are always deployed in a stage prior to dependent pods starting up:
#2656 (comment)

@siggy siggy changed the title Read ConfigMaps from K8s to mitigate timing issues during install Poll ConfigMaps to mitigate timing issues during install Apr 9, 2019
@siggy siggy changed the title Poll ConfigMaps to mitigate timing issues during install Listen for ConfigMap updates to mitigate timing issues during install Apr 9, 2019
@olix0r
Copy link
Member

olix0r commented Apr 9, 2019

i thought the problem we saw was that the proxy-autoinjector too eagerly loaded a newer config than it knew how to support -- i.e. it's already hot-loading configs. Wouldn't hot-loading configs worsen the existing problem?

@ihcsim
Copy link
Contributor

ihcsim commented Apr 9, 2019

i thought the problem we saw was that the proxy-autoinjector too eagerly loaded a newer config than it knew how to support

I bet that it happens right at the start of the inject handler, before the injector even gets a chance to decide whether to skip the pod mutation or not.

@siggy
Copy link
Member Author

siggy commented Apr 9, 2019

In some cases we are hot-loading, in others we are reading at startup time:
https://github.com/search?q=MountPathGlobalConfig&type=Code
https://github.com/search?q=MountPathInstallConfig&type=Code
https://github.com/search?q=MountPathProxyConfig&type=Code

The goal of this ticket is to identity and fix the failure modes where there is a mismatch between the ConfigMap version and the control-plane component version, typically manifesting during install/upgrade when things are changing.

@olix0r
Copy link
Member

olix0r commented Apr 9, 2019

The goal of this ticket is to identity and fix the failure modes where there is a mismatch between the ConfigMap version and the control-plane component version, typically manifesting during install/upgrade when things are changing.

Ok, good. I think we explicitly do NOT want to hot-reload configs in some places, then... This needs to be audited on a case-by-case basis. Specifically, the identity service should NOT be able to hot-reload it's trust roots, etc

@siggy siggy added the priority/P0 Release Blocker label Apr 9, 2019
@alpeb alpeb self-assigned this Apr 16, 2019
@alpeb
Copy link
Member

alpeb commented Apr 16, 2019

We're currently hot-reloading in the proxy auto-injector and the gRPC Config endpoint, and not hot-reloading everywhere else (Destination, Identity and Web services). That setup sounds correct to me.

The auto-injector appears to be behaving fine for config changes, except in the particular case when we upgrade from any version between 2.2.1 and Edge-19.3.3 (as of Edge-19.4.1 we don't fail when the config contains unknown fields) and the updated ConfigMap happens to be available before the auto-injector gets redeployed. I don't see what we can do in this case because it's the pre-update code that is causing the failure.

Both the Destination and Identity services are using the config only to extract Identity context info. Should we apply the version check @siggy suggests? If so, what should we do if the check fails? Block the startup until the right version is available? (from what I gather Configmap updates can take up to a minute at worst)

The Web service is only concerned about the UUID generated during install (or during upgrade if missing). If upgrading from a version with no UUID in the config and the Web service starts first, the UUID will be empty (I tested this), but eventually it'll be populated whenever there's a redeployment in the future. Is it worth also adding a watch for this?

@siggy
Copy link
Member Author

siggy commented Apr 16, 2019

@alpeb Bear in mind that some of these timing issues may be resolved via #2337.

@siggy
Copy link
Member Author

siggy commented Apr 22, 2019

One approach would be to create an annotation on each control-plane component that is a hash of the linkerd-config, so changing linkerd-config during linkerd upgrade causes an annotation change, which causes a redeploy of the podspec.

The highest priority part of this task is ensuring the proxy auto-injector does not hot-reload a new config and break during linkerd upgrade.

@alpeb
Copy link
Member

alpeb commented May 2, 2019

Even after multi-stage upgrade we still have this problem, although I can only reproduce it by contrived means:

  • Only when upgrading from edge-19.2.5, edge-19.3.1, edge-19.3.2 or edge-19.3.3. These are the versions that started shipping the linkerd-config CM, but whose proxy-injector failed whenever there was a new field in the JSON config.
  • And only when I extract the linkerd-config CM from linkerd upgrade, apply it, wait for a minute for it to propagate into the pods (yes it takes a while), and then apply the rest of the upgrade YAML.

@siggy your annotations suggestion won't do the trick for this particular issue because it's the old deployed proxy injector what is forbidding the upgrade to complete, not even giving a chance for itself to be redeployed.

I believe given this is a problem with the edges and not a stable version, and it's a rare condition, we can just add a note in the 2.4 upgrade notes, advising to delete the proxy injector deployment before upgrading, if coming from one of these edge versions. WDYT?

@siggy
Copy link
Member Author

siggy commented May 6, 2019

@alpeb This all makes sense, and I agree I think we can address this through upgrade docs. I'm ok closing this issue for now. This ticket was written immediately after we implemented upgrade and observed a linkerd-config / auto-injector issue, and the goal was to get an understanding of this class of issue. 👍

@alpeb
Copy link
Member

alpeb commented May 6, 2019

Great, thanks everyone for the feedback!

@alpeb alpeb closed this as completed May 6, 2019
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants