Listen for ConfigMap updates to mitigate timing issues during install #2671

siggy · 2019-04-09T06:24:03Z

Problem statement

The linkerd upgrade command may deploy and restart new control-plane components before an updated linkerd-config ConfigMap is available. Because the control-plane components read linkerd-config via a volume mount at startup time, any subsequent updates to the ConfigMap go unnoticed.

Proposal

Modify the control-plane components to poll the ConfigMaps, or listen for changes via Kubernetes. Also validate the ConfigMap is the expected version.

Open questions

Upcoming multi-stage install work may render this work moot, if we can guarantee that ConfigMaps are always deployed in a stage prior to dependent pods starting up:
#2656 (comment)

The text was updated successfully, but these errors were encountered:

olix0r · 2019-04-09T15:09:00Z

i thought the problem we saw was that the proxy-autoinjector too eagerly loaded a newer config than it knew how to support -- i.e. it's already hot-loading configs. Wouldn't hot-loading configs worsen the existing problem?

ihcsim · 2019-04-09T15:43:51Z

i thought the problem we saw was that the proxy-autoinjector too eagerly loaded a newer config than it knew how to support

I bet that it happens right at the start of the inject handler, before the injector even gets a chance to decide whether to skip the pod mutation or not.

siggy · 2019-04-09T15:44:45Z

In some cases we are hot-loading, in others we are reading at startup time:
https://github.com/search?q=MountPathGlobalConfig&type=Code
https://github.com/search?q=MountPathInstallConfig&type=Code
https://github.com/search?q=MountPathProxyConfig&type=Code

The goal of this ticket is to identity and fix the failure modes where there is a mismatch between the ConfigMap version and the control-plane component version, typically manifesting during install/upgrade when things are changing.

olix0r · 2019-04-09T15:47:20Z

The goal of this ticket is to identity and fix the failure modes where there is a mismatch between the ConfigMap version and the control-plane component version, typically manifesting during install/upgrade when things are changing.

Ok, good. I think we explicitly do NOT want to hot-reload configs in some places, then... This needs to be audited on a case-by-case basis. Specifically, the identity service should NOT be able to hot-reload it's trust roots, etc

alpeb · 2019-04-16T18:59:18Z

We're currently hot-reloading in the proxy auto-injector and the gRPC Config endpoint, and not hot-reloading everywhere else (Destination, Identity and Web services). That setup sounds correct to me.

The auto-injector appears to be behaving fine for config changes, except in the particular case when we upgrade from any version between 2.2.1 and Edge-19.3.3 (as of Edge-19.4.1 we don't fail when the config contains unknown fields) and the updated ConfigMap happens to be available before the auto-injector gets redeployed. I don't see what we can do in this case because it's the pre-update code that is causing the failure.

Both the Destination and Identity services are using the config only to extract Identity context info. Should we apply the version check @siggy suggests? If so, what should we do if the check fails? Block the startup until the right version is available? (from what I gather Configmap updates can take up to a minute at worst)

The Web service is only concerned about the UUID generated during install (or during upgrade if missing). If upgrading from a version with no UUID in the config and the Web service starts first, the UUID will be empty (I tested this), but eventually it'll be populated whenever there's a redeployment in the future. Is it worth also adding a watch for this?

siggy · 2019-04-16T19:02:07Z

@alpeb Bear in mind that some of these timing issues may be resolved via #2337.

siggy · 2019-04-22T21:27:55Z

One approach would be to create an annotation on each control-plane component that is a hash of the linkerd-config, so changing linkerd-config during linkerd upgrade causes an annotation change, which causes a redeploy of the podspec.

The highest priority part of this task is ensuring the proxy auto-injector does not hot-reload a new config and break during linkerd upgrade.

alpeb · 2019-05-02T23:02:07Z

Even after multi-stage upgrade we still have this problem, although I can only reproduce it by contrived means:

Only when upgrading from edge-19.2.5, edge-19.3.1, edge-19.3.2 or edge-19.3.3. These are the versions that started shipping the linkerd-config CM, but whose proxy-injector failed whenever there was a new field in the JSON config.
And only when I extract the linkerd-config CM from linkerd upgrade, apply it, wait for a minute for it to propagate into the pods (yes it takes a while), and then apply the rest of the upgrade YAML.

@siggy your annotations suggestion won't do the trick for this particular issue because it's the old deployed proxy injector what is forbidding the upgrade to complete, not even giving a chance for itself to be redeployed.

I believe given this is a problem with the edges and not a stable version, and it's a rare condition, we can just add a note in the 2.4 upgrade notes, advising to delete the proxy injector deployment before upgrading, if coming from one of these edge versions. WDYT?

siggy · 2019-05-06T18:32:35Z

@alpeb This all makes sense, and I agree I think we can address this through upgrade docs. I'm ok closing this issue for now. This ticket was written immediately after we implemented upgrade and observed a linkerd-config / auto-injector issue, and the goal was to get an understanding of this class of issue. 👍

alpeb · 2019-05-06T18:41:43Z

Great, thanks everyone for the feedback!

siggy added area/controller area/install labels Apr 9, 2019

siggy changed the title ~~Read ConfigMaps from K8s to mitigate timing issues during install~~ Poll ConfigMaps to mitigate timing issues during install Apr 9, 2019

siggy changed the title ~~Poll ConfigMaps to mitigate timing issues during install~~ Listen for ConfigMap updates to mitigate timing issues during install Apr 9, 2019

siggy added the priority/P0 Release Blocker label Apr 9, 2019

alpeb self-assigned this Apr 16, 2019

alpeb mentioned this issue Apr 19, 2019

Focus install by role and stage #2337

Closed

11 tasks

siggy mentioned this issue Apr 25, 2019

Introduce multi-stage upgrade #2723

Merged

alpeb closed this as completed May 6, 2019

github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Listen for ConfigMap updates to mitigate timing issues during install #2671

Listen for ConfigMap updates to mitigate timing issues during install #2671

siggy commented Apr 9, 2019 •

edited

Loading

olix0r commented Apr 9, 2019 •

edited

Loading

ihcsim commented Apr 9, 2019

siggy commented Apr 9, 2019

olix0r commented Apr 9, 2019

alpeb commented Apr 16, 2019

siggy commented Apr 16, 2019

siggy commented Apr 22, 2019 •

edited

Loading

alpeb commented May 2, 2019

siggy commented May 6, 2019

alpeb commented May 6, 2019

Listen for ConfigMap updates to mitigate timing issues during install #2671

Listen for ConfigMap updates to mitigate timing issues during install #2671

Comments

siggy commented Apr 9, 2019 • edited Loading

Problem statement

Proposal

Open questions

olix0r commented Apr 9, 2019 • edited Loading

ihcsim commented Apr 9, 2019

siggy commented Apr 9, 2019

olix0r commented Apr 9, 2019

alpeb commented Apr 16, 2019

siggy commented Apr 16, 2019

siggy commented Apr 22, 2019 • edited Loading

alpeb commented May 2, 2019

siggy commented May 6, 2019

alpeb commented May 6, 2019

siggy commented Apr 9, 2019 •

edited

Loading

olix0r commented Apr 9, 2019 •

edited

Loading

siggy commented Apr 22, 2019 •

edited

Loading