Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should nmi (Pod Managed Identity) add-on tolerate all taints? #2146

Closed
ckittel opened this issue Feb 25, 2021 · 17 comments
Closed

Should nmi (Pod Managed Identity) add-on tolerate all taints? #2146

ckittel opened this issue Feb 25, 2021 · 17 comments
Labels
feature-request Requested Features resolution/answer-provided Provided answer to issue, question or feedback.

Comments

@ckittel
Copy link
Member

ckittel commented Feb 25, 2021

In my cluster with four node pools (one system, three user pools). Two of the three user node pools share a taint. The remaining has it's own (different) Taint as well. This means that all user node pools have a taint. In this cluster workloads all need to have their appropriate tolerations (one or both of those). I was caught by surprise when nmi was subject to the taints. All other system add-ons seem to have the standard :NoExecute op=Exists and :Schedule op=Exists and CriticalAddonsOnly op=Exists tolerations, but nmi has no tolerations (outside of default daemonset condition taints like node disk-pressure, pid-pressure, etc).

Two observations:

  1. nmi doesn't have doesn't have the CriticalAddonsOnly toleration. For me this is fine, because if I taint the system node pool (as suggested by the docs), it won't be scheduled there as I don't have needs for it there. However, I could see some wanting it on the system node pool (even if you shouldn't be running user workloads there).
  2. Because nmi doesn't have :NoExecute op=Exists and :Schedule op=Exists tolerations, my nodepool taints used for the dedicated nodes use case (vs system caps concerns), are blocking the deployment of the nmi cluster wide.

Should nmi -- as deployed as an add-on -- be more permissive in its tolerations?

Kubernetes version: 1.20.2

@ghost ghost added the triage label Feb 25, 2021
@ghost
Copy link

ghost commented Feb 25, 2021

Hi ckittel, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@ghost ghost added the action-required label Feb 28, 2021
@ghost
Copy link

ghost commented Feb 28, 2021

Triage required from @Azure/aks-pm

@ghost
Copy link

ghost commented Mar 5, 2021

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Mar 5, 2021
@ghost
Copy link

ghost commented Mar 20, 2021

Issue needing attention of @Azure/aks-leads

@miwithro
Copy link
Contributor

@aramase @bcho

@ghost ghost added action-required and removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Mar 23, 2021
@ghost
Copy link

ghost commented Mar 25, 2021

Triage required from @Azure/aks-pm

@ghost
Copy link

ghost commented Mar 30, 2021

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Mar 30, 2021
@bcho
Copy link
Member

bcho commented Mar 31, 2021

Ack and thanks for the feedback, we will add these taints for nmi deployment in the upcoming release. cc @miwithro

@ghost ghost added action-required and removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Mar 31, 2021
@ghost
Copy link

ghost commented Apr 2, 2021

Triage required from @Azure/aks-pm

@ghost
Copy link

ghost commented Apr 7, 2021

@Azure/aks-pm issue needs labels

@miwithro miwithro added feature feature-request Requested Features labels Apr 12, 2021
@palma21 palma21 removed the feature label May 12, 2021
@adammal
Copy link

adammal commented Jul 20, 2021

This request would be very useful to us. We would like to make use of dedicated node pools and at present the NMI DaemonSet is not deploying to our tainted node pool.
Any idea when this might become available?

@bcho
Copy link
Member

bcho commented Jul 21, 2021

In AKS side, we already added these tolerations to the nmi daemonset:

      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists

@adammal do these tolerations work for you?

@adammal
Copy link

adammal commented Jul 21, 2021

In AKS side, we already added these tolerations to the nmi daemonset:

      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists

@adammal do these tolerations work for you?

Thanks for responding. Unfortunately, no.
We have tainted our node pool with key=value:NoExecute

I've noticed that DaemonSets like kube-proxy and calico-node (for example) are configured as follows.

  tolerations:
    - key: CriticalAddonsOnly
      operator: Exists
    - operator: Exists
      effect: NoExecute
    - operator: Exists
      effect: NoSchedule

and these DaemonSets are deploying to the tainted node pool as expected.

Should I be tainting differently?

Kubernetes version: v1.19.11

@adammal
Copy link

adammal commented Jul 21, 2021

I've also noticed that the same problem exists with the AKS-AzureKeyVaultSecretsProvider addon.

https://docs.microsoft.com/en-us/azure/aks/csi-secrets-store-driver

The two DaemonSets it establishes do not specify any tolerations and as a result do not deploy to tainted node pools either.

@bcho
Copy link
Member

bcho commented Jul 21, 2021

Hi @adammal Thanks for the feedback, I will update the nmi ds tolerations to align with the kube-proxy. For the secret store CSI driver, @ZeroMagic will update the tolerations settings as well.

@adammal
Copy link

adammal commented Jul 21, 2021

Thanks very much guys! Really appreciate it.

@miwithro miwithro added the resolution/answer-provided Provided answer to issue, question or feedback. label Jul 21, 2021
@ghost
Copy link

ghost commented Jul 24, 2021

Thanks for reaching out. I'm closing this issue as it was marked with "Answer Provided" and it hasn't had activity for 2 days.

@ghost ghost closed this as completed Jul 24, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 23, 2021
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature-request Requested Features resolution/answer-provided Provided answer to issue, question or feedback.
Projects
None yet
Development

No branches or pull requests

5 participants