Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to create pod sandbox: rpc - error getting ClusterInformation connection is unauthorized: Unauthorized #8379

Closed
eliassal opened this issue Dec 28, 2023 · 31 comments · Fixed by #8563

Comments

@eliassal
Copy link

I have K8S up and running and able to deploy and run different Pods/containers. Today, I tried to deply mysql to it with PVC and PV
After deploying, container get stuck in "ContainerCreating" status, gets terminated and recreated

When I d describe Pod I see this

Events:
  Type     Reason                  Age               From               Message
  ----     ------                  ----              ----               -------
  Normal   Scheduled               61s               default-scheduler  Successfully assigned default/mysql-74799d694c-j4mcr to chef-u16desk
  Warning  FailedCreatePodSandBox  60s               kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d9f4c91394d548cca1b189e665fd0532f158bb5bb4407153aa48e1af40afe2f0": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized
  Normal   SandboxChanged          5s (x5 over 60s)  kubelet            Pod sandbox changed, it will be killed and re-created.

Expected Behavior

Pod should run with persistent volume

Current Behavior

Pod get stuck in "ContainerCreating" status

Context

Enclosed the yaml files
mysql-storage.txt
mysql-deployment.txt
for PV, PVC and mysql deployment

Your Environment

Calico version : as indicated above I used https://raw.githubusercontent.com/projectcalico/calico/master/manifests/calico.yaml
Orchestrator version: kubernetes 1.26
Operating System and version: ubuntu 22.04

@caseydavenport
Copy link
Member

I'd recommend using manifests from an official release, as it's possible that master is unstable for some reason. v3.27.0 is the most recent release right now.

error getting ClusterInformation: connection is unauthorized: Unauthorized

This error suggests that the calico-cni-plugin serviceaccount doesn't have permission to get ClusterInformations. Can you share the contents of this command?

kubectl get clusterrole calico-cni-plugin -o yaml

@eliassal
Copy link
Author

Here is the ouput of the command
~/Projects/DeployMySQL-OnKubernetes$ kubectl get clusterrole calico-cni-plugin -o yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"name":"calico-cni-plugin"},"rules":[{"apiGroups":[""],"resources":["pods","nodes","namespaces"],"verbs":["get"]},{"apiGroups":[""],"resources":["pods/status"],"verbs":["patch"]},{"apiGroups":["crd.projectcalico.org"],"resources":["blockaffinities","ipamblocks","ipamhandles","clusterinformations","ippools","ipreservations","ipamconfigs"],"verbs":["get","list","create","update","delete"]}]}
  creationTimestamp: "2023-04-14T14:53:00Z"
  name: calico-cni-plugin
  resourceVersion: "766"
  uid: e328e973-e60a-4d13-96b0-901df58d1ccc
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  - namespaces
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
  - patch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  - ipamblocks
  - ipamhandles
  - clusterinformations
  - ippools
  - ipreservations
  - ipamconfigs
  verbs:
  - get
  - list
  - create
  - update
  - delete

@caseydavenport
Copy link
Member

Seems like the CNI plugin has permissions to get ClusterInformation, which suggests this isn't an RBAC issue as much as a more general authorization issue.

This thread includes a number of potential reasons why this might happen: #5712

Including:

  • NTP synchronization issues
  • Expired certificates being given to the CNI plugin

How old is this cluster by the way?

@caseydavenport
Copy link
Member

One thing that might be useful to check is if restarting calico/node on the affected node improves things at all.

@eliassal
Copy link
Author

@caseydavenport , if you mean NTP synchronization issues between Master and node, then they show exactly same time no difference.
Cluster is 1 year old, it was setup in January of this year.

Expired certificates being given to the CNI plugin how can I check this and renew it?

@eliassal
Copy link
Author

@caseydavenport YES, I rebooted the VM and pod was created succesfuly and I was able to access the mysql container. Can you please tell me what can be the root of this issue that disturbed calico from functioning correctly?
Thanks again for your help

@caseydavenport
Copy link
Member

Restarting the node suggests there was some temporary state in place that had expired and was refreshed on reboot. The most likely thing would be the CNI plugin's bearer token.

What version of Calico do you have installed?

e.g.,

kubectl get clusterinformations -o yaml

Newer versions of Calico should automatically update the token to prevent cases like this starting in v3.24 it seems: #5910

However, you would need to be on v3.24 or greater and also have properly updated manifests that volume mount the necessary CNI configuration directory into calico/node so that it can provide refreshed tokens to the CNI plugin. Otherwise, I think the tokens expire after about a year.

@eliassal
Copy link
Author

Hi, here is the output of the command

kubectl get clusterinformations -o yaml

Calico is V 3.26

apiVersion: v1
items:
- apiVersion: crd.projectcalico.org/v1
  kind: ClusterInformation
  metadata:
    annotations:
      projectcalico.org/metadata: '{"uid":"fc5f4dde-71ef-4937-9430-46a80ed38299","creationTimestamp":"2023-04-14T15:04:55Z"}'
    creationTimestamp: "2023-04-14T15:04:55Z"
    generation: 1
    name: default
    resourceVersion: "1443"
    uid: 51fdcd2c-eb0b-40c8-834c-8986d0bdd420
  spec:
    calicoVersion: v3.26.0-0.dev-403-gf8c46d4273ba
    clusterGUID: 6ddd81728f22472096a3e0c64e1ba716
    clusterType: k8s,bgp,kubeadm,kdd
    datastoreReady: true
kind: List
metadata:
  resourceVersion: ""

@caseydavenport
Copy link
Member

3.26.0-0.dev-403-gf8c46d4273ba

Interesting, looks like a dev build is being run rather than a production release?

@eliassal
Copy link
Author

OK, so what should I do? Should I switch to production stable release? If yes, how? Thanks

@matthewdupre
Copy link
Member

@eliassal I'm curious how you ended up installing the newest code from github (back in ~April) rather than a production release - do you remember where you started installing from?

Generally speaking everyone should stick with a stable release unless you're testing something out that hasn't been released yet. https://docs.tigera.io/calico/latest/about/ has the docs for the current release (v3.27.0)

@eliassal
Copy link
Author

eliassal commented Jan 2, 2024

Thanks @matthewdupre but the link you provided does not indicate how to upgrade and if there is any chance to break current config.
Even me, I dont remember exactly how but I remember I followed instructions from one of the courses on cl;oudguru or pluralsight.

@caseydavenport
Copy link
Member

There are upgrade docs in the side bar: https://docs.tigera.io/calico/latest/operations/upgrading/kubernetes-upgrade

@eliassal I'm afraid I can't guarantee you won't break your config - you're running an unreleased / unsupported version of Calico.

@eliassal
Copy link
Author

eliassal commented Jan 3, 2024

@caseydavenport OK, I will go througfh the upgrade doc but tell me whta is calicoctl? I dont have this tool on my cluster?

@caseydavenport
Copy link
Member

@eliassal you can ignore the section about Host Endpoints - that's only for upgrades from versions older than v3.14.

You can read about calicoctl in the documentation, it's a CLI tool.

@davhdavh
Copy link
Contributor

I had same problem after upgrade to 3.27.0, but a complete restart of calico solved it:

kubectl delete pods --all -n calico-system --force

@davhdavh
Copy link
Contributor

Nope, seems 3.27 is total fubar. Had to restart calico 4 times already

@mazdakn
Copy link
Member

mazdakn commented Jan 30, 2024

@davhdavh what's the error you get in your cluster?
If it's the same access error mentioned in the description of this issue, then what's the output of this command:

kubectl get clusterrole calico-cni-plugin -o yaml

@caseydavenport
Copy link
Member

I believe the "Unauthorized" error message to be distinct from the typical RBAC error. IIUC, if this was an RBAC issue, we'd see additional context along the lines of this:

system:serviceaccount:calico-system:calico-cni-plugin is unable to get clusterinformations at cluster scope

(or similar, writing it out from memory)

I believe the simple "Unauthorized" means that there is something more fundamental going on - i.e., the certificates in-use have expired or perhaps the token itself has expired.

@caseydavenport
Copy link
Member

Another issue with the same symptom: #7171

Relevant bit:

When API server token/certificate get rotated, calico is trying to authenticate using current token, which is invalid as API server token was rotated. Due to this calico is failing to authenticate with API server which results in failing to add network to POD.

One thing to check here would be the calico/node pod logs from the affected node - does it contain any logs indicating that it has successfully (or unsuccessfully) refreshed the CNI plugin token? You'll want to look for logs from token_watch.go

@davhdavh
Copy link
Contributor

davhdavh commented Feb 1, 2024

it is installed via helm with pretty basic settings, except using the new windows setup that 3.27 bring...
And kubectl get clusterrole calico-cni-plugin -o yaml returns the same as the above.
Every single time, it is the windows that break. ie, if I restart 'calico-node-windows-xxxx' it will work again.
image
Last log entry for calico-node-windows is yesterday evening, so nothing.
Killed calico-node-windows, and 1 min later:
image

@caseydavenport
Copy link
Member

Aha, yes that's important context if it's only happening on Windows nodes. Likely a bug in how the token refresh works on Windows nodes (or perhaps isn't being enabled on Windows nodes?). CC @coutinhop

@davhdavh
Copy link
Contributor

davhdavh commented Feb 5, 2024

Any workaround? Pretty tired of the clusters being half broken every morning

@coutinhop
Copy link
Member

@davhdavh if I understood you correctly, you're now using the Windows operator install that came out in v3.27.0, right? Could you set LogSeverityScreen to debug in the default FelixConfiguration and provide logs for the Windows pods (ideally all of them: uninstall-calico, install-cni, node, felix). Anything in particular to look out for when trying to reproduce your issue? If there is something broken with the token refresh mechanism, I'd assume this happens after a set period of time that the cluster is running, is that correct? How many Linux and Windows nodes do you have in your cluster? Are you using VXLAN? What version of kubernetes are you using? What version of containerd in the Windows nodes?

@davhdavh
Copy link
Contributor

davhdavh commented Feb 6, 2024

if I understood you correctly, you're now using the Windows operator install that came out in v3.27.0, right?

Yes. We were using the manual host-process setup in 3.26, so it really shouldn't be a very big change.

Could you set LogSeverityScreen to debug in the default FelixConfiguration and provide logs for the Windows pods (ideally all of them: uninstall-calico, install-cni, node, felix).

sure, will send next time it is stuck.

Anything in particular to look out for when trying to reproduce your issue?

No, we should be running with the most basic setup there is that includes windows.

If there is something broken with the token refresh mechanism, I'd assume this happens after a set period of time that the cluster is running, is that correct?

Yes, but it is long enough that I haven't figured out the timing yet.

How many Linux and Windows nodes do you have in your cluster?

Happens on both our dev cluster (1 main linux worker + control-plane and 2 micro control-planes and 1 windows node)
and preprod cluster (3 main linux worker + control-plane and 2 windows node). That's only clusters we upgraded to 3.27 so far.

Are you using VXLAN?

Yes.

tigera-operator:
  enabled: true
  installation:
    serviceCIDRs:
    - 10.96.0.0/12
    calicoNetwork:
      windowsDataplane: HNS
      # enable iptable port forwarding
      containerIPForwarding: Enabled
      bgp: Disabled
      # Note: The ipPools section cannot be modified post-install.
      ipPools:
      - blockSize: 26
        cidr: 10.168.0.0/16
        disableBGPExport: false
        encapsulation: VXLAN
        natOutgoing: Enabled
        nodeSelector: all()
  • kubernetes-services-endpoint + kube-proxy-windows daemonset is our entire config.

What version of kubernetes are you using?

v1.29.0

What version of containerd in the Windows nodes?

1.7.2 for dev and 1.7.11 for preprod.

@davhdavh
Copy link
Contributor

davhdavh commented Feb 7, 2024

@coutinhop
Detected problem at 01:39:32 (log time).
I have very few pods starting on windows around that time, and it is only a problem on start and terminate. No problem for pods that keep running.
So it is probably the update at 01:33:39.297 or 01:33:43.858 that was the cause.

Here are the logs...
calico-node-windows.zip

@davhdavh
Copy link
Contributor

@coutinhop any workarounds? it is getting quite annoying to have to fix this manually every single day

@davhdavh
Copy link
Contributor

Here is a small workaround script to monitor the problem, and kill the pods

while true; 
  do 
    kubectl get events --all-namespaces -o json --watch --watch-only | \
    jq 'select(.message | test(".*error getting ClusterInformation.*")) | .reportingInstance'  --unbuffered | \
    while read line; do 
      kubectl -n calico-system get pods --selector app.kubernetes.io/name=calico-node-windows; 
      kubectl -n calico-system delete pod --selector app.kubernetes.io/name=calico-node-windows --force;
      date;
    done; 
done

@coutinhop
Copy link
Member

@davhdavh sorry for the delay! While I could not find anything relevant in the logs you provided, that lead me to look into the exact reason why I couldn't find any token refresher messages in the logs, and it turns out it doesn't run on windows 😢
It currently is only invoked on the runit service scripts, which are not used by Calico for Windows:

exec calico-node -monitor-token

I'll get started right away on working that into the Windows scripts...

In the meantime, I'm glad you found a work around. I'll keep you posted on a fix...

@eliassal
Copy link
Author

@caseydavenport @matthewdupre Hi, I decided to reinstall K8s on a new fresh ubuntu, I am a little bit confused about instructions at https://docs.tigera.io/calico/latest/getting-started/kubernetes/quickstart
I need that pod network to be 192.168.200.0, should I run step 1 and step 2?
2nd, it is indicated in the begining that we should run
sudo kubeadm init --pod-network-cidr=192.168......
I have already did kube admin as follows
sudo kubeadm init --control-plane-endpoint=kubernetes --upload-certs
and it was succesful, so should I run the init again with the pod network 192.168.200.0 or download the manifests in step 1 and 2 update them then apply them?
Thanks for your help

@caseydavenport
Copy link
Member

@eliassal please open a new issue - sounds unrelated to the original problem here and best to keep separate concerns separated for anyone looking in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants