Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster validation didn't pass after upgrading to kops version 1.11.0 #6292

Closed
tsahoo opened this issue Jan 3, 2019 · 17 comments
Closed

Cluster validation didn't pass after upgrading to kops version 1.11.0 #6292

tsahoo opened this issue Jan 3, 2019 · 17 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@tsahoo
Copy link

tsahoo commented Jan 3, 2019

**1. What kops version are you running?
Version 1.11.0

**2. What Kubernetes version are you running?
v1.10.11

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops upgrade cluster $NAME
ITEM    PROPERTY                     OLD       NEW
Cluster KubernetesVersion       1.10.11   1.11.6

kops upgrade cluster $NAME --yes
kops rolling-update cluster --yes

5. What happened after the commands executed?
Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "*"has not yet joined cluster
master not healthy after update, stopping rolling-update: "error validating cluster after removing a node: cluster did not validate within a duration of "5m0s""

6. What did you expect to happen?
Cluster validation should pass for up kops as well kubernetes version upgrade.

**7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 
  name: 
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://
  etcdClusters:
  - etcdMembers:
    - instanceGroup: 
      name: a-1
    - instanceGroup: 
      name: b-1
    - instanceGroup: 
      name: a-2
    name: main
  - etcdMembers:
    - instanceGroup: 
      name: a-1
    - instanceGroup: 
      name: b-1
    - instanceGroup: 
      name: a-2
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubernetesApiAccess:
  - 
  kubernetesVersion: 1.11.6
  masterInternalName: 
  masterPublicName: 
  networkCIDR: 
  networking:
    calico: {}
  nonMasqueradeCIDR: 
  sshAccess:
  - 
  subnets:
  - cidr: 
    name: 
    type: Private
    zone: 
  - cidr: 
    name: 
    type: Private
    zone: 
  - cidr: 
    name: 
    type: Utility
    zone: 
  - cidr: 
    name: 
    type: Utility
    zone: 
  topology:
    dns:
      type: Public
    masters: private
    nodes: private


@justinsb
Copy link
Member

justinsb commented Jan 4, 2019

I'm not able to reproduce this. I've tried with --topology=private, --networking=calico, both HA and non-HA. Is there anything additional that I can try to repoduce this?

Does the cluster recover despite the validation failure? In other words, is it just that 5 minutes is too short a time? It seems unlikely, but maybe if you have a pod that is slow to terminate or restart.

@tsahoo
Copy link
Author

tsahoo commented Jan 4, 2019

@justinsb Is kops 1.11.0 supporting etcdv2 ?

@justinsb
Copy link
Member

justinsb commented Jan 4, 2019

@tsahoo yes, and etcd3. The upgrade from etcd2 -> etcd3 relies on etcd-manager, and the plan is to finish up the final edge cases for that upgrade in kops 1.12.

@justinsb
Copy link
Member

justinsb commented Jan 4, 2019

I've also realized that we really should print the validation failure on a kops rolling-update validation failure. (I take it we don't, which just isn't helpful)

@tsahoo I don't suppose you ran kops validate cluster and were able to see the problem?

Edit: actually, it looks like we know what happened - the new machine did not join the cluster.

@tsahoo
Copy link
Author

tsahoo commented Jan 4, 2019

@justinsb Yes .While we upgrade the cluster the new master node is not joining the cluster with kubernetes version 1.11.6. And after that the cluster validation didn't pass. But kops 1.11.0 is running fine with kubernetes version below 1.11.X.

@justinsb
Copy link
Member

justinsb commented Jan 4, 2019

Thanks @tsahoo - are you able to SSH to the instance which didn't join (it should be the one that started most recently), and look at the logs to figure out what went wrong. The error should either be in journalctl -u kops-configuration, or in journalctl -u kubelet - or maybe in journalctl -u protokube. Hopefully one of those gives us the hint why the node isn't rejoining the cluster.

You could also try kops validate cluster again to see if the master was just very slow to join.

@loshz
Copy link

loshz commented Jan 8, 2019

Also experiencing this problem upgrading from 1.10.11 to 1.11.6. Very similar cluster config to the OP. Also using weave as the CNI.

I am seeing this log multiple times in kubelet logs:

Jan 08 16:05:23 ip-172-21-37-167 kubelet[2512]: W0108 16:05:23.846996    2512 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d/

kube-apiserver seems to be in a CrashLoopBackoff.

@vivekgarg20
Copy link
Contributor

I had same problem but turned out because of enable-custom-metrics flag, which is deprecated in 1.11.
Please make sure you do not have that. In stead of this please use following spec -
spec:
kubeControllerManager:
horizontalPodAutoscalerUseRestClients: true

@vivekgarg20
Copy link
Contributor

@justinsb It would be a good idea to put it in required actions.

@loshz
Copy link

loshz commented Jan 9, 2019

I think I've figured out the cause of my problems so I'll post a new issue as I don't want to hijack this one.

@BrianChristie
Copy link

For folks having problems with 1.11, if you are using OIDC for cluster authentication, see this comment
#6046 (comment)

authorization-rbac-super-user was removed in 1.11 so you'll need to remove that from your cluster spec if you were using it.

@ghost
Copy link

ghost commented Jan 16, 2019

Have the same issue in AWS, kops 1.11, trying to upgrade 1.10.6, then 1.10.12 to the 1.11.6. Every time got like this:
VALIDATION ERRORS
KIND NAME MESSAGE
Machine i-088ee22081adaa2b1 machine "i-088ee22081adaa2b1" has not yet joined cluster
Machine i-0cd125be94a9e05fd machine "i-0cd125be94a9e05fd" has not yet joined cluster

Any advices with horizontalPodAutoscalerUseRestClients and rbac does not work for me.

@myspotontheweb
Copy link

myspotontheweb commented Feb 12, 2019

Here was my upgrade procedure which worked from 1.9 -> 1.11

Procedure

Pre-upgrade

The kubelet configuration (in my case) needed to be changed from:

  kubelet:
    enableCustomMetrics: true

to

  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook

The --enable-custom-metrics flag is no longer supported in v1.13 and will cause the kublet to fail on startup. The new settings are to secure the kubelet process. In kops v1.13 anonymous authentication defaults to be being switched off, which in turn means we must enable webhook authentication, so that process like tiller (helm) and metrics server can now login using bearer tokens.

Post-upgrade

Necessary kubelet-api fix:

kubectl create clusterrolebinding kubelet-api-admin --clusterrole=system:kubelet-api-admin --user=kubelet-api

Introduced v1.10: Need to authorize kubelet-api to access kublet API

Looks like it's fixed in the next version of kops

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 13, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 12, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants