Cluster validation didn't pass after upgrading to kops version 1.11.0 #6292

tsahoo · 2019-01-03T12:28:18Z

**1. What kops version are you running?
Version 1.11.0

**2. What Kubernetes version are you running?
v1.10.11

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops upgrade cluster $NAME
ITEM    PROPERTY                     OLD       NEW
Cluster KubernetesVersion       1.10.11   1.11.6

kops upgrade cluster $NAME --yes
kops rolling-update cluster --yes

5. What happened after the commands executed?
Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "*"has not yet joined cluster
master not healthy after update, stopping rolling-update: "error validating cluster after removing a node: cluster did not validate within a duration of "5m0s""

6. What did you expect to happen?
Cluster validation should pass for up kops as well kubernetes version upgrade.

**7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 
  name: 
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://
  etcdClusters:
  - etcdMembers:
    - instanceGroup: 
      name: a-1
    - instanceGroup: 
      name: b-1
    - instanceGroup: 
      name: a-2
    name: main
  - etcdMembers:
    - instanceGroup: 
      name: a-1
    - instanceGroup: 
      name: b-1
    - instanceGroup: 
      name: a-2
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubernetesApiAccess:
  - 
  kubernetesVersion: 1.11.6
  masterInternalName: 
  masterPublicName: 
  networkCIDR: 
  networking:
    calico: {}
  nonMasqueradeCIDR: 
  sshAccess:
  - 
  subnets:
  - cidr: 
    name: 
    type: Private
    zone: 
  - cidr: 
    name: 
    type: Private
    zone: 
  - cidr: 
    name: 
    type: Utility
    zone: 
  - cidr: 
    name: 
    type: Utility
    zone: 
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

The text was updated successfully, but these errors were encountered:

justinsb · 2019-01-04T05:17:15Z

I'm not able to reproduce this. I've tried with --topology=private, --networking=calico, both HA and non-HA. Is there anything additional that I can try to repoduce this?

Does the cluster recover despite the validation failure? In other words, is it just that 5 minutes is too short a time? It seems unlikely, but maybe if you have a pod that is slow to terminate or restart.

tsahoo · 2019-01-04T09:55:15Z

@justinsb Is kops 1.11.0 supporting etcdv2 ?

justinsb · 2019-01-04T12:42:06Z

@tsahoo yes, and etcd3. The upgrade from etcd2 -> etcd3 relies on etcd-manager, and the plan is to finish up the final edge cases for that upgrade in kops 1.12.

justinsb · 2019-01-04T12:43:08Z

~~I've also realized that we really should print the validation failure on a kops rolling-update validation failure. (I take it we don't, which just isn't helpful)~~

~~@tsahoo I don't suppose you ran kops validate cluster and were able to see the problem?~~

Edit: actually, it looks like we know what happened - the new machine did not join the cluster.

tsahoo · 2019-01-04T12:54:59Z

@justinsb Yes .While we upgrade the cluster the new master node is not joining the cluster with kubernetes version 1.11.6. And after that the cluster validation didn't pass. But kops 1.11.0 is running fine with kubernetes version below 1.11.X.

justinsb · 2019-01-04T13:02:56Z

Thanks @tsahoo - are you able to SSH to the instance which didn't join (it should be the one that started most recently), and look at the logs to figure out what went wrong. The error should either be in journalctl -u kops-configuration, or in journalctl -u kubelet - or maybe in journalctl -u protokube. Hopefully one of those gives us the hint why the node isn't rejoining the cluster.

You could also try kops validate cluster again to see if the master was just very slow to join.

loshz · 2019-01-08T16:08:51Z

Also experiencing this problem upgrading from 1.10.11 to 1.11.6. Very similar cluster config to the OP. Also using weave as the CNI.

I am seeing this log multiple times in kubelet logs:

Jan 08 16:05:23 ip-172-21-37-167 kubelet[2512]: W0108 16:05:23.846996    2512 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d/

kube-apiserver seems to be in a CrashLoopBackoff.

vivekgarg20 · 2019-01-09T07:30:01Z

I had same problem but turned out because of enable-custom-metrics flag, which is deprecated in 1.11.
Please make sure you do not have that. In stead of this please use following spec -
spec:
kubeControllerManager:
horizontalPodAutoscalerUseRestClients: true

vivekgarg20 · 2019-01-09T07:33:03Z

@justinsb It would be a good idea to put it in required actions.

loshz · 2019-01-09T11:05:11Z

I think I've figured out the cause of my problems so I'll post a new issue as I don't want to hijack this one.

BrianChristie · 2019-01-10T13:07:17Z

For folks having problems with 1.11, if you are using OIDC for cluster authentication, see this comment
#6046 (comment)

authorization-rbac-super-user was removed in 1.11 so you'll need to remove that from your cluster spec if you were using it.

ghost · 2019-01-16T01:40:13Z

Have the same issue in AWS, kops 1.11, trying to upgrade 1.10.6, then 1.10.12 to the 1.11.6. Every time got like this:
VALIDATION ERRORS
KIND NAME MESSAGE
Machine i-088ee22081adaa2b1 machine "i-088ee22081adaa2b1" has not yet joined cluster
Machine i-0cd125be94a9e05fd machine "i-0cd125be94a9e05fd" has not yet joined cluster

Any advices with horizontalPodAutoscalerUseRestClients and rbac does not work for me.

myspotontheweb · 2019-02-12T20:31:44Z

Here was my upgrade procedure which worked from 1.9 -> 1.11

Procedure

Pre-upgrade

The kubelet configuration (in my case) needed to be changed from:

  kubelet:
    enableCustomMetrics: true

to

  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook

The --enable-custom-metrics flag is no longer supported in v1.13 and will cause the kublet to fail on startup. The new settings are to secure the kubelet process. In kops v1.13 anonymous authentication defaults to be being switched off, which in turn means we must enable webhook authentication, so that process like tiller (helm) and metrics server can now login using bearer tokens.

Post-upgrade

Necessary kubelet-api fix:

kubectl create clusterrolebinding kubelet-api-admin --clusterrole=system:kubelet-api-admin --user=kubelet-api

Introduced v1.10: Need to authorize kubelet-api to access kublet API

Looks like it's fixed in the next version of kops

Fix kubelet api admin #6312

fejta-bot · 2019-05-13T21:19:44Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-06-12T22:01:53Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-07-12T22:49:50Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-07-12T22:49:57Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mmacfadden mentioned this issue Apr 4, 2019

After rolling-update nodes can't join cluster - DNS lookup fails to master host #6727

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 13, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 12, 2019

k8s-ci-robot closed this as completed Jul 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster validation didn't pass after upgrading to kops version 1.11.0 #6292

Cluster validation didn't pass after upgrading to kops version 1.11.0 #6292

tsahoo commented Jan 3, 2019 •

edited by justinsb

Loading

justinsb commented Jan 4, 2019

tsahoo commented Jan 4, 2019

justinsb commented Jan 4, 2019

justinsb commented Jan 4, 2019 •

edited

Loading

tsahoo commented Jan 4, 2019

justinsb commented Jan 4, 2019

loshz commented Jan 8, 2019

vivekgarg20 commented Jan 9, 2019

vivekgarg20 commented Jan 9, 2019

loshz commented Jan 9, 2019

BrianChristie commented Jan 10, 2019

ghost commented Jan 16, 2019

myspotontheweb commented Feb 12, 2019 •

edited

Loading

fejta-bot commented May 13, 2019

fejta-bot commented Jun 12, 2019

fejta-bot commented Jul 12, 2019

k8s-ci-robot commented Jul 12, 2019

Cluster validation didn't pass after upgrading to kops version 1.11.0 #6292

Cluster validation didn't pass after upgrading to kops version 1.11.0 #6292

Comments

tsahoo commented Jan 3, 2019 • edited by justinsb Loading

justinsb commented Jan 4, 2019

tsahoo commented Jan 4, 2019

justinsb commented Jan 4, 2019

justinsb commented Jan 4, 2019 • edited Loading

tsahoo commented Jan 4, 2019

justinsb commented Jan 4, 2019

loshz commented Jan 8, 2019

vivekgarg20 commented Jan 9, 2019

vivekgarg20 commented Jan 9, 2019

loshz commented Jan 9, 2019

BrianChristie commented Jan 10, 2019

ghost commented Jan 16, 2019

myspotontheweb commented Feb 12, 2019 • edited Loading

Procedure

Pre-upgrade

Post-upgrade

fejta-bot commented May 13, 2019

fejta-bot commented Jun 12, 2019

fejta-bot commented Jul 12, 2019

k8s-ci-robot commented Jul 12, 2019

tsahoo commented Jan 3, 2019 •

edited by justinsb

Loading

justinsb commented Jan 4, 2019 •

edited

Loading

myspotontheweb commented Feb 12, 2019 •

edited

Loading