Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kops upgrade 1.7 -> 1.8 release notes - downtime required with canal? #3911

Closed
pierreozoux opened this issue Nov 22, 2017 · 15 comments
Closed
Assignees
Milestone

Comments

@pierreozoux
Copy link
Contributor

I'm preparing our migration to 1.8, and reading the relase-note I'm a bit worried.

It says: will involve downtime. Do you mean that if we are using canal, we have to suffer downtime to upgrade?

Reading this document it doesn't seem like it needs a downtime at all:
https://github.com/projectcalico/calico/blob/master/upgrade/v2.5/README.md

I'm just wondering:

  • is it the path you described that requires "downtime"?
  • is there a path that doesn't require "downtime"?
  • can you specify "downtime"?

If there is a path without downtime, I'd be happy to find it with you and document it!

@pierreozoux pierreozoux changed the title kops 1.8 release notes - downtime with canal? kops 1.8 release notes - downtime required with canal? Nov 22, 2017
@pierreozoux pierreozoux changed the title kops 1.8 release notes - downtime required with canal? kops upgrade 1.7 -> 1.8 release notes - downtime required with canal? Nov 22, 2017
@chrislovecnm
Copy link
Contributor

See #3905

And we just dropped a PR

@chrislovecnm
Copy link
Contributor

#3908

@pierreozoux
Copy link
Contributor Author

@chrislovecnm I read all of that before posting this issue :)

this line worries me!

will involve downtime

Thanks for clarification :)

@chrislovecnm
Copy link
Contributor

/assign @KashifSaadat

These are the instructions that @KashifSaadat worked out. @caseydavenport who is a good canal guru?

@pierreozoux I will let the experts comment, but yes downtime with Kubernetes is not the best.

@KashifSaadat
Copy link
Contributor

KashifSaadat commented Nov 27, 2017

I ran into the following issues when trying to do a gradual rolling-update with no downtime:

  • On some occasions, rolling a new node on k8s v1.8 would incorrectly pick up and deploy the old Canal manifest (for v1.6-v1.7), and so the Canal pods would error with attempting to access TPRs. I'm not sure if this is some odd delay / caching issue with the Manifest / DS having the old contents at the time the node was rebuilt. This is maybe a bug, or I need to leave more delay between the kops update and kops rolling-update commands.
  • Most of the way through the deployment, kube-dns is left on a v1.7 node yet the cluster is mostly upgraded to v1.8. You then run into communication issues with the API and the cluster fails to validate, breaking the rolling-update. The cluster won't be fully functional at this point. Logs from kubedns:
dns.go:174] Waiting for services and endpoints to be initialized from apiserver...
reflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Endpoints: Get https://10.10.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.10.0.1:443: getsockopt: no route to host
reflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Service: Get https://10.10.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.10.0.1:443: getsockopt: no route to host

@caseydavenport
Copy link
Member

caseydavenport commented Nov 29, 2017

The linked procedure is designed to have no downtime. But, it needs to be performed in a particular order due to removal of support of TPRs in the k8s API.

Essentially these things need to happen in order:

  • Data is migrated from TPR -> CRD (as per the provided script)
  • canal is upgraded to v2.5 (works on both k8s 1.7 and 1.8)
  • Kubernetes is upgraded to v1.8.

The key being that any canal v2.4 pod will stop working when k8s is updated to v1.8, so the update of canal needs to happen first.

@justinsb justinsb added this to the 1.8.0 milestone Dec 1, 2017
@jurgenweber
Copy link

jurgenweber commented Dec 4, 2017

@caseydavenport do you have a daemon set manifest for the Canal upgrade to v2.6? OR can I just edit my current one.

Does the 1.6/1.7 kops version of canal support the CRD storage engine? Can I do that data migration and then leave it before I move on.

@caseydavenport
Copy link
Member

@jurgenweber the manifests here will work for v2.6 - https://github.com/projectcalico/canal/tree/ff2a346124ac0a2203237c3f76e1a5428c8369ab/k8s-install/1.7

You'll need the new CRDs if you're upgrading from pre-v2.5.

Does the 1.6/1.7 kops version of canal support the CRD storage engine? Can I do that data migration and then leave it before I move on.

You should be able to do the data migration, then upgrade kops/canal. You need at least Kubernetes v1.7 in order to use Canal v2.5+, because CRDs do not exist in earlier versions of Kubernetes.

@jurgenweber
Copy link

jurgenweber commented Dec 4, 2017

I am currently on k8s 1.7.10.

So looking at the currently deployed DS I have:

        image: quay.io/calico/cni:v1.10.0
        image: quay.io/calico/node:v2.5.1
        image: quay.io/coreos/flannel:v0.8.0

Calico 2.5.1 supports both TPR and CRD? And so does k8s 1.7.

I have already the configuration in CRD from my first botched attempt to go to k8s 1.8:

NAME                                          AGE
bgppeers.crd.projectcalico.org                26d
globalbgpconfigs.crd.projectcalico.org        26d
globalfelixconfigs.crd.projectcalico.org      26d
globalnetworkpolicies.crd.projectcalico.org   26d
ippools.crd.projectcalico.org                 26d

Because I still have the TPR's as well:

$ kubectl get thirdpartyresources -n kube-system
NAME                              DESCRIPTION                   VERSION(S)
global-config.projectcalico.org   Calico Global Configuration   v1
ip-pool.projectcalico.org         Calico IP Pools               v1

Should I blow these away and do the data migration again? How do I know what datastore calico is currently using?

I see this in the DS:

        - name: DATASTORE_TYPE
          value: kubernetes

but that does not clarify it.

So by the looks of things I need to discern:

What datastore is in use and if I need to do the data migration again.
Current version of Calico is fine for both 1.7 and 1.8 so I won't upgrade unless kops does it.

sorry for all the questions, my first attempt was a bit of a disaster. I had some pods with no internet access and unable to function. I was successful in rolling back after noticing the issue but by this time I had already upgraded the masters and one of my instance groups. It did take a bit of hacking in etcd-server to get cronjobs to work again. :) The good news is that I had no production downtime.

Thanks

@caseydavenport
Copy link
Member

caseydavenport commented Dec 5, 2017

@jurgenweber Calico v2.4 and less uses TPRs to store data.

Calico v2.5+ uses CRDs only.

As for Kubernetes, k8s v1.7. supports both CRDs and TPRs, but k8s v1.8 supports only CRDs.

So, you need to migrate the data from TPRs to CRDs on k8s v1.7 before upgrading to Calico v2.5.

All of this is ONLY when using the DATASTORE_TYPE=kubernetes options, which kops does not use for Calico, but DOES use for canal.

@jurgenweber
Copy link

jurgenweber commented Dec 18, 2017

Sorry, I have not gotten back to this really busy but over Christmas when things are slow I hope to do the upgrade.

So going by my Canal deployment image tags, I am on CRD's.... Assuming "Calico v2.5+" == " image: quay.io/calico/node:v2.5.1"? Or is the CNI version the version I should be concerned with?

Do I? What image/part of Calico is in question here?

Ok.

How can I tell which one is in use?

@caseydavenport
Copy link
Member

@jurgenweber yep, calico/node:v2.5.1 means Calico v2.5.1.

@jurgenweber
Copy link

jurgenweber commented Dec 18, 2017

I'll be honest, I do not know how that happened or maybe it was always 2.5.1... or I dunno. Anyway, sounds like I am on CRD's. Thank you for all of your patience with my questions.

@jurgenweber
Copy link

Just dropping a note to say I managed the upgrade to 1.8.6 this morning. Thank you for all your advice and help.

@KashifSaadat
Copy link
Contributor

Good to hear it's working, thanks for letting us know. Going to close this issue, feel free to update / reopen if you think there's anything outstanding here.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants