Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] - Add ability to configure number of Typha Replicas #7181

Closed

Conversation

gjtempleton
Copy link
Member

@gjtempleton gjtempleton commented Jun 24, 2019

Adds the ability to configure the number of Typha replicas when using Calico CNI in 1.12+
to limit the impact of Calico on the APIServer and increase the scalability of the cluster.

Also adds the ability to configure Typha's Prometheus config.

Resolves #7158

Dependent on #7051 due to Typha image being 3.7.2 This has now been merged. Just dependent on me getting around to testing the migration path now.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 24, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @gjtempleton. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 24, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gjtempleton
To complete the pull request process, please assign chrislovecnm
You can assign the PR to them by writing /assign @chrislovecnm in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@gambol99 gambol99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me ... I left just the one comment .. Also, does this work with canal? .. Note, your'll need to bump the manifests versions to get this to rollout .. i believe here

```
networking:
calico:
typhaReplicas: 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can an existing calico be upgraded to typha without incident? .. I'm guessing no and if so, perhaps just chunk in a helpful comment indicating.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to double check if it's still the case, but it used to be the case that they could (~ version 3.2 from memory) as the new calico node pods come up and see the new value for FELIX_TYPHAK8SSERVICENAME they connect to Typha rather than directly to the APIServer.

I'll have a play around with that at some point over the next couple of days to confirm though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this is an important thing for us to test and confirm. I feel like quite a few people will be excited for this and jump in when they see it in the release notes!

@gambol99
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 24, 2019
@gjtempleton
Copy link
Member Author

Thanks for the super-quick response and feedback, knew I'd missed something around the manifests!

@gjtempleton
Copy link
Member Author

On the Canal front, I've just had to bring myself back up to speed with the current state of Canal, and it seems the docs don't have the same recommendations around Typha usage given Flannel is the bit performing the heavy lifting of networking whilst Calico is just the policy so there's no need for the fan out.

Worth noting that I've put in calico/typha:v3.7.2 on the assumption that #7051 gets merged and I rebase this before merging.

@mikesplain
Copy link
Contributor

/retest

@semoac
Copy link

semoac commented Jun 27, 2019

Hi.

I just manually apply this changes on my cluster (kops 1.12.2/k8s v1.12.9).

The commit uses the image calico/typha:v3.7.2. This version is not compatible with the rest of the manifest and versions of the calico nodes (calico/node:v3.4.0) causing problems to the startup process of typha (and the calico-nodes pods that are waiting for the service).

The version 3.7.2 needs an extra set of CRDS and resources in the cluster rolebinding. (please check this to see the difference)

After switching the image to calico/typha:v3.4.0-amd64 everything worked properly.

By the way I tried this on a cluster with a working calico installation (without typha) and route reflectors.
It will be awesome to have a one shot pod per master that uses calicoctl to set itself as route reflector to avoid manual steps. ¿Any tips on how to do that on kops v1.12?¿The IG spec supports custom "scripts"? (kops n00b here).

Finally, some errors I got on the first try:

2019-06-27 14:22:33.606 [INFO][8] watchercache.go 204: Failed to perform list of current data during resync ListRoot="/calico/ipam/v2/host/" error=connection is unauthorized: blockaffinities.crd.projectcalico.org is forbidden: User "system:serviceaccount:kube-system:calico-node" cannot list resource "blockaffinities" in API group "crd.projectcalico.org" at the cluster scope
2019-06-27 14:33:33.387 [INFO][8] watchercache.go 259: Failed to create watcher ListRoot="/calico/ipam/v2/assignment/" error=resource does not exist: {0} with error: the server could not find the requested resource (get IPAMBlocks.crd.projectcalico.org) performFullResync=true

@gjtempleton
Copy link
Member Author

gjtempleton commented Jun 27, 2019

Hi @semoac,

Thanks for giving it a try and sorry about the issues you had, I made the assumption that #7051 will be merged before this is merged, and that will take care of the extra CRDs required etc. (I throw-away mentioned this in an earlier comment, but should have been clearer in the PR description, will add that now.)

Good to know it worked for you once you downgraded the image.

ETA: Stuck a WIP on it until fully tested to answer the question around migrations from an existing 0-Typha setup and the pre-requisite PR is merged.

@gjtempleton gjtempleton changed the title Add ability to configure number of Typha Replicas [WIP] - Add ability to configure number of Typha Replicas Jun 27, 2019
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 27, 2019
@semoac
Copy link

semoac commented Jun 27, 2019

Hi @semoac,

Thanks for giving it a try and sorry about the issues you had, I made the assumption that #7051 will be merged before this is merged, and that will take care of the extra CRDs required etc. (I throw-away mentioned this in an earlier comment, but should have been clearer in the PR description, will add that now.)

Good to know it worked for you once you downgraded the image.

My mistake.
I overlooked the comment in the PR description. I thought the issue of the version was due to a technical problem rather than simply waiting for an update to the version of Calico that is currently used by Kops v1.12.2.

Thanks for point that out for me.

@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 30, 2019
Adds the ability to configure the number of Typha replicas when using Calico CNI in 1.12+
to limit the impact of Calico on the APIServer and increase the scalability of the cluster.

Also adds the ability to configure Typha's Prometheus config.

Add Passing TyphaReplicas Validation Test
@gjtempleton gjtempleton force-pushed the 1.12-Typha-Configurability branch from dce7003 to fcc85e2 Compare July 30, 2019 14:25
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 30, 2019
@mikesplain mikesplain mentioned this pull request Sep 7, 2019
3 tasks
@gjtempleton gjtempleton deleted the 1.12-Typha-Configurability branch September 16, 2019 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to Configure Typha for Calico CNI
5 participants