Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS / UToronto: calico-typha scaled to three replicas, forcing three core nodes #3592

Closed
consideRatio opened this issue Jan 9, 2024 · 1 comment

Comments

@consideRatio
Copy link
Contributor

This is similar to #2490 where the same problem is considered for GKE clusters. In the AKS clusters though, calico is deployed in the "calico-system" namespace via the "tigera-operator" in the namespace named the same thing.

I've not yet figured out how to re-configure it to not require 3 replicas, but it would be good to reduce that number or allow at least two out of three pods to co-locate on a node to avoid forcing use of 3 nodes.

@consideRatio
Copy link
Contributor Author

consideRatio commented Jan 9, 2024

Findings

  • calico-typha will fail to schedule next to another calico-typha pod due to use of node ports (when the pod is to schedule # node(s) didn't have free ports for the requested pod ports).
  • aks installs calico by using the tigera-operator, and the installation of calico is controller by an Installation resource-type named default
  • while Installation resources allows typha config under typhaDeployment, we can't declare the number of replicas directly or influence how the number of replicas may be calculated behind the scenes by the tigera-operator
  • the developers of tigera-operator describes in Support for defining the replica count for Deployments via Helm Chart projectcalico/calico#6100 (comment) that the replica count for calico-typha is calculated based on cluster size
  • the calico-typha autoscaling logic is declared here to reference a calculated amount of replicas based on the nodes in the cluster.

A way forward

There is no great solution, only upstream feature request and mitigation is reasonable at the moment.

  1. Mitigation ideas
    • DONE: Stop using core node pool autoscaling, pinning to 2 core nodes
    • (cluster re-creation required) Use 2 CPU / 16 GB nodes instead of the 4 CPU / 32 GB nodes to get cheaper replicas
    • schedule calico-typha on user nodes as well (by configuring the default Installation resource's typhaDeployment.spec.template.spec.affinity and adding a toleration)
  2. Upstream - tigera-operator to support user provided config on the autoscaling ladder, like GKE's horisonal autoscaling configmaps adjusted in Decide and document on pod scaling of GKE's clusters calico-typha pods, and konnectivity-agent pods #2490

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Needs Shaping / Refinement
Development

No branches or pull requests

2 participants