AKS / UToronto: calico-typha scaled to three replicas, forcing three core nodes #3592

consideRatio · 2024-01-09T09:39:45Z

This is similar to #2490 where the same problem is considered for GKE clusters. In the AKS clusters though, calico is deployed in the "calico-system" namespace via the "tigera-operator" in the namespace named the same thing.

I've not yet figured out how to re-configure it to not require 3 replicas, but it would be good to reduce that number or allow at least two out of three pods to co-locate on a node to avoid forcing use of 3 nodes.

consideRatio · 2024-01-09T15:30:44Z

Findings

calico-typha will fail to schedule next to another calico-typha pod due to use of node ports (when the pod is to schedule # node(s) didn't have free ports for the requested pod ports).
aks installs calico by using the tigera-operator, and the installation of calico is controller by an Installation resource-type named default
while Installation resources allows typha config under typhaDeployment, we can't declare the number of replicas directly or influence how the number of replicas may be calculated behind the scenes by the tigera-operator
the developers of tigera-operator describes in Support for defining the replica count for Deployments via Helm Chart projectcalico/calico#6100 (comment) that the replica count for calico-typha is calculated based on cluster size
the calico-typha autoscaling logic is declared here to reference a calculated amount of replicas based on the nodes in the cluster.

A way forward

There is no great solution, only upstream feature request and mitigation is reasonable at the moment.

Mitigation ideas
- DONE: Stop using core node pool autoscaling, pinning to 2 core nodes
- (cluster re-creation required) Use 2 CPU / 16 GB nodes instead of the 4 CPU / 32 GB nodes to get cheaper replicas
- schedule calico-typha on user nodes as well (by configuring the default Installation resource's typhaDeployment.spec.template.spec.affinity and adding a toleration)
Upstream - tigera-operator to support user provided config on the autoscaling ladder, like GKE's horisonal autoscaling configmaps adjusted in Decide and document on pod scaling of GKE's clusters calico-typha pods, and konnectivity-agent pods #2490
- Typha autoscaler's autoscaling profile to be configurable tigera/operator#3095

github-project-automation bot added this to DEPRECATED Engineering and Product Backlog Jan 9, 2024

github-project-automation bot moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Jan 9, 2024

consideRatio mentioned this issue Jan 9, 2024

Draft notes from AKS upgrade #3591

Closed

consideRatio self-assigned this Jan 9, 2024

consideRatio mentioned this issue Jan 9, 2024

terraform, azure and utoronto: an upgrade, misc to support it, and misc opportunistic details #3596

Merged

yuvipanda added the allocation:internal-eng label Mar 20, 2024

consideRatio mentioned this issue May 2, 2024

Understand reasons for clusters having more than one core node running #4024

Closed

7 tasks

consideRatio removed their assignment Jun 26, 2024

consideRatio closed this as not planned Won't fix, can't repro, duplicate, stale Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AKS / UToronto: calico-typha scaled to three replicas, forcing three core nodes #3592

AKS / UToronto: calico-typha scaled to three replicas, forcing three core nodes #3592

consideRatio commented Jan 9, 2024

consideRatio commented Jan 9, 2024 •

edited

Loading

AKS / UToronto: calico-typha scaled to three replicas, forcing three core nodes #3592

AKS / UToronto: calico-typha scaled to three replicas, forcing three core nodes #3592

Comments

consideRatio commented Jan 9, 2024

consideRatio commented Jan 9, 2024 • edited Loading

Findings

A way forward

consideRatio commented Jan 9, 2024 •

edited

Loading