You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The logic in buildPodEquivalenceGroups and filterOutSchedulable groups pods by their scheduling requirements, as a scalability optimization. This is done by first grouping by the controller UID, and then comparing pod specs for pods from one controller. If there's something in the pod spec that's unique to a single pod within a controller, every pod ends up in a group of its own, and the optimization breaks.
In extreme cases when there are a lot of such pods (a couple thousand can be enough), CA can spend such a long time in one loop that it fails health-checks and is killed by kubelet. Then everything repeats once it gets back up, and CA is effectively broken until the pods are scheduled or deleted.
One trigger for pod specs being different is the BoundServiceAccountTokenVolume feature, which injects uniquely-named projected volumes into each pod's spec. This was taken into account by CA in #4441.
We've just run into another one - Jobs using completionMode: Indexed. In this mode, each pod gets a unique, indexed hostname in its spec. This is documented here: https://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode. AFAIU the hostname shouldn't affect scheduling, so sanitizing it in PodSpecSemanticallyEqual should be enough to fix this particular issue.
However, this approach of "fixing" single fields as issues pop up doesn't scale very well. We should come up with a more generic solution to these kinds of problems. One idea could be having a cutoff for the number of groups within one controller, proposed in #4441 (comment).
The text was updated successfully, but these errors were encountered:
The logic in
buildPodEquivalenceGroups
andfilterOutSchedulable
groups pods by their scheduling requirements, as a scalability optimization. This is done by first grouping by the controller UID, and then comparing pod specs for pods from one controller. If there's something in the pod spec that's unique to a single pod within a controller, every pod ends up in a group of its own, and the optimization breaks.In extreme cases when there are a lot of such pods (a couple thousand can be enough), CA can spend such a long time in one loop that it fails health-checks and is killed by kubelet. Then everything repeats once it gets back up, and CA is effectively broken until the pods are scheduled or deleted.
One trigger for pod specs being different is the
BoundServiceAccountTokenVolume
feature, which injects uniquely-named projected volumes into each pod's spec. This was taken into account by CA in #4441.We've just run into another one - Jobs using
completionMode: Indexed
. In this mode, each pod gets a unique, indexed hostname in its spec. This is documented here: https://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode. AFAIU the hostname shouldn't affect scheduling, so sanitizing it inPodSpecSemanticallyEqual
should be enough to fix this particular issue.However, this approach of "fixing" single fields as issues pop up doesn't scale very well. We should come up with a more generic solution to these kinds of problems. One idea could be having a cutoff for the number of groups within one controller, proposed in #4441 (comment).
The text was updated successfully, but these errors were encountered: