Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix pod equivalency checks for pods with projected volumes #4441

Merged
merged 1 commit into from
Dec 24, 2021

Conversation

marwanad
Copy link
Member

@marwanad marwanad commented Nov 4, 2021

With the BoundServiceAccountTokenVolume feature being GA, the SA admission controller uses volume projection to add a new volume to the pod, rather than the old method with a secret based volume. This means that two replicas of the same pod will have a different PodSpec because the volume added by the plugin will have a different random-suffix.

E1103 21:20:54.328103 1 equivalence_groups.go:81] mismatch (-want +got):
  v1.PodSpec{
   Volumes: []v1.Volume{
   {
-  Name: "kube-api-access-8n6v5",
+  Name: "kube-api-access-7pnsd",
   },
   },
   InitContainers: nil,
   Containers: []v1.Container{
   {
   ... // 7 identical fields
   Env: nil,
   Resources: v1.ResourceRequirements{Limits: v1.ResourceList{s"cpu": {i: resource.int64Amount{value: 500, scale: -3}, s: "500m", Format: "DecimalSI"}}, Requests: v1.ResourceList{s"cpu": {i: resource.int64Amount{value: 200, scale: -3}, s: "200m", Format: "DecimalSI"}}},
   VolumeMounts: []v1.VolumeMount{
   {
-  Name: "kube-api-access-8n6v5",
+  Name: "kube-api-access-7pnsd",
  ReadOnly: true,
   MountPath: "/var/run/secrets/kubernetes.io/serviceaccount",
   ... // 3 identical fields
   },
   },
   VolumeDevices: nil,
   LivenessProbe: nil,
   ... // 10 identical fields
   },
   },
   EphemeralContainers: nil,
   RestartPolicy: "Always",
   ... // 30 identical fields

While this started as en effort to fix the service account projection case with the fix initially just dropping the service account volume name from the deep equals check, mid-way we came to the realization that this will in-fact impact majority of cases with volumes.

Wouldn't having the old code imply that for stateful sets - you basically end up with O(NumberOfUnschedulablePods) equivalence groups and having to run binpacking on each group?

This also begs the question: Do we even need to do an equality check on the pod spec? Wouldn't the ownerRef hash be sufficient for this grouping?

This change has substantially reduced the method's runtime especially for a large unschedulable workload, from hours to just seconds.

I think this affects strictly K8s 1.21+ but not sure about the scope.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 4, 2021
@marwanad marwanad force-pushed the fix-pod-equivalence-perf branch 2 times, most recently from 2561133 to e57a8fc Compare November 4, 2021 01:53
@marwanad
Copy link
Member Author

marwanad commented Nov 4, 2021

@mwielgus @MaciekPytel adding GCP folks for historical context and thoughts

@marwanad marwanad force-pushed the fix-pod-equivalence-perf branch from e57a8fc to ea5b9aa Compare November 4, 2021 02:05
// This is also applicable to a pod without volumes scenario due to the volume
// projection of the default service account onto the kube-api-access volume
func sanitizeVolumesAndMounts(podSpec apiv1.PodSpec) apiv1.PodSpec {
podSpec.Volumes = nil
Copy link
Contributor

@MaciekPytel MaciekPytel Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching result of scheduler Filters only makes sense if we know the pods are identical as far as scheduling is concerned. As far as I understand projected volumes don't affect scheduling, so it makes sense to ignore them in comparison function.
However, PVs impact scheduling. If pod is using PVC that already has a bound PV the zone where a pod needs to run is determined by where the PV is. IIRC this would generally be the case when using Immediate volumeBindingMode. Even when using WaitForFirstConsumer we should make sure that allowedTopologies match.

So I think this implementation is overly broad. We should narrow it down to make sure the pods in equivalence group really are identical as far as scheduling is concerned. As much as I dislike the added complexity, the most obvious way to do this seem to be by selectively filtering out volumes that don't affect scheduling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the broad comment. We can filter out the projected volumes by name then null out the volume mounts in the containers.

One question though - doesn't this mean the performance with SatetfulSets is subpar? Every pod will get its own claim name so a large scale PVC scenario, building this equivalence group can take forever. I understand its necessary because the scheduler needs to consider those PVC constraints which CA has no visibility over. But, is there something better we can do? Maybe short-circuit building this method somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replying to the question:

  • Conceptually in scale-up we run Filters in two places: initial filtering (check if pods fit on an empty node in a given nodegroup) and later in binpacking. In each step we run PreFilters once per pod, in initial filtering we also run Filters once, while in binpacking we run it O(node_delta).
  • Equivalence groups are just and optimization for the initial filtering. Since we're doing check for each pod independently we opportunistically cache the results and re-use them for identical pods. This idea doesn't really work for StatefulSets, since every pod in StatefulSet has a different PVC and so may need to run in a different zone (and since binpacking is done on a zonal NodeGroup that means that the result of initial filtering should actually be different between pods in SS).
    • Note that this optimization reduces time spent running PreFilters/Filters in scale-up by at most 50% (as explained above equivalence groups don't feature in actual binpacking and we run binpacking simulation per-NodeGroup, not per-EquivalenceGroup). We also run PreFilters/Filters earlier in CA loop as well as do a bunch of API calls to provider and other time consuming operations. So, while significant, this optimization shouldn't make or break CA performance. If it does can you share the logs and function_duration_seconds metric for an example run?

The problem with StatefulSets goes deeper than this though. CA does entire binpacking using a NodeGroup in one zone and, if all the pods in scale-up can run in any zone it splits this scale-up between zones. If you have a StatefulSet that uses Immediate volumeBindingMode and gets spread over 3 zones, the individual pods in StatefulSet are no longer 'zone-agnostic' and we can't just calculate total number of nodes needed and arbitrarily split it between zones. Instead a zonal binpacking will only accomodate pods in one zone (ie. 1/3 of all pods) and the rest of pods will only be handled by next run of scale-up logic. In other words for StatefulSet you would generally need 3x more autoscaler loops (swap 3 with the number of zones StatefulSet can schedule into).

So StatefulSets are expected to be much slower, but I don't think this particular piece of logic should be the primary reason for it. However, I suspect the first version of your change may produce much shorter latency for StatefulSets, particularly if you test by just creating a large SS in one go.
I think what happens is your logic combines with bug #4129. By placing all the pods in StatefulSet in the same equivalence group you're essentially bypassing the initial filtering phase (it's enough for one pod to fit in a given zone for CA to decide they all fit). #4129 is caused by the fact that binpacking logic never rechecks initial filtering logic and just blindly assumes that anything that passed initial filtering would fit on a new node. This means that any StatefulSet pod that can't run in the zone used in binpacking is translated into an extra node in scale-up decision (even if multiple pods would fit on a node, each will get a separate node!). CA will overshoot scale-up by a large factor on every loop, which negates the need to run extra loops described above and obviously helps the latency a lot. In a single, large scale-up test the end result may look reasonable - CA starts scaling-up before all the pods are created, so the fact that CA adds way too many nodes initially is hidden by the fact that more pods are created later on and schedule on those node.

To sum up - the fact that pods from StatefulSet gets into different equivalence groups is I think correct, because they're not equivalent for scheduling purposes. Poor StatefulSet comes from how CA handle multi-zonal cluster and would probably require doing multi-zonal binpacking (ie. completely rewriting scale-up...) to fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - a general recommendation for any PV is to never use Immediate volumeBindingMode with CA and instead use WaitForFirstConsumer. If I understand correctly this way StatefulSets are pod are zone agnostic and can all be handled in a single scale-up. They will still end up in separate equivalence groups, but that should not have as big of an impact by itself (it just requires one extra CheckPredicates call for each pod, but shouldn't impact binpacking).

Copy link
Member Author

@marwanad marwanad Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the lengthy response on the stateful set case.

The test case (and what the PR is targeting) is a simple nginx deployment on a 1.21 cluster where the default service account token gets projected added as a projected volume by default. The scenario made me think about the SS case but it is tangential.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      run: nginx
  replicas: 40000
  template:
    metadata:
      labels:
        run: nginx
    spec:
      containers:
      - name: nginx
        image: docker.io/library/nginx:1.13.12-alpine
        ports:
        - containerPort: 80
        resources:
          limits:
            cpu: 500m
          requests:
            cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    run: nginx
spec:
  ports:
  - port: 80
  selector:
    run: nginx
  • CA changes:
diff --git i/cluster-autoscaler/core/equivalence_groups.go w/cluster-autoscaler/core/equivalence_groups.go
--- i/cluster-autoscaler/core/equivalence_groups.go
+++ w/cluster-autoscaler/core/equivalence_groups.go
@@ -17,7 +17,10 @@ limitations under the License.
package core

import (
+	"k8s.io/autoscaler/cluster-autoscaler/metrics"
+	"k8s.io/klog/v2"
	"reflect"
+	"time"

	apiv1 "k8s.io/api/core/v1"
	apiequality "k8s.io/apimachinery/pkg/api/equality"
@@ -34,6 +37,8 @@ type podEquivalenceGroup struct {

// buildPodEquivalenceGroups prepares pod groups with equivalent scheduling properties.
func buildPodEquivalenceGroups(pods []*apiv1.Pod) []*podEquivalenceGroup {
+	klog.V(1).Info("At the start of buildPodEquivalence Group")
+	buildPodEquivalenceGroupStart := time.Now()
	podEquivalenceGroups := []*podEquivalenceGroup{}
	for _, pods := range groupPodsBySchedulingProperties(pods) {
		podEquivalenceGroups = append(podEquivalenceGroups, &podEquivalenceGroup{
@@ -42,6 +47,8 @@ func buildPodEquivalenceGroups(pods []*apiv1.Pod) []*podEquivalenceGroup {
			schedulable:      false,
		})
	}
+     klog.V(1).Infof("Done computing equivalence groups with length: %s", len(podEquivalenceGroups))
+	metrics.UpdateDurationFromStart("buildPodEquivalenceGroups", buildPodEquivalenceGroupStart)
	return podEquivalenceGroups
}
diff --git i/cluster-autoscaler/metrics/metrics.go w/cluster-autoscaler/metrics/metrics.go
--- i/cluster-autoscaler/metrics/metrics.go
+++ w/cluster-autoscaler/metrics/metrics.go
@@ -277,9 +277,9 @@ func UpdateDurationFromStart(label FunctionLabel, start time.Time) {
func UpdateDuration(label FunctionLabel, duration time.Duration) {
	// TODO(maciekpytel): remove second condition if we manage to get
	// asynchronous node drain
-	if duration > LogLongDurationThreshold && label != ScaleDown {
-		klog.V(4).Infof("Function %s took %v to complete", label, duration)
-	}
+	//if duration > LogLongDurationThreshold && label != ScaleDown {
+		klog.V(1).Infof("Function %s took %v to complete", label, duration)
+	//}
	functionDuration.WithLabelValues(string(label)).Observe(duration.Seconds())
	functionDurationSummary.WithLabelValues(string(label)).Observe(duration.Seconds())
}
  • Start CA with 40k pods already pending, no cpu-limits, scan-interval of 60s. I understand that an incremental run where pods are coming up gradually vs a one-shot large pending pods will perform differently but I think the bottlenecks here apply to both cases.

  • There's two issues here I've hit:

    • The initial filtering filterOutSchedulablePodListProcessor takes a long time to compute which is expected since it would run the filters per pod. I haven't gotten to digging into it what we can optimize there yet (Can we build pod equivalency and possibly exit early instead of running 40k times?). You can wait for it to complete or just return unschedulablePods from it since that would be the end result anyways if you are starting CA at this point in time.
    • buildPodEquivalenceGroups takes a long time as well prior to the changes in this PR.
  • First Run without the changes:

I1109 17:44:48.843677       1 metrics.go:281] Function updateClusterState took 1.63602ms to complete
I1109 17:44:48.889530       1 filter_out_schedulable.go:64] Filtering out schedulables
I1109 18:34:55.929658       1 metrics.go:281] Function filterOutSchedulable took 50m7.040078821s to complete
I1109 18:34:55.929766       1 filter_out_schedulable.go:81] No schedulable pods
I1109 18:34:55.959080       1 klogx.go:86] Pod default/nginx-7485d99856-p9ww8 is unschedulable
I1109 18:34:55.959143       1 klogx.go:86] Pod default/nginx-7485d99856-5b577 is unschedulable
I1109 18:34:55.959154       1 klogx.go:86] Pod default/nginx-7485d99856-vx8h5 is unschedulable
I1109 18:34:55.959162       1 klogx.go:86] Pod default/nginx-7485d99856-m7ndx is unschedulable
I1109 18:34:55.959169       1 klogx.go:86] Pod default/nginx-7485d99856-xwn56 is unschedulable
I1109 18:34:55.959177       1 klogx.go:86] Pod default/nginx-7485d99856-qmqcf is unschedulable
I1109 18:34:55.959184       1 klogx.go:86] Pod default/nginx-7485d99856-pwb9w is unschedulable
I1109 18:34:55.959192       1 klogx.go:86] Pod default/nginx-7485d99856-pwjw9 is unschedulable
I1109 18:34:55.959199       1 klogx.go:86] Pod default/nginx-7485d99856-84xqq is unschedulable
I1109 18:34:55.959206       1 klogx.go:86] Pod default/nginx-7485d99856-5q5sw is unschedulable
I1109 18:34:55.959214       1 klogx.go:86] Pod default/nginx-7485d99856-7jlrc is unschedulable
I1109 18:34:55.959223       1 klogx.go:86] Pod default/nginx-7485d99856-cbwq2 is unschedulable
I1109 18:34:55.959231       1 klogx.go:86] Pod default/nginx-7485d99856-b8v2f is unschedulable
I1109 18:34:55.959238       1 klogx.go:86] Pod default/nginx-7485d99856-x2vdf is unschedulable
I1109 18:34:55.959245       1 klogx.go:86] Pod default/nginx-7485d99856-hhzbh is unschedulable
I1109 18:34:55.959253       1 klogx.go:86] Pod default/nginx-7485d99856-bqlz6 is unschedulable
I1109 18:34:55.959261       1 klogx.go:86] Pod default/nginx-7485d99856-p8clz is unschedulable
I1109 18:34:55.959269       1 klogx.go:86] Pod default/nginx-7485d99856-x9m6s is unschedulable
I1109 18:34:55.959276       1 klogx.go:86] Pod default/nginx-7485d99856-955ph is unschedulable
I1109 18:34:55.959284       1 klogx.go:86] Pod default/nginx-7485d99856-w2hdh is unschedulable
I1109 18:34:55.971382       1 klogx.go:86] 39946 other pods are also unschedulable
I1109 18:34:55.971659       1 equivalence_groups.go:40] At the start of buildPodEquivalence Group
I1109 19:26:27.564098       1 equivalence_groups.go:50] Done computing equivalence groups with length: %!s(int=39891)
I1109 19:26:27.564156       1 metrics.go:281] Function buildPodEquivalenceGroups took 51m31.592478269s to complete

buildPodEquivalenceGroup took 51m31.592478269s in this run.

The filterOutSchedulable duration isn't the focus here but I'd like to tackle this next.

  • Second run with the changes in this PR
I1109 20:13:56.833444       1 metrics.go:281] Function filterOutSchedulable took 44m53.991263278s to complete
I1109 20:13:56.833541       1 filter_out_schedulable.go:81] No schedulable pods
I1109 20:13:56.856611       1 klogx.go:86] Pod default/nginx-7485d99856-9cwbx is unschedulable
I1109 20:13:56.856650       1 klogx.go:86] Pod default/nginx-7485d99856-j8lzg is unschedulable
I1109 20:13:56.856656       1 klogx.go:86] Pod default/nginx-7485d99856-wf2qt is unschedulable
I1109 20:13:56.856662       1 klogx.go:86] Pod default/nginx-7485d99856-xlw6c is unschedulable
I1109 20:13:56.856666       1 klogx.go:86] Pod default/nginx-7485d99856-rqsfb is unschedulable
I1109 20:13:56.856672       1 klogx.go:86] Pod default/nginx-7485d99856-pbg86 is unschedulable
I1109 20:13:56.856677       1 klogx.go:86] Pod default/nginx-7485d99856-nmj74 is unschedulable
I1109 20:13:56.856681       1 klogx.go:86] Pod default/nginx-7485d99856-rnjxl is unschedulable
I1109 20:13:56.856686       1 klogx.go:86] Pod default/nginx-7485d99856-z5qcx is unschedulable
I1109 20:13:56.856691       1 klogx.go:86] Pod default/nginx-7485d99856-8qmfq is unschedulable
I1109 20:13:56.856696       1 klogx.go:86] Pod default/nginx-7485d99856-t8nmt is unschedulable
I1109 20:13:56.856700       1 klogx.go:86] Pod default/nginx-7485d99856-ttdrf is unschedulable
I1109 20:13:56.856705       1 klogx.go:86] Pod default/nginx-7485d99856-mf29b is unschedulable
I1109 20:13:56.856710       1 klogx.go:86] Pod default/nginx-7485d99856-rmw9z is unschedulable
I1109 20:13:56.856715       1 klogx.go:86] Pod default/nginx-7485d99856-5r44s is unschedulable
I1109 20:13:56.856720       1 klogx.go:86] Pod default/nginx-7485d99856-f5jnl is unschedulable
I1109 20:13:56.856725       1 klogx.go:86] Pod default/nginx-7485d99856-twc9f is unschedulable
I1109 20:13:56.856729       1 klogx.go:86] Pod default/nginx-7485d99856-w2st5 is unschedulable
I1109 20:13:56.856734       1 klogx.go:86] Pod default/nginx-7485d99856-f7vp6 is unschedulable
I1109 20:13:56.856739       1 klogx.go:86] Pod default/nginx-7485d99856-6ck5x is unschedulable
I1109 20:13:56.865525       1 klogx.go:86] 39946 other pods are also unschedulable
I1109 20:13:56.865784       1 equivalence_groups.go:40] At the start of buildPodEquivalence Group
I1109 20:13:58.577666       1 equivalence_groups.go:50] Done computing equivalence groups with length: %!s(int=1)
I1109 20:13:58.577713       1 metrics.go:281] Function buildPodEquivalenceGroups took 1.71191175s to complete

You can see that the filterOutSchedulable has decreased significantly for that run. I think without that change the inner loop is exponential as you keep adding more groups as you're looping through the 40k pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that note, I did see FilterOutSchedulable complete significantly faster with CA 1.17 compared to the latest master. I probably won't get to do another run till the new year but it was something I wanted to dig into.

That's not unexpected. In 1.18 the scheduler moved onto 'scheduler framework' and to remain compatible we had to completely rewrite scheduler integration in CA and a lot of associated code (including the entire filterOutSchedulable logic). The new version is arguably more correct, but we've lost some of our old optimizations (ex. the ability to completely disable InterPodAffinity logic if there are no pods in the cluster using it).

It took around 1.5 minute per nodegroup

That is very interesting. After some digging through the code, I reached the following conclusions:

  • It definitely makes me suspect the buildPodEquivalenceGroup cost is not due to filters. DeepEqual seems like the likely culprit, but I'm not sure if there is any way to be completely sure other than instrumenting / profiling it and running the test.
  • Turns out I was completely wrong about filterOutSchedulable - we do negative caching, which seems very similar to the one used in podEquivalenceGroups: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/filter_out_schedulable.go#L150. And it also runs a ton of DeepEqual in your scenario (we essentially has a hash map, where every pod has a hash conflict and it degenerates into a list).
    • I bet changing https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/utils/pod_schedulable.go#L53 to PodSpecSemanticallyEqual is going to fix the problem with filterOutSchedulable in your test. Would you mind testing and, if it works, adding it to this PR? Alternatively, we could merge this one and do it in a follow-up.
    • I think the fact that we have 2 implementations of essentially the same caching logic is not great. This PR is a great example of how easily an optimization can get applied to one implementation, but not the other. I think the ideal state would be to try and merge the two implementations (and maybe consider using them in estimator too? I see no reason why negative caching wouldn't be applicable there). Obviously this is a much bigger refactor and not in scope of this PR.

Copy link
Member Author

@marwanad marwanad Dec 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scheduler framework

yup - I remember the check to skip the pod affinity predicate since it was costly.

PodSemanticallyEqual still uses DeepEquals right? I think the culprit the combination of that #pending_pods^2 factor and DeepEqual call all bundled together.

With the same setup, I did one run with the main branch and didn't wait for it to complete for the interest of time. At 30 mins, it was still running.

After changing the call to PodSemanticallyEqual in podschedulable info (and thus reducing the lists):

Complete run with both changes:

I1221 11:46:09.059008    1 filter_out_schedulable.go:65] Filtering out schedulables
I1221 11:46:11.187135       1 metrics.go:322] Function filterOutSchedulable took 2.128068927s to complete
I1221 11:46:11.207504       1 klogx.go:86] Pod default/nginx-867649cf6c-r52rk is unschedulable
I1221 11:46:11.207550       1 klogx.go:86] Pod default/nginx-867649cf6c-f7dp8 is unschedulable
I1221 11:46:11.207557       1 klogx.go:86] Pod default/nginx-867649cf6c-7l8vf is unschedulable
I1221 11:46:11.207562       1 klogx.go:86] Pod default/nginx-867649cf6c-pvwf4 is unschedulable
I1221 11:46:11.207568       1 klogx.go:86] Pod default/nginx-867649cf6c-qfdmj is unschedulable
I1221 11:46:11.207572       1 klogx.go:86] Pod default/nginx-867649cf6c-gb2pg is unschedulable
I1221 11:46:11.207578       1 klogx.go:86] Pod default/nginx-867649cf6c-xgjkl is unschedulable
I1221 11:46:11.207582       1 klogx.go:86] Pod default/nginx-867649cf6c-5v2qn is unschedulable
I1221 11:46:11.207587       1 klogx.go:86] Pod default/nginx-867649cf6c-px5p5 is unschedulable
I1221 11:46:11.207592       1 klogx.go:86] Pod default/nginx-867649cf6c-l56v7 is unschedulable
I1221 11:46:11.207597       1 klogx.go:86] Pod default/nginx-867649cf6c-mf5lr is unschedulable
I1221 11:46:11.207603       1 klogx.go:86] Pod default/nginx-867649cf6c-4qvxj is unschedulable
I1221 11:46:11.207608       1 klogx.go:86] Pod default/nginx-867649cf6c-khh22 is unschedulable
I1221 11:46:11.207612       1 klogx.go:86] Pod default/nginx-867649cf6c-kbc29 is unschedulable
I1221 11:46:11.207617       1 klogx.go:86] Pod default/nginx-867649cf6c-l978v is unschedulable
I1221 11:46:11.207622       1 klogx.go:86] Pod default/nginx-867649cf6c-l26vb is unschedulable
I1221 11:46:11.207628       1 klogx.go:86] Pod default/nginx-867649cf6c-r5fsc is unschedulable
I1221 11:46:11.207633       1 klogx.go:86] Pod default/nginx-867649cf6c-mrkhm is unschedulable
I1221 11:46:11.207638       1 klogx.go:86] Pod default/nginx-867649cf6c-lmqmq is unschedulable
I1221 11:46:11.207643       1 klogx.go:86] Pod default/nginx-867649cf6c-njj57 is unschedulable
I1221 11:46:11.222752       1 klogx.go:86] 39795 other pods are also unschedulable
I1221 11:46:11.223188       1 equivalence_groups.go:40] At the start of buildPodEquivalence Group
I1221 11:46:13.350307       1 equivalence_groups.go:50] Done computing equivalence groups with length: 1
I1221 11:46:13.350361       1 metrics.go:322] Function buildPodEquivalenceGroups took 2.127137979s to complete
I1221 11:47:00.024298       1 metrics.go:322] Function Estimate took 46.673038307s to complete
I1221 11:47:45.443291       1 metrics.go:322] Function Estimate took 45.418670444s to complete
I1221 11:47:48.877763       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I1221 11:47:48.877829       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 25.801µs
I1221 11:48:30.754422       1 metrics.go:322] Function Estimate took 45.310845406s to complete
I1221 11:49:14.609391       1 metrics.go:322] Function Estimate took 43.854693687s to complete
I1221 11:49:48.906943       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I1221 11:49:48.909349       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 35.202µs
I1221 11:50:00.243521       1 metrics.go:322] Function Estimate took 45.633812159s to complete
I1221 11:50:00.243571       1 scale_up.go:468] Best option to resize: pool4-36302984
I1221 11:50:00.243587       1 scale_up.go:472] Estimated 1077 nodes needed in pool4-36302984

This was done with 5 nodegroups.

I updated the PR with that change too.

Good catch there! That can probably give us some seconds back in medium-scale scenarios too for those K8s versions.

I think the ideal state would be to try and merge the two implementations (and maybe consider using them in estimator too? I see no reason why negative caching wouldn't be applicable there).

Yeah there's definitely room for logic sharing there. I'll try and revisit this point again in the new year.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodSemanticallyEqual still uses DeepEquals right? I think the culprit the combination of that #pending_pods^2 factor and DeepEqual call all bundled together.

Yeah, that's what I meant. If all those pods end up as a separate entry in podSchedulableMap, each pod is compared with each previously processed pod. With PodSemanticallyEqual there is just one comparison needed per pod.

I wonder if we could come up with some sort of fallback. The caching in podSchedulableMap and equivalenceGroups is meant as an optimization, but as this case demonstrated it can become prohibitively expensive instead in cases where there are thousands of pods with different scheduling requirements owned by the same controller. Regardless of why those pods are different, at some point the caching becomes too expensive to be worth it. Maybe we could come up with a cutoff point, where we stop caching once the list of 'different' pods from the same controller grows too long? An example implementation could be something like adding if len(podMap[uid]) < maxLength in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/utils/pod_schedulable.go#L80.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm interesting idea - so beyond that maxLength - we just consider the remaining pods different?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, basically once we reach a certain list length we don't append anymore. We risk missing out on some caching opportunity, but once the list gets long enough the cost of comparisons exceeds the cost of running Filters and so the caching hurts us more than it helps.

In practice, my guess is that either all pods in a given controller are identical (or there are 2 variants in case of some ongoing update) or there is some issue which makes them all different (like the one we have here). I can't really think of a case where a single controller has a 100+ "classes" of pods, with multiple pods in each class. So my guess is that if the list gets very long, it probably means we don't get anything out of caching anyway.

cluster-autoscaler/core/equivalence_groups.go Outdated Show resolved Hide resolved
@marwanad marwanad changed the title fix incorrect pod equivalency check for pods with volumes fix incorrect pod equivalency check for pods with projected volumes Nov 8, 2021
@marwanad marwanad force-pushed the fix-pod-equivalence-perf branch from ea5b9aa to 1a4e562 Compare November 9, 2021 00:53
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 9, 2021
@marwanad marwanad force-pushed the fix-pod-equivalence-perf branch from 1a4e562 to 1998b4d Compare November 9, 2021 00:55
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 9, 2021
@marwanad marwanad force-pushed the fix-pod-equivalence-perf branch 3 times, most recently from 985e66d to 4e5c11d Compare November 9, 2021 01:24
@marwanad marwanad force-pushed the fix-pod-equivalence-perf branch from 4e5c11d to 0e66339 Compare November 9, 2021 20:30
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 9, 2021
@marwanad marwanad force-pushed the fix-pod-equivalence-perf branch from 0e66339 to 5304d71 Compare November 9, 2021 20:45
@marwanad
Copy link
Member Author

@MaciekPytel friendly ping on this PR and my response above with the test results.

@marwanad
Copy link
Member Author

@MaciekPytel @towca @mwielgus please?

/assign @MaciekPytel

@MaciekPytel
Copy link
Contributor

/lgtm
/approve
/hold
Sorry for delay. Added a hold in case you want to continue discussion on this PR (I'll try to be more responsive this time..). Please feel free to cancel it and merge.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 16, 2021
@marwanad marwanad force-pushed the fix-pod-equivalence-perf branch from 5304d71 to b0b7ca0 Compare December 17, 2021 11:50
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 17, 2021
@marwanad marwanad force-pushed the fix-pod-equivalence-perf branch from b0b7ca0 to 2421e47 Compare December 20, 2021 21:57
@marwanad marwanad changed the title fix incorrect pod equivalency check for pods with projected volumes fix pod equivalency checks for pods with projected volumes Dec 20, 2021
@marwanad marwanad force-pushed the fix-pod-equivalence-perf branch from 2421e47 to 23280bd Compare December 20, 2021 21:58
@MaciekPytel
Copy link
Contributor

/lgtm
Leaving the hold, so you can add the unittest you mentioned in your last comment.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 21, 2021
@marwanad marwanad force-pushed the fix-pod-equivalence-perf branch from 23280bd to 286f44e Compare December 21, 2021 15:09
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 21, 2021
podInRc1_2.OwnerReferences = GenerateOwnerReferences(rc1.Name, "ReplicationController", "extensions/v1beta1", rc1.UID)
err, found = pMap.Get(podInRc1_2)
assert.True(t, found)
assert.Nil(t, err)

// A replica in rc1 with a projected volume
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new added test cases.


// sanitizeProjectedVolumesAndMounts returns a pod spec with projected volumes
// and their mounts removed
func sanitizeProjectedVolumesAndMounts(podSpec apiv1.PodSpec) apiv1.PodSpec {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this method to append a local list of volumes and set that instead of setting projected volumes to empty structs.

The reason is previously comparing:

p1Spec.Volumes = nil
p2Spec.Volumes = []Volume[{}] // where {} was a previous projected volume that got set to an empty struct

would be false.

Here's the diff:

 func sanitizeProjectedVolumesAndMounts(podSpec apiv1.PodSpec) apiv1.PodSpec {
-       var projectedVolumeNames []string
-       for i, v := range podSpec.Volumes {
-               if v.Projected != nil {
-                       projectedVolumeNames = append(projectedVolumeNames, v.Name)
-                       podSpec.Volumes[i] = apiv1.Volume{}
+       projectedVolumeNames := map[string]bool{}
+       var volumes []apiv1.Volume
+       for _, v := range podSpec.Volumes {
+               if v.Projected == nil {
+                       volumes = append(volumes, v)
+               } else {
+                       projectedVolumeNames[v.Name] = true
                }
        }
+       podSpec.Volumes = volumes

-       for _, pv := range projectedVolumeNames {
-               for c := range podSpec.Containers {
-                       for v, mount := range podSpec.Containers[c].VolumeMounts {
-                               if mount.Name == pv {
-                                       podSpec.Containers[c].VolumeMounts[v] = apiv1.VolumeMount{}
-                               }
+       for i := range podSpec.Containers {
+               var volumeMounts []apiv1.VolumeMount
+               for _, mount := range podSpec.Containers[i].VolumeMounts {
+                       if ok := projectedVolumeNames[mount.Name]; !ok {
+                               volumeMounts = append(volumeMounts, mount)
                        }
                }
+               podSpec.Containers[i].VolumeMounts = volumeMounts
        }
-
        return podSpec

limitations under the License.
*/

package utils
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this file as well.

@MaciekPytel
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 24, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel, marwanad

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@MaciekPytel
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 24, 2021
@k8s-ci-robot k8s-ci-robot merged commit 6d19e3d into kubernetes:master Dec 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants