test: HasInstance reporting accurate global deleting count #7155

Bryce-Soghigian · 2024-08-12T17:23:53Z

What type of PR is this?

/kind test

What this PR does / why we need it:

This pr adds an e2e test that validates the lifecycle of HasInstance. We need to check that nodes are not being counted as deleted if the vm is still there.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

...ler/cloudprovider/azure/test/templates/cluster-template-prow-aks-aso-cluster-autoscaler.yaml

k8s-ci-robot · 2024-09-05T16:03:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Bryce-Soghigian
Once this PR has been reviewed and has the lgtm label, please assign bigdarkclown for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Bryce-Soghigian · 2024-09-05T17:22:12Z

Some prelim review can occur now. Still some kinks to iron out in the framework

Bryce-Soghigian · 2024-09-05T17:24:14Z

I would like to migrate things over to use test.Deployment from karpenter but there are some dependency conflicts. I don't want to stay blocked on those conflicts so I will just do it the old fashioned way.

		deployment := test.Deployment(test.DeploymentOptions{
			ObjectMeta: metav1.ObjectMeta{
				Name:      "php-apache",
				Namespace: namespace.Name,
			},
			Replicas: 30,
			PodOptions: test.PodOptions{
				Image: "registry.k8s.io/hpa-example",
				ResourceRequirements: corev1.ResourceRequirements{
					Limits: corev1.ResourceList{
						corev1.ResourceCPU: resource.MustParse("500m"),
					},
					Requests: corev1.ResourceList{
						corev1.ResourceCPU: resource.MustParse("200m"),
					},
				},
			},
		})

Bryce-Soghigian · 2024-09-05T18:59:45Z

/test pull-cluster-autoscaler-e2e-azure

Bryce-Soghigian · 2024-09-05T23:41:02Z

/test pull-cluster-autoscaler-e2e-azure

Bryce-Soghigian · 2024-09-09T21:07:41Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+// have a VM will be counted as deleted rather than unready due to cluster state checking if a
+// node is being deleted based on the existence of that taint.
+// Inside the scale-up loop, we call GetUpcomingNodes, which returns how many nodes will be added to each of the node groups.
+// https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-release-1.30/cluster-autoscaler/clusterstate/clusterstate.go#L987


TODO: Replace with blob link rather than release branch

fix: updating makefile to include node resource group fix: semantics fix: update fix: node rg test: increasing replicas test: adding validation that a subsequent scaleup isn't stuck fix: using live deployment rather than older object test: adding more pods, and validating deployment is scaled up before proceeding fix: test should have nodepools that actually can schedule the 100 pods for the test fix: including namespace in test.DeploymentOptions call fix: using higher mincount and minor refactor to deployment readiness logic fix: removing helper fix: validating deployment scales up at the end test: validating deployment replica counts all go ready and available fix: force cas onto systempool test: trying new pod spec to get around metrics server scaling issues test: removing flakey check test: theory fix: ci lint fix: chasing after a smidgeon of telemetry test: simplifying test to one node and lock rather than workload based simulation test: using nodepool client to create isolated nodepool for scaleup agentpool api requires count adding lock client, cluster autoscaler status configmap parsing, and validation that HasInstance doesn't falsely report BeingDeleted chore: semantic updates fix: conflicting package versions test: depedency fix and fixing bug in id parsing fix: conflict between karpenter + CAS fix: resource group name fix: go.mod fix: error check for item with two values fix: deref :) fix fix: cluster resource group fix: removing locks no race check but its not worth the pain of getting rbac fix: vm check refactor: cleanup some unused clients fix: := fix: ns get

…ed before proceeding further

cluster-autoscaler/cloudprovider/azure/test/e2e/e2e_suite_test.go

tallaxes

Overall looks good, left some questions and suggestions.

tallaxes · 2024-09-10T01:27:25Z

...ler/cloudprovider/azure/test/templates/cluster-template-prow-aks-aso-cluster-autoscaler.yaml

@@ -257,7 +257,7 @@ spec:
        cluster-autoscaler-enabled: "true"
        cluster-autoscaler-name: ${CLUSTER_NAME}
        max: "5"
-        min: "1"
+        min: "0"


Does this affect other tests? Need to make sure we don't break existing expectations when modifying template.

Maybe having separate template for scale from zero would be preferred in general, since CAS behavior often differs due to node templating w/ min count == 0

tallaxes · 2024-09-10T01:27:45Z

...ler/cloudprovider/azure/test/templates/cluster-template-prow-aks-aso-cluster-autoscaler.yaml

@@ -322,4 +322,4 @@ rules:
  verbs:
  - get
  - list
-  - watch
+  - watch


What requires watching (does not work without it)?

tallaxes · 2024-09-10T01:29:56Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+	return Status, nil
+}
+
+func CreateNodepool(ctx context.Context, npClient *armcontainerservice.AgentPoolsClient, rg, clusterName, nodepoolName string, agentpool armcontainerservice.AgentPool) (*armcontainerservice.AgentPool, error) {


Is this (and DeleteNodePool) used?

tallaxes · 2024-09-10T01:30:59Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+	})
+})
+
+func ExpectTaintedSystempool(ctx context.Context, k8sClient client.Client) {


This (and some other helpers) should be extracted out of this specific test

Also, can this be done via template instead?

Yeah i considered doing this in the template but that messes with the logic of the other test and didn't want to murk with that too much.

Figured we can extract these helpers out when they are actually used by something else.

Hmm, I would not expect other tests to touching system pool either ... But, if needed, we could also have multiple templates, no? If you ran into issue with this, lets capture and address later, but would be good to start establishing best practices around this.

tallaxes · 2024-09-10T01:33:37Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+			return false
+		}, "5m", "10s").Should(BeTrue(), "Workload should be scheduled on a new node")
+
+		By("verifying the new node's ProviderID matches the expected resource ID and the resource exists before placing a delete lock on it")


There are no delete locks involved (anymore)

There are no delete locks involved (anymore)

do we know if this is true for all of our supported release versions?

THe delete locks were part of this test originally that i removed.

tallaxes · 2024-09-10T01:43:28Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+		// We should not be reporting this node as being deleted
+		// even though it has the ToBeDeleted CAS Taint with HasInstance implemented


Wording can be improved

Maybe just: "config map should not show this node as being deleted, despite it having the ToBeDeleted taint"

tallaxes · 2024-09-10T01:44:27Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+		Expect(err).ToNot(HaveOccurred())
+		// We should not be reporting this node as being deleted
+		// even though it has the ToBeDeleted CAS Taint with HasInstance implemented
+		Expect(newStatus.ClusterWide.Health.NodeCounts.Registered.BeingDeleted).To(


Should also check for where this node shoud be counted instead? (Unready?)

tallaxes · 2024-09-10T02:41:12Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+//	newNodes := ar.CurrentTarget - (len(readiness.Ready) + len(readiness.Unready) + len(readiness.LongUnregistered))
+//
+// We will falsely report the count of newNodes here, which leads to not creating new nodes in the scale-up loop.
+// We want to validate that for the Azure provider, HasInstance solves the case where we have a VM that has not been deleted yet,


The full description of why HasInstance is needed in the first place, the implications of it missing, etc. is probably too much for here.

It should be enough to describe what needs testing (something like "VM that still exists is not reported as deleted") and maybe how the test goes about it (something like "scale down, and check that before VM is deleted it counted as Unready rather than Deleted in the autoscaler cluster state, as reported via ConfigMap").

The fact this is achieved internally via HasInstance implementation is almost irrelevant here.

I'd recommend using this information, but leaving it in reviewer notes for this PR instead. That way, we can refer to this context later. Agree that it's a bit too verbose for code comments.

tallaxes · 2024-09-10T02:49:20Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+			Spec: appsv1.DeploymentSpec{
+				Selector: &metav1.LabelSelector{
+					MatchLabels: map[string]string{
+						"run": "php-apache",


Yes, pretty much anything would work here, if consistent, but using php-apache is still a bit confusing when deploying pause.

tallaxes · 2024-09-10T02:55:49Z

cluster-autoscaler/go.mod

@@ -203,7 +203,7 @@ require (
 	gopkg.in/natefinch/lumberjack.v2 v2.2.1 // indirect
 	gopkg.in/warnings.v0 v0.1.2 // indirect
 	gopkg.in/yaml.v3 v3.0.1 // indirect
-	k8s.io/apiextensions-apiserver v0.0.0 // indirect
+	k8s.io/apiextensions-apiserver v0.29.0 // indirect


This is worrying (and inconsistent with other versions). There don't seem to be any changes outside of test. What caused this?

rakechill · 2024-09-10T14:50:27Z

cluster-autoscaler/cloudprovider/azure/test/e2e/azure_test.go

@@ -70,7 +70,7 @@ var _ = Describe("Azure Provider", func() {
 		Expect(k8s.List(ctx, nodes)).To(Succeed())
 		nodeCountBefore := len(nodes.Items)

-		By("Creating 100 Pods")
+		By("Creating 30 Pods")


did you update to 30 to reduce the amount of time it took to run this test? just trying to understand since I believe it's unrelated to your HasInstance case

I added a test before that checked if the replicas were ready, and we would get stuck at 40. The 100 replicas dont scale with the two pools with 5 nodes from what I observed.

I did runs with a higher max count and we could schedule the 100 pods, but decided its better to lower pod count instead to make all the pods get scheduled in each test run.

cluster-autoscaler/cloudprovider/azure/test/e2e/e2e_suite_test.go

rakechill · 2024-09-10T15:15:42Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+//	newNodes := ar.CurrentTarget - (len(readiness.Ready) + len(readiness.Unready) + len(readiness.LongUnregistered))
+//
+// We will falsely report the count of newNodes here, which leads to not creating new nodes in the scale-up loop.
+// We want to validate that for the Azure provider, HasInstance solves the case where we have a VM that has not been deleted yet,


I'd recommend using this information, but leaving it in reviewer notes for this PR instead. That way, we can refer to this context later. Agree that it's a bit too verbose for code comments.

rakechill · 2024-09-10T15:35:50Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+
+		By("tainting all existing nodes to ensure workload gets scheduled on a new node")
+		ExpectTaintedSystempool(ctx, k8s)
+		By("schedule workload to go on the node")


nit: replace "schedule" here with "scheduling a"

rakechill · 2024-09-10T15:42:15Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+			return false
+		}, "5m", "10s").Should(BeTrue(), "Workload should be scheduled on a new node")
+
+		By("verifying the new node's ProviderID matches the expected resource ID and the resource exists before placing a delete lock on it")


There are no delete locks involved (anymore)

do we know if this is true for all of our supported release versions?

rakechill · 2024-09-10T15:44:40Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+		Eventually(func() bool {
+			Expect(k8s.Get(ctx, client.ObjectKey{Name: newNode.Name}, newNode)).To(Succeed())
+			for _, taint := range newNode.Spec.Taints {
+				if taint.Key == "ToBeDeletedByClusterAutoscaler" {
+					return true
+				}
+			}
+			return false
+		}, "1m", "1s").Should(BeTrue(), "Node should have ToBeDeletedByClusterAutoscaler taint")


maybe a more generic HasTaint() helper?

rakechill · 2024-09-10T15:45:28Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+		}, "1m", "1s").Should(BeTrue(), "Node should have ToBeDeletedByClusterAutoscaler taint")
+		_, err = vmssVMsClient.Get(ctx, nodeResourceGroup, vmssName, instanceID, nil)
+		Expect(err).To(BeNil())
+		By("Expecting cluster autoscaler status to not report this node as BeingDeleted")


nit: lower case "Expecting"

rakechill · 2024-09-10T15:49:04Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+		// We should not be reporting this node as being deleted
+		// even though it has the ToBeDeleted CAS Taint with HasInstance implemented


Maybe just: "config map should not show this node as being deleted, despite it having the ToBeDeleted taint"

rakechill · 2024-09-10T15:53:57Z

...ler/cloudprovider/azure/test/templates/cluster-template-prow-aks-aso-cluster-autoscaler.yaml

@@ -257,7 +257,7 @@ spec:
        cluster-autoscaler-enabled: "true"
        cluster-autoscaler-name: ${CLUSTER_NAME}
        max: "5"
-        min: "1"
+        min: "0"


Maybe having separate template for scale from zero would be preferred in general, since CAS behavior often differs due to node templating w/ min count == 0

…ated

k8s-ci-robot · 2024-09-11T23:15:15Z

@Bryce-Soghigian: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-autoscaler-e2e-azure-master	`44b8f28`	link	false	`/test pull-cluster-autoscaler-e2e-azure-master`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

rakechill · 2024-09-25T20:39:35Z

cluster-autoscaler/cloudprovider/azure/test/e2e/has_instance_test.go

+
+		namespace = &corev1.Namespace{
+			ObjectMeta: metav1.ObjectMeta{
+				GenerateName: "azure-e2e-",


I wonder if using a different namespace (maybe has-instance-e2e) would sufficiently isolate the basic CAS test from hasinstance test.

The AfterEach() checks that the namespace is deleted, but I'm unsure how that interacts with the resources within that namespace.

I wonder if some of those resources can end up existing still if the namespace create happens fast enough after the namespace delete...

I'll do two things:

double check the k8s expected behavior when a namespace is deleted

if the resources within the namespace are expected to still be there, I'll update the has-instance namespace.

Yes, it is possible for a namespace to be deleted while some resources within it haven't yet been removed. This situation can occur due to several reasons:

Finalizers: Kubernetes uses finalizers to ensure that certain cleanup tasks are completed before a resource is fully deleted. If a namespace has resources with finalizers that haven't been processed, the namespace can get stuck in a terminating state. This means the namespace deletion is initiated, but it won't complete until all finalizers are removed.

Stuck Resources: Sometimes, resources within a namespace might not be properly cleaned up due to issues such as network problems, misconfigurations, or bugs. This can prevent the namespace from being fully deleted

Manual Cleanup Required: In cases where resources are stuck, manual intervention might be required to remove the finalizers or force delete the resources. This can involve using commands like kubectl edit to remove finalizers or kubectl delete to forcefully remove the resources.

API Server Load: High load on the Kubernetes API server can also cause delays in the deletion process, leading to resources not being removed promptly

If you encounter this issue, you can try the following steps to resolve it:

Use kubectl get namespace [namespace-name] -o json to check for any finalizers.

If finalizers are present, use kubectl edit namespace [namespace-name] to remove them.

Force delete the namespace using kubectl delete namespace [namespace-name] --force --grace-period=0.

Thinking more about the right approach here.

Option 1: We use the same namespace for each set of tests and instead add a finalizer onto each namespace to ensure all of its resources are deleted before it's deleted. With this approach, I'm worried about the namespace taking too long to delete in that case.

Option 2: We use a different namespace for each set of tests. In this case, it should prevent each set from polluting each other, but won't prevent individual cases within a set from polluting each other. If we always have one case per set, this shouldn't be an issue. However, I don't think that's something we can necessarily rely on.

Option 3: Ensure we properly clean up all resources in AfterEach() similar to Karpenter setup helpers (link).

rakechill · 2024-09-25T20:40:34Z

cluster-autoscaler/cloudprovider/azure/test/e2e/e2e_suite_test.go

@@ -76,7 +82,12 @@ func TestE2E(t *testing.T) {
 var _ = BeforeSuite(func() {


hmmm I wonder if we also should have an AfterSuite, or if we assume that the resources will be properly deleted by some other method...
I'll check on this, as well.

k8s-ci-robot · 2024-10-21T20:21:27Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 12, 2024

k8s-ci-robot requested a review from feiskyer August 12, 2024 17:23

k8s-ci-robot added the area/cluster-autoscaler label Aug 12, 2024

k8s-ci-robot requested a review from nilo19 August 12, 2024 17:23

k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 12, 2024

Bryce-Soghigian commented Sep 3, 2024

View reviewed changes

...ler/cloudprovider/azure/test/templates/cluster-template-prow-aks-aso-cluster-autoscaler.yaml Outdated Show resolved Hide resolved

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 4, 2024

k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 5, 2024

Bryce-Soghigian changed the title ~~[WIP DONT REVIEW!!!] test: HasInstance reporting accurate global deleting count~~ test: HasInstance reporting accurate global deleting count Sep 5, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 5, 2024

Bryce-Soghigian closed this Sep 5, 2024

Bryce-Soghigian reopened this Sep 5, 2024

Bryce-Soghigian force-pushed the bsoghigian/has-instance-e2e branch from e90ee07 to fe6bacc Compare September 5, 2024 23:14

Bryce-Soghigian force-pushed the bsoghigian/has-instance-e2e branch 4 times, most recently from d8d15d3 to 2b517a8 Compare September 9, 2024 21:06

Bryce-Soghigian commented Sep 9, 2024

View reviewed changes

Bryce-Soghigian force-pushed the bsoghigian/has-instance-e2e branch from 2b517a8 to 9d63b1d Compare September 9, 2024 21:11

fix: adding check to validate teh configmap exists and has been creat…

5988bfe

…ed before proceeding further

Bryce-Soghigian added 2 commits September 9, 2024 17:08

fix: pls

abefd21

lint

03e207e

Bryce-Soghigian commented Sep 10, 2024

View reviewed changes

cluster-autoscaler/cloudprovider/azure/test/e2e/e2e_suite_test.go Show resolved Hide resolved

tallaxes reviewed Sep 10, 2024

View reviewed changes

rakechill reviewed Sep 10, 2024

View reviewed changes

Bryce-Soghigian added 5 commits September 10, 2024 09:43

fix: addressing pr feedback

edd788d

fix: syntax

b243e95

logging status configmap to understand if they are being properly upd…

cce5f89

…ated

fix: adding additional stateful assertions of cluster state

db3650d

fix: validating unready and ready totals

44b8f28

rakechill reviewed Sep 25, 2024

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 21, 2024

@@ @@ -322,4 +322,4 @@ rules: @@
                 verbs:
                 - get
                 - list
-                - watch
+                - watch

		// We should not be reporting this node as being deleted
		// even though it has the ToBeDeleted CAS Taint with HasInstance implemented

		@@ -76,7 +82,12 @@ func TestE2E(t *testing.T) {
		var _ = BeforeSuite(func() {

test: HasInstance reporting accurate global deleting count #7155

Are you sure you want to change the base?

test: HasInstance reporting accurate global deleting count #7155

Conversation

Bryce-Soghigian commented Aug 12, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Sep 5, 2024

Bryce-Soghigian commented Sep 5, 2024 • edited Loading

Bryce-Soghigian commented Sep 5, 2024 • edited Loading

Bryce-Soghigian commented Sep 5, 2024

Bryce-Soghigian commented Sep 5, 2024

Choose a reason for hiding this comment

tallaxes left a comment

Choose a reason for hiding this comment

tallaxes Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rakechill Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 21, 2024

Bryce-Soghigian commented Aug 12, 2024 •

edited

Loading

Bryce-Soghigian commented Sep 5, 2024 •

edited

Loading

Bryce-Soghigian commented Sep 5, 2024 •

edited

Loading

tallaxes Sep 10, 2024 •

edited

Loading

rakechill Sep 25, 2024 •

edited

Loading