Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: HasInstance reporting accurate global deleting count #7155

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

Bryce-Soghigian
Copy link
Member

@Bryce-Soghigian Bryce-Soghigian commented Aug 12, 2024

What type of PR is this?

/kind test

What this PR does / why we need it:

This pr adds an e2e test that validates the lifecycle of HasInstance. We need to check that nodes are not being counted as deleted if the vm is still there.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 12, 2024
@k8s-ci-robot k8s-ci-robot requested a review from feiskyer August 12, 2024 17:23
@k8s-ci-robot k8s-ci-robot requested a review from nilo19 August 12, 2024 17:23
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 12, 2024
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 4, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Bryce-Soghigian
Once this PR has been reviewed and has the lgtm label, please assign bigdarkclown for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 5, 2024
@Bryce-Soghigian Bryce-Soghigian changed the title [WIP DONT REVIEW!!!] test: HasInstance reporting accurate global deleting count test: HasInstance reporting accurate global deleting count Sep 5, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 5, 2024
@Bryce-Soghigian
Copy link
Member Author

Bryce-Soghigian commented Sep 5, 2024

Some prelim review can occur now. Still some kinks to iron out in the framework

@Bryce-Soghigian
Copy link
Member Author

Bryce-Soghigian commented Sep 5, 2024

I would like to migrate things over to use test.Deployment from karpenter but there are some dependency conflicts. I don't want to stay blocked on those conflicts so I will just do it the old fashioned way.

		deployment := test.Deployment(test.DeploymentOptions{
			ObjectMeta: metav1.ObjectMeta{
				Name:      "php-apache",
				Namespace: namespace.Name,
			},
			Replicas: 30,
			PodOptions: test.PodOptions{
				Image: "registry.k8s.io/hpa-example",
				ResourceRequirements: corev1.ResourceRequirements{
					Limits: corev1.ResourceList{
						corev1.ResourceCPU: resource.MustParse("500m"),
					},
					Requests: corev1.ResourceList{
						corev1.ResourceCPU: resource.MustParse("200m"),
					},
				},
			},
		})


@Bryce-Soghigian
Copy link
Member Author

/test pull-cluster-autoscaler-e2e-azure

@Bryce-Soghigian Bryce-Soghigian force-pushed the bsoghigian/has-instance-e2e branch from e90ee07 to fe6bacc Compare September 5, 2024 23:14
@Bryce-Soghigian
Copy link
Member Author

/test pull-cluster-autoscaler-e2e-azure

@Bryce-Soghigian Bryce-Soghigian force-pushed the bsoghigian/has-instance-e2e branch 4 times, most recently from d8d15d3 to 2b517a8 Compare September 9, 2024 21:06
// have a VM will be counted as deleted rather than unready due to cluster state checking if a
// node is being deleted based on the existence of that taint.
// Inside the scale-up loop, we call GetUpcomingNodes, which returns how many nodes will be added to each of the node groups.
// https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-release-1.30/cluster-autoscaler/clusterstate/clusterstate.go#L987
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Replace with blob link rather than release branch

fix: updating makefile to include node resource group

fix: semantics

fix: update

fix: node rg

test: increasing replicas

test: adding validation that a subsequent scaleup isn't stuck

fix: using live deployment rather than older object

test: adding more pods, and validating deployment is scaled up before proceeding

fix: test should have nodepools that actually can schedule the 100 pods for the test

fix: including namespace in test.DeploymentOptions call

fix: using higher mincount and minor refactor to deployment readiness logic

fix: removing helper

fix: validating deployment scales up at the end

test: validating deployment replica counts all go ready and available

fix: force cas onto systempool

test: trying new pod spec to get around metrics server scaling issues

test: removing flakey check

test: theory

fix: ci lint

fix: chasing after a smidgeon of telemetry

test: simplifying test to one node and lock rather than workload based simulation

test: using nodepool client to create isolated nodepool for scaleup

agentpool api requires count

adding lock client, cluster autoscaler status configmap parsing, and validation that HasInstance doesn't falsely report BeingDeleted

chore: semantic updates

fix: conflicting package versions

test: depedency fix and fixing bug in id parsing

fix: conflict between karpenter + CAS

fix: resource group name

fix: go.mod

fix: error check for item with two values

fix: deref :)

fix

fix: cluster resource group

fix: removing locks no race check but its not worth the pain of getting rbac

fix: vm check

refactor: cleanup some unused clients

fix: :=

fix: ns get
@Bryce-Soghigian Bryce-Soghigian force-pushed the bsoghigian/has-instance-e2e branch from 2b517a8 to 9d63b1d Compare September 9, 2024 21:11
Copy link
Contributor

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, left some questions and suggestions.

@@ -257,7 +257,7 @@ spec:
cluster-autoscaler-enabled: "true"
cluster-autoscaler-name: ${CLUSTER_NAME}
max: "5"
min: "1"
min: "0"
Copy link
Contributor

@tallaxes tallaxes Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this affect other tests? Need to make sure we don't break existing expectations when modifying template.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe having separate template for scale from zero would be preferred in general, since CAS behavior often differs due to node templating w/ min count == 0

@@ -322,4 +322,4 @@ rules:
verbs:
- get
- list
- watch
- watch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What requires watching (does not work without it)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure

return Status, nil
}

func CreateNodepool(ctx context.Context, npClient *armcontainerservice.AgentPoolsClient, rg, clusterName, nodepoolName string, agentpool armcontainerservice.AgentPool) (*armcontainerservice.AgentPool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this (and DeleteNodePool) used?

})
})

func ExpectTaintedSystempool(ctx context.Context, k8sClient client.Client) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and some other helpers) should be extracted out of this specific test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can this be done via template instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i considered doing this in the template but that messes with the logic of the other test and didn't want to murk with that too much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured we can extract these helpers out when they are actually used by something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I would not expect other tests to touching system pool either ... But, if needed, we could also have multiple templates, no? If you ran into issue with this, lets capture and address later, but would be good to start establishing best practices around this.

return false
}, "5m", "10s").Should(BeTrue(), "Workload should be scheduled on a new node")

By("verifying the new node's ProviderID matches the expected resource ID and the resource exists before placing a delete lock on it")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no delete locks involved (anymore)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no delete locks involved (anymore)

do we know if this is true for all of our supported release versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THe delete locks were part of this test originally that i removed.

Comment on lines 188 to 189
// We should not be reporting this node as being deleted
// even though it has the ToBeDeleted CAS Taint with HasInstance implemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording can be improved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just: "config map should not show this node as being deleted, despite it having the ToBeDeleted taint"

Expect(err).ToNot(HaveOccurred())
// We should not be reporting this node as being deleted
// even though it has the ToBeDeleted CAS Taint with HasInstance implemented
Expect(newStatus.ClusterWide.Health.NodeCounts.Registered.BeingDeleted).To(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also check for where this node shoud be counted instead? (Unready?)

// newNodes := ar.CurrentTarget - (len(readiness.Ready) + len(readiness.Unready) + len(readiness.LongUnregistered))
//
// We will falsely report the count of newNodes here, which leads to not creating new nodes in the scale-up loop.
// We want to validate that for the Azure provider, HasInstance solves the case where we have a VM that has not been deleted yet,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full description of why HasInstance is needed in the first place, the implications of it missing, etc. is probably too much for here.

It should be enough to describe what needs testing (something like "VM that still exists is not reported as deleted") and maybe how the test goes about it (something like "scale down, and check that before VM is deleted it counted as Unready rather than Deleted in the autoscaler cluster state, as reported via ConfigMap").

The fact this is achieved internally via HasInstance implementation is almost irrelevant here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using this information, but leaving it in reviewer notes for this PR instead. That way, we can refer to this context later. Agree that it's a bit too verbose for code comments.

Spec: appsv1.DeploymentSpec{
Selector: &metav1.LabelSelector{
MatchLabels: map[string]string{
"run": "php-apache",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pretty much anything would work here, if consistent, but using php-apache is still a bit confusing when deploying pause.

@@ -203,7 +203,7 @@ require (
gopkg.in/natefinch/lumberjack.v2 v2.2.1 // indirect
gopkg.in/warnings.v0 v0.1.2 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
k8s.io/apiextensions-apiserver v0.0.0 // indirect
k8s.io/apiextensions-apiserver v0.29.0 // indirect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is worrying (and inconsistent with other versions). There don't seem to be any changes outside of test. What caused this?

@@ -70,7 +70,7 @@ var _ = Describe("Azure Provider", func() {
Expect(k8s.List(ctx, nodes)).To(Succeed())
nodeCountBefore := len(nodes.Items)

By("Creating 100 Pods")
By("Creating 30 Pods")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you update to 30 to reduce the amount of time it took to run this test? just trying to understand since I believe it's unrelated to your HasInstance case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test before that checked if the replicas were ready, and we would get stuck at 40. The 100 replicas dont scale with the two pools with 5 nodes from what I observed.

I did runs with a higher max count and we could schedule the 100 pods, but decided its better to lower pod count instead to make all the pods get scheduled in each test run.

// newNodes := ar.CurrentTarget - (len(readiness.Ready) + len(readiness.Unready) + len(readiness.LongUnregistered))
//
// We will falsely report the count of newNodes here, which leads to not creating new nodes in the scale-up loop.
// We want to validate that for the Azure provider, HasInstance solves the case where we have a VM that has not been deleted yet,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using this information, but leaving it in reviewer notes for this PR instead. That way, we can refer to this context later. Agree that it's a bit too verbose for code comments.


By("tainting all existing nodes to ensure workload gets scheduled on a new node")
ExpectTaintedSystempool(ctx, k8s)
By("schedule workload to go on the node")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replace "schedule" here with "scheduling a"

return false
}, "5m", "10s").Should(BeTrue(), "Workload should be scheduled on a new node")

By("verifying the new node's ProviderID matches the expected resource ID and the resource exists before placing a delete lock on it")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no delete locks involved (anymore)

do we know if this is true for all of our supported release versions?

Comment on lines 174 to 182
Eventually(func() bool {
Expect(k8s.Get(ctx, client.ObjectKey{Name: newNode.Name}, newNode)).To(Succeed())
for _, taint := range newNode.Spec.Taints {
if taint.Key == "ToBeDeletedByClusterAutoscaler" {
return true
}
}
return false
}, "1m", "1s").Should(BeTrue(), "Node should have ToBeDeletedByClusterAutoscaler taint")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a more generic HasTaint() helper?

}, "1m", "1s").Should(BeTrue(), "Node should have ToBeDeletedByClusterAutoscaler taint")
_, err = vmssVMsClient.Get(ctx, nodeResourceGroup, vmssName, instanceID, nil)
Expect(err).To(BeNil())
By("Expecting cluster autoscaler status to not report this node as BeingDeleted")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lower case "Expecting"

Comment on lines 188 to 189
// We should not be reporting this node as being deleted
// even though it has the ToBeDeleted CAS Taint with HasInstance implemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just: "config map should not show this node as being deleted, despite it having the ToBeDeleted taint"

@@ -257,7 +257,7 @@ spec:
cluster-autoscaler-enabled: "true"
cluster-autoscaler-name: ${CLUSTER_NAME}
max: "5"
min: "1"
min: "0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe having separate template for scale from zero would be preferred in general, since CAS behavior often differs due to node templating w/ min count == 0

@k8s-ci-robot
Copy link
Contributor

@Bryce-Soghigian: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-autoscaler-e2e-azure-master 44b8f28 link false /test pull-cluster-autoscaler-e2e-azure-master

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.


namespace = &corev1.Namespace{
ObjectMeta: metav1.ObjectMeta{
GenerateName: "azure-e2e-",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using a different namespace (maybe has-instance-e2e) would sufficiently isolate the basic CAS test from hasinstance test.

The AfterEach() checks that the namespace is deleted, but I'm unsure how that interacts with the resources within that namespace.

I wonder if some of those resources can end up existing still if the namespace create happens fast enough after the namespace delete...

I'll do two things:

  1. double check the k8s expected behavior when a namespace is deleted
  2. if the resources within the namespace are expected to still be there, I'll update the has-instance namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is possible for a namespace to be deleted while some resources within it haven't yet been removed. This situation can occur due to several reasons:

  1. Finalizers: Kubernetes uses finalizers to ensure that certain cleanup tasks are completed before a resource is fully deleted. If a namespace has resources with finalizers that haven't been processed, the namespace can get stuck in a terminating state. This means the namespace deletion is initiated, but it won't complete until all finalizers are removed.

  2. Stuck Resources: Sometimes, resources within a namespace might not be properly cleaned up due to issues such as network problems, misconfigurations, or bugs. This can prevent the namespace from being fully deleted

  3. Manual Cleanup Required: In cases where resources are stuck, manual intervention might be required to remove the finalizers or force delete the resources. This can involve using commands like kubectl edit to remove finalizers or kubectl delete to forcefully remove the resources.

  4. API Server Load: High load on the Kubernetes API server can also cause delays in the deletion process, leading to resources not being removed promptly

If you encounter this issue, you can try the following steps to resolve it:

  • Use kubectl get namespace [namespace-name] -o json to check for any finalizers.
  • If finalizers are present, use kubectl edit namespace [namespace-name] to remove them.
  • Force delete the namespace using kubectl delete namespace [namespace-name] --force --grace-period=0.

Copy link
Contributor

@rakechill rakechill Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about the right approach here.

Option 1: We use the same namespace for each set of tests and instead add a finalizer onto each namespace to ensure all of its resources are deleted before it's deleted. With this approach, I'm worried about the namespace taking too long to delete in that case.

Option 2: We use a different namespace for each set of tests. In this case, it should prevent each set from polluting each other, but won't prevent individual cases within a set from polluting each other. If we always have one case per set, this shouldn't be an issue. However, I don't think that's something we can necessarily rely on.

Option 3: Ensure we properly clean up all resources in AfterEach() similar to Karpenter setup helpers (link).

@@ -76,7 +82,12 @@ func TestE2E(t *testing.T) {
var _ = BeforeSuite(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I wonder if we also should have an AfterSuite, or if we assume that the resources will be properly deleted by some other method...
I'll check on this, as well.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 21, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants