-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: HasInstance reporting accurate global deleting count #7155
base: master
Are you sure you want to change the base?
test: HasInstance reporting accurate global deleting count #7155
Conversation
...ler/cloudprovider/azure/test/templates/cluster-template-prow-aks-aso-cluster-autoscaler.yaml
Outdated
Show resolved
Hide resolved
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Bryce-Soghigian The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Some prelim review can occur now. Still some kinks to iron out in the framework |
I would like to migrate things over to use test.Deployment from karpenter but there are some dependency conflicts. I don't want to stay blocked on those conflicts so I will just do it the old fashioned way.
|
/test pull-cluster-autoscaler-e2e-azure |
e90ee07
to
fe6bacc
Compare
/test pull-cluster-autoscaler-e2e-azure |
d8d15d3
to
2b517a8
Compare
// have a VM will be counted as deleted rather than unready due to cluster state checking if a | ||
// node is being deleted based on the existence of that taint. | ||
// Inside the scale-up loop, we call GetUpcomingNodes, which returns how many nodes will be added to each of the node groups. | ||
// https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-release-1.30/cluster-autoscaler/clusterstate/clusterstate.go#L987 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Replace with blob link rather than release branch
fix: updating makefile to include node resource group fix: semantics fix: update fix: node rg test: increasing replicas test: adding validation that a subsequent scaleup isn't stuck fix: using live deployment rather than older object test: adding more pods, and validating deployment is scaled up before proceeding fix: test should have nodepools that actually can schedule the 100 pods for the test fix: including namespace in test.DeploymentOptions call fix: using higher mincount and minor refactor to deployment readiness logic fix: removing helper fix: validating deployment scales up at the end test: validating deployment replica counts all go ready and available fix: force cas onto systempool test: trying new pod spec to get around metrics server scaling issues test: removing flakey check test: theory fix: ci lint fix: chasing after a smidgeon of telemetry test: simplifying test to one node and lock rather than workload based simulation test: using nodepool client to create isolated nodepool for scaleup agentpool api requires count adding lock client, cluster autoscaler status configmap parsing, and validation that HasInstance doesn't falsely report BeingDeleted chore: semantic updates fix: conflicting package versions test: depedency fix and fixing bug in id parsing fix: conflict between karpenter + CAS fix: resource group name fix: go.mod fix: error check for item with two values fix: deref :) fix fix: cluster resource group fix: removing locks no race check but its not worth the pain of getting rbac fix: vm check refactor: cleanup some unused clients fix: := fix: ns get
2b517a8
to
9d63b1d
Compare
…ed before proceeding further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, left some questions and suggestions.
@@ -257,7 +257,7 @@ spec: | |||
cluster-autoscaler-enabled: "true" | |||
cluster-autoscaler-name: ${CLUSTER_NAME} | |||
max: "5" | |||
min: "1" | |||
min: "0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this affect other tests? Need to make sure we don't break existing expectations when modifying template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe having separate template for scale from zero would be preferred in general, since CAS behavior often differs due to node templating w/ min count == 0
@@ -322,4 +322,4 @@ rules: | |||
verbs: | |||
- get | |||
- list | |||
- watch | |||
- watch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What requires watching (does not work without it)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure
return Status, nil | ||
} | ||
|
||
func CreateNodepool(ctx context.Context, npClient *armcontainerservice.AgentPoolsClient, rg, clusterName, nodepoolName string, agentpool armcontainerservice.AgentPool) (*armcontainerservice.AgentPool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this (and DeleteNodePool) used?
}) | ||
}) | ||
|
||
func ExpectTaintedSystempool(ctx context.Context, k8sClient client.Client) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and some other helpers) should be extracted out of this specific test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can this be done via template instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i considered doing this in the template but that messes with the logic of the other test and didn't want to murk with that too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured we can extract these helpers out when they are actually used by something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I would not expect other tests to touching system pool either ... But, if needed, we could also have multiple templates, no? If you ran into issue with this, lets capture and address later, but would be good to start establishing best practices around this.
return false | ||
}, "5m", "10s").Should(BeTrue(), "Workload should be scheduled on a new node") | ||
|
||
By("verifying the new node's ProviderID matches the expected resource ID and the resource exists before placing a delete lock on it") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no delete locks involved (anymore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no delete locks involved (anymore)
do we know if this is true for all of our supported release versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THe delete locks were part of this test originally that i removed.
// We should not be reporting this node as being deleted | ||
// even though it has the ToBeDeleted CAS Taint with HasInstance implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wording can be improved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just: "config map should not show this node as being deleted, despite it having the ToBeDeleted taint"
Expect(err).ToNot(HaveOccurred()) | ||
// We should not be reporting this node as being deleted | ||
// even though it has the ToBeDeleted CAS Taint with HasInstance implemented | ||
Expect(newStatus.ClusterWide.Health.NodeCounts.Registered.BeingDeleted).To( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also check for where this node shoud be counted instead? (Unready?)
// newNodes := ar.CurrentTarget - (len(readiness.Ready) + len(readiness.Unready) + len(readiness.LongUnregistered)) | ||
// | ||
// We will falsely report the count of newNodes here, which leads to not creating new nodes in the scale-up loop. | ||
// We want to validate that for the Azure provider, HasInstance solves the case where we have a VM that has not been deleted yet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full description of why HasInstance is needed in the first place, the implications of it missing, etc. is probably too much for here.
It should be enough to describe what needs testing (something like "VM that still exists is not reported as deleted") and maybe how the test goes about it (something like "scale down, and check that before VM is deleted it counted as Unready rather than Deleted in the autoscaler cluster state, as reported via ConfigMap").
The fact this is achieved internally via HasInstance implementation is almost irrelevant here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend using this information, but leaving it in reviewer notes for this PR instead. That way, we can refer to this context later. Agree that it's a bit too verbose for code comments.
Spec: appsv1.DeploymentSpec{ | ||
Selector: &metav1.LabelSelector{ | ||
MatchLabels: map[string]string{ | ||
"run": "php-apache", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, pretty much anything would work here, if consistent, but using php-apache is still a bit confusing when deploying pause.
@@ -203,7 +203,7 @@ require ( | |||
gopkg.in/natefinch/lumberjack.v2 v2.2.1 // indirect | |||
gopkg.in/warnings.v0 v0.1.2 // indirect | |||
gopkg.in/yaml.v3 v3.0.1 // indirect | |||
k8s.io/apiextensions-apiserver v0.0.0 // indirect | |||
k8s.io/apiextensions-apiserver v0.29.0 // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is worrying (and inconsistent with other versions). There don't seem to be any changes outside of test. What caused this?
@@ -70,7 +70,7 @@ var _ = Describe("Azure Provider", func() { | |||
Expect(k8s.List(ctx, nodes)).To(Succeed()) | |||
nodeCountBefore := len(nodes.Items) | |||
|
|||
By("Creating 100 Pods") | |||
By("Creating 30 Pods") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you update to 30 to reduce the amount of time it took to run this test? just trying to understand since I believe it's unrelated to your HasInstance case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test before that checked if the replicas were ready, and we would get stuck at 40. The 100 replicas dont scale with the two pools with 5 nodes from what I observed.
I did runs with a higher max count and we could schedule the 100 pods, but decided its better to lower pod count instead to make all the pods get scheduled in each test run.
// newNodes := ar.CurrentTarget - (len(readiness.Ready) + len(readiness.Unready) + len(readiness.LongUnregistered)) | ||
// | ||
// We will falsely report the count of newNodes here, which leads to not creating new nodes in the scale-up loop. | ||
// We want to validate that for the Azure provider, HasInstance solves the case where we have a VM that has not been deleted yet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend using this information, but leaving it in reviewer notes for this PR instead. That way, we can refer to this context later. Agree that it's a bit too verbose for code comments.
|
||
By("tainting all existing nodes to ensure workload gets scheduled on a new node") | ||
ExpectTaintedSystempool(ctx, k8s) | ||
By("schedule workload to go on the node") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: replace "schedule" here with "scheduling a"
return false | ||
}, "5m", "10s").Should(BeTrue(), "Workload should be scheduled on a new node") | ||
|
||
By("verifying the new node's ProviderID matches the expected resource ID and the resource exists before placing a delete lock on it") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no delete locks involved (anymore)
do we know if this is true for all of our supported release versions?
Eventually(func() bool { | ||
Expect(k8s.Get(ctx, client.ObjectKey{Name: newNode.Name}, newNode)).To(Succeed()) | ||
for _, taint := range newNode.Spec.Taints { | ||
if taint.Key == "ToBeDeletedByClusterAutoscaler" { | ||
return true | ||
} | ||
} | ||
return false | ||
}, "1m", "1s").Should(BeTrue(), "Node should have ToBeDeletedByClusterAutoscaler taint") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a more generic HasTaint() helper?
}, "1m", "1s").Should(BeTrue(), "Node should have ToBeDeletedByClusterAutoscaler taint") | ||
_, err = vmssVMsClient.Get(ctx, nodeResourceGroup, vmssName, instanceID, nil) | ||
Expect(err).To(BeNil()) | ||
By("Expecting cluster autoscaler status to not report this node as BeingDeleted") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: lower case "Expecting"
// We should not be reporting this node as being deleted | ||
// even though it has the ToBeDeleted CAS Taint with HasInstance implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just: "config map should not show this node as being deleted, despite it having the ToBeDeleted taint"
@@ -257,7 +257,7 @@ spec: | |||
cluster-autoscaler-enabled: "true" | |||
cluster-autoscaler-name: ${CLUSTER_NAME} | |||
max: "5" | |||
min: "1" | |||
min: "0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe having separate template for scale from zero would be preferred in general, since CAS behavior often differs due to node templating w/ min count == 0
@Bryce-Soghigian: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
||
namespace = &corev1.Namespace{ | ||
ObjectMeta: metav1.ObjectMeta{ | ||
GenerateName: "azure-e2e-", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if using a different namespace (maybe has-instance-e2e
) would sufficiently isolate the basic CAS test from hasinstance test.
The AfterEach() checks that the namespace is deleted, but I'm unsure how that interacts with the resources within that namespace.
I wonder if some of those resources can end up existing still if the namespace create happens fast enough after the namespace delete...
I'll do two things:
- double check the k8s expected behavior when a namespace is deleted
- if the resources within the namespace are expected to still be there, I'll update the has-instance namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is possible for a namespace to be deleted while some resources within it haven't yet been removed. This situation can occur due to several reasons:
-
Finalizers: Kubernetes uses finalizers to ensure that certain cleanup tasks are completed before a resource is fully deleted. If a namespace has resources with finalizers that haven't been processed, the namespace can get stuck in a terminating state. This means the namespace deletion is initiated, but it won't complete until all finalizers are removed.
-
Stuck Resources: Sometimes, resources within a namespace might not be properly cleaned up due to issues such as network problems, misconfigurations, or bugs. This can prevent the namespace from being fully deleted
-
Manual Cleanup Required: In cases where resources are stuck, manual intervention might be required to remove the finalizers or force delete the resources. This can involve using commands like
kubectl edit
to remove finalizers orkubectl delete
to forcefully remove the resources. -
API Server Load: High load on the Kubernetes API server can also cause delays in the deletion process, leading to resources not being removed promptly
If you encounter this issue, you can try the following steps to resolve it:
- Use
kubectl get namespace [namespace-name] -o json
to check for any finalizers. - If finalizers are present, use
kubectl edit namespace [namespace-name]
to remove them. - Force delete the namespace using
kubectl delete namespace [namespace-name] --force --grace-period=0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about the right approach here.
Option 1: We use the same namespace for each set of tests and instead add a finalizer onto each namespace to ensure all of its resources are deleted before it's deleted. With this approach, I'm worried about the namespace taking too long to delete in that case.
Option 2: We use a different namespace for each set of tests. In this case, it should prevent each set from polluting each other, but won't prevent individual cases within a set from polluting each other. If we always have one case per set, this shouldn't be an issue. However, I don't think that's something we can necessarily rely on.
Option 3: Ensure we properly clean up all resources in AfterEach()
similar to Karpenter setup helpers (link).
@@ -76,7 +82,12 @@ func TestE2E(t *testing.T) { | |||
var _ = BeforeSuite(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm I wonder if we also should have an AfterSuite
, or if we assume that the resources will be properly deleted by some other method...
I'll check on this, as well.
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind test
What this PR does / why we need it:
This pr adds an e2e test that validates the lifecycle of HasInstance. We need to check that nodes are not being counted as deleted if the vm is still there.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: