Load balancers getting deleted randomly on deletion of ingress records with a whole diff groupname #3304

someshkoli · 2023-08-02T13:54:05Z

Describe the bug
This has happened twice and all that happened earlier had happened this time as well.

We had few helm releases in namespace=namespace1 in which we were creating ingress record. The group name attached to those ingress records were group1 .

There were few other helm releases in namespace=namespace2 in which we were creating ingress records. The group name attached to these ingress records were group2.

Now there was some error on ingress records on namespace2 which were not able to reconcile due to following error.

{"level":"error","ts":"2023-08-02T08:34:21Z","msg":"Reconciler error","controller":"ingress","object":{"name":"group2"},"namespace":"","name":"group2","reconcileID":"82e1580e","error":"InvalidParameter: 1 validation error(s) found.\n- minimum field value of 1, CreateTargetGroupInput.Port.\n"}

We never paid attention to this until today when we saw the logs.

Now since helm releases in namespace1 were stale we went ahead and deleted all those helm releases resulting in all the ingress records to get deleted. (assuming this also triggers reconcilation of ingress records in the controller)

This resulted in ingress records in namespace2 to get delete (dont know how why).
From debug I found a audit log where alb controller is setting finalizers for these ingress as null (not pasting here rn, lmk if its needed).
From alb controller logs I Found the following log lines

{"level":"error","ts":"2023-08-02T08:34:21Z","msg":"Reconciler error","controller":"ingress","object":{"name":"group2"},"namespace":"","name":"group2","reconcileID":"82e1580e","error":"InvalidParameter: 1 validation error(s) found.\n- minimum field value of 1, CreateTargetGroupInput.Port.\n"}
{"level":"info","ts":"2023-08-02T08:34:23Z","logger":"controllers.ingress","msg":"successfully built model","model":"{\"id\":\"group2\",\"resources\":{}}"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"deleting loadBalancer","arn":"arn:aws:elasticloadbalancing:us-east-1:9999999999999999loadbalancer/app/k8s-group2-awdawdawdaw/awdawdawdaw"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"deleted securityGroup","securityGroupID":"sg-054d"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"successfully deployed model","ingressGroup":"group1"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"deleted loadBalancer","arn":"arn:aws:elasticloadbalancing:us-east-1:9999999999999999loadbalancer/app/k8s-group2-awdawdawdaw/awdawdawdaw"}

Steps to reproduce
Mentioned above ^

Expected outcome
Ingress / Loadbalancers of group2 should not get deleted when deletion is triggered for group1
Environment
production

AWS Load Balancer controller version: 2.5.2
Kubernetes version: 1.23
Using EKS (yes/no), if so version? yes

Additional Context:

The text was updated successfully, but these errors were encountered:

someshkoli · 2023-08-02T13:55:51Z

From what I found after going through the code

aws-load-balancer-controller/pkg/deploy/elbv2/load_balancer_synthesizer.go

Line 190 in d1b8fbb

    
           func isSDKLoadBalancerRequiresReplacement(sdkLB LoadBalancerWithTags, resLB *elbv2model.LoadBalancer) bool {

func isSDKLoadBalancerRequiresReplacement(sdkLB LoadBalancerWithTags, resLB *elbv2model.LoadBalancer) bool {
	if string(resLB.Spec.Type) != awssdk.StringValue(sdkLB.LoadBalancer.Type) {
		return true
	}
	if resLB.Spec.Scheme != nil && string(*resLB.Spec.Scheme) != awssdk.StringValue(sdkLB.LoadBalancer.Scheme) {
		return true
	}
	return false
}

This code piece can mark the loadbalancer for replacement / deletion when there's the spec does not match. My hunch is that whatever caused the below log line resulted in spec going out of sync and resulting in this function returning true.

{"level":"error","ts":"2023-08-02T08:34:21Z","msg":"Reconciler error","controller":"ingress","object":{"name":"group2"},"namespace":"","name":"group2","reconcileID":"82e1580e","error":"InvalidParameter: 1 validation error(s) found.\n- minimum field value of 1, CreateTargetGroupInput.Port.\n"}

oliviassss · 2023-08-02T20:25:46Z

@someshkoli, Hi, do you have multiple controllers in difference namespaces?

someshkoli · 2023-08-02T20:33:59Z

@someshkoli, Hi, do you have multiple controllers in difference namespaces?

@oliviassss -ve only one controller

M00nF1sh · 2023-08-02T22:10:05Z

@someshkoli

Did the "Ingress" objects in your namespace2 got deleted or just the ALB for the "group2" got deleted? If it's the "Ingress" objects got deleted, then it must be something triggered from your end since our controller didn't delete Ingress objects. You can refer the audit logs to see which user/component trigger the Ingress deletion.
As for the minimum field value of 1, CreateTargetGroupInput.Port.\n"} error, this is unexpected. Would you post more logs, especially the logs with "successfully built model", where there is large JSON-encoded model.
As for the code you mentioned, the replacement logic only triggers when you changed the "Schema" or LoadBalancerType, since these fields are immutable in ELB APIs, we have to recreate a replacement one. However, that's not the case per your logs since the model is empty ,"model":"{\"id\":\"group2\",\"resources\":{}}"}, which means all Ingress in group2 is in "deleting state"

BTW, In general, each Ingress group is reconciled independently, changing one Ingress group shouldn't impact another.

someshkoli · 2023-08-02T22:39:38Z

@M00nF1sh Hey,

I'm not entirely sure what exactly happened, like the ingress record was missing and from audit log all I found was a patch req from the controller to the ingress record setting finalizers=null.
It may be, I had seen it earlier randomly but seeing it right before the deletion line made me curious. (PS: had seen this last time as well when exact same situation had happened), here's the json model that I found in logs
{"level":"info","ts":"2023-08-02T08:34:19Z","logger":"controllers.ingress","msg":"successfully built model","model":{"id":"test-ingress","resources":{"AWS::EC2::SecurityGroup":{"ManagedLBSecurityGroup":{"spec":{"groupName":"k8s-group2-907ce91fe3","description":"[k8s] Managed SecurityGroup for LoadBalancer","ingress":[{"ipProtocol":"tcp","fromPort":443,"toPort":443,"ipRanges":[{"cidrIP":"0.0.0.0/0"}]},{"ipProtocol":"tcp","fromPort":80,"toPort":80,"ipRanges":[{"cidrIP":"0.0.0.0/0"}]}]}}},"AWS::ElasticLoadBalancingV2::Listener":{"80":{"spec":{"loadBalancerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::LoadBalancer/LoadBalancer/status/loadBalancerARN"},"port":80,"protocol":"HTTP","defaultActions":[{"type":"fixed-response","fixedResponseConfig":{"contentType":"text/plain","statusCode":"404"}}]}},"443":{"spec":{"loadBalancerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::LoadBalancer/LoadBalancer/status/loadBalancerARN"},"port":443,"protocol":"HTTPS","defaultActions":[{"type":"fixed-response","fixedResponseConfig":{"contentType":"text/plain","statusCode":"404"}}],"certificates":[{"certificateARN":"arn:aws:acm:us-east-1:999499138329:certificate/60f46466-c2f4-43e7-a30f-fa201b99f8ba"}],"sslPolicy":"ELBSecurityPolicy-2016-08"}}},"AWS::ElasticLoadBalancingV2::ListenerRule":{"443:1":{"spec":{"listenerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::Listener/443/status/listenerARN"},"priority":1,"actions":[{"type":"forward","forwardConfig":{"targetGroups":[{"targetGroupARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::TargetGroup/helm/app1:8081/status/targetGroupARN"}}]}}],"conditions":[{"field":"host-header","hostHeaderConfig":{"values":["app1.test-domain.com"]}},{"field":"path-pattern","pathPatternConfig":{"values":["*"]}}]}},"80:1":{"spec":{"listenerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::Listener/80/status/listenerARN"},"priority":1,"actions":[{"type":"redirect","redirectConfig":{"port":"443","protocol":"HTTPS","statusCode":"HTTP_301"}}],"conditions":[{"field":"host-header","hostHeaderConfig":{"values":["app1.test-domain.com"]}},{"field":"path-pattern","pathPatternConfig":{"values":["*"]}}]}}},"AWS::ElasticLoadBalancingV2::LoadBalancer":{"LoadBalancer":{"spec":{"name":"k8s-group1-997d1c003f","type":"application","scheme":"internet-facing","ipAddressType":"ipv4","subnetMapping":[{"subnetID":"subnet-00000000000000000"},{"subnetID":"subnet-00000000000000000"}],"securityGroups":[{"$ref":"#/resources/AWS::EC2::SecurityGroup/ManagedLBSecurityGroup/status/groupID"},"sg-00000000000000000"]}}},"AWS::ElasticLoadBalancingV2::TargetGroup":{"helm/app1:8081":{"spec":{"name":"k8s-helm-app1-c423627cdc","targetType":"instance","port":0,"protocol":"HTTP","protocolVersion":"HTTP1","ipAddressType":"ipv4","healthCheckConfig":{"port":"traffic-port","protocol":"HTTP","path":"/","matcher":{"httpCode":"200"},"intervalSeconds":15,"timeoutSeconds":5,"healthyThresholdCount":2,"unhealthyThresholdCount":2}}}},"K8S::ElasticLoadBalancingV2::TargetGroupBinding":{"helm/app1:8081":{"spec":{"template":{"metadata":{"name":"k8s-helm-app1-c423627cdc","namespace":"helm","creationTimestamp":null},"spec":{"targetGroupARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::TargetGroup/helm/app1:8081/status/targetGroupARN"},"targetType":"instance","serviceRef":{"name":"app1","port":8081},"networking":{"ingress":[{"from":[{"securityGroup":{"groupID":"sg-00000000000000000"}}],"ports":[{"protocol":"TCP","port":0}]}]},"ipAddressType":"ipv4"}}}}}}}}

However, that's not the case per your logs since the model is empty ,"model":"{"id":"group2","resources":{}}"}, which means all Ingress in group2 is in "deleting state"

yes exactly my concern, how did this happen in first place. I'm assuming this is what caused the lb to get marked as deleted -> causing controller to send finalizer null signal to the ingress record -> then making the ingress to get queued for deletion.
I havent looked into the code yet but is it possible that the broken model (the error that ive sent you), is getting applied (somehow) setting entire resource model to empty {} resulting in the schema condition to go deletion state ?

BTW, In general, each Ingress group is reconciled independently, changing one Ingress group shouldn't impact another.

That is how its supposed to behave, ive no idea why this happened. (twice)

johngmyers · 2023-08-06T21:43:00Z

I'm having trouble figuring out where that line could be logged from.

someshkoli · 2023-08-08T20:20:26Z

@johngmyers which one ?

minimum field value of 1, CreateTargetGroupInput.Port.\n"}

this ? So I found that this error pops up when you have alb.ingress.kubernetes.io/target-type: instance and underlying service type as ClusterIP which is a fair but error is a bit misleading. But its well documented.

someshkoli · 2023-08-08T20:23:37Z

Another interesting thing that I found while trying to replicate this entire thing

PS: this is whole new thing, might raise new issue for this

when a faulty ingress (i1) is applied with group g1 and host entry h1 -> reconcilation fails -> alb is not allocated -> apply another faulty ingress (i2) with group g1 and host entry h2

You will notice that ingress record i2 now has host entry as h1, I thought this is a reconciliation issue and might get fixed post fixing the fault in ingress, but on fixing the fault it kept the host h1 in ingress i2 💀

PS: by fault above I mean, set alb.ingress.kubernetes.io/target-type: instance on a ClusterIP service

blakebarnett · 2023-09-14T17:55:05Z

We also encountered this issue. An Ingress resource was deleted in namespace1 and LoadBalancers for 3 ingresses in namespace2 were deleted. This caused an outage for 3 services, the Ingress resources for these 3 didn't change other than the LB hostname status field eventually went blank.

Annotations in use for the 3 ingresses that had their LoadBalancers incorrectly deleted:

    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-west-2:xxxxxxxxxxxxx:certificate/xxxxxxxxx
    alb.ingress.kubernetes.io/healthcheck-path: /
    alb.ingress.kubernetes.io/healthcheck-port: "80"
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    kubernetes.io/ingress.class: alb

The Service type is ClusterIP in all cases.

This occurred on v2.4.5 (image version), helm chart v1.4.6

It is extremely alarming that this can happen.

someshkoli · 2023-09-14T17:57:25Z

We also encountered this issue. An Ingress resource was deleted in namespace1 and LoadBalancers for 3 ingresses in namespace2 were deleted. This caused and outage for 3 services, the Ingress resources for these 3 didn't change other than the LB hostname status field eventually went blank.

Annotations in use for the 3 ingresses that had their LoadBalancers incorrectly deleted:
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-west-2:xxxxxxxxxxxxx:certificate/xxxxxxxxx
    alb.ingress.kubernetes.io/healthcheck-path: /
    alb.ingress.kubernetes.io/healthcheck-port: "80"
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    kubernetes.io/ingress.class: alb
The Service type is ClusterIP in all cases.

This occurred on v2.4.5 (image version), helm chart v1.4.6

It is extremely alarming that this can happen.

finally someone who can relate, we've had such outage twice and since there was no update on the conversation I had started thinking that I might've deleted it by mistake (somehow randomly).

I tried reproducing it but couldn't, wau ?

johngmyers · 2023-09-14T18:00:31Z

Controller logs for the sync that did the inappropriate deletion would be helpful.

blakebarnett · 2023-09-14T18:03:48Z

Unfortunately logs for this controller weren't being shipped at the time and the pods were restarted during troubleshooting so we lost them. I do have the CloudTrail events that show that the IRSA role the controller was using is what did the deletion, but not much other than that.

someshkoli · 2023-09-14T18:05:52Z

I have container logs, lmk if you want me to send ya ?
Hopefully they don't contain any sensitive information?
Ps: it's native logs haven't touched them

blakebarnett · 2023-09-14T18:08:17Z

Oh also, I should note that deleting and recreating the Ingress resources fixed it immediately. I've been testing v2.6.1 in a dev cluster, I manually deleted the AWS LoadBalancer resources, and the controller starts throwing 403 IAM errors like this:

{"level":"error","ts":"2023-09-14T17:52:06Z","msg":"Reconciler error","controller":"ingress","object":{"name":"cd-demo-frontend","namespace":"development"},"namespace":"development","name":"cd-demo-frontend","reconcileID":"4ecd3e0a-5acd-47ab-8127-6f4fdd1fc6d6","error":"AccessDenied: User: arn:aws:sts::XXXXXXX:assumed-role/alb-ingress-irsa-role/XXXXX is not authorized to perform: elasticloadbalancing:AddTags on resource: arn:aws:elasticloadbalancing:us-west-2:XXXXXX:targetgroup/k8s-developm-cddemofr-d994612186/* because no identity-based policy allows the elasticloadbalancing:AddTags action\n\tstatus code: 403, request id: 2a3686f4-d682-4fe4-b3b3-54e1e7be32ec"}

I waited > 10 hours for the default --sync-period but it didn't recreate them.

oliviassss · 2023-09-14T18:10:37Z

@blakebarnett, this is a separate issue, see: #3383 (comment)

k8s-triage-robot · 2024-01-28T10:59:53Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-27T11:51:29Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-28T12:36:02Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-28T12:36:07Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

someshkoli mentioned this issue Aug 19, 2023

ingress mapping host record getting changed when faulty ingress is applied on a load balancer group with existing faulty ingress #3350

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 27, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load balancers getting deleted randomly on deletion of ingress records with a whole diff groupname #3304

Load balancers getting deleted randomly on deletion of ingress records with a whole diff groupname #3304

someshkoli commented Aug 2, 2023 •

edited

Loading

someshkoli commented Aug 2, 2023 •

edited

Loading

oliviassss commented Aug 2, 2023

someshkoli commented Aug 2, 2023

M00nF1sh commented Aug 2, 2023

someshkoli commented Aug 2, 2023 •

edited

Loading

johngmyers commented Aug 6, 2023

someshkoli commented Aug 8, 2023

someshkoli commented Aug 8, 2023 •

edited

Loading

blakebarnett commented Sep 14, 2023 •

edited

Loading

someshkoli commented Sep 14, 2023 •

edited

Loading

johngmyers commented Sep 14, 2023

blakebarnett commented Sep 14, 2023

someshkoli commented Sep 14, 2023

blakebarnett commented Sep 14, 2023

oliviassss commented Sep 14, 2023

k8s-triage-robot commented Jan 28, 2024

k8s-triage-robot commented Feb 27, 2024

k8s-triage-robot commented Mar 28, 2024

k8s-ci-robot commented Mar 28, 2024

Load balancers getting deleted randomly on deletion of ingress records with a whole diff groupname #3304

Load balancers getting deleted randomly on deletion of ingress records with a whole diff groupname #3304

Comments

someshkoli commented Aug 2, 2023 • edited Loading

someshkoli commented Aug 2, 2023 • edited Loading

oliviassss commented Aug 2, 2023

someshkoli commented Aug 2, 2023

M00nF1sh commented Aug 2, 2023

someshkoli commented Aug 2, 2023 • edited Loading

johngmyers commented Aug 6, 2023

someshkoli commented Aug 8, 2023

someshkoli commented Aug 8, 2023 • edited Loading

blakebarnett commented Sep 14, 2023 • edited Loading

someshkoli commented Sep 14, 2023 • edited Loading

johngmyers commented Sep 14, 2023

blakebarnett commented Sep 14, 2023

someshkoli commented Sep 14, 2023

blakebarnett commented Sep 14, 2023

oliviassss commented Sep 14, 2023

k8s-triage-robot commented Jan 28, 2024

k8s-triage-robot commented Feb 27, 2024

k8s-triage-robot commented Mar 28, 2024

k8s-ci-robot commented Mar 28, 2024

someshkoli commented Aug 2, 2023 •

edited

Loading

someshkoli commented Aug 2, 2023 •

edited

Loading

someshkoli commented Aug 2, 2023 •

edited

Loading

someshkoli commented Aug 8, 2023 •

edited

Loading

blakebarnett commented Sep 14, 2023 •

edited

Loading

someshkoli commented Sep 14, 2023 •

edited

Loading