Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load balancers getting deleted randomly on deletion of ingress records with a whole diff groupname #3304

Closed
someshkoli opened this issue Aug 2, 2023 · 19 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@someshkoli
Copy link

someshkoli commented Aug 2, 2023

Describe the bug
This has happened twice and all that happened earlier had happened this time as well.

We had few helm releases in namespace=namespace1 in which we were creating ingress record. The group name attached to those ingress records were group1 .

There were few other helm releases in namespace=namespace2 in which we were creating ingress records. The group name attached to these ingress records were group2.

Now there was some error on ingress records on namespace2 which were not able to reconcile due to following error.

{"level":"error","ts":"2023-08-02T08:34:21Z","msg":"Reconciler error","controller":"ingress","object":{"name":"group2"},"namespace":"","name":"group2","reconcileID":"82e1580e","error":"InvalidParameter: 1 validation error(s) found.\n- minimum field value of 1, CreateTargetGroupInput.Port.\n"}

We never paid attention to this until today when we saw the logs.

Now since helm releases in namespace1 were stale we went ahead and deleted all those helm releases resulting in all the ingress records to get deleted. (assuming this also triggers reconcilation of ingress records in the controller)

This resulted in ingress records in namespace2 to get delete (dont know how why).
From debug I found a audit log where alb controller is setting finalizers for these ingress as null (not pasting here rn, lmk if its needed).
From alb controller logs I Found the following log lines

{"level":"error","ts":"2023-08-02T08:34:21Z","msg":"Reconciler error","controller":"ingress","object":{"name":"group2"},"namespace":"","name":"group2","reconcileID":"82e1580e","error":"InvalidParameter: 1 validation error(s) found.\n- minimum field value of 1, CreateTargetGroupInput.Port.\n"}
{"level":"info","ts":"2023-08-02T08:34:23Z","logger":"controllers.ingress","msg":"successfully built model","model":"{\"id\":\"group2\",\"resources\":{}}"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"deleting loadBalancer","arn":"arn:aws:elasticloadbalancing:us-east-1:9999999999999999loadbalancer/app/k8s-group2-awdawdawdaw/awdawdawdaw"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"deleted securityGroup","securityGroupID":"sg-054d"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"successfully deployed model","ingressGroup":"group1"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"deleted loadBalancer","arn":"arn:aws:elasticloadbalancing:us-east-1:9999999999999999loadbalancer/app/k8s-group2-awdawdawdaw/awdawdawdaw"}

Steps to reproduce
Mentioned above ^

Expected outcome
Ingress / Loadbalancers of group2 should not get deleted when deletion is triggered for group1
Environment
production

  • AWS Load Balancer controller version: 2.5.2
  • Kubernetes version: 1.23
  • Using EKS (yes/no), if so version? yes

Additional Context:

@someshkoli
Copy link
Author

someshkoli commented Aug 2, 2023

From what I found after going through the code

func isSDKLoadBalancerRequiresReplacement(sdkLB LoadBalancerWithTags, resLB *elbv2model.LoadBalancer) bool {

func isSDKLoadBalancerRequiresReplacement(sdkLB LoadBalancerWithTags, resLB *elbv2model.LoadBalancer) bool {
	if string(resLB.Spec.Type) != awssdk.StringValue(sdkLB.LoadBalancer.Type) {
		return true
	}
	if resLB.Spec.Scheme != nil && string(*resLB.Spec.Scheme) != awssdk.StringValue(sdkLB.LoadBalancer.Scheme) {
		return true
	}
	return false
}

This code piece can mark the loadbalancer for replacement / deletion when there's the spec does not match. My hunch is that whatever caused the below log line resulted in spec going out of sync and resulting in this function returning true.

{"level":"error","ts":"2023-08-02T08:34:21Z","msg":"Reconciler error","controller":"ingress","object":{"name":"group2"},"namespace":"","name":"group2","reconcileID":"82e1580e","error":"InvalidParameter: 1 validation error(s) found.\n- minimum field value of 1, CreateTargetGroupInput.Port.\n"}

@oliviassss
Copy link
Collaborator

@someshkoli, Hi, do you have multiple controllers in difference namespaces?

@someshkoli
Copy link
Author

@someshkoli, Hi, do you have multiple controllers in difference namespaces?

@oliviassss -ve only one controller

@M00nF1sh
Copy link
Collaborator

M00nF1sh commented Aug 2, 2023

@someshkoli

  1. Did the "Ingress" objects in your namespace2 got deleted or just the ALB for the "group2" got deleted? If it's the "Ingress" objects got deleted, then it must be something triggered from your end since our controller didn't delete Ingress objects. You can refer the audit logs to see which user/component trigger the Ingress deletion.

  2. As for the minimum field value of 1, CreateTargetGroupInput.Port.\n"} error, this is unexpected. Would you post more logs, especially the logs with "successfully built model", where there is large JSON-encoded model.

  3. As for the code you mentioned, the replacement logic only triggers when you changed the "Schema" or LoadBalancerType, since these fields are immutable in ELB APIs, we have to recreate a replacement one. However, that's not the case per your logs since the model is empty ,"model":"{\"id\":\"group2\",\"resources\":{}}"}, which means all Ingress in group2 is in "deleting state"

BTW, In general, each Ingress group is reconciled independently, changing one Ingress group shouldn't impact another.

@someshkoli
Copy link
Author

someshkoli commented Aug 2, 2023

@M00nF1sh Hey,

  1. I'm not entirely sure what exactly happened, like the ingress record was missing and from audit log all I found was a patch req from the controller to the ingress record setting finalizers=null.

  2. It may be, I had seen it earlier randomly but seeing it right before the deletion line made me curious. (PS: had seen this last time as well when exact same situation had happened), here's the json model that I found in logs
    {"level":"info","ts":"2023-08-02T08:34:19Z","logger":"controllers.ingress","msg":"successfully built model","model":{"id":"test-ingress","resources":{"AWS::EC2::SecurityGroup":{"ManagedLBSecurityGroup":{"spec":{"groupName":"k8s-group2-907ce91fe3","description":"[k8s] Managed SecurityGroup for LoadBalancer","ingress":[{"ipProtocol":"tcp","fromPort":443,"toPort":443,"ipRanges":[{"cidrIP":"0.0.0.0/0"}]},{"ipProtocol":"tcp","fromPort":80,"toPort":80,"ipRanges":[{"cidrIP":"0.0.0.0/0"}]}]}}},"AWS::ElasticLoadBalancingV2::Listener":{"80":{"spec":{"loadBalancerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::LoadBalancer/LoadBalancer/status/loadBalancerARN"},"port":80,"protocol":"HTTP","defaultActions":[{"type":"fixed-response","fixedResponseConfig":{"contentType":"text/plain","statusCode":"404"}}]}},"443":{"spec":{"loadBalancerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::LoadBalancer/LoadBalancer/status/loadBalancerARN"},"port":443,"protocol":"HTTPS","defaultActions":[{"type":"fixed-response","fixedResponseConfig":{"contentType":"text/plain","statusCode":"404"}}],"certificates":[{"certificateARN":"arn:aws:acm:us-east-1:999499138329:certificate/60f46466-c2f4-43e7-a30f-fa201b99f8ba"}],"sslPolicy":"ELBSecurityPolicy-2016-08"}}},"AWS::ElasticLoadBalancingV2::ListenerRule":{"443:1":{"spec":{"listenerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::Listener/443/status/listenerARN"},"priority":1,"actions":[{"type":"forward","forwardConfig":{"targetGroups":[{"targetGroupARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::TargetGroup/helm/app1:8081/status/targetGroupARN"}}]}}],"conditions":[{"field":"host-header","hostHeaderConfig":{"values":["app1.test-domain.com"]}},{"field":"path-pattern","pathPatternConfig":{"values":["*"]}}]}},"80:1":{"spec":{"listenerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::Listener/80/status/listenerARN"},"priority":1,"actions":[{"type":"redirect","redirectConfig":{"port":"443","protocol":"HTTPS","statusCode":"HTTP_301"}}],"conditions":[{"field":"host-header","hostHeaderConfig":{"values":["app1.test-domain.com"]}},{"field":"path-pattern","pathPatternConfig":{"values":["*"]}}]}}},"AWS::ElasticLoadBalancingV2::LoadBalancer":{"LoadBalancer":{"spec":{"name":"k8s-group1-997d1c003f","type":"application","scheme":"internet-facing","ipAddressType":"ipv4","subnetMapping":[{"subnetID":"subnet-00000000000000000"},{"subnetID":"subnet-00000000000000000"}],"securityGroups":[{"$ref":"#/resources/AWS::EC2::SecurityGroup/ManagedLBSecurityGroup/status/groupID"},"sg-00000000000000000"]}}},"AWS::ElasticLoadBalancingV2::TargetGroup":{"helm/app1:8081":{"spec":{"name":"k8s-helm-app1-c423627cdc","targetType":"instance","port":0,"protocol":"HTTP","protocolVersion":"HTTP1","ipAddressType":"ipv4","healthCheckConfig":{"port":"traffic-port","protocol":"HTTP","path":"/","matcher":{"httpCode":"200"},"intervalSeconds":15,"timeoutSeconds":5,"healthyThresholdCount":2,"unhealthyThresholdCount":2}}}},"K8S::ElasticLoadBalancingV2::TargetGroupBinding":{"helm/app1:8081":{"spec":{"template":{"metadata":{"name":"k8s-helm-app1-c423627cdc","namespace":"helm","creationTimestamp":null},"spec":{"targetGroupARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::TargetGroup/helm/app1:8081/status/targetGroupARN"},"targetType":"instance","serviceRef":{"name":"app1","port":8081},"networking":{"ingress":[{"from":[{"securityGroup":{"groupID":"sg-00000000000000000"}}],"ports":[{"protocol":"TCP","port":0}]}]},"ipAddressType":"ipv4"}}}}}}}}

However, that's not the case per your logs since the model is empty ,"model":"{"id":"group2","resources":{}}"}, which means all Ingress in group2 is in "deleting state"

yes exactly my concern, how did this happen in first place. I'm assuming this is what caused the lb to get marked as deleted -> causing controller to send finalizer null signal to the ingress record -> then making the ingress to get queued for deletion.
I havent looked into the code yet but is it possible that the broken model (the error that ive sent you), is getting applied (somehow) setting entire resource model to empty {} resulting in the schema condition to go deletion state ?

BTW, In general, each Ingress group is reconciled independently, changing one Ingress group shouldn't impact another.

That is how its supposed to behave, ive no idea why this happened. (twice)

@johngmyers
Copy link
Contributor

I'm having trouble figuring out where that line could be logged from.

@someshkoli
Copy link
Author

@johngmyers which one ?

minimum field value of 1, CreateTargetGroupInput.Port.\n"}

this ? So I found that this error pops up when you have alb.ingress.kubernetes.io/target-type: instance and underlying service type as ClusterIP which is a fair but error is a bit misleading. But its well documented.

@someshkoli
Copy link
Author

someshkoli commented Aug 8, 2023

Another interesting thing that I found while trying to replicate this entire thing

PS: this is whole new thing, might raise new issue for this

when a faulty ingress (i1) is applied with group g1 and host entry h1 -> reconcilation fails -> alb is not allocated -> apply another faulty ingress (i2) with group g1 and host entry h2

You will notice that ingress record i2 now has host entry as h1, I thought this is a reconciliation issue and might get fixed post fixing the fault in ingress, but on fixing the fault it kept the host h1 in ingress i2 💀

PS: by fault above I mean, set alb.ingress.kubernetes.io/target-type: instance on a ClusterIP service

@blakebarnett
Copy link

blakebarnett commented Sep 14, 2023

We also encountered this issue. An Ingress resource was deleted in namespace1 and LoadBalancers for 3 ingresses in namespace2 were deleted. This caused an outage for 3 services, the Ingress resources for these 3 didn't change other than the LB hostname status field eventually went blank.

Annotations in use for the 3 ingresses that had their LoadBalancers incorrectly deleted:

    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-west-2:xxxxxxxxxxxxx:certificate/xxxxxxxxx
    alb.ingress.kubernetes.io/healthcheck-path: /
    alb.ingress.kubernetes.io/healthcheck-port: "80"
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    kubernetes.io/ingress.class: alb

The Service type is ClusterIP in all cases.

This occurred on v2.4.5 (image version), helm chart v1.4.6

It is extremely alarming that this can happen.

@someshkoli
Copy link
Author

someshkoli commented Sep 14, 2023

We also encountered this issue. An Ingress resource was deleted in namespace1 and LoadBalancers for 3 ingresses in namespace2 were deleted. This caused and outage for 3 services, the Ingress resources for these 3 didn't change other than the LB hostname status field eventually went blank.

Annotations in use for the 3 ingresses that had their LoadBalancers incorrectly deleted:

    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-west-2:xxxxxxxxxxxxx:certificate/xxxxxxxxx
    alb.ingress.kubernetes.io/healthcheck-path: /
    alb.ingress.kubernetes.io/healthcheck-port: "80"
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    kubernetes.io/ingress.class: alb

The Service type is ClusterIP in all cases.

This occurred on v2.4.5 (image version), helm chart v1.4.6

It is extremely alarming that this can happen.

finally someone who can relate, we've had such outage twice and since there was no update on the conversation I had started thinking that I might've deleted it by mistake (somehow randomly).

I tried reproducing it but couldn't, wau ?

@johngmyers
Copy link
Contributor

Controller logs for the sync that did the inappropriate deletion would be helpful.

@blakebarnett
Copy link

Unfortunately logs for this controller weren't being shipped at the time and the pods were restarted during troubleshooting so we lost them. I do have the CloudTrail events that show that the IRSA role the controller was using is what did the deletion, but not much other than that.

@someshkoli
Copy link
Author

I have container logs, lmk if you want me to send ya ?
Hopefully they don't contain any sensitive information?
Ps: it's native logs haven't touched them

@blakebarnett
Copy link

Oh also, I should note that deleting and recreating the Ingress resources fixed it immediately. I've been testing v2.6.1 in a dev cluster, I manually deleted the AWS LoadBalancer resources, and the controller starts throwing 403 IAM errors like this:

{"level":"error","ts":"2023-09-14T17:52:06Z","msg":"Reconciler error","controller":"ingress","object":{"name":"cd-demo-frontend","namespace":"development"},"namespace":"development","name":"cd-demo-frontend","reconcileID":"4ecd3e0a-5acd-47ab-8127-6f4fdd1fc6d6","error":"AccessDenied: User: arn:aws:sts::XXXXXXX:assumed-role/alb-ingress-irsa-role/XXXXX is not authorized to perform: elasticloadbalancing:AddTags on resource: arn:aws:elasticloadbalancing:us-west-2:XXXXXX:targetgroup/k8s-developm-cddemofr-d994612186/* because no identity-based policy allows the elasticloadbalancing:AddTags action\n\tstatus code: 403, request id: 2a3686f4-d682-4fe4-b3b3-54e1e7be32ec"}

I waited > 10 hours for the default --sync-period but it didn't recreate them.

@oliviassss
Copy link
Collaborator

@blakebarnett, this is a separate issue, see: #3383 (comment)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 27, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants