Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove backend from external backends if same backend name #8430

Closed
wants to merge 10 commits into from

Conversation

freddyesteban
Copy link

@freddyesteban freddyesteban commented Apr 4, 2022

What this PR does / why we need it:

Remove backend from external backends if new backend has the same name, prevents using old cached external backend when type of backend is changed.

When changing Service object type from ExternalName to ClusterIP, the backend in backends_with_external_name in the lua balancer is never removed causing the External backend to keep serving traffic. This PR removes the balancer from backends_with_external_name if it has the same name.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation only

Which issue/s this PR fixes

fixes #8440

How Has This Been Tested?

  • Update balancer.lua code with fix
  • Run make dev-env
  • Created a Deployment to serve a static page using tag nginx:latest
  • Created Service object of type ExternalName pointing to an external website, traffic is routed appropriately
  • Changed Service object to type ClusterIP to route traffic to Deployment pod, traffic is routed appropriately (prior to this PR, traffic was still routed using external backend).

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I've read the CONTRIBUTION guide
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Apr 4, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: freddyesteban / name: Freddy Esteban Perez (a826cea, fb2b55b)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 4, 2022
@k8s-ci-robot
Copy link
Contributor

@freddyesteban: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 4, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @freddyesteban. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. area/lua Issues or PRs related to lua code labels Apr 4, 2022
@longwuyuan
Copy link
Contributor

  • I think, there is too much assumption here
  • It will be nice to see how a user lands in the situation where there is a backend lingering.
  • I think there was a issue raised about a similar situation but there too, a detailed description and step-by-step instruction to reproduce the problem of lingering backend was not provided.
  • It could be very likely true that there is lingering backend that needs to be manually removed, but this comment post is to get clarity on the simple fact as to why someone would first create a ingress with a externalName type service as backend in the first place. And then follow it up with editing it instead of deleting the Ingress and creating a new one.

@freddyesteban
Copy link
Author

  • I think, there is too much assumption here
  • It will be nice to see how a user lands in the situation where there is a backend lingering.
  • I think there was a issue raised about a similar situation but there too, a detailed description and step-by-step instruction to reproduce the problem of lingering backend was not provided.
  • It could be very likely true that there is lingering backend that needs to be manually removed, but this comment post is to get clarity on the simple fact as to why someone would first create a ingress with a externalName type service as backend in the first place. And then follow it up with editing it instead of deleting the Ingress and creating a new one.

@longwuyuan thank you for taking the time to look at our PR.

Our use case for changing the Service type in flight is to spin down all pods if not in use (spin to zero), and point to an external service to signals the user that pod is spun down ( a please wait page) until it wakes up. Our automation "wakes up" the pods by scaling the deployment replicas back to 1 and the Service Object is switched to use the ClusterIP. This particular behavior worked on an older version of the controller, the nginx controller would cause a reload for the change though. In this version of the controller, it attempt to perform the change dynamically using the Lua balancer which is great so we can hopefully avoid the reload.

You've suggested why not delete the ingress object and we could do that but found that just updating the Service type avoids a reload. Our clusters are relative big and this could be very advantageous to our use case.

To replicate the issue, I've put together a step-by-step guide here.

To see the difference with our change, follow step-by-step guide here.

Why fixing removing the cached external backend with the same name is important to us?
Changing Service type to ClusterIP is a dynamic reconfiguration without reload. If you look at logs I provided here.

We're aware that we could delete the ingress object or even update it in place and that would work but that causes a reload. Which it could be ok since the old ingress controller object was doing this but we'd like to take advantage of the dynamic reload our fix is providing.

Here are two step-by-step guide for the approach of deleting the ingress object and just recreating, the logs will show a backend reload, see here.

Here are two step-by-step guide for the approach of updating the ingress object to use a separate Service object, the logs will show a backend reload, see here.

@longwuyuan
Copy link
Contributor

Hi @freddyesteban ,
Thanks for the detailed explaining and the reproduce procedure. It helps.
My first request is that you create a new issue and put all the details you explained here in that issue.

  • Please explode all the data related to reproducing the problem in that issue.
  • Link that issue here with the string fixes <pound/hash symbol> <issuenumber>

My next request is please write tests. I think there should be some assurance that just checking for a pre-existing stale backend in Lua does not interfere with any other codepath. Please write tests that you think will provide this assurance. I din't expect any panic for a if condition check, but a test should confirm that for users who don't create a externalName type service and edit it later, there is not going to be any impact. I don't even know if the externalName type existing but not being pointed to the internet but to a custom destination make sa difference or not. Basically, please write all the tests that will provide the assurance needed.

@longwuyuan
Copy link
Contributor

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Apr 7, 2022
@longwuyuan
Copy link
Contributor

@freddyesteban on a very different note, if. you have already tried https://kubernetes.github.io/ingress-nginx/examples/customization/custom-errors/#custom-errors, please write a note on why custom-errors are not a preferred solution for the use case you described. It would be such a clean and supported solution to server the "please wait" page from custom-backend.

@tao12345666333
Copy link
Member

Please sign CLA

@tao12345666333
Copy link
Member

/assign

@freddyesteban
Copy link
Author

Please sign CLA

@tao12345666333 We were under the impression that we as a company had already done that but I think the project moved to EasyCLA and that's no longer the case. My manager has filed a ticket to get that fixed. Thank you.

@freddyesteban
Copy link
Author

freddyesteban commented Apr 7, 2022

@longwuyuan

Thank you. I created the issue and linked it to this PR. In regards to testing, the change would only remove the backend from backends_with_external_name after the sync_backend of a non-external backend sync, removing it from the externals would not affect a user not having an external backend. The lookup of the backend name in the table is safe and would not panic. If there's change between external types, e.g. external backend is reconfigures to point to a custom destination, it is not affected because in order for that code path, the user would have to change the type of the Service to non-external. I've attempted to add test before but in order to test the table backends_with_external_name, I'd have to break encapsulation because the function updating the table and the table itself are not exported. I'm working on it anyways atm. Thoughts on exporting the sync_backends and backends_with_external_name for testing purposes?

I'm new to Lua, apologies if there's a better way to approach the testing and if you have any suggestions or could point me in the right direction, I'd appreciate that.

@freddyesteban
Copy link
Author

@freddyesteban on a very different note, if. you have already tried https://kubernetes.github.io/ingress-nginx/examples/customization/custom-errors/#custom-errors, please write a note on why custom-errors are not a preferred solution for the use case you described. It would be such a clean and supported solution to server the "please wait" page from custom-backend.

For us at least, It's more of routing to a particular external service when the Service changes rather than creating a default backend that could handle our particular use case. With enough work, I think we could find multiple solutions including your suggestion of deleting ingress objects. We'd like to have our please wait service decoupled from the nginx controller deployments as it serves multiple clusters, that's just one factor.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 7, 2022
@freddyesteban
Copy link
Author

@longwuyuan @tao12345666333 thoughts on the changes?

@rikatz
Copy link
Contributor

rikatz commented May 1, 2022

/ok-to-test
@tao12345666333 can you please take a look. It does make sense to me, but I'm a bit worried every time I mess with Lua code ;)

Thanks

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 1, 2022
@tao12345666333
Copy link
Member

Sure. It's on my queue.

@tao12345666333
Copy link
Member

/test pull-ingress-nginx-test-lua

@tao12345666333
Copy link
Member

/retest

@tao12345666333
Copy link
Member

Errors in CI are not related to code changes, which may have something to do with test-infra. I will do a code review

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/ingress-nginx/8430/pull-ingress-nginx-test-lua/1521739764124880896/build-log.txt

build/run-in-docker.sh: line 65: USER: unbound variable
build/run-in-docker.sh: line 65: USER: unbound variable
build/run-in-docker.sh: line 65: docker: command not found
make: *** [Makefile:146: lua-test] Error 127

@longwuyuan
Copy link
Contributor

Errors in CI are not related to code changes, which may have something to do with test-infra. I will do a code review

Ricardo had to remove a if condition that checks for DIND, because prow was failing e2e but local/laptop e2e was working.
Now you are reporting run-in-docker.sh related error message. Hope we are aware of any underlying infra/prow changes to avoid spiralling out of control. There was no announcement though and I had a success with e2e on laptop in the last 24 hours so surely its related to prow.

@tao12345666333
Copy link
Member

/retest

@freddyesteban
Copy link
Author

@tao12345666333 any updates on this or anything I should be doing to help get this over the line? TIA

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 5, 2022
@freddyesteban
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 10, 2022
@freddyesteban
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 12, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: freddyesteban
Once this PR has been reviewed and has the lgtm label, please ask for approval from tao12345666333. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

@freddyesteban: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-ingress-nginx-test-lua 39ecf8b link true /test pull-ingress-nginx-test-lua

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/lua Issues or PRs related to lua code cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stale external backend when changing service type from ExternalName to ClusterIP
6 participants