Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health check with dynamic configuration #2325

Closed
valeriano-manassero opened this issue Apr 10, 2018 · 13 comments
Closed

Health check with dynamic configuration #2325

valeriano-manassero opened this issue Apr 10, 2018 · 13 comments

Comments

@valeriano-manassero
Copy link

Readiness probe must return 200 only when first LUA config load is ok.

Feature added here: NCCloud/fluid#24

@aledbf
Copy link
Member

aledbf commented Apr 10, 2018

This was implemented here #2308

@valeriano-manassero
Copy link
Author

As far as I understand, with the quoted implementation, Healthz uri can return a 200 before all backends are loaded.
If this happens on a rolling update the update is not transparent because you can cause some downtime, specially on big datasets.

@aledbf
Copy link
Member

aledbf commented Apr 10, 2018

@ElvinEfendi ping

@ElvinEfendi
Copy link
Member

I'm trying to understand the issue. I can see that on boot readiness prob will pass as soon as controller POSTs default backend(or really any backend) to Lua land. And the k8s API might be sending more Ingress/Service updates from that point on incrementally(@valeriano-manassero I think that's what you're referring to with rolling update) - and between those increments ingress-nginx will already be ready to receive traffic and if there's a request to one of the backends that has not been POSTed to Lua client will see 503.
But this behaviour is not much different without dynamic configuration enabled. What will happen is right after first backend data controller will generate the Nginx template and readiness probe will pass, but it will still not have all the upstreams defined in Nginx config - so certain requests during this time will not succeeded.

So in this sense this is more of a general problem, not dynamic configuration specific - I can see that this might be causing an issue when for some reason an ingress-nginx pod restarts.

But IMHO the changes introduced at NCCloud/fluid#24 is not fixing this issue - that implementation to me at least seems effectively the same as #2308.


A more complete solution IMO would be to query the ingresses from k8s API on health check and pass the respective backend names to a specific Lua health check endpoint where it asserts that all those backends have been registered in the shared dictionary in Lua.


@valeriano-manassero could you come up with an e2e regression test to show this behaviour? Or at least step by step guide for us to regenerate it - that would be very valuable before spending more time on the solution.

@valeriano-manassero
Copy link
Author

valeriano-manassero commented Apr 10, 2018

@ElvinEfendi I'm sorry but, right now I'm very busy and I don't have time to come up with an e2e regression test but you can show this behavior simply getting something like 10k to 20k backends and then update the deployment.
You will see some 503 errors due to backends not fully loaded while readinessProbe getting 200 from healthz uri.
The trick is to simply give a 200 response only when the first load is done. In the meanwhile the readiness probe not ok prevent k8s from destroy the old pod until the new is really ready to serve requests.

@ElvinEfendi
Copy link
Member

ElvinEfendi commented Apr 10, 2018

I'm sorry but, right now I'm very busy and I don't have time to come up with an e2e regression test but you can show this behavior simply getting something like 10k to 20k backends and then update the deployment.

No worries! I'll try that when I have some time.

The trick is to simply give a 200 response only when the first load is done.

@valeriano-manassero but this is exactly what @aledbf did at https://github.com/kubernetes/ingress-nginx/pull/2308/files, no?

@aledbf
Copy link
Member

aledbf commented Apr 10, 2018

and then update the deployment

the ingress controller deployment?

@ElvinEfendi
Copy link
Member

the ingress controller deployment?

I think so yeah, in order to trigger a restart

@valeriano-manassero
Copy link
Author

Yep, for example, I see the issue updating to latest version of Fluid in my environment. Many customers get some 503 for some seconds during rollover update and that's because readinessProbe was already getting 200 but the new pods were not ready to serve all the backends.

@aledbf
Copy link
Member

aledbf commented Apr 10, 2018

I think so yeah, in order to trigger a restart

For that scenario, I want to make a refactoring to split the go code from nginx.
This allows us to fix some edge cases like:

  • having different configurations for some time windows when the ingress controller runs in different nodes
  • reduce the number of connections to the API server
  • only scale nginx and not the ingress controller
  • the nginx deployment only updates the configuration, which fixes the issue @valeriano-manassero mentions.

After the next release, I will work on this. Right now I am looking how to secure the communication between this two deployments (like using https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster/)

@mumoshu
Copy link

mumoshu commented May 9, 2018

@aledbf Hi. Thanks for maintaining nginx ingress!

Just wondered if we really need the secure communication thing.

How about ingress controller just periodically writes rendered a nginx.conf and a dynamic config into a configmap? Then the nginx pod could mount the configmap as a volume, watches for change with inotify/fsnotify, and on update either run nginx-reload or feed the config via Lua?

@aledbf
Copy link
Member

aledbf commented May 9, 2018

How about ingress controller just periodically writes rendered a nginx.conf and a dynamic config into a configmap?

I thought in that and the issue I see is the permissions over that configmap. We not only need to dump the nginx.conf but also the SSL certificates

@aledbf
Copy link
Member

aledbf commented Jun 15, 2018

As far as I understand, with the quoted implementation, Healthz uri can return a 200 before all backends are loaded.

Closing. This can also occur with the "normal" reload mode. We cannot block serving traffic.

@aledbf aledbf closed this as completed Jun 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants