Health check with dynamic configuration #2325

valeriano-manassero · 2018-04-10T09:04:59Z

Readiness probe must return 200 only when first LUA config load is ok.

aledbf · 2018-04-10T09:14:41Z

This was implemented here #2308

valeriano-manassero · 2018-04-10T09:19:31Z

As far as I understand, with the quoted implementation, Healthz uri can return a 200 before all backends are loaded.
If this happens on a rolling update the update is not transparent because you can cause some downtime, specially on big datasets.

aledbf · 2018-04-10T11:25:08Z

@ElvinEfendi ping

ElvinEfendi · 2018-04-10T11:50:04Z

I'm trying to understand the issue. I can see that on boot readiness prob will pass as soon as controller POSTs default backend(or really any backend) to Lua land. And the k8s API might be sending more Ingress/Service updates from that point on incrementally(@valeriano-manassero I think that's what you're referring to with rolling update) - and between those increments ingress-nginx will already be ready to receive traffic and if there's a request to one of the backends that has not been POSTed to Lua client will see 503.
But this behaviour is not much different without dynamic configuration enabled. What will happen is right after first backend data controller will generate the Nginx template and readiness probe will pass, but it will still not have all the upstreams defined in Nginx config - so certain requests during this time will not succeeded.

So in this sense this is more of a general problem, not dynamic configuration specific - I can see that this might be causing an issue when for some reason an ingress-nginx pod restarts.

But IMHO the changes introduced at NCCloud/fluid#24 is not fixing this issue - that implementation to me at least seems effectively the same as #2308.

A more complete solution IMO would be to query the ingresses from k8s API on health check and pass the respective backend names to a specific Lua health check endpoint where it asserts that all those backends have been registered in the shared dictionary in Lua.

@valeriano-manassero could you come up with an e2e regression test to show this behaviour? Or at least step by step guide for us to regenerate it - that would be very valuable before spending more time on the solution.

valeriano-manassero · 2018-04-10T11:59:03Z

@ElvinEfendi I'm sorry but, right now I'm very busy and I don't have time to come up with an e2e regression test but you can show this behavior simply getting something like 10k to 20k backends and then update the deployment.
You will see some 503 errors due to backends not fully loaded while readinessProbe getting 200 from healthz uri.
The trick is to simply give a 200 response only when the first load is done. In the meanwhile the readiness probe not ok prevent k8s from destroy the old pod until the new is really ready to serve requests.

ElvinEfendi · 2018-04-10T12:04:59Z

I'm sorry but, right now I'm very busy and I don't have time to come up with an e2e regression test but you can show this behavior simply getting something like 10k to 20k backends and then update the deployment.

No worries! I'll try that when I have some time.

The trick is to simply give a 200 response only when the first load is done.

@valeriano-manassero but this is exactly what @aledbf did at https://github.com/kubernetes/ingress-nginx/pull/2308/files, no?

aledbf · 2018-04-10T12:11:01Z

and then update the deployment

the ingress controller deployment?

ElvinEfendi · 2018-04-10T12:12:10Z

the ingress controller deployment?

I think so yeah, in order to trigger a restart

valeriano-manassero · 2018-04-10T12:14:28Z

Yep, for example, I see the issue updating to latest version of Fluid in my environment. Many customers get some 503 for some seconds during rollover update and that's because readinessProbe was already getting 200 but the new pods were not ready to serve all the backends.

aledbf · 2018-04-10T12:22:24Z

I think so yeah, in order to trigger a restart

For that scenario, I want to make a refactoring to split the go code from nginx.
This allows us to fix some edge cases like:

having different configurations for some time windows when the ingress controller runs in different nodes
reduce the number of connections to the API server
only scale nginx and not the ingress controller
the nginx deployment only updates the configuration, which fixes the issue @valeriano-manassero mentions.

After the next release, I will work on this. Right now I am looking how to secure the communication between this two deployments (like using https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster/)

mumoshu · 2018-05-09T12:54:09Z

@aledbf Hi. Thanks for maintaining nginx ingress!

Just wondered if we really need the secure communication thing.

How about ingress controller just periodically writes rendered a nginx.conf and a dynamic config into a configmap? Then the nginx pod could mount the configmap as a volume, watches for change with inotify/fsnotify, and on update either run nginx-reload or feed the config via Lua?

aledbf · 2018-05-09T13:04:41Z

How about ingress controller just periodically writes rendered a nginx.conf and a dynamic config into a configmap?

I thought in that and the issue I see is the permissions over that configmap. We not only need to dump the nginx.conf but also the SSL certificates

aledbf · 2018-06-15T04:13:45Z

As far as I understand, with the quoted implementation, Healthz uri can return a 200 before all backends are loaded.

Closing. This can also occur with the "normal" reload mode. We cannot block serving traffic.

aledbf closed this as completed Jun 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health check with dynamic configuration #2325

Health check with dynamic configuration #2325

valeriano-manassero commented Apr 10, 2018

aledbf commented Apr 10, 2018

valeriano-manassero commented Apr 10, 2018

aledbf commented Apr 10, 2018

ElvinEfendi commented Apr 10, 2018

valeriano-manassero commented Apr 10, 2018 •

edited

Loading

ElvinEfendi commented Apr 10, 2018 •

edited

Loading

aledbf commented Apr 10, 2018

ElvinEfendi commented Apr 10, 2018

valeriano-manassero commented Apr 10, 2018

aledbf commented Apr 10, 2018 •

edited

Loading

mumoshu commented May 9, 2018 •

edited

Loading

aledbf commented May 9, 2018 •

edited

Loading

aledbf commented Jun 15, 2018

Health check with dynamic configuration #2325

Health check with dynamic configuration #2325

Comments

valeriano-manassero commented Apr 10, 2018

aledbf commented Apr 10, 2018

valeriano-manassero commented Apr 10, 2018

aledbf commented Apr 10, 2018

ElvinEfendi commented Apr 10, 2018

valeriano-manassero commented Apr 10, 2018 • edited Loading

ElvinEfendi commented Apr 10, 2018 • edited Loading

aledbf commented Apr 10, 2018

ElvinEfendi commented Apr 10, 2018

valeriano-manassero commented Apr 10, 2018

aledbf commented Apr 10, 2018 • edited Loading

mumoshu commented May 9, 2018 • edited Loading

aledbf commented May 9, 2018 • edited Loading

aledbf commented Jun 15, 2018

valeriano-manassero commented Apr 10, 2018 •

edited

Loading

ElvinEfendi commented Apr 10, 2018 •

edited

Loading

aledbf commented Apr 10, 2018 •

edited

Loading

mumoshu commented May 9, 2018 •

edited

Loading

aledbf commented May 9, 2018 •

edited

Loading