Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to drain endpoints before removal #7218

Open
concaf opened this issue Jun 9, 2019 · 13 comments
Open

Unable to drain endpoints before removal #7218

concaf opened this issue Jun 9, 2019 · 13 comments
Labels
design proposal Needs design doc/proposal before implementation help wanted Needs help!

Comments

@concaf
Copy link

concaf commented Jun 9, 2019

Title: Unable to drain endpoints before removal

Description:
I'm trying to implement endpoint connection draining (and not draining envoy itself) with sticky sessions configured.

At t=0, there are 5 endpoints configured in my envoy cluster with cookie based sticky sessions configured.
At t=10, I wish to drain 1 out of the 5 endpoints I have i.e. I do not want any new connections to be routed to that endpoint, but I want the current sessions to persist before I remove the endpoint from the list.

In order to drain an endpoint, I've tried 2 approaches -

  1. Set 'health_status': 'DRAINING' in the endpoint (See https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/core/health_check.proto#envoy-api-enum-core-healthstatus)
    This takes care of not sending any new connections to the endpoint, but does not allow older sticky sessions to persist and they're broken off.
    I got leads for this from Feature request: a means to refuse subsequent TCP connections while allowing current connections enough time to drain #2920 (comment), but doesn't seem to work 🤷‍♂️

  2. Set load_balancing_weight: 1 in the endpoint in the hope that a lower weight will not route any new requests to the endpoint but the sticky sessions will persist, but that didn't happen.

So, what am I missing here? Any pointers are appreciated! Thanks :)

Config:
This is how the endpoint configuration looks when I'm trying to drain it -

                                {
                                    "endpoint": {
                                        "address": {
                                            "socket_address": {
                                                "address": "172.17.0.9",
                                                "port_value": 8080
                                            }
                                        }
                                    }
                                },
                                {
                                    "endpoint": {
                                        "address": {
                                            "socket_address": {
                                                "address": "172.17.0.14",
                                                "port_value": 8080
                                            }
                                        }
                                    },
                                    "health_status": "DRAINING"
                                },
@mattklein123 mattklein123 added the question Questions that are neither investigations, bugs, nor enhancements label Jun 10, 2019
@mattklein123
Copy link
Member

I think I understand what you are after, but I'm not sure how it can be implemented cleanly. How would Envoy know that a session is OK to keep sending to an endpoint? Do you expect this to be somehow encoded in the cookie?

@concaf
Copy link
Author

concaf commented Jun 10, 2019

@mattklein123 thanks for the prompt reply ;)

I think the cookie does not need to have any information on the upcoming drain, I'm expecting a way to signal envoy of this drain in the configuration somewhere.

I think the best I can explain this is the way is quoting how Nginx drains sticky sessions (more at https://www.nginx.com/blog/nginx-plus-backend-upgrades-individual-servers#server-api-persistence)

Session persistence is required for any application that keeps state information for users (such as a shopping cart), but it complicates upgrades because now it is not enough just to wait until there are no active connections to our server before taking it offline. There might be clients that aren’t sending requests right now but haven’t ended their session with the server. For the best user experience, we want to keep the active sessions open – the amount of time depends on the application – but don’t want any new sessions to start.

In this case, before taking the server offline we don’t only want the number of active connections to be zero, but also for all sessions to end. That translates to the server being idle for some amount of time, which depends on the application. We can periodically check the dashboard or use the status API to determine that the server is idle, but we can also automate the process of marking a server drain and verifying it is idle before marking it down.

So, from what I understand, Envoy could do this 2 ways -

  1. Allow the application to specify the drain time - during which no new connections are sent but older ones persist - and after which the endpoint is removed from the configuration
  2. Allow an endpoint to be marked as draining (which is more like 'unhealthy soon' than 'unhealthy now') - and when marked no new connections are sent but older ones persist.

More reading

  1. haproxy-ingress implements this by setting 0 weight to the endpoint which translates to no new connections but continue older ones - the implementation is at Support a "drain" mode for terminating pods and servers failing readiness checks jcmoraisjr/haproxy-ingress#95 which is well explained
  2. An outstanding feature request on Nginx's Kubernetes' ingress at Enter drain mode when a pod is terminating and sticky-sessions are enabled nginx/kubernetes-ingress#267 - some pointers in the discussion there
  3. Feature request on Ambassador (on which I am working on) - Support for draining node during pod termination emissary-ingress/emissary#1473

@concaf
Copy link
Author

concaf commented Jun 10, 2019

I just tried separating the draining endpoints and healthy endpoints into 2 different clusters, and then pointed the route to both of them via weighted_clusters (draining cluster with weight 0, and healthy cluster with weight 100), but no luck. Sticky sessions do not persist -

                                                            "weighted_clusters": {
                                                                "clusters": [
                                                                    {
                                                                        "name": "cluster_qotm_er_ring_hash_cookie_sticky_cookie",
                                                                        "weight": 100
                                                                    },
                                                                    {
                                                                        "name": "draining_cluster",
                                                                        "weight": 0
                                                                    }
                                                                ]
                                                            }

@mattklein123
Copy link
Member

I'm sorry I don't fully understand what you are after. Session stickiness is supported effectively via hashing of an incoming request cookie. How can Envoy understand whether to continue to route existing sessions without some state in the cookie itself?

Do you mean that you want users with existing cookies to keep routing, but no new cookies to be issued that target that upstream?

@concaf
Copy link
Author

concaf commented Jun 11, 2019

Do you mean that you want users with existing cookies to keep routing, but no new cookies to be issued that target that upstream?

@mattklein123 Yep, exactly 👍

I want to drain an endpoint/upstream and I want to give it 1 hour to drain. In that one hour, I want already existing sticky sessions to keep routing to that very upstream, but I don't want any newer sticky sessions to be issued to that upstream.

@mattklein123
Copy link
Member

OK I think I understand what you are after. Right now the cookie hashing policies just effectively store a hash value in the cookie and are meant for use with a consistent hashing load balancer. When a host is failed, the rest of the hosts will rehash. There is no actual state stored in the cookie itself.

I think we could support something like this but it would take a bunch of thinking on how to do it. @alyssawilk any thoughts on this one given the existing cookie support?

@alyssawilk
Copy link
Contributor

I'm not terribly familiar with Envoy upstream selection. I also kind of see how this would work for H2, where you can check connection lifetime vs drain time if we move a bunch of APIs around but I'm less convinced it works well for HTTP/1 given that you may have a preexisting pipeline connection and then prefetch more, and have inconsistent state?

I think rather than base this on connection lifetime it might make sense to do some logic around cookie max-age, where if the client connection were in continued use Envoy could continue extending the max-age and if the client went idle the cookie would expire. Then it's just a question of having a state where your backend selection always uses the hash when the cookie is present, and only selects from non-draining backend when the cookie is not present?

@kflynn
Copy link
Contributor

kflynn commented Jun 20, 2019

After kicking this around at Datawire for a bit, we think that it would work to add a way to mark a given upstream as "blacklisted for new hashes only":

  • if a request comes in with a valid (non-expired) cookie directing it to the blacklisted upstream, fine, go ahead and route.
  • if a request comes in with no cookie, or with an invalid or expired cookie, never allow the hash to select any blacklisted upstream.

Does that sound like a reasonable approach?

@mattklein123
Copy link
Member

Does that sound like a reasonable approach?

Conceptually it would work, but I think the implementation will be complex, as then we need a per-host draining concept (which I guess we have today but I would need to refresh the details). Worse, and this is where I would really have to think it through, I'm not convinced this works well at all if the upstream hosts are stable and things rehash. Are we assuming upstream stability? Even if we assume upstream stability, once a host drains and then goes away, 1/N approximately are going to rehash again anyway so don't you still need to support migration?

@stale
Copy link

stale bot commented Jul 21, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 21, 2019
@stale
Copy link

stale bot commented Jul 28, 2019

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

@stale stale bot closed this as completed Jul 28, 2019
@concaf
Copy link
Author

concaf commented Aug 20, 2019

@mattklein123 @alyssawilk can we mark this as 'help wanted'?

@alyssawilk alyssawilk reopened this Aug 20, 2019
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Aug 20, 2019
@alyssawilk alyssawilk added the help wanted Needs help! label Aug 20, 2019
@mattklein123 mattklein123 added design proposal Needs design doc/proposal before implementation and removed question Questions that are neither investigations, bugs, nor enhancements labels Aug 20, 2019
@Pawan-Bishnoi
Copy link
Contributor

OK I think I understand what you are after. Right now the cookie hashing policies just effectively store a hash value in the cookie and are meant for use with a consistent hashing load balancer. When a host is failed, the rest of the hosts will rehash. There is no actual state stored in the cookie itself.

I think we could support something like this but it would take a bunch of thinking on how to do it. @alyssawilk any thoughts on this one given the existing cookie support?

@mattklein123 this isn't a concern now with sessionFilter
Since podIP is used directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design proposal Needs design doc/proposal before implementation help wanted Needs help!
Projects
None yet
Development

No branches or pull requests

5 participants