Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stateful session persistence #16698

Closed
wbpcode opened this issue May 27, 2021 · 18 comments
Closed

stateful session persistence #16698

wbpcode opened this issue May 27, 2021 · 18 comments
Assignees
Labels
area/load balancing enhancement Feature requests. Not bugs or questions. stale stalebot believes this issue/PR has not been touched recently

Comments

@wbpcode
Copy link
Member

wbpcode commented May 27, 2021

Http session persistence is achieved through hash-based (consistent hash algorithm) load balancing. When the state of the backend servers change (new server is added, existing server is deleted, server state is updated, etc.) it breaks the session persistence result.

Is it possible to consider adding a cookie-based stateful session persistence? When this feature is turned on, the host id or something is added to the cookie. When new requests arrive, the corresponding host is resolved from the cookie and passed to the LoadBalancer via Upstream::LoadBalancerContext. The LoadBalancer can return this host first, if possible.

@wbpcode wbpcode added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels May 27, 2021
@wbpcode wbpcode changed the title stateful session sticky based on cookie stateful session persistence based on cookie May 27, 2021
@phlax phlax added area/load balancing and removed triage Issue requires triage labels May 27, 2021
@phlax
Copy link
Member

phlax commented May 27, 2021

cc @alyssawilk

@wbpcode
Copy link
Member Author

wbpcode commented May 27, 2021

  • The Upstream::LoadBalancerContext may needs to extend an interface similar to the one shown below:
struct PrimaryHost {
   std::string host_id_or_hash_or_address_or_something; // something can uniquely identify a host 
   size_t expected_status; // expected health status of host
};

PrimaryHost primaryHostShouldSelected()
  • The Http Router need add some new API to enable or disable this function and some code to handle cookie.

  • The LoadBlancer can optionally select the host based on the above interface. Of course, when this feature is enabled, we need an additional hash table, which will bring some additional memory overhead.

In general, it is not complicated and does not require much new code.

I am not sure but this may be can close #7218.

@wbpcode
Copy link
Member Author

wbpcode commented Jun 10, 2021

Add a doc to describe more detail and some possible solutions.
https://docs.google.com/document/d/1IU4b76AgOXijNa4sew1gfBfSiOMbZNiEt5Dhis8QpYg/edit?usp=sharing

@wbpcode wbpcode changed the title stateful session persistence based on cookie stateful session persistence Jun 10, 2021
@wbpcode
Copy link
Member Author

wbpcode commented Jun 10, 2021

@alyssawilk @mattklein123 Is this plan feasible for #7218

@alyssawilk
Copy link
Contributor

@wbpcode
Copy link
Member Author

wbpcode commented Jun 11, 2021

@alyssawilk The most important reason is that no matter what hash policy is used, the existing session persistence is stateless. In the current solution, the cookie generated by Envoy does not directly save the upstream host information. A suitable host is obtained according to the algorithm based on the value calculated by the hash.
When the state of the back-end servers changes (New servers are added, some servers’ state is updated, etc.), the hash ring will be rebuilt, the session persistence is broken.

The requirement in #7218 is actually a good example. When host x is marked as draining or degraded or unhealthy, for the existing session requests, we hope that they will still be routed to host x. If we have the ability to maintain stateful session persistence, and for each session, save the relevant host information, then we can do this.

@alyssawilk
Copy link
Contributor

so at some point if a host is draining it's going to go away. It sounds like what you really want here is a period where no new "sticky" sessions are assigned, but sessions already sticky to an upstream will be routed there?

You say on the doc that maglev might meet your needs if only you could route while draining, but I don't think either of your design options include routing to a host while it's draining, which I don't think your cookie solution would necessarily do either.

That said, encoding ip:port of backend in cookies and using them for direct routing is a pretty common method of LB, and if you want to implement that I think it would be a useful addition to Envoy.

@wbpcode
Copy link
Member Author

wbpcode commented Jun 15, 2021

so at some point if a host is draining it's going to go away. It sounds like what you really want here is a period where no new "sticky" sessions are assigned, but sessions already sticky to an upstream will be routed there?

Yes.

You say on the doc that maglev might meet your needs if only you could route while draining, but I don't think either of your design options include routing to a host while it's draining, which I don't think your cookie solution would necessarily do either.

The draining host is special. Because it has been removed from the hosts set of the cluster, we may need to think more about it. But unhealthy host and degraded host can use the cookie scheme to maintain the existing sticky session.

That said, encoding ip:port of backend in cookies and using them for direct routing is a pretty common method of LB, and if you want to implement that I think it would be a useful addition to Envoy.

I am happy to implement it.

@rgs1
Copy link
Member

rgs1 commented Jul 8, 2021

Ok reposting some of the stuff shared with @wbpcode offline, since it might be useful for future cases of custom load balancing.

We've used two methods for achieving something similar to what could be called stateful session persistence.

Case 1:
As described in [0], we route requests based on a cookie value which is extracted and then applied as dynamic metadata using a filter similar to the header-to-metadata filter [1]. This metadata is in turn used by the subset LB to find the right endpoint. For @wbpcode's use case, you could have one subset per endpoint to achieve this. The case of missing endpoints could be addressed by computing an availability map from the main thread or just checking directly from the data path if the endpoint is available. For the use case described in the blog post, we compute a map of capacity per version from the main thread. The additional work from the main thread might not be needed for wbpcode's case.

Case 2:
We have another use case where we balance requests by hashing on the output of a regex applied to a given header. This might not be directly applicable, since there isn't – currently at least – a way to handle fallbacks in a custom way when using a hash policy for routing.

Happy to expand more on any of these details.

[0] https://medium.com/pinterest-engineering/simplifying-web-deploys-19244fe13737
[1] https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/header_to_metadata/v3/header_to_metadata.proto
[2] #11819

@wbpcode
Copy link
Member Author

wbpcode commented Jul 8, 2021

Reposting some replies to @rgs1 so that people who want to learn about this work can get more information.

custom filter + dynamic metadata + subset lb is a powerful method that can be used to implement various complex routing.

But l think the problems custom filter + dynamic metadata + subset lb solved are still not consistent with stateful session sticky. The problem I hope to solve is how to persist the session.
For example, we have three upstream instances of A B C, and the client X request is routed to A at the beginning. After that, A was marked as degraded due to the health check. I hope that at this time, traffic from X is still routed to A. The traffic from the new client Y is normally routed to B or C.

In essence, subset lb is still matching routing rules and then selecting host. (By setting clever matching rules, you can achieve a result similar to stateful session sticky in some scenarios, but I think this way of using it is too complicated for persisting the session, and it only covers some of the scenarios)

The stateful session sticky is to record the result of the first match, and then directly route to the corresponding host. And it is completely independent of the load balancing algorithm. Any load balancing algorithm can obtain this property.

More detailed doc: https://docs.google.com/document/d/1IU4b76AgOXijNa4sew1gfBfSiOMbZNiEt5Dhis8QpYg/edit?usp=sharing

Similar function in other proxy: https://docs.nginx.com/nginx/admin-guide/load-balancer/http-load-balancer/#enabling-session-persistence

@github-actions
Copy link

github-actions bot commented Aug 7, 2021

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 7, 2021
@wbpcode
Copy link
Member Author

wbpcode commented Aug 11, 2021

working

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Aug 11, 2021
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 10, 2021
@wbpcode
Copy link
Member Author

wbpcode commented Sep 11, 2021

working

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Sep 11, 2021
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 11, 2021
@wbpcode
Copy link
Member Author

wbpcode commented Oct 12, 2021

working

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Oct 12, 2021
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 11, 2021
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/load balancing enhancement Feature requests. Not bugs or questions. stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

4 participants