Add Ratelimit API #1767

hzxuzhonghu · 2020-12-02T09:16:07Z

Based on the design

istio-policy-bot · 2020-12-02T09:16:11Z

😊 Welcome @hzxuzhonghu! This is either your first contribution to the Istio api repo, or it's been
awhile since you've been here.

You can learn more about the Istio working groups, code of conduct, and contributing guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

istio-testing · 2020-12-02T09:16:27Z

@hzxuzhonghu: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
release-notes_api	`ab20dd7`	link	`/test release-notes_api`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

hzxuzhonghu · 2020-12-02T09:17:53Z

cc @gargnupur @mandarjog @bianpengyuan @douglas-reid @howardjohn @ramaraochavali

costinm · 2020-12-07T03:36:52Z

mesh/v1alpha1/config.proto

@@ -631,3 +635,27 @@ message Certificate {
  // multiple DNS names.
  repeated string dns_names = 2;
 }
+
+// RateLimitService describes the configuration for an external rate limit service provider.
+message RateLimitService {


As usual: do we expect the rate limit service to be the same in all namespaces and pods ? No possible use case where different namespaces would use different rate services ?

Why not add it to ProxyConfig instead, and use the existing pattern of RemoteService ?

If you want to keep it in MeshConfig - maybe the ExtensionProvider (which was just added) would be needed, but I don't think we have an approved design on using ExtensionProvider.

Generally, i think this should be global. It is easy to operate. I understand ProxyConfig is used to generate configs for proxy in proxy side. But rls is used in istiod. As ExtensionProvider , it is only for external authz.

costinm · 2020-12-07T03:38:27Z

mesh/v1alpha1/config.proto

+
+  // The timeout for the rate limit service RPC.
+  // If not set, this defaults to 20ms.
+  google.protobuf.Duration timeout = 4;


For all services we use in Istio, we need to document how auth is going to be performed. Are all rate services using standard Istio mTLS or do we need to support other mechanisms ?

Also, is this service subject to normal Istio APIs - DestinationRule for example ? If yes ( and I assume it should ), do we want to duplicate timeout here ?

The timeout here is from the application view(envoy->rls). DR can be used, but it will require an additional config

costinm · 2020-12-07T03:39:39Z

mesh/v1alpha1/config.proto

+  // The filter’s behaviour in case the rate limiting service does not respond back.
+  // When it is set to true, Envoy will allow traffic in case of communication failure
+  // between rate limiting service and the proxy.
+  bool fail_open = 3;


Default bool value is false - i.e. fail close. I don't think that's correct for a rate service - I would rather have the setting be 'fail_closed', so the user needs to explicitly add it if he wants this.

Ah, this is the behavior of the legacy mixer.

costinm · 2020-12-07T03:44:20Z

mesh/v1alpha1/config.proto

+// RateLimitService describes the configuration for an external rate limit service provider.
+message RateLimitService {
+  // REQUIRED. Specifies the service that implements rate limit service.
+  // The format is "[<Namespace>/]<Hostname>". The <Hostname> is the full qualified host name in the Istio service


I know we use this pattern in some configs, like Gateway, for a specific purpose ( delegation ).

Do we need this here ? If there is a strong use cases - I will request any implementation before we unhide this API will have test cases for each option. Maybe start with the simple solution first, of using a ServiceEntry/DestinationRule/etc in istio-system and not provide too much complexity of allowing arbitrary namespace.

We should not enforce it has to be in istio-system.

costinm · 2020-12-07T03:47:56Z

networking/v1alpha3/virtual_service.proto

+
+  // The following descriptor entry is appended to the descriptor:
+  // `("generic_key", "<descriptor_value>")`
+  message GenericKey{


I think this PR is introducing a lot of things that feel duplicated and quite complex. Don't we have other APIs that deal with header matching, etc ( in Policy, routing ) ?

This is to allow specifying any key-val desctriptor entry and optional.

It is better make input be the properties from request like machs instead of introducing a key-val pairs.

Both are useful. Generic value could be the route name for example.

costinm · 2020-12-07T03:48:47Z

networking/v1alpha3/virtual_service.proto

+      string prefix = 5;
+
+      // RE2 style regex-based match (https://github.com/google/re2/wiki/Syntax).
+      string regex = 6;


Usual question about regex - what are the test plans ( I don't think the design doc includes enough detail ), and do we really need regex ? Perf impact, complexity, etc.

Even if Envoy supports it - we don't have to.

Regex provides much more flexibility, like uri regex match? Shouldn't we make it extensible?

costinm

I think all new APIs must be added as hidden, and will only be unhidden when the implementation is ready ( including tests for anything that is getting added, in particular for beta APIs.

costinm · 2020-12-07T03:52:46Z

networking/v1alpha3/virtual_service.proto

+
+  // A RateLimitDescriptor is a list of hierarchical entries that are used by the service to
+  // determine the final rate limit key and overall allowed limit. Here are some examples of how
+  // they might be used for the domain "envoy".


I think the examples are missing, also to be able to have test coverage we'll need a more explicit list of what is supported.

Could global rate limit api provide the similar attribute such as max_tokens interval which defines the rate limit ? The current global rate limit api defines required input to Envoy's ratelimit server rather than the ratelimit rule. And it's hard for end ursers.

costinm · 2020-12-07T03:55:25Z

networking/v1alpha3/virtual_service.proto

@@ -1273,6 +1397,53 @@ message RouteDestination {
  // version. If there is only one destination in a rule, all traffic will be
  // routed to it irrespective of the weight.
  int32 weight = 2;
+
+  // L4 local rate limiting policy.
+  LocalRateLimit local_rate_limit = 3;


Given the goal to switch to BTS, and the likely situation where L4 rate limit will need to take into account attributes of the request - including TLS and metadata we add (telemetry for example): I would very much prefer to fold this into a single RateLimit, that applies for both HTTP and TCP.

We will need to document which headers/features are supported depending on connection type - but there are too many variations depending on TLS, transport, protocol to create a separate proto for each case - it seems cleaner to just treat all 'features' the same for http and TCP.

It is confusing making TCP and HTTP together, they have many differences. For tcp, i donot think metadata can be rate limited with upstream envoy

costinm · 2020-12-07T03:56:34Z

networking/v1alpha3/virtual_service.proto

+  int32 status_code = 1;
+
+  // The token bucket configuration to use for rate limiting requests.
+  TokenBucket token_bucket = 2 [(google.api.field_behavior) = REQUIRED];


So the HTTP rate limit doesn't use any of the request attributes ? It seems identical with the L4, except the status code.

Yeah, request attributes is not supported with envoy's local rate limit.

costinm · 2020-12-07T03:57:08Z

networking/v1alpha3/virtual_service.proto

+message HTTPLocalRateLimit {
+  // StatusCode allows for a custom HTTP response status code to the downstream client
+  // when the request has been rate limited. Defaults to 429 (TooManyRequests).
+  int32 status_code = 1;


Use case ? 429 is a pretty standard response, why would we need another option ? It adds tests and complexity.

Agree, I can remove

costinm

Also since this is touching beta APIs that we'll have to support for a LONG time, we need again to figure out how to clearly document that the added fields are experimental/brand new and may change ( given how many and complex it looks it's likely ).

costinm

One more thing: sorry for not reading the design carefully, I realize there is a major issue here, the VirtualService and model here is applied outbound - i.e. the rate service will have no effect if a client doesn't use Istio.

It is possible to apply the rate limit in Gateway for example, but that would be a different API ( i.e. if only gateway is supposed to do rate limit calls ).

We do need some way to enforce rate limits on inbound as well.

costinm · 2020-12-07T15:30:06Z

On Mon, Dec 7, 2020 at 4:47 AM Zhonghu Xu ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In networking/v1alpha3/virtual_service.proto <#1767 (comment)>: > + // If not set it defaults to ‘generic_key’ as the descriptor key. + string descriptor_key = 1; + + // The value to use in the descriptor entry. + string descriptor_value = 2 [(google.api.field_behavior) = REQUIRED]; + } +} + +// HTTPLocalRateLimit used to rate limit the HTTP requests locally +message HTTPLocalRateLimit { + // StatusCode allows for a custom HTTP response status code to the downstream client + // when the request has been rate limited. Defaults to 429 (TooManyRequests). + int32 status_code = 1; + + // The token bucket configuration to use for rate limiting requests. + TokenBucket token_bucket = 2 [(google.api.field_behavior) = REQUIRED]; Yeah, request attributes is not supported with envoy's local rate limit.

From an API perspective, the question is if this is a fundamental design/intentional choice, that needs to be reflected in the API design - or just a limitation of the current implementation. I strongly suspect it's the later. In addition, advanced rate limit ( as API ) may use a combination of local token bucket and remote calls. I think it is critical to agree that a Istio RateLimit API should be an abstraction that allows multiple implementations - including 'proxy-less' or non-envoy proxies, as well as different implementations inside envoy ( WASM, other styles ). So I would also use a single proto to represent both rate limits, and maybe add a section to the design doc with the previous rate limit API we had ( when mixer provided an implementation ), and maybe an analysis of what was wrong with the old API and explain how we want to improve. I also strongly believe in "less is more" - so maybe we can start with a smaller API, and later if we find the need to have 4 we can add the others ( i.e. dedicated TCP pair, and local/remote differentiation)

costinm · 2020-12-07T15:39:50Z

On Mon, Dec 7, 2020 at 4:20 AM Zhonghu Xu ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In mesh/v1alpha1/config.proto <#1767 (comment)>: > @@ -631,3 +635,27 @@ message Certificate { // multiple DNS names. repeated string dns_names = 2; } + +// RateLimitService describes the configuration for an external rate limit service provider. +message RateLimitService { Generally, i think this should be global. It is easy to operate. I understand ProxyConfig is used to generate configs for proxy in proxy side. But rls is used in istiod. As ExtensionProvider , it is only for external authz.

Why ? In a large org, MeshConfig is operated by the 'istiod admin team'. We are trying to reduce the pressure on the istiod team ( which may be an external vendor for 'managed istiod' ). Without Istio a user has the ability to deploy its own redis ( or other rate limit service ) and configure its workloads to use them without requesting a central team to make changes ( that involve running operator and in practice restart Istio - at least with current istioctl or helm, since MeshConfig is overridden ) ExtensionProvider: I agree, it was designed for external authz and unfortunately uses a generic and confusing name that suggest that all external providers will be covered. To be fair, there is a need for a common style and design for all integrations - I don't think ExtensionProvider in MeshConfig is the right one, in part for the reasons above but there are far more issues. However telemetry team is attempting to use it, and TOC may override my objections, so I have to ask the question: if TOC decides ExtensionProvider is required, this PR will also be affected.

costinm · 2020-12-07T15:44:49Z

On Mon, Dec 7, 2020 at 4:20 AM Zhonghu Xu ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In mesh/v1alpha1/config.proto <#1767 (comment)>: > @@ -631,3 +635,27 @@ message Certificate { // multiple DNS names. repeated string dns_names = 2; } + +// RateLimitService describes the configuration for an external rate limit service provider. +message RateLimitService { Generally, i think this should be global. It is easy to operate. I understand ProxyConfig is used to generate configs for proxy in proxy side. But rls is used in istiod. As ExtensionProvider , it is only for external authz.

MeshConfig.default is also gobal, and as easy to operate as fields in MeshConfig ( yes, users have the ability to override - but they don't have to). ProxyConfig is intended for configs affecting the proxy - doesn't matter who generates the configs. For example we discussed the option to generate the bootstrap in istiod, and tracing which is now in bootstrap is moving to proper dynamic config. I think we are confusing implementation details ( where some code happens to run or how API is implemented ) with the intent. For example if we wanted Istiod itself to make rate limit calls - I would put it in MeshConfig, because it would clearly not be a config of the proxies. But everything that affects the Proxy - I would keep in ProxyConfig.

howardjohn

We have 2 open design docs, and now a PR, for the same feature. Can we keep discussions in one place?

gargnupur · 2020-12-07T16:02:54Z

We have 2 open design docs, and now a PR, for the same feature. Can we keep discussions in one place?

Agreed... it might be easier to just have a meeting and sort it out.. @hzxuzhonghu : what do you think?
Looks like networking group sync up is coming this week, we can use time after that?

Docs in review:

hzxuzhonghu · 2020-12-08T02:48:31Z

The meeting time is not friendly for my timezone(1am for me), can we make it like PST 17:00

gargnupur · 2020-12-09T07:29:31Z

@hzxuzhonghu : We are discussing, good time to meet.. will get back to you in a day or two..

costinm · 2020-12-09T15:00:30Z

We can discuss in the networking meeting, currently at 9 PST. PST 17:00 sounds good too - maybe we should discuss moving the networking meeting at this time if it is more convenient for people. I don't think we have a lot of participants from Europe, but East Coast people may be unhappy. I am starting to lean towards simply defining separate protos, in dedicated packages (rate.istio.io/v1alpha1) - with an eye towards adopting external specs or converging with other APIs in this space instead of istio-only - and moving them out of istio.io space. Similar with what we plan for Gateway ( moving towards K8S CR)

…

On Mon, Dec 7, 2020 at 6:48 PM Zhonghu Xu ***@***.***> wrote: The meeting time is not friendly for my timezone(1am for me), can we make it like PST 17:00 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1767 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAUR2VRCQJWZFZWCFHPA53STWHYZANCNFSM4UKGNZ4Q> .

hzxuzhonghu · 2020-12-11T02:10:40Z

I am starting to lean towards simply defining separate protos, in dedicated
packages (rate.istio.io/v1alpha1)

I am all for this, what's more, we should provide a ratelimit server which can also consume this api, so users donot need to config twice(both in istio and the rate limit server side). Previously when we have mixer, this is natural.

hzxuzhonghu · 2020-12-13T10:10:39Z

@costinm @gargnupur This is a separate API that is much direct and without coupling with VS. It can be applied on inbound/outbound based on users' choice.
https://docs.google.com/document/d/1ySvR6s-6Ngs0Uaj_e3-8MaM4bkm44I_HkHLuP6fABVA/edit#

istio-testing · 2020-12-15T16:25:31Z

@hzxuzhonghu: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SpecialYang · 2021-04-07T07:08:28Z

How is this pr going? Need this feature, too.

hzxuzhonghu · 2021-06-04T08:44:11Z

@kyessenov @mandarjog @gargnupur @costinm @howardjohn I am not sure how to push this feature forward, I am starting working on it since last year, and it has been half year passed, and we haven't come to any agreement. Most users including us are using envoyfilter or designed a separate CRD, and translate it to envoyflter, which is very bad UX.

cc @istio/technical-oversight-committee is this still on the roadmap?

kyessenov · 2021-06-04T17:01:28Z

@hzxuzhonghu It's still we something we want. Besides the concerns around the low level details of actions/descriptors, we need to figure out how to decouple rate limit policy from networking APIs.

hzxuzhonghu · 2021-06-07T01:42:26Z

@kyessenov You are absolutely quite right, with a separate API, maybe we could implement rate limit as a plugin as auth does.

jwendell · 2021-07-23T12:50:06Z

Is this obsolete by #2028 ?

hzxuzhonghu · 2021-07-26T02:48:38Z

I think so, though it has some downsides

hzxuzhonghu added 2 commits December 2, 2020 17:13

add rate limit

567dd7d

auto gen

ab20dd7

hzxuzhonghu requested review from dcberg, duderino, linsun, louiscryan, nrjpoddar, rshriram and smawson as code owners December 2, 2020 09:16

google-cla bot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Dec 2, 2020

istio-testing added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 2, 2020

costinm reviewed Dec 7, 2020

View reviewed changes

costinm requested changes Dec 7, 2020

View reviewed changes

costinm reviewed Dec 7, 2020

View reviewed changes

howardjohn requested changes Dec 7, 2020

View reviewed changes

istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Dec 15, 2020

arkodg mentioned this pull request Feb 10, 2023

Support Rate Limiting using Gateway API Extensions istio/istio#43295

Open

hzxuzhonghu closed this May 22, 2023

Add Ratelimit API #1767

Add Ratelimit API #1767

Conversation

hzxuzhonghu commented Dec 2, 2020

istio-policy-bot commented Dec 2, 2020

istio-testing commented Dec 2, 2020

hzxuzhonghu commented Dec 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzxuzhonghu Dec 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzxuzhonghu Dec 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

costinm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

costinm left a comment

Choose a reason for hiding this comment

costinm left a comment

Choose a reason for hiding this comment

costinm commented Dec 7, 2020 via email

costinm commented Dec 7, 2020 via email

costinm commented Dec 7, 2020 via email

howardjohn left a comment

Choose a reason for hiding this comment

gargnupur commented Dec 7, 2020 • edited Loading

hzxuzhonghu commented Dec 8, 2020

gargnupur commented Dec 9, 2020

costinm commented Dec 9, 2020 via email

hzxuzhonghu commented Dec 11, 2020

hzxuzhonghu commented Dec 13, 2020

istio-testing commented Dec 15, 2020

SpecialYang commented Apr 7, 2021

hzxuzhonghu commented Jun 4, 2021

kyessenov commented Jun 4, 2021

hzxuzhonghu commented Jun 7, 2021

jwendell commented Jul 23, 2021

hzxuzhonghu commented Jul 26, 2021

hzxuzhonghu Dec 7, 2020 •

edited

Loading

hzxuzhonghu Dec 7, 2020 •

edited

Loading

gargnupur commented Dec 7, 2020 •

edited

Loading