-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outlier detection support in DestinationRule for grpc proxyless #3000
base: master
Are you sure you want to change the base?
Outlier detection support in DestinationRule for grpc proxyless #3000
Conversation
😊 Welcome @aditya-32! This is either your first contribution to the Istio api repo, or it's been You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines Thanks for contributing! Courtesy of your friendly welcome wagon. |
Hi @aditya-32. Thanks for your PR. I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/easycla |
b9abfe4
to
73bbc00
Compare
73bbc00
to
0b22a81
Compare
FailurePercentageEjection failure_percentage_ejection = 10; | ||
|
||
// SuccessRateEjection Algorithm for Grpc-xds proxyless for deciding | ||
// if the host is an outlier or not. | ||
SuccessRateEjection success_rate_ejection = 11; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is only for grpc not sidecar.
How does it collaborate with the existing configs like Consecutive_5Xx
? Does it have a limitation only some of these can be set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently istio provides outlierDetection which is supported by envoy proxy only. But these FailurePercent and SuccessRate are not supported for both grpc-xds and envoy.
Does it have a limitation only some of these can be set?
regarding this , in order to trigger outlierDetection we need to set properties if either of these algorithm
reference for previous discussion:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does envoy not support the same ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Envoy has support envoy outlier detection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok but they do apply to envoy as well, right? if so, we shouldn't say its "for Grpc-xds proxyless", as its not. Its for any proxy type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot get the answer, can you reply positively
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hzxuzhonghu envoy proto for outlierDetection has those parameters for FailurePercent and SuccessRate algorithm, but currently we don't have a way to configure it from istio. Only way to configure those for application using envoy-proxy is through EnvoyFilter. But our use case is for grpc proxyless application(no envoy sidecar proxy) , and istio does not support a way to configure those Algorithm parameters. Also currently there is no way to configure outlierDetection for grpc-proxyless application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@howardjohn @hzxuzhonghu let me answer these questions on @aditya-32 behalf.
Q. Does envoy support these parameters?
Yes
Q. How does it collaborate with the existing configs like Consecutive_5XX?
From Success Rate and Failure Percentage
There is no mention of how they are able to co-relate with the consecutive_5xx, only mention are of the configurations like split_external_local_origin_errors
and few of the configurations.
In this scenario, I believe that it is for the envoy / grpc to handle how they would want to process the configuration and istio should help with translation and leave the client and the user to understand what combination works, because there will always be few internal implementations which are not documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my suggestion is to make them explicit, what does each option control and how will they collaborate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hzxuzhonghu can you please re-review, made the changes
// interval duration above) to perform failure percentage-based ejection for this address. If the | ||
// volume is lower than this setting, failure percentage-based ejection will not be performed for | ||
// this host. | ||
google.protobuf.UInt32Value request_volume = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it is a sliding window or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the request for the host should be atleast request_volume count within the interval period in order to be eligible for outlierDetection
FailurePercentageEjection failure_percentage_ejection = 10; | ||
|
||
// SuccessRateEjection Algorithm for Grpc-xds proxyless for deciding | ||
// if the host is an outlier or not. | ||
SuccessRateEjection success_rate_ejection = 11; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does envoy not support the same ones?
// The % chance that an address will be actually ejected when an outlier status is detected through | ||
// failure percentage statistics. This setting can be used to disable ejection or to ramp it up | ||
// slowly. | ||
google.protobuf.UInt32Value enforcement_percentage = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? I am struggling to see the use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enforcement_percentage this is like what is the probability for a host to be removed from LB pool when detected as outlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know what it is, but I am not clear why we need to offer it or why a user would want to set it. "It exists in the Envoy API" is not a reason why, either -- Envoy's API has different design principals than Istio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@howardjohn I think I understand the use-case here, lets say that there are 500 servers and 50 of them started showing ejection behaviour. Then the outlier detection algorithm will eject all these 50 servers and distribute its load on rest of the 450 servers, this might be a big traffic shift for other servers to handle.
On the contrary, having this setting to lets say 50 will eliminate 25 and then 12 and so on, such that the traffic is ramped up in corresponding intervals.
Although none of the documentations (both enovy / grpc-xds) clearly mentions in which scenario this is applicable. But from what I see it may help prevent issues with temporary spikes and help smoothen out the curve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well - the documentation on this PR is also pretty unclear. My concern is that reading either doc - I wouldn't know how to use it, and most likely people will use it wrong or not take advantage if it is actually needed - on an already over-complex system.
I understand there is a need - and envoy and proxyless may support this - but if we go down the path of supporting each and every envoy field we may be better off with just adopting Envoy API ( which we already do in EnvoyFilter - with well known problems ).
// The minimum number of addresses in order to perform failure percentage-based ejection. | ||
// If the total number of addresses is less than this value, failure percentage-based | ||
// ejection will not be performed. | ||
google.protobuf.UInt32Value minimum_hosts = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this in the context of Istio? How do we expect users to set it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@howardjohn for this and the remaining fields, I am a little confused when we say in the context of istio, I would like to better understand how do we decide which properties do we need to translate to data plane like envoy / grpc-xds, because istio is itself might not be doing any action here excpet for providing the configurations and if a user would like to configure these things for their setup, how would one ensure that they are made available to them via the destination rule otherwise?
I would say these are these parameters are more or less tuning parameters and may not be left as well left to default values (which is 5), because they would otherwise not work for the cases where the service is having less than 5 hosts. They are better tuned as per the traffic patterns and the use cases of the user.
I would have given better answer if I would have used it on staging / production setup and tried and tested it, but to test it we would as well be needing these features in istio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally users try it with EnvoyFilter - adding it to a beta API just to test it out and figure out that only 1 or 2 users understand how to use it - but we have to maintain it for a long time is not ideal.
I know John has a different opinion on new CRDs - but given that Ambient will have particular troubles with DestinationRule ( per-node ztunnel won't apply them - and waypoints can't respect the full semantics ) - this sounds like something that would be much better off in a separate CRD, like CircuitBreaking, with the new attachment model.
DR is in particular tricky because it can be used both client side and server side. It is unreasonable to expect clients to know how many servers are running ( and all of them to update when this change). With auto-scaling - I doubt even the producer can know this. Some of those settings would be far better left to Istiod to compute.
// interval duration above) to perform failure percentage-based ejection for this address. If the | ||
// volume is lower than this setting, failure percentage-based ejection will not be performed for | ||
// this host. | ||
google.protobuf.UInt32Value request_volume = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this in the context of Istio? How do we expect users to set it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@howardjohn make sense will check and update regarding the fields priority and usecase!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least this will need a LOT more docs and examples. Adding an API that even Istio reviewers have trouble understanding leaves little chance to regular users.
I don't disagree that some of it may be useful - but also not clear if this is something the user should configure or Istiod should set, knowing the current number of endpoints.
Another concern is that I don't understand how this relates to traffic shifting and versioning - if I deploy v2 and all the new endpoints are outliers, how would the setting affect this ?
In any case - at the very least the new fields should be annotated with whatever 'experimental' we use to indicate they are NOT part of the beta API and are very experimental. Also a LOT of docs on how this will work with ambient ( we still share many APIs, and for new stuff we should document if it applies or not, where it applies and how the workload selector vs other composition rules are changed)
// The % chance that an address will be actually ejected when an outlier status is detected through | ||
// failure percentage statistics. This setting can be used to disable ejection or to ramp it up | ||
// slowly. | ||
google.protobuf.UInt32Value enforcement_percentage = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well - the documentation on this PR is also pretty unclear. My concern is that reading either doc - I wouldn't know how to use it, and most likely people will use it wrong or not take advantage if it is actually needed - on an already over-complex system.
I understand there is a need - and envoy and proxyless may support this - but if we go down the path of supporting each and every envoy field we may be better off with just adopting Envoy API ( which we already do in EnvoyFilter - with well known problems ).
// The minimum number of addresses in order to perform failure percentage-based ejection. | ||
// If the total number of addresses is less than this value, failure percentage-based | ||
// ejection will not be performed. | ||
google.protobuf.UInt32Value minimum_hosts = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally users try it with EnvoyFilter - adding it to a beta API just to test it out and figure out that only 1 or 2 users understand how to use it - but we have to maintain it for a long time is not ideal.
I know John has a different opinion on new CRDs - but given that Ambient will have particular troubles with DestinationRule ( per-node ztunnel won't apply them - and waypoints can't respect the full semantics ) - this sounds like something that would be much better off in a separate CRD, like CircuitBreaking, with the new attachment model.
DR is in particular tricky because it can be used both client side and server side. It is unreasonable to expect clients to know how many servers are running ( and all of them to update when this change). With auto-scaling - I doubt even the producer can know this. Some of those settings would be far better left to Istiod to compute.
@@ -972,7 +1037,7 @@ message OutlierDetection { | |||
// for connections to upstream database cluster. | |||
// | |||
// {{<tabset category-name="example">}} | |||
// {{<tab name="v1alpha3" category-value="v1alpha3">}} | |||
// {{<tab name="v1alpha3" category-value="v1lpha3">}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is accidental.
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'd love to see it implemented as we are adopting proxyless setup. |
Closes #3001