Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/Feature] - Consistent behavior for outlier detection for local and remote endpoints of a target service #205

Closed
nirvanagit opened this issue Apr 16, 2022 · 1 comment
Assignees
Labels
bug Something isn't working
Milestone

Comments

@nirvanagit
Copy link
Collaborator

Describe the bug
Outlier detection behaves differently when target service is local to client, and when it is in a remote cluster.

When target service is in the client cluster

Target service is accessed over .svc service endpoint, which is a single IP (kubernetes service IP)

When target service is in a remote cluster

Target service is accessed over load balancer cname, which resolves to 3 IPs, one for each availability zone.

When target service in local cluster goes down, envoy marks it as unhealthy based on max_ejection_percent, which defaults to 1, as admiral does not configure this currently.

When target service in remote cluster goes down, envoy marks only one out of 3 endpoints as unhealthy, because it ejects the endpoints based on max_ejection_percentage, which as mentioned earlier is currently not set by admiral, hence the default value of 1 is used.

Steps To Reproduce

  1. Create client and server applications
  2. Create server app in two clusters.
  3. Ensure service entry for the target service has two endpoints, one for each region/cluster.
  4. Start traffic from client pod using fortio - fortio load -qps 100 -t 0 <.MESH endpoint>
  5. Inject fault on the client, by blackholing the target service IP (which is local to the client) - ip route add blackhole X.X.X.X
  6. Check envoy configuration, it will show the (only) endpoint as unhealthy.
  7. Traffic will get diverted to the healthier region.

Repeat the above steps with a slight modification

  1. Update SE so that the local endpoint also points to the load balancer CNAME and not the service address.
  2. Make it healthy, so that traffic goes to the local endpoint.
  3. Inject failure, but this time for all the 3 IPs corresponding to the load balancer of the local cluster.
  4. Check envoy configuration, it will show only one out of three endpoints as unhealthy
  5. Traffic gets diverted to healthier region, but client observes a spike of 5xx errors, which remains steady.

Expected behavior
Outlier detection should work the same way irrespective of where the target service is wrt to the client.

@nirvanagit nirvanagit added the bug Something isn't working label Apr 16, 2022
@aattuluri aattuluri self-assigned this Apr 20, 2022
@aattuluri aattuluri added this to the v1.3 milestone Apr 20, 2022
@aattuluri
Copy link
Contributor

Fixed with #207

kpharasi pushed a commit that referenced this issue Dec 19, 2024
* Temporary jenkins file to deploy only in qal

* Only Build in new jenkins - testing

* boilerplate code for traffic config crd

* envoy filter creation code

* WIP - refactored to separate Traffic Controller and Remote controllers

* adding code for decoupled logic

* removed commented code, added functionality for remote worker manager

* changing workerMgrMap from normal map to sync.Map to handle concurrent access

* test checkin

* Bootstrap changes

* add dependency, rollout & service controllers

* add dependency, rollout & service controllers

* service entry not needed for admiral-unification

* implemented decideEnvoyFilter logic

* passing stopCh to processItem to propogate os.exit signal

* added new function to get allkeys of a cache map and improved envoy filter creation

* Branch name correction in jenkins

* Branch name correction in jenkins (#183)

* change to us west2

* Ratelimiting filter logic (#184)

* Branch name correction in jenkins

* change to us west2

* Change default traffic config namespace

* Ratelimiting filter logic (#185)

* Branch name correction in jenkins

* change to us west2

* Change default traffic config namespace

* do not proceed with vs geberation in case of rollout

* Ratelimiting filter logic (#186)

* Branch name correction in jenkins

* change to us west2

* Change default traffic config namespace

* do not proceed with vs geberation in case of rollout

* adding debug statements to identify GetKey() cache panic

* reverting back the envoyfilter namespace to admiral-sync

* Handle remote cluster unavailability (#191)

* Ratelimiting filter logic (#192)

* Handle remote cluster unavailability

* remove unwanted space before api version

* adding new fields to traffic config crd

* completed envoy filter operations

* updated error handling for processItem function

* fixing crd not registered issue

* added workload labels i.e. deployment/rollout labels to envoyfilter

* updating traffic config CRD to add timeout for edgeservice spec and handling nil cache panic

* Ignore updates in passive mode (#197)

* Merge heliograph changes (#198)

* Ignore updates in passive mode

* merge heliograph changes(temporary)

* Ratelimiting filter logic (#199)

* Ignore updates in passive mode

* merge heliograph changes(temporary)

* Update heliograph from admiral-deployment, not from jenkins

* Temporary hack to override heliograph tag (#200)

* Ignore updates in passive mode

* merge heliograph changes(temporary)

* Update heliograph from admiral-deployment, not from jenkins

* temporary hack to override heliograph tag

* undo heliograph changes in jenkins

* undo heliograph changes in jenkins (#201)

* undo heliograph changes in jenkins (#201)

* change envoy filter namespace & move vs creation to a different file

* change envoy filter namespace & move vs creation to a different file (#203)

* undo heliograph changes in jenkins

* change envoy filter namespace & move vs creation to a different file

* fixing context

* vs deletion

* logging improvement, deletion (#205)

* undo heliograph changes in jenkins

* change envoy filter namespace & move vs creation to a different file

* vs deletion

* asseralias to be in lowercase

* debugging labels issue (#207)

* no need of retries in vs

* Deploy label debugging (#209)

* debugging labels issue

* handling map update

* adding code changes to correct envoy filters and logging struct (#210)

* Envofilter formatting (#211)

* adding code changes to correct envoy filters and logging struct

* updating service now image

* Envofilter formatting (#212)

* adding code changes to correct envoy filters and logging struct

* updating service now image

* envoyfilter namespace changed from admiral-sync to istio-system

* changing envoy filter namespace from admiral-sync to istio-system (#213)

* Ns change (#214)

* changing envoy filter namespace from admiral-sync to istio-system

* fixing envoyfilter issue for name and struct

* vs env delete fix

* handling traffic config isDisbaled annotation, also handling update and delete (#217)

* cname identity map pupulation fix

* position gervice entry selector only for admiral

* altering traffic config persona code flow in service entry

* altering traffic config persona code flow in service entry (#221)

* making strings as const in common for resue

* Unification test (#222)

* altering traffic config persona code flow in service entry

* making strings as const in common for resue

* adding adaptive concurrency filter

* altering traffic config persona code flow in service entry

* remove cluster from map upon deletion

* debug traces

* review comments

* null check for virtual service

* worker manager pointer corrected

* add traffic config object for all clusters the asset is in

* envoy filter env crud fix

* admiral-unification in qal east

* handling empty workloadenvselector lable case (#228)

* Delete registry.go_backup

Not needed

* Delete types.go_backup

* Delete virtualservice.go_backup

* Delete envoyfilter-operations.go_backup

* merge conflict for pull req 229

Co-authored-by: Rajendra Gosavi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants