Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

listener: performance degradation when exact balance used with original dst #15146

Closed
caitong93 opened this issue Feb 23, 2021 · 7 comments · Fixed by #15842
Closed

listener: performance degradation when exact balance used with original dst #15146

caitong93 opened this issue Feb 23, 2021 · 7 comments · Fixed by #15842

Comments

@caitong93
Copy link

caitong93 commented Feb 23, 2021

Title: listener:performance degradation when exact balance used with original dst

Description:

We use Envoy as a sidecar, all out bound traffic is first redirected to 127.0.0.1:15001(Envoy) by iptables, then forwarded to different listeners by original dst. When exact balance is enabled(on all listeners), we found the connection balance being worse. Tested with 32 downstream connections, 2 workers, connection distribution between the two handlers is always 1:31(sometimes 2:30) by observing downstream_cx_active.
When a new connection being received, it will be handled by ExactConnectionBalancer first, and increase numConnections() of the selected handler by one. Then connection is forwarded to a new listener in ConnectionHandlerImpl::ActiveTcpSocket::newConnection() , and decrease that gauge immediately.
If connections arrive quickly, there is a high chance that the first handler is always selected since numConnections() of all handlers are zero.

@caitong93 caitong93 added bug triage Issue requires triage labels Feb 23, 2021
@lambdai
Copy link
Contributor

lambdai commented Feb 23, 2021

IMHO the rebalance should be applied to the second listener. In most of the cases, the first listener doesn't own any connection

@caitong93
Copy link
Author

caitong93 commented Feb 23, 2021

IMHO the rebalance should be applied to the second listener. In most of the cases, the first listener doesn't own any connection

I also expect rebalance to be applied to the second listener. But(correct me if i am wrong) it seems rebalance can only happen at the first listener, see comments here. If exact_balance is only enabled for the second listener, I guess it won't work.

@lambdai
Copy link
Contributor

lambdai commented Feb 23, 2021

@caitong93 It won't work under the current code.

This is a reasonable scenario that to apply to the second listener. @mattklein123 I can change if you agree.

@mattklein123
Copy link
Member

Yeah I agree we probably need to specially handle this case where for forwarded connections we do the rebalance at that point and not initially.

@mattklein123 mattklein123 added area/listener area/perf help wanted Needs help! and removed bug triage Issue requires triage labels Feb 23, 2021
@lambdai lambdai self-assigned this Feb 23, 2021
@lambdai
Copy link
Contributor

lambdai commented Mar 9, 2021

Plan to fix this along with #15126

@boeboe
Copy link

boeboe commented Apr 16, 2021

@lambdai

Will the fix mean that users have to configure exact_balance on the first catch_all listener 0.0.0.0:15001, or do users have to configure it on the next listener 0.0.0.0:9080 (in case the upstream service/cluster is at 9080)?

Related to istio/istio#18152, where @hobbytp tried to apply the exact_balance on the second listener.

From a end-user perspective, when somebody is digging into this setting, it is because of performance tuning in high throughput and low latency environments and I would assume that he/she thereby expects to have to only tune this setting once, instead of once for every target cluster, handled by a separate 2nd in line 0.0.0.0:<svc_port> listener? Or do you foresee that users should have the ability to configure this per second_in_line_listener/upstream_service pair?

@lambdai
Copy link
Contributor

lambdai commented Apr 20, 2021

Will the fix mean that users have to configure exact_balance on the first catch_all listener 0.0.0.0:15001, or do users have to configure it on the next listener 0.0.0.0:9080 (in case the upstream service/cluster is at 9080)?

Sorry for the late reply.
For istio where the 15001 listener usually doesn't hold connection, the 15001 should use no-op balancer (the goal is to reduce latency) and the huge amount of "9080" sub listeners should use exact_balancer (trade cross thread migration for balancing).

Yeah, you can also use exact_balancer for 9080 listener and not to use balancer for 9070.

htuch pushed a commit that referenced this issue Apr 29, 2021
…15842)

If listener1 redirects the connection to listener2, the balancer field in listener2 decides whether to rebalance.
Previously we rely on the rebalancing at listener1, however, the rebalance is weak because listener1 is likely to
not own any connection and the rebalance is no-op.

Risk Level: MID. Rebalance may introduce latency. User needs to clear rebalancer field of listener2 to recover the original behavior.

Fix #15146 #16113

Signed-off-by: Yuchen Dai <[email protected]>
gokulnair pushed a commit to gokulnair/envoy that referenced this issue May 6, 2021
…nvoyproxy#15842)

If listener1 redirects the connection to listener2, the balancer field in listener2 decides whether to rebalance.
Previously we rely on the rebalancing at listener1, however, the rebalance is weak because listener1 is likely to
not own any connection and the rebalance is no-op.

Risk Level: MID. Rebalance may introduce latency. User needs to clear rebalancer field of listener2 to recover the original behavior.

Fix envoyproxy#15146 envoyproxy#16113

Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Gokul Nair <[email protected]>
gokulnair pushed a commit to gokulnair/envoy that referenced this issue May 6, 2021
…nvoyproxy#15842)

If listener1 redirects the connection to listener2, the balancer field in listener2 decides whether to rebalance.
Previously we rely on the rebalancing at listener1, however, the rebalance is weak because listener1 is likely to
not own any connection and the rebalance is no-op.

Risk Level: MID. Rebalance may introduce latency. User needs to clear rebalancer field of listener2 to recover the original behavior.

Fix envoyproxy#15146 envoyproxy#16113

Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Gokul Nair <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants