-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] Noticeable perf regression over time #13412
Comments
I don't think we have anything set up to help us understand what might be going on, so just tagging some perf interested folks to chime in. |
Thanks for raising this. Very interesting. Note that producing this type of information is actually the goal of #11266 (cc @abaptiste), so it would be great to make sure we aren't duplicating effort there. As per the slow regression you are seeing, is it possible to get a flamegraph of the run at the beginning and end of the range? |
@htuch and @mum4k as well. This is very interesting. We also run continuous benchmarks internally and have noticed some hiccups here and there but not a slow degradation There are so many variables about how to run load test. I am not familiar with bombardier. There's also a benchmark tool in the Envoy project called Nighthawk, which has all kinds of settings for
and other settings. It would be helpful to know exactly how to repro these results. Thanks! |
I can do runs on a version at the beginning and now, and take a trace (perf/lttng). Would help if someone could give me a docker hub tag that would represent the state as of early June. With that I could confirm it's an issue or just hardware slippage (but again for nginx and haproxy there isn't any). If someone is interested in how we (dotnet team) do our perf testing, happy to explain. Our infra is in https://github.com/dotnet/crank, very easy to setup, and then I just need one command line to start the scenario and get results from all the machines. |
There's quite a bit of stuff there. Can you summarize in terms of HTTP parameters more detail about your load-test? E.g. protocol, open vs closed, amount of concurrency, and any other params you can think of? |
Configuration is here: https://github.com/aspnet/Benchmarks/blob/master/docker/envoy/envoy.yaml http1.1 between the load generator and the proxy. I think the config doesn't force what is used between the proxy and the cluster (the service supports h2 and http1.1). But I already said that in previous comments. I don't know what the open vs closed means though. |
"open" means that the load-generator sends traffic at a constant rate, independent of whether the system-under-test can handle that rate. We expect that with sufficient load, some amount of the requests will fail, and throughput will remain constant in terms of request-count (success+failure) "closed" means that the each concurrent stream in the load-generator sends a request and waits for the response, before sending another one. In this case we don't expect requests to fail generally, but throughput will slow down to wait for the system to respond. |
I picked this build from 4 months ago before the huge cliff you can see on the charts around 6/8, and can repro the regression: https://hub.docker.com/layers/envoyproxy/envoy-dev/9461f6bad1044133d81d69eba44f20f93771aa2e/images/sha256-0c14d10b416bfbc02e15b3d15ceeee58ac619f1b0ffe9c01d9bf22e1857fa34c?context=explore Before:
After
The numbers don't match the charts as they were not run on the same machine (same topology though). I assume at that point it's up to you to decide if and how to investigate this regression. |
I took a trace with perf, but the results don't show any user code. I am not used to tracing native code, and maybe it requires some symbols source. I can share them still, 22MB each. Let me know where I can drop them. |
A 10% linear decrease in performance over a quarter due to slow degradation seems alarming. Also, are there some known regressions that correspond to some of the large drops seen in the graphs above? I'm most curious about the big drop around the 3rd week of June that we seem to have partially recovered from by July 1st.
|
As part of the ASP.NET team, I am measuring envoy and the perf of other proxies continuously.
For the past few months we can see that envoy's perf is slowly degrading.
Overall, compared to nginx or haproxy it's still doing very well, but you might have missed these numbers so I am sharing them with you.
The following chart represents RPS and mean latency for different payload sizes overtime (10B, 100B, 1KB).
This is based on the
envoyproxy/envoy-dev:latest
docker image, and run around once a day using this tag.The full configuration can be found here: https://github.com/aspnet/Benchmarks/tree/master/docker/envoy
The benchmark is using bombardier with 256 connections, using 3 different machines (load, proxy, downstream), with dedicated 40Gb/s NICs on a private lan and a dedicated switch. Other proxies like nginx, haproxy, or even the baseline (without proxies) are stable during this time frame.
The text was updated successfully, but these errors were encountered: