Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] Noticeable perf regression over time #13412

Open
sebastienros opened this issue Oct 6, 2020 · 11 comments
Open

[PERF] Noticeable perf regression over time #13412

sebastienros opened this issue Oct 6, 2020 · 11 comments
Labels
area/perf help wanted Needs help! investigate Potential bug that needs verification

Comments

@sebastienros
Copy link

As part of the ASP.NET team, I am measuring envoy and the perf of other proxies continuously.
For the past few months we can see that envoy's perf is slowly degrading.
Overall, compared to nginx or haproxy it's still doing very well, but you might have missed these numbers so I am sharing them with you.

The following chart represents RPS and mean latency for different payload sizes overtime (10B, 100B, 1KB).

image

This is based on the envoyproxy/envoy-dev:latest docker image, and run around once a day using this tag.
The full configuration can be found here: https://github.com/aspnet/Benchmarks/tree/master/docker/envoy
The benchmark is using bombardier with 256 connections, using 3 different machines (load, proxy, downstream), with dedicated 40Gb/s NICs on a private lan and a dedicated switch. Other proxies like nginx, haproxy, or even the baseline (without proxies) are stable during this time frame.

@sebastienros sebastienros added the triage Issue requires triage label Oct 6, 2020
@snowp
Copy link
Contributor

snowp commented Oct 6, 2020

I don't think we have anything set up to help us understand what might be going on, so just tagging some perf interested folks to chime in.

@jmarantz @antoniovicente @oschaaf

@snowp snowp added area/perf investigate Potential bug that needs verification and removed triage Issue requires triage labels Oct 6, 2020
@mattklein123
Copy link
Member

Thanks for raising this. Very interesting. Note that producing this type of information is actually the goal of #11266 (cc @abaptiste), so it would be great to make sure we aren't duplicating effort there.

As per the slow regression you are seeing, is it possible to get a flamegraph of the run at the beginning and end of the range?

@jmarantz
Copy link
Contributor

jmarantz commented Oct 6, 2020

@htuch and @mum4k as well. This is very interesting. We also run continuous benchmarks internally and have noticed some hiccups here and there but not a slow degradation

There are so many variables about how to run load test. I am not familiar with bombardier. There's also a benchmark tool in the Envoy project called Nighthawk, which has all kinds of settings for

  • open-loop vs closed loop
  • h2 vs h1
  • concurrency

and other settings. It would be helpful to know exactly how to repro these results.

Thanks!

@sebastienros
Copy link
Author

I can do runs on a version at the beginning and now, and take a trace (perf/lttng). Would help if someone could give me a docker hub tag that would represent the state as of early June. With that I could confirm it's an issue or just hardware slippage (but again for nginx and haproxy there isn't any).

If someone is interested in how we (dotnet team) do our perf testing, happy to explain. Our infra is in https://github.com/dotnet/crank, very easy to setup, and then I just need one command line to start the scenario and get results from all the machines.

@mattklein123 mattklein123 added the help wanted Needs help! label Oct 6, 2020
@jmarantz
Copy link
Contributor

jmarantz commented Oct 6, 2020

There's quite a bit of stuff there. Can you summarize in terms of HTTP parameters more detail about your load-test? E.g. protocol, open vs closed, amount of concurrency, and any other params you can think of?

@sebastienros
Copy link
Author

Configuration is here: https://github.com/aspnet/Benchmarks/blob/master/docker/envoy/envoy.yaml

http1.1 between the load generator and the proxy. I think the config doesn't force what is used between the proxy and the cluster (the service supports h2 and http1.1).
256 concurrent connections
Request with only a small query string are forwarded and the results have different sizes based on the request.

But I already said that in previous comments. I don't know what the open vs closed means though.

@jmarantz
Copy link
Contributor

jmarantz commented Oct 6, 2020

"open" means that the load-generator sends traffic at a constant rate, independent of whether the system-under-test can handle that rate. We expect that with sufficient load, some amount of the requests will fail, and throughput will remain constant in terms of request-count (success+failure)

"closed" means that the each concurrent stream in the load-generator sends a request and waits for the response, before sending another one. In this case we don't expect requests to fail generally, but throughput will slow down to wait for the system to respond.

@sebastienros
Copy link
Author

I picked this build from 4 months ago before the huge cliff you can see on the charts around 6/8, and can repro the regression: https://hub.docker.com/layers/envoyproxy/envoy-dev/9461f6bad1044133d81d69eba44f20f93771aa2e/images/sha256-0c14d10b416bfbc02e15b3d15ceeee58ac619f1b0ffe9c01d9bf22e1857fa34c?context=explore

Before:

| Mean latency (us)      |     2,276 |
| Max latency (us)       |    86,032 |
| Requests/sec           |   112,395 |
| Requests/sec (max)     |   118,835 | 

After

| Mean latency (us)      |     2,586 |
| Max latency (us)       |    94,641 |
| Requests/sec           |    98,941 |
| Requests/sec (max)     |   106,922 |

The numbers don't match the charts as they were not run on the same machine (same topology though).

I assume at that point it's up to you to decide if and how to investigate this regression.

@mattklein123
Copy link
Member

If there is any way for you to gather a perf trace / flame graph between the 2 runs that would be appreciated, since you know how to run it on your testbed. If not we can try to take a look at these 2 SHAs and your config using Nighthawk. cc @oschaaf @jmarantz

@sebastienros
Copy link
Author

I took a trace with perf, but the results don't show any user code. I am not used to tracing native code, and maybe it requires some symbols source. I can share them still, 22MB each. Let me know where I can drop them.

@antoniovicente
Copy link
Contributor

@htuch and @mum4k as well. This is very interesting. We also run continuous benchmarks internally and have noticed some hiccups here and there but not a slow degradation

A 10% linear decrease in performance over a quarter due to slow degradation seems alarming.

Also, are there some known regressions that correspond to some of the large drops seen in the graphs above? I'm most curious about the big drop around the 3rd week of June that we seem to have partially recovered from by July 1st.

There are so many variables about how to run load test. I am not familiar with bombardier. There's also a benchmark tool in the Envoy project called Nighthawk, which has all kinds of settings for

  • open-loop vs closed loop
  • h2 vs h1
  • concurrency

and other settings. It would be helpful to know exactly how to repro these results.

Thanks!

@jmarantz jmarantz changed the title [PERF] Noticeable perf regression overtime [PERF] Noticeable perf regression over time Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/perf help wanted Needs help! investigate Potential bug that needs verification
Projects
None yet
Development

No branches or pull requests

5 participants