[PERF] Noticeable perf regression over time #13412

sebastienros · 2020-10-06T18:22:08Z

As part of the ASP.NET team, I am measuring envoy and the perf of other proxies continuously.
For the past few months we can see that envoy's perf is slowly degrading.
Overall, compared to nginx or haproxy it's still doing very well, but you might have missed these numbers so I am sharing them with you.

The following chart represents RPS and mean latency for different payload sizes overtime (10B, 100B, 1KB).

This is based on the envoyproxy/envoy-dev:latest docker image, and run around once a day using this tag.
The full configuration can be found here: https://github.com/aspnet/Benchmarks/tree/master/docker/envoy
The benchmark is using bombardier with 256 connections, using 3 different machines (load, proxy, downstream), with dedicated 40Gb/s NICs on a private lan and a dedicated switch. Other proxies like nginx, haproxy, or even the baseline (without proxies) are stable during this time frame.

The text was updated successfully, but these errors were encountered:

snowp · 2020-10-06T19:26:23Z

I don't think we have anything set up to help us understand what might be going on, so just tagging some perf interested folks to chime in.

@jmarantz @antoniovicente @oschaaf

mattklein123 · 2020-10-06T19:33:03Z

Thanks for raising this. Very interesting. Note that producing this type of information is actually the goal of #11266 (cc @abaptiste), so it would be great to make sure we aren't duplicating effort there.

As per the slow regression you are seeing, is it possible to get a flamegraph of the run at the beginning and end of the range?

jmarantz · 2020-10-06T19:38:07Z

@htuch and @mum4k as well. This is very interesting. We also run continuous benchmarks internally and have noticed some hiccups here and there but not a slow degradation

There are so many variables about how to run load test. I am not familiar with bombardier. There's also a benchmark tool in the Envoy project called Nighthawk, which has all kinds of settings for

open-loop vs closed loop
h2 vs h1
concurrency

and other settings. It would be helpful to know exactly how to repro these results.

Thanks!

sebastienros · 2020-10-06T20:19:51Z

I can do runs on a version at the beginning and now, and take a trace (perf/lttng). Would help if someone could give me a docker hub tag that would represent the state as of early June. With that I could confirm it's an issue or just hardware slippage (but again for nginx and haproxy there isn't any).

If someone is interested in how we (dotnet team) do our perf testing, happy to explain. Our infra is in https://github.com/dotnet/crank, very easy to setup, and then I just need one command line to start the scenario and get results from all the machines.

jmarantz · 2020-10-06T20:42:19Z

There's quite a bit of stuff there. Can you summarize in terms of HTTP parameters more detail about your load-test? E.g. protocol, open vs closed, amount of concurrency, and any other params you can think of?

sebastienros · 2020-10-06T21:09:11Z

Configuration is here: https://github.com/aspnet/Benchmarks/blob/master/docker/envoy/envoy.yaml

http1.1 between the load generator and the proxy. I think the config doesn't force what is used between the proxy and the cluster (the service supports h2 and http1.1).
256 concurrent connections
Request with only a small query string are forwarded and the results have different sizes based on the request.

But I already said that in previous comments. I don't know what the open vs closed means though.

jmarantz · 2020-10-06T21:31:18Z

"open" means that the load-generator sends traffic at a constant rate, independent of whether the system-under-test can handle that rate. We expect that with sufficient load, some amount of the requests will fail, and throughput will remain constant in terms of request-count (success+failure)

"closed" means that the each concurrent stream in the load-generator sends a request and waits for the response, before sending another one. In this case we don't expect requests to fail generally, but throughput will slow down to wait for the system to respond.

sebastienros · 2020-10-07T01:37:50Z

I picked this build from 4 months ago before the huge cliff you can see on the charts around 6/8, and can repro the regression: https://hub.docker.com/layers/envoyproxy/envoy-dev/9461f6bad1044133d81d69eba44f20f93771aa2e/images/sha256-0c14d10b416bfbc02e15b3d15ceeee58ac619f1b0ffe9c01d9bf22e1857fa34c?context=explore

Before:

| Mean latency (us)      |     2,276 |
| Max latency (us)       |    86,032 |
| Requests/sec           |   112,395 |
| Requests/sec (max)     |   118,835 |

After

| Mean latency (us)      |     2,586 |
| Max latency (us)       |    94,641 |
| Requests/sec           |    98,941 |
| Requests/sec (max)     |   106,922 |

The numbers don't match the charts as they were not run on the same machine (same topology though).

I assume at that point it's up to you to decide if and how to investigate this regression.

mattklein123 · 2020-10-07T02:43:20Z

If there is any way for you to gather a perf trace / flame graph between the 2 runs that would be appreciated, since you know how to run it on your testbed. If not we can try to take a look at these 2 SHAs and your config using Nighthawk. cc @oschaaf @jmarantz

sebastienros · 2020-10-07T16:42:26Z

I took a trace with perf, but the results don't show any user code. I am not used to tracing native code, and maybe it requires some symbols source. I can share them still, 22MB each. Let me know where I can drop them.

antoniovicente · 2020-10-12T23:18:05Z

@htuch and @mum4k as well. This is very interesting. We also run continuous benchmarks internally and have noticed some hiccups here and there but not a slow degradation

A 10% linear decrease in performance over a quarter due to slow degradation seems alarming.

Also, are there some known regressions that correspond to some of the large drops seen in the graphs above? I'm most curious about the big drop around the 3rd week of June that we seem to have partially recovered from by July 1st.

There are so many variables about how to run load test. I am not familiar with bombardier. There's also a benchmark tool in the Envoy project called Nighthawk, which has all kinds of settings for

open-loop vs closed loop

h2 vs h1

concurrency

and other settings. It would be helpful to know exactly how to repro these results.

Thanks!

sebastienros added the triage Issue requires triage label Oct 6, 2020

snowp added area/perf investigate Potential bug that needs verification and removed triage Issue requires triage labels Oct 6, 2020

mattklein123 added the help wanted Needs help! label Oct 6, 2020

oschaaf mentioned this issue Oct 6, 2020

[Proposal] Add envoyproxy/nighthawk's client as a load generator dotnet/crank#142

Open

rojkov mentioned this issue Nov 25, 2021

some performance results of envoy's different versions #19103

Open

jmarantz changed the title ~~[PERF] Noticeable perf regression overtime~~ [PERF] Noticeable perf regression over time Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] Noticeable perf regression over time #13412

[PERF] Noticeable perf regression over time #13412

sebastienros commented Oct 6, 2020

snowp commented Oct 6, 2020

mattklein123 commented Oct 6, 2020

jmarantz commented Oct 6, 2020

sebastienros commented Oct 6, 2020

jmarantz commented Oct 6, 2020

sebastienros commented Oct 6, 2020

jmarantz commented Oct 6, 2020

sebastienros commented Oct 7, 2020

mattklein123 commented Oct 7, 2020

sebastienros commented Oct 7, 2020

antoniovicente commented Oct 12, 2020

[PERF] Noticeable perf regression over time #13412

[PERF] Noticeable perf regression over time #13412

Comments

sebastienros commented Oct 6, 2020

snowp commented Oct 6, 2020

mattklein123 commented Oct 6, 2020

jmarantz commented Oct 6, 2020

sebastienros commented Oct 6, 2020

jmarantz commented Oct 6, 2020

sebastienros commented Oct 6, 2020

jmarantz commented Oct 6, 2020

sebastienros commented Oct 7, 2020

mattklein123 commented Oct 7, 2020

sebastienros commented Oct 7, 2020

antoniovicente commented Oct 12, 2020