proposal: Add tracking latencies and flamegraphs in CI #11266

oschaaf · 2020-05-19T21:39:34Z

Filing this issue to get a feel for interest in this

Goal:

Add a means to track and persist latency numbers and perf visualizations like flamegraphs over time in CI. This would allow us to track how we're doing over time as well as have perf information
at hand when a latency regression is observed.

Description:

Nighthawk uses a lightweight python-based framework for integration testing.
This framework serves as a basis for writing NH's own benchmarks.

With a small bit of modification this could be modified to:

make consumption very low friction in foreign code bases (like Envoy)
allow it to inject proxies in the between the client and test server. For example, Envoy at a certain sha
scavenge tests from external locations

More details, and some concrete scripts for getting an idea of what this would look like can be found here.

/cc @danzh2010 @htuch

htuch · 2020-05-19T21:54:20Z

I think even redline QPS would be an amazing contribution here, everything else proposed seems like gravy. +1000.

mattklein123 · 2020-05-19T22:25:32Z

See #961. I desperately want this. This will require a lot of thought in terms of how to structure repeatable tests, but yes, we really need to do this.

antoniovicente · 2020-05-19T23:42:07Z

I don't know if this exists yet, but it would also be good to have a few relatively small benchmark scenarios that can be used for A/B comparison of performance after changes to data plane components, specially in cases where we expect some performance impact. Tracking performance data for the small benchmarks over time on a calibrated environment would be great.

oschaaf · 2020-05-25T21:37:00Z

I have started exploring this. Tracking progress here.
@antoniovicente It might be good to take a look at test_benchmarks.py, to see if that allows enough flexibility. The idea is that consumers can specify their own locations where the suite should scavenge tests, which in turn can supply custom fixtures with custom Envoy configurations.

mattklein123 · 2020-05-25T21:47:35Z

cc @marcomagdy who is also interested in helping with this effort.

snowp · 2020-06-04T20:09:37Z

We'd also be interested in this, so let me know how I can help to move this forward

oschaaf · 2020-06-09T14:49:19Z

Update: a good part of this is in review over at envoyproxy/nighthawk#337.

Nighthawk is eating its own dogfeed via a new CI task, and is dropping simple visualizations per test (example).

Cpu profiles are also collected, but flamegraphing needs more work as we need to consider the binaries and libraries involved in generating the profile to get sensible output for that.

antoniovicente · 2020-08-07T17:00:13Z

Any updates?

oschaaf · 2020-08-07T17:27:19Z

Well, I got sidetracked for a but, but this has been happily test-driving in NHs own CI. So far so good.

For example see the .html files in the artefacts of a recent PR.
We could consider wiring up the current state in Envoy's CI as an MVP based on the docker-based flow (Nighthawk's CI runs with its locally produced binaries). This should be pretty doable, but I would appreciate help/guidance there.

Some important improvements that others have expressed interest in tackling are:

The current UI is limited to a directly listing of artefacts as offered by the CI env (CircleCI in NH).
There's no regression analysis / detection.

For more detailed status, see https://github.com/envoyproxy/nighthawk/tree/master/benchmarks#todos

abaptiste · 2020-09-04T15:15:35Z

Hello Folks. We have a design doc for a framework that we'd like your comments on:

https://docs.google.com/document/d/14Iz8j--Mvb06QFB8RurtYlwmy657YbAVfqDr-jKgtaQ/edit#heading=h.grkfe6onmtgv

htuch · 2020-09-08T15:58:33Z

@abaptiste thanks. My super high-level comment is that as a developer and performance engineer (user story), I'd like to be able to have control over the benchmark execution environment. So, any framework should be capable of running 100% locally. It's fine to make it also available as a SaaS via buckets or e-mail, but I think we're limiting applicability if those are the only options.

mattklein123 · 2020-09-08T23:18:21Z

@abaptiste thanks. My super high-level comment is that as a developer and performance engineer (user story), I'd like to be able to have control over the benchmark execution environment. So, any framework should be capable of running 100% locally. It's fine to make it also available as a SaaS via buckets or e-mail, but I think we're limiting applicability if those are the only options.

+1 I left a bunch of comments around this. I also want to make sure we have a clear post-MVP path for CI integration as IMO this is the thing we really want to unlock ASAP. Thank you for working on this!

abaptiste · 2020-09-09T15:23:16Z

Thank you for the comments. These are the major themes I've captured:

Define all JSON schemas using proto3 messages (this will be done as part of the MVP)
We need a better authentication mechanism
CI integration so that builds run nightly or upon master check-in
Long term storage of results so that we can chart the performance of prior builds
Ability to do performance runs complementing local development

If there are additional items I may have inadvertently missed or misunderstood, please let me know.

mattklein123 · 2020-09-09T18:05:25Z

@abaptiste that list LGTM and also similar to our offline conversation. Thanks for working on this! This will be awesome.

abaptiste · 2020-09-28T22:14:02Z

I posted a separate doc based on the feedback from the initial review. Please feel free to take a look and comment.

htuch · 2020-09-29T15:30:16Z

@abaptiste the new doc LGTM, tagging @oschaaf @mattklein123 @antoniovicente @mum4k @pamorgan @snowp for comments/sign-off.

mattklein123 · 2020-09-29T17:28:08Z

I looked at the doc and at a high level it looks great to me. Very excited for this work!

oschaaf · 2020-09-29T18:07:29Z

Looks good to me!

This is the initial 'official' commit for the Salvo tool. This aims to abstract the execution of nighthawk to benchmark a given envoy version. See this issue for some background. The two design docs for this project are referenced here. In this commit, salvo is placed into a separate directory of the envoy-perf repository, and is referenced from the main README.md. Testing: Unit tests included, Address as many pylint3 issues as feasible [#Issue] envoyproxy/envoy#11266 Signed-off-by: Alvin Baptiste <[email protected]>

gyohuangxin · 2022-01-12T02:08:07Z

Any updates? We are interested in the integration, is there any help I can offer?

mum4k · 2022-01-12T02:34:26Z

Hi @gyohuangxin, we will gladly accept help. We expect to be able to staff this work in about 6 months, but I would gladly work with you in the meantime if you have the cycles. If you are able to help, it would be good to get in touch and discuss priorities and the direction. Are you on the Envoy's Slack by any chance?

gyohuangxin · 2022-01-12T04:47:06Z

@mum4k Thank you! yes, let's discuss on Slack.

keithmattix · 2024-08-15T14:04:21Z

What's the latest on this effort? This would be extremely beneficial

mum4k · 2024-08-15T14:18:39Z

This effort has been de-staffed temporarily. If there is anyone who wants to pick it up in the meantime, I will gladly transfer the latest state and/or guide, review code as desired.

mattklein123 added area/perf help wanted Needs help! labels May 19, 2020

antoniovicente mentioned this issue Jun 3, 2020

[http] Measure per-connection and per-request memory usage #11421

Open

oschaaf mentioned this issue Sep 4, 2020

Upgrade from v1.12.3 to v1.14.4 leads to more CPU usage without traffic #12080

Closed

mattklein123 assigned abaptiste Sep 8, 2020

mattklein123 mentioned this issue Oct 6, 2020

[PERF] Noticeable perf regression over time #13412

Open

abaptiste mentioned this issue Oct 12, 2020

[salvo] Initial commit for benchmark abstraction framework envoyproxy/envoy-perf#73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: Add tracking latencies and flamegraphs in CI #11266

proposal: Add tracking latencies and flamegraphs in CI #11266

oschaaf commented May 19, 2020

htuch commented May 19, 2020

mattklein123 commented May 19, 2020

antoniovicente commented May 19, 2020

oschaaf commented May 25, 2020 •

edited

Loading

mattklein123 commented May 25, 2020

snowp commented Jun 4, 2020

oschaaf commented Jun 9, 2020

antoniovicente commented Aug 7, 2020

oschaaf commented Aug 7, 2020

abaptiste commented Sep 4, 2020

htuch commented Sep 8, 2020

mattklein123 commented Sep 8, 2020

abaptiste commented Sep 9, 2020

mattklein123 commented Sep 9, 2020

abaptiste commented Sep 28, 2020

htuch commented Sep 29, 2020

mattklein123 commented Sep 29, 2020

oschaaf commented Sep 29, 2020

gyohuangxin commented Jan 12, 2022

mum4k commented Jan 12, 2022

gyohuangxin commented Jan 12, 2022 •

edited

Loading

keithmattix commented Aug 15, 2024

mum4k commented Aug 15, 2024

proposal: Add tracking latencies and flamegraphs in CI #11266

proposal: Add tracking latencies and flamegraphs in CI #11266

Comments

oschaaf commented May 19, 2020

Goal:

Description:

htuch commented May 19, 2020

mattklein123 commented May 19, 2020

antoniovicente commented May 19, 2020

oschaaf commented May 25, 2020 • edited Loading

mattklein123 commented May 25, 2020

snowp commented Jun 4, 2020

oschaaf commented Jun 9, 2020

antoniovicente commented Aug 7, 2020

oschaaf commented Aug 7, 2020

abaptiste commented Sep 4, 2020

htuch commented Sep 8, 2020

mattklein123 commented Sep 8, 2020

abaptiste commented Sep 9, 2020

mattklein123 commented Sep 9, 2020

abaptiste commented Sep 28, 2020

htuch commented Sep 29, 2020

mattklein123 commented Sep 29, 2020

oschaaf commented Sep 29, 2020

gyohuangxin commented Jan 12, 2022

mum4k commented Jan 12, 2022

gyohuangxin commented Jan 12, 2022 • edited Loading

keithmattix commented Aug 15, 2024

mum4k commented Aug 15, 2024

oschaaf commented May 25, 2020 •

edited

Loading

gyohuangxin commented Jan 12, 2022 •

edited

Loading