Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connectivity Test: Add latency measurement #2094

Merged
merged 1 commit into from
Nov 17, 2023

Conversation

darox
Copy link
Contributor

@darox darox commented Nov 7, 2023

This PR adds netperf based latency tests that can be triggered by running:

cilium connectivity test --perf --perf-latency --perf-samples 5
ℹ️  Monitor aggregation detected, will skip some flow validation steps
⚠️  Each zone only has a single node - could impact the performance test results
⌛ [arn:aws:eks:eu-central-1:679388779924:cluster/test-test-dario] Waiting for deployment cilium-test/perf-client to become ready...
⌛ [arn:aws:eks:eu-central-1:679388779924:cluster/test-test-dario] Waiting for deployment cilium-test/perf-server to become ready...
⌛ [arn:aws:eks:eu-central-1:679388779924:cluster/test-test-dario] Waiting for deployment cilium-test/perf-client-other-node to become ready...
⌛ [arn:aws:eks:eu-central-1:679388779924:cluster/test-test-dario] Waiting for CiliumEndpoint for pod cilium-test/perf-client-567c9764b5-ghqt9 to appear...
⌛ [arn:aws:eks:eu-central-1:679388779924:cluster/test-test-dario] Waiting for CiliumEndpoint for pod cilium-test/perf-client-other-node-f98d8784-dmsgx to appear...
⌛ [arn:aws:eks:eu-central-1:679388779924:cluster/test-test-dario] Waiting for CiliumEndpoint for pod cilium-test/perf-server-bb67f66b5-vzxsk to appear...
🔭 Enabling Hubble telescope...
⚠️  Unable to contact Hubble Relay, disabling Hubble telescope and flow validation: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [::1]:4245: connect: connection refused"
ℹ️  Expose Relay locally with:
   cilium hubble enable
   cilium hubble port-forward&
ℹ️  Cilium version: 1.13.6
🏃 Running tests...
[=] Test [network-perf]
..

🔥 Latency Test Summary: 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
📋 Scenario        | Pod                                                | Test            | Num Samples     | Duration        | Min             | Mean            | Max             | P50             | P90             | P99            
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
📋 pod-net         | perf-client-567c9764b5-ghqt9                       | TCP_RR          | 5               | 10s             | 22.80        μs | 29.64        μs | 4088.20      μs | 29.00        μs | 30.00        μs | 41.80        μs
📋 pod-net         | perf-client-other-node-f98d8784-dmsgx              | TCP_RR          | 5               | 10s             | 478.40       μs | 501.04       μs | 6432.00      μs | 496.60       μs | 509.00       μs | 553.20       μs
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

✅ All 1 tests (2 actions) successful, 0 tests skipped, 0 scenarios skipped.

Copy link
Contributor

@learnitall learnitall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion here is to add the ability to see percentile results in addition to the average. Having 50, 90, and 99th percentiles can be really helpful when trying to understand the average.

@darox darox marked this pull request as ready for review November 8, 2023 11:09
@darox
Copy link
Contributor Author

darox commented Nov 8, 2023

One suggestion here is to add the ability to see percentile results in addition to the average. Having 50, 90, and 99th percentiles can be really helpful when trying to understand the average.

Implemented

@darox darox requested a review from learnitall November 8, 2023 11:34
Copy link
Contributor

@learnitall learnitall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thank you for this. I have one nit, but otherwise LGTM.

Also, just a note, averaging percentiles can sometimes get you into trouble (see https://www.circonus.com/2018/11/the-problem-with-percentiles-aggregation-brings-aggravation/). However, I think that for this use case it's totally valid. The percentiles you are averaging together are samples from the same test, so the distributions should be the same.

connectivity/tests/perfpod.go Outdated Show resolved Hide resolved
Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darox Nice work!

connectivity/tests/perfpod.go Outdated Show resolved Hide resolved
connectivity/tests/perfpod.go Outdated Show resolved Hide resolved
@darox
Copy link
Contributor Author

darox commented Nov 9, 2023

@darox Nice work!

Thank you. I implemented all of your proposals.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It might make sense at some point to consider collapsing all of the --perf-* flags into one --perf flag that accepts enums such as --perf=latency or --perf=net. The former runs latency tests and the latter only runs the base "network" tests. We could have --perf=all to run all perf-related tests. This would reduce the need to keep adding more and more flags. Not required for this PR though.

@darox
Copy link
Contributor Author

darox commented Nov 15, 2023

LGTM. It might make sense at some point to consider collapsing all of the --perf-* flags into one --perf flag that accepts enums such as --perf=latency or --perf=net. The former runs latency tests and the latter only runs the base "network" tests. We could have --perf=all to run all perf-related tests. This would reduce the need to keep adding more and more flags. Not required for this PR though.

Thank you for the feedback. Absolutely, this is something I wanna work on after this PR is merged.

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One non-blocking nit regarding summary headers for latency tests vs. existing performance tests. Can be addressed as a follow-up. Thanks.

connectivity/check/context.go Outdated Show resolved Hide resolved
connectivity/check/context.go Show resolved Hide resolved
@darox
Copy link
Contributor Author

darox commented Nov 16, 2023

LGTM. One non-blocking nit regarding summary headers for latency tests vs. existing performance tests. Can be addressed as a follow-up. Thanks.

Awesome catch. I didn't see that.

Add netperf based latency tests that can be triggered by running `cilium
connectivity test --perf --perf-latency --perf-samples 5`

Signed-off-by: Dario Mader <[email protected]>
@tklauser tklauser merged commit d8f3d93 into cilium:main Nov 17, 2023
17 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants