Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFP: Cilium CLI connectivity tests speedup. #15

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions cilium/CFP-1189-connectivity-tests-speedup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# CFP-1189: Cilium CLI connectivity tests speedup

**SIG: SIG-USER**

**Begin Design Discussion:** 2024-01-19

**Cilium Release:** 1.15

**Authors:** Viktor Kurchenko <[email protected]>

## Summary

The CFP describes a new approach to run Cilium connectivity tests.
The idea is to group all the tests in small sets and run them in parallel.

## Motivation

Currently, connectivity tests might run in CI for over 1 hour (it depends
on many factors) and the test case count constantly increasing.
To make CI pipelines faster and cheaper we can consider the connectivity
tests parallelization approach.

## Goals

* Group connectivity tests into independent sets that can be run concurrently.
* Run each independent test set concurrently (all together or in batches).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping in test sets that can be run concurrently may be a difficult exercise:

  • how do you know whether your test impacts a test from a different set?
  • when adding a new test you need to understand all the tests from the other sets to know whether your test can be run in parallel to them

Possible alternative approaches:

  • define a few configuration flavours, something like what is used in conformance e2e. Is this approach used consistently in other workflows, are there areas for improvement?
  • test filtering depending on the configuration, which is done in Cilium CLI.
  • group tests with code areas. As part of CI it may be acceptable to run a subset of the test suite for a localised change.
  • mark tests that are destructive, which can't be run in parallel, e.g Cilium, cluster update, uninstall, failure simulation, etc
  • do we have overlap between tests / workflows that we could reduce?

Non destructive tests, which share the same configuration flavour may then be run in parallel on the same cluster. Tests for a code areas, which have not been touched by a PR can be trimmed of the CI run.

* Collect test results and display them periodically.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a different subject to me and it may make sense to have the CFP focused on test parallelisation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a different subject to me and it may make sense to have the CFP focused on test parallelisation.

Maybe but if some tests are run in parallel how it will look from the user's perspective?
Output might be unreadable, isn't it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to follow. I have looked at a simple example. The curl test should be free of side effect. The flow validation, I am guessing so and for metrics validation I don't know.
Currently tests are run sequentially in the order they have been registered. They are run in a separate go routine but the loop read from a channel populated at the completion of a test case before going on with the next one:
https://github.com/cilium/cilium-cli/blob/main/connectivity/check/context.go#L402-L433
If we move the channel read outside of the loop we can still collect the results and present them in an ordered way. Is it what you mean? Or am I not getting your point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I meant.


## Proposal

### Walkthrough

1. Cilium CLI groups all connectivity tests into independent sets that
do not interfere with each other. A separate component can be implemented
to keep this responsibility (e.g.: `TestSetFactory`).
2. Each produced test set can be run concurrently. In some cases, it might be
not acceptable to run many test sets concurrently (e.g.: due to limited
resources in a cluster). CLI should provide an option
(e.g.: `--test-batch-size`) that allows to group test sets into fixed-size
batches and run batches synchronously.
3. Each test set should provision its namespace and all the required resources.
4. A separate component (e.g.: `TestMonitor`) can be implemented and run in a
dedicated goroutine to collect each test set execution results and display
them periodically.

### Concurrent output example

![Conn tests concurrent output](./images/conn-tests-concurrent-output.gif)

However, this example might not work properly on GitHub actions due to in-place
output updates. In pipelines, CLI can cache and display each test result
sequentially even with a predefined order.
Binary file added cilium/images/conn-tests-concurrent-output.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.