Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFP: Cilium CLI connectivity tests speedup. #15

Merged

Conversation

viktor-kurchenko
Copy link
Contributor

No description provided.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposal sounds good to me. As mentioned offline by Andre, it would be good to see a POC of how this would work. Namely, the aspect I have concerns about is that many of the connectivity tests configure the cluster in a specific way that may conflict with other test runs (such as policies, etc.). It would be good to understand how you propose to approach that problem.

@viktor-kurchenko
Copy link
Contributor Author

Proposal sounds good to me. As mentioned offline by Andre, it would be good to see a POC of how this would work. Namely, the aspect I have concerns about is that many of the connectivity tests configure the cluster in a specific way that may conflict with other test runs (such as policies, etc.). It would be good to understand how you propose to approach that problem.

Thank you, @christarazi !

Yeah, it should be challenging but I want to try it.
Do you know what else except Cluster wide network policies can interfere with different namespaces?

@christarazi
Copy link
Member

In general, any policy whether it's CNP or CCNP can interfere, especially if the workloads that they select are common amongst other policies. It seems like one approach could be to completely separate workloads via namespaces for each "group" of connectivity tests. This way the policies applied only have an effect within the namespaces that they are in, so therefore namespaces would be the separation barrier that allows parallelism.

Copy link

@fgiloux fgiloux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to people like me that the document includes a list, at a high level, of the strategies that have already been explored and why they are falling short.


* Group connectivity tests into independent sets that can be run concurrently.
* Run each independent test set concurrently (all together or in batches).
* Collect test results and display them periodically.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a different subject to me and it may make sense to have the CFP focused on test parallelisation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a different subject to me and it may make sense to have the CFP focused on test parallelisation.

Maybe but if some tests are run in parallel how it will look from the user's perspective?
Output might be unreadable, isn't it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to follow. I have looked at a simple example. The curl test should be free of side effect. The flow validation, I am guessing so and for metrics validation I don't know.
Currently tests are run sequentially in the order they have been registered. They are run in a separate go routine but the loop read from a channel populated at the completion of a test case before going on with the next one:
https://github.com/cilium/cilium-cli/blob/main/connectivity/check/context.go#L402-L433
If we move the channel read outside of the loop we can still collect the results and present them in an ordered way. Is it what you mean? Or am I not getting your point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I meant.

## Goals

* Group connectivity tests into independent sets that can be run concurrently.
* Run each independent test set concurrently (all together or in batches).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping in test sets that can be run concurrently may be a difficult exercise:

  • how do you know whether your test impacts a test from a different set?
  • when adding a new test you need to understand all the tests from the other sets to know whether your test can be run in parallel to them

Possible alternative approaches:

  • define a few configuration flavours, something like what is used in conformance e2e. Is this approach used consistently in other workflows, are there areas for improvement?
  • test filtering depending on the configuration, which is done in Cilium CLI.
  • group tests with code areas. As part of CI it may be acceptable to run a subset of the test suite for a localised change.
  • mark tests that are destructive, which can't be run in parallel, e.g Cilium, cluster update, uninstall, failure simulation, etc
  • do we have overlap between tests / workflows that we could reduce?

Non destructive tests, which share the same configuration flavour may then be run in parallel on the same cluster. Tests for a code areas, which have not been touched by a PR can be trimmed of the CI run.

@viktor-kurchenko
Copy link
Contributor Author

It would be useful to people like me that the document includes a list, at a high level, of the strategies that have already been explored and why they are falling short.

Yeah, it would be useful for me as well.

@viktor-kurchenko
Copy link
Contributor Author

I was thinking about PoC plan and realized that Cilium CLI can be used to run multiple tests in parallel (at least for testing).
The --test-namespace and --test parameters were used to validate the idea.

I've selected 46 tests (from EKS CNI conformance test workflow) and used the attached bash script to run the tests in batches/parallel.

You can find results in the table: https://docs.google.com/spreadsheets/d/1csmszEtlohqPpgMV8N_aJUI4yCoW8mPke46k-rG-Uec/edit?usp=sharing

Conclusions:

  • Cilium CLI has no tests that use CiliumClusterwideNetworkPolicy yet.
  • At least 46 selected tests can be run in parallel with no interference!!!
  • Practically, it won't be possible to run each test in a separate namespace due to a lot of pods and IPs allocation.
  • Ideally, CLI should create and verify all the required test namespaces/deployments only once initially.

Further steps (order might be different):

  1. Rename --test-namespace parameter to --test-namespace-prefix.
  2. Implement a new parameter: --test-parallel-runs with the default value: 1.
  3. Move test namespace/deployments creation and verification logic before the test run function.
  4. Implement tests grouping logic into batches with the --test-parallel-runs size.
  5. Think about how to collect and display output (considering GH runners that might have different behavior than a local terminal).

Also, I was thinking about implementing this as a new CLI command (e.g.: cilium tests ..., maybe even hidden).
So, for some time we can have both old and new approaches with shared test sources and will be able to test and compare them without any impact.

CC: @aanm @christarazi @fgiloux @brlbil @michi-covalent.

@christarazi
Copy link
Member

That sounds good to me.

Just one thing on

Cilium CLI has no tests that use CiliumClusterwideNetworkPolicy yet.

Soon I imagine #16 will get merged and we'll very likely have tests with CCNP, so it is something that we'll need to consider in this proposal.

@brlbil
Copy link

brlbil commented Feb 3, 2024

Sounds great,

Think about how to collect and display output (considering GH runners that might have different behavior than a local terminal).

One thing might be tricky, printing test logs correctly given the tests would be run concurrently. Also, the JUnit collection should be considered.

@viktor-kurchenko
Copy link
Contributor Author

Sounds great,

Think about how to collect and display output (considering GH runners that might have different behavior than a local terminal).

One thing might be tricky, printing test logs correctly given the tests would be run concurrently. Also, the JUnit collection should be considered.

Thanks!
I've already tested this: cilium/images/conn-tests-concurrent-output.gif

@xmulligan
Copy link
Member

@viktor-kurchenko we just added statuses for CFPs. Where do you think this one currently falls? https://github.com/cilium/design-cfps#status

@viktor-kurchenko
Copy link
Contributor Author

@viktor-kurchenko we just added statuses for CFPs. Where do you think this one currently falls? https://github.com/cilium/design-cfps#status

@xmulligan I think the status should be: Released cilium/cilium-cli 0.16

@michi-covalent michi-covalent merged commit b869f50 into cilium:main Aug 10, 2024
@xmulligan xmulligan mentioned this pull request Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants