Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

github: Use Canonical runners for scheduled system tests #469

Merged
merged 8 commits into from
Dec 19, 2024

Conversation

roosterfish
Copy link
Contributor

@roosterfish roosterfish commented Nov 8, 2024

This PR splits the rather complex matrix system test into three groups.
The system test code is moved into a repo local action which is leveraged by each of the groups.
See an example of the restructured tests workflow here.

In addition the "instances" suite is now executed also on the Canonical runners when the workflow is triggered by schedule.

@roosterfish roosterfish force-pushed the self_hosted_runners branch 2 times, most recently from 8af0edc to 4f38ed2 Compare November 8, 2024 11:07
@roosterfish roosterfish marked this pull request as ready for review November 8, 2024 13:58
@roosterfish
Copy link
Contributor Author

@masnax I did some tests regarding this error.

Unfortunately the timeout set in the MicroCeph GetConfig client func is set to only 5s.
Based on your suggestion in the meeting earlier, can it be that right after forming the MicroCloud, the proxy is waiting for something in the cluster to settle before it can forward the request to MicroCeph's local unix socket?

I have bootstrapped a single node MicroCloud and fired requests to /1.0/services/microceph in parallel.
Right around when the MicroCloud is bootstrapped I saw a delay in the response which could proof that something is going on there.

@masnax
Copy link
Contributor

masnax commented Nov 8, 2024

I have bootstrapped a single node MicroCloud and fired requests to /1.0/services/microceph in parallel. Right around when the MicroCloud is bootstrapped I saw a delay in the response which could proof that something is going on there.

Well this current failure is happening long before MicroCloud is bootstrapped, as it happens right after system discovery and before asking any setup questions. In the bootstrap case, the only delay would be related to refreshing the truststore and waiting for the lock, but even that wouldn't happen on a single-node request as it all goes through the unix socket which skips truststore verification.

When bootstrapping, the listeners also restart, so that could be the delay you're seeing locally. But again that wouldn't affect the test failure since it's not during bootstrap.

can it be that right after forming the MicroCloud, the proxy is waiting for something in the cluster to settle before it can forward the request to MicroCeph's local unix socket?

this is the whole local proxy block in MicroCloud so it's definitely not waiting for anything here.

Since it's a network request, there is the additional overhead of authHandlerMTLS pulling the truststore.

@roosterfish roosterfish force-pushed the self_hosted_runners branch 2 times, most recently from 7c6d5ff to ae01100 Compare November 11, 2024 10:37
@roosterfish
Copy link
Contributor Author

Well this current failure is happening long before MicroCloud is bootstrapped, as it happens right after system discovery and before asking any setup questions

Mh it looks we can fix it by waiting for microceph cluster bootstrap to settle and only continue if it's done.
I have added another commit that adds a wrapper function we can use throughout the test suite to wait until microceph status reports the single node cluster services are present. I found waiting for this condition looks to be enough.

@masnax
Copy link
Contributor

masnax commented Nov 12, 2024

Well this current failure is happening long before MicroCloud is bootstrapped, as it happens right after system discovery and before asking any setup questions

Mh it looks we can fix it by waiting for microceph cluster bootstrap to settle and only continue if it's done. I have added another commit that adds a wrapper function we can use throughout the test suite to wait until microceph status reports the single node cluster services are present. I found waiting for this condition looks to be enough.

Is this something that can be checked over the API? Perhaps microceph cluster bootsrap shouldn't return until all its services are finished, or have a ready API that we can check against before sending requests.

@roosterfish roosterfish force-pushed the self_hosted_runners branch 11 times, most recently from d8a0c34 to 56c1b8a Compare November 25, 2024 13:23
@roosterfish
Copy link
Contributor Author

Is this something that can be checked over the API? Perhaps microceph cluster bootsrap shouldn't return until all its services are finished

Issue logged in canonical/microceph#473.

@roosterfish roosterfish marked this pull request as draft November 26, 2024 09:24
@roosterfish roosterfish force-pushed the self_hosted_runners branch 2 times, most recently from 62ec1f0 to 222713b Compare November 26, 2024 11:02
@roosterfish
Copy link
Contributor Author

@MggMuggins have you ever seen this one https://github.com/canonical/microcloud/actions/runs/12031420307/job/33588930317?pr=469#step:17:1490?

It might be that this is caused because the runners are slower. We already saw various others scenarios caused by the "slowness". Maybe you have an idea.

@MggMuggins
Copy link
Contributor

IIRC the trust store is held in memory in MicroCluster and synchronized (fanotify?); I think the recovery process doesn't use the in-memory synchronization because it's expected that it won't be accessing the trust store at the same time as any other thread. I wonder if this was an incorrect assumption. I'll try and take a look this afternoon.

@MggMuggins
Copy link
Contributor

I have downloaded the logs for the failed PR run and will plan to look into this next pulse.

@roosterfish roosterfish force-pushed the self_hosted_runners branch 7 times, most recently from cf0fcd4 to 1e31a6e Compare December 13, 2024 15:27
@roosterfish roosterfish changed the title github: Use Canonical runners for system tests github: Use Canonical runners for scheduled system tests Dec 13, 2024
@roosterfish roosterfish marked this pull request as ready for review December 13, 2024 15:34
@roosterfish roosterfish force-pushed the self_hosted_runners branch 2 times, most recently from fb5cff6 to 7be4bc3 Compare December 17, 2024 10:46
api/session_join.go Outdated Show resolved Hide resolved
masnax
masnax previously approved these changes Dec 18, 2024
@tomponline
Copy link
Member

needs a rebase please

Ensure MicroCeph is fully started after bootstrapping to prevent running into timeouts
if the test suite is too fast.

Signed-off-by: Julian Pelizäus <[email protected]>
When trying to install the LXD snap but it already exists, the exit code isn't >0
so the refresh will never happen.

Signed-off-by: Julian Pelizäus <[email protected]>
This allows reducing the time between sending the password and starting to listen on join intents.
On slow test runners we saw errors because the initiator hasn't yet started to listen on join
intents but potential joiners where already dialing in with the passphrase.

Signed-off-by: Julian Pelizäus <[email protected]>
This allows having a much cleaner matrix definition grouped by
core, upgrade and Canonical specific system tests.

Signed-off-by: Julian Pelizäus <[email protected]>
This allows reuse of the system test steps for all groups
core, upgrade and Canonical specific tests.

Signed-off-by: Julian Pelizäus <[email protected]>
@roosterfish
Copy link
Contributor Author

@tomponline @masnax rebased.

@roosterfish roosterfish merged commit 48f53f7 into canonical:main Dec 19, 2024
24 checks passed
@roosterfish roosterfish deleted the self_hosted_runners branch December 19, 2024 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants