github: Use Canonical runners for scheduled system tests #469

roosterfish · 2024-11-08T09:05:16Z

This PR splits the rather complex matrix system test into three groups.
The system test code is moved into a repo local action which is leveraged by each of the groups.
See an example of the restructured tests workflow here.

In addition the "instances" suite is now executed also on the Canonical runners when the workflow is triggered by schedule.

roosterfish · 2024-11-08T17:28:58Z

@masnax I did some tests regarding this error.

Unfortunately the timeout set in the MicroCeph GetConfig client func is set to only 5s.
Based on your suggestion in the meeting earlier, can it be that right after forming the MicroCloud, the proxy is waiting for something in the cluster to settle before it can forward the request to MicroCeph's local unix socket?

I have bootstrapped a single node MicroCloud and fired requests to /1.0/services/microceph in parallel.
Right around when the MicroCloud is bootstrapped I saw a delay in the response which could proof that something is going on there.

masnax · 2024-11-08T17:36:09Z

I have bootstrapped a single node MicroCloud and fired requests to /1.0/services/microceph in parallel. Right around when the MicroCloud is bootstrapped I saw a delay in the response which could proof that something is going on there.

Well this current failure is happening long before MicroCloud is bootstrapped, as it happens right after system discovery and before asking any setup questions. In the bootstrap case, the only delay would be related to refreshing the truststore and waiting for the lock, but even that wouldn't happen on a single-node request as it all goes through the unix socket which skips truststore verification.

When bootstrapping, the listeners also restart, so that could be the delay you're seeing locally. But again that wouldn't affect the test failure since it's not during bootstrap.

can it be that right after forming the MicroCloud, the proxy is waiting for something in the cluster to settle before it can forward the request to MicroCeph's local unix socket?

this is the whole local proxy block in MicroCloud so it's definitely not waiting for anything here.

Since it's a network request, there is the additional overhead of authHandlerMTLS pulling the truststore.

roosterfish · 2024-11-11T15:35:56Z

Well this current failure is happening long before MicroCloud is bootstrapped, as it happens right after system discovery and before asking any setup questions

Mh it looks we can fix it by waiting for microceph cluster bootstrap to settle and only continue if it's done.
I have added another commit that adds a wrapper function we can use throughout the test suite to wait until microceph status reports the single node cluster services are present. I found waiting for this condition looks to be enough.

masnax · 2024-11-12T15:26:22Z

Well this current failure is happening long before MicroCloud is bootstrapped, as it happens right after system discovery and before asking any setup questions

Mh it looks we can fix it by waiting for microceph cluster bootstrap to settle and only continue if it's done. I have added another commit that adds a wrapper function we can use throughout the test suite to wait until microceph status reports the single node cluster services are present. I found waiting for this condition looks to be enough.

Is this something that can be checked over the API? Perhaps microceph cluster bootsrap shouldn't return until all its services are finished, or have a ready API that we can check against before sending requests.

roosterfish · 2024-11-25T16:33:18Z

Is this something that can be checked over the API? Perhaps microceph cluster bootsrap shouldn't return until all its services are finished

Issue logged in canonical/microceph#473.

roosterfish · 2024-11-27T11:53:48Z

@MggMuggins have you ever seen this one https://github.com/canonical/microcloud/actions/runs/12031420307/job/33588930317?pr=469#step:17:1490?

It might be that this is caused because the runners are slower. We already saw various others scenarios caused by the "slowness". Maybe you have an idea.

MggMuggins · 2024-11-27T15:16:44Z

IIRC the trust store is held in memory in MicroCluster and synchronized (fanotify?); I think the recovery process doesn't use the in-memory synchronization because it's expected that it won't be accessing the trust store at the same time as any other thread. I wonder if this was an incorrect assumption. I'll try and take a look this afternoon.

MggMuggins · 2024-12-03T15:16:43Z

I have downloaded the logs for the failed PR run and will plan to look into this next pulse.

api/session_join.go

tomponline · 2024-12-19T08:17:16Z

needs a rebase please

Ensure MicroCeph is fully started after bootstrapping to prevent running into timeouts if the test suite is too fast. Signed-off-by: Julian Pelizäus <[email protected]>

When trying to install the LXD snap but it already exists, the exit code isn't >0 so the refresh will never happen. Signed-off-by: Julian Pelizäus <[email protected]>

Signed-off-by: Julian Pelizäus <[email protected]>

This allows reducing the time between sending the password and starting to listen on join intents. On slow test runners we saw errors because the initiator hasn't yet started to listen on join intents but potential joiners where already dialing in with the passphrase. Signed-off-by: Julian Pelizäus <[email protected]>

Signed-off-by: Julian Pelizäus <[email protected]>

This allows having a much cleaner matrix definition grouped by core, upgrade and Canonical specific system tests. Signed-off-by: Julian Pelizäus <[email protected]>

This allows reuse of the system test steps for all groups core, upgrade and Canonical specific tests. Signed-off-by: Julian Pelizäus <[email protected]>

roosterfish · 2024-12-19T10:38:49Z

@tomponline @masnax rebased.

roosterfish force-pushed the self_hosted_runners branch 2 times, most recently from 8af0edc to 4f38ed2 Compare November 8, 2024 11:07

roosterfish marked this pull request as ready for review November 8, 2024 13:58

roosterfish force-pushed the self_hosted_runners branch from 159ed91 to 44b77ae Compare November 8, 2024 15:11

roosterfish force-pushed the self_hosted_runners branch 2 times, most recently from 7c6d5ff to ae01100 Compare November 11, 2024 10:37

roosterfish force-pushed the self_hosted_runners branch 11 times, most recently from d8a0c34 to 56c1b8a Compare November 25, 2024 13:23

roosterfish marked this pull request as draft November 26, 2024 09:24

roosterfish force-pushed the self_hosted_runners branch 2 times, most recently from 62ec1f0 to 222713b Compare November 26, 2024 11:02

roosterfish force-pushed the self_hosted_runners branch from 09bc020 to fbbdbbe Compare December 3, 2024 14:13

roosterfish force-pushed the self_hosted_runners branch from fbbdbbe to 68e771e Compare December 4, 2024 09:16

roosterfish force-pushed the self_hosted_runners branch 7 times, most recently from cf0fcd4 to 1e31a6e Compare December 13, 2024 15:27

roosterfish changed the title ~~github: Use Canonical runners for system tests~~ github: Use Canonical runners for scheduled system tests Dec 13, 2024

roosterfish marked this pull request as ready for review December 13, 2024 15:34

roosterfish force-pushed the self_hosted_runners branch 2 times, most recently from fb5cff6 to 7be4bc3 Compare December 17, 2024 10:46

masnax reviewed Dec 17, 2024

View reviewed changes

api/session_join.go Outdated Show resolved Hide resolved

roosterfish force-pushed the self_hosted_runners branch from 7be4bc3 to 22fd4da Compare December 17, 2024 16:52

roosterfish requested a review from masnax December 18, 2024 09:06

masnax previously approved these changes Dec 18, 2024

View reviewed changes

roosterfish added 8 commits December 19, 2024 09:25

test/suites/basic: Use wrapper when bootstrapping MicroCeph

f05bd12

Ensure MicroCeph is fully started after bootstrapping to prevent running into timeouts if the test suite is too fast. Signed-off-by: Julian Pelizäus <[email protected]>

github: Remove noop LXD refresh

7248c5a

When trying to install the LXD snap but it already exists, the exit code isn't >0 so the refresh will never happen. Signed-off-by: Julian Pelizäus <[email protected]>

github/workflows/tests: Install missing net-tools

3df097f

Signed-off-by: Julian Pelizäus <[email protected]>

api/session: Add timeout waiting for an active join intent consumer

0557f20

Signed-off-by: Julian Pelizäus <[email protected]>

github: Use path valid for both GitHub and Canonical runners

954597a

Signed-off-by: Julian Pelizäus <[email protected]>

github: Restructure matrix system tests

78e571d

This allows having a much cleaner matrix definition grouped by core, upgrade and Canonical specific system tests. Signed-off-by: Julian Pelizäus <[email protected]>

github/actions/system-test: Add new action for running a system-test

e035c9b

This allows reuse of the system test steps for all groups core, upgrade and Canonical specific tests. Signed-off-by: Julian Pelizäus <[email protected]>

roosterfish dismissed masnax’s stale review via e035c9b December 19, 2024 08:27

roosterfish force-pushed the self_hosted_runners branch from 22fd4da to e035c9b Compare December 19, 2024 08:27

tomponline approved these changes Dec 19, 2024

View reviewed changes

roosterfish merged commit 48f53f7 into canonical:main Dec 19, 2024
24 checks passed

roosterfish deleted the self_hosted_runners branch December 19, 2024 10:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

github: Use Canonical runners for scheduled system tests #469

github: Use Canonical runners for scheduled system tests #469

roosterfish commented Nov 8, 2024 •

edited

Loading

roosterfish commented Nov 8, 2024

masnax commented Nov 8, 2024 •

edited

Loading

roosterfish commented Nov 11, 2024

masnax commented Nov 12, 2024

roosterfish commented Nov 25, 2024

roosterfish commented Nov 27, 2024

MggMuggins commented Nov 27, 2024

MggMuggins commented Dec 3, 2024

tomponline commented Dec 19, 2024

roosterfish commented Dec 19, 2024

github: Use Canonical runners for scheduled system tests #469

github: Use Canonical runners for scheduled system tests #469

Conversation

roosterfish commented Nov 8, 2024 • edited Loading

roosterfish commented Nov 8, 2024

masnax commented Nov 8, 2024 • edited Loading

roosterfish commented Nov 11, 2024

masnax commented Nov 12, 2024

roosterfish commented Nov 25, 2024

roosterfish commented Nov 27, 2024

MggMuggins commented Nov 27, 2024

MggMuggins commented Dec 3, 2024

tomponline commented Dec 19, 2024

roosterfish commented Dec 19, 2024

roosterfish commented Nov 8, 2024 •

edited

Loading

masnax commented Nov 8, 2024 •

edited

Loading