Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connectivity test: introduce connectivity test suite timeout flag #2145

Conversation

mhofstetter
Copy link
Member

@mhofstetter mhofstetter commented Dec 4, 2023

Currently, the connectivity test suite command doesn't have a timeout. Its context only gets cancelled due to interrupt of SIGTERM signal.

For most connectivity test scenarios and actions, this isn't a problem as they have individual fine-grained timeouts (e.g. request timeout).

But there are test scenarios (e.g. health - Cilium Health Probe) that might run infinitely on failiure (e.g. if a Cilium Agent Pod no longer is available).

Therefore, this commit introduces a new flag --timeout that provides the possibility to configure an overall timeout for the connectivity test suite. It defaults to 0 - meaning no timeout by default (current behavior).

Alternative: Introduce timeout for health probe test (I thought it's better to solve with a global timeout - because other test scenarios / actions have the same problem)

Example of an infinite health probe attempt (until GitHub action timeout) in cilium/cilium: https://github.com/cilium/cilium/actions/runs/7060997016/job/19221753163 (Of course it would also be possible to only solve it on the usage side - e.g. by setting a timeout in the connectivity test step. But i thought it might be worth for other use-cases too.)


go func() {
<-ctx.Done()
cc.Log("Interrupt received, cancelling tests...")
cc.Logf("Cancellation request (%s) received, cancelling tests...", context.Cause(ctx))
Copy link
Member Author

@mhofstetter mhofstetter Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of an interrupt, we might lose a little bit of detail in the log. This is because signal.NotifyContext doesn't provide the actual cause. But this should be fixed once golang/go#60756 is solved.

For the time being we would have to live with what context.Cause(ctx) provides -> ctx.Error()

Interrupt: Cancellation request (context canceled) received, cancelling tests...
Timeout: Cancellation request (connectivity test suite timeout (5s) reached) received, cancelling tests...

@@ -135,6 +137,9 @@ const (
// ClustermeshMaxConnectedClusters is the default number of the maximum
// number of clusters that should be allowed to connect to the Clustermesh.
ClustermeshMaxConnectedClusters = 255

// Default timeout for Connectivity Test Suite (disabled by default)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IF we want to enable a default timeout by default - what would it be? 30min?

(Part of the reason why i thought it's best to keep the current behavior and the timeout disabled by default)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to leave the default timeout disable by default. Otherwise users outside of GHA might easily hit it, depending on their infrastructure.

Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the connectivity test suite command doesn't have a timeout.
Its context only gets cancelled due to interrupt of SIGTERM signal.

For most connectivity test scenarios and actions, this isn't a problem
as they have individual fine-grained timeouts (e.g. request timeout).

But there are test scenarios (e.g. `health ` - `Cilium Health Probe`)
that might run infinitely on failiure (e.g. if a Cilium Agent Pod no
longer is available).

Therefore, this commit introduces a new flag `--timeout` that provides
the possibility to configure an overall timeout for the connectivity
test suite. It defaults to `0` - meaning no timeout by default (current
behaviour).

Signed-off-by: Marco Hofstetter <[email protected]>
@mhofstetter mhofstetter force-pushed the pr/mhofstetter/connectivity-test-timeout branch from 93951f1 to aad0c99 Compare December 4, 2023 15:22
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the difficulty in selecting a default timeout value because most of these tests run with a GH Actions timeout anyway. With that said, what's the point of this PR if the default is going to be 0 anyway? When is it intended to set the timeout to a non-zero value?

@mhofstetter
Copy link
Member Author

mhofstetter commented Dec 4, 2023

I agree with the difficulty in selecting a default timeout value because most of these tests run with a GH Actions timeout anyway. With that said, what's the point of this PR if the default is going to be 0 anyway? When is it intended to set the timeout to a non-zero value?

Might be of use for other use-cases beside testing in GHA too? At least we would provide the possibility - without having to rely on the calling infrastructure to provide this.

And maybe, at best, we can agree on a reasonable default. This way it would be aligned with other cilium-cli commands (e.g. cilium status --wait - that also provides a default timeout of 5min)

/cc @tklauser WDYT?

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think this might benefit users running connectivity tests manually outside of GHA. I like that the default timeout is disabled by default and we don't change existing behavior.

@@ -135,6 +137,9 @@ const (
// ClustermeshMaxConnectedClusters is the default number of the maximum
// number of clusters that should be allowed to connect to the Clustermesh.
ClustermeshMaxConnectedClusters = 255

// Default timeout for Connectivity Test Suite (disabled by default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to leave the default timeout disable by default. Otherwise users outside of GHA might easily hit it, depending on their infrastructure.

@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Dec 5, 2023
@tklauser
Copy link
Member

tklauser commented Dec 5, 2023

Let's wait for Chris to approve as well before merging this.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this change. I have a slight worry in the back of my mind that when this timeout is set, if we are not carefully enough we'll run into weird situations where it might overlap just enough with the GH Actions timeout.

Maybe it's worth just mentioning in the help msg that this is not recommended to set in GH Actions?

@michi-covalent michi-covalent merged commit 03ee972 into cilium:main Dec 5, 2023
20 checks passed
@mhofstetter mhofstetter deleted the pr/mhofstetter/connectivity-test-timeout branch December 5, 2023 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants