-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connectivity test: introduce connectivity test suite timeout flag #2145
connectivity test: introduce connectivity test suite timeout flag #2145
Conversation
|
||
go func() { | ||
<-ctx.Done() | ||
cc.Log("Interrupt received, cancelling tests...") | ||
cc.Logf("Cancellation request (%s) received, cancelling tests...", context.Cause(ctx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of an interrupt, we might lose a little bit of detail in the log. This is because signal.NotifyContext
doesn't provide the actual cause. But this should be fixed once golang/go#60756 is solved.
For the time being we would have to live with what context.Cause(ctx)
provides -> ctx.Error()
Interrupt: Cancellation request (context canceled) received, cancelling tests...
Timeout: Cancellation request (connectivity test suite timeout (5s) reached) received, cancelling tests...
213f782
to
eba2389
Compare
eba2389
to
93951f1
Compare
@@ -135,6 +137,9 @@ const ( | |||
// ClustermeshMaxConnectedClusters is the default number of the maximum | |||
// number of clusters that should be allowed to connect to the Clustermesh. | |||
ClustermeshMaxConnectedClusters = 255 | |||
|
|||
// Default timeout for Connectivity Test Suite (disabled by default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IF we want to enable a default timeout by default - what would it be? 30min?
(Part of the reason why i thought it's best to keep the current behavior and the timeout disabled by default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to leave the default timeout disable by default. Otherwise users outside of GHA might easily hit it, depending on their infrastructure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhofstetter LGTM
Currently, the connectivity test suite command doesn't have a timeout. Its context only gets cancelled due to interrupt of SIGTERM signal. For most connectivity test scenarios and actions, this isn't a problem as they have individual fine-grained timeouts (e.g. request timeout). But there are test scenarios (e.g. `health ` - `Cilium Health Probe`) that might run infinitely on failiure (e.g. if a Cilium Agent Pod no longer is available). Therefore, this commit introduces a new flag `--timeout` that provides the possibility to configure an overall timeout for the connectivity test suite. It defaults to `0` - meaning no timeout by default (current behaviour). Signed-off-by: Marco Hofstetter <[email protected]>
93951f1
to
aad0c99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the difficulty in selecting a default timeout value because most of these tests run with a GH Actions timeout anyway. With that said, what's the point of this PR if the default is going to be 0 anyway? When is it intended to set the timeout to a non-zero value?
Might be of use for other use-cases beside testing in GHA too? At least we would provide the possibility - without having to rely on the calling infrastructure to provide this. And maybe, at best, we can agree on a reasonable default. This way it would be aligned with other cilium-cli commands (e.g. /cc @tklauser WDYT? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I think this might benefit users running connectivity tests manually outside of GHA. I like that the default timeout is disabled by default and we don't change existing behavior.
@@ -135,6 +137,9 @@ const ( | |||
// ClustermeshMaxConnectedClusters is the default number of the maximum | |||
// number of clusters that should be allowed to connect to the Clustermesh. | |||
ClustermeshMaxConnectedClusters = 255 | |||
|
|||
// Default timeout for Connectivity Test Suite (disabled by default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to leave the default timeout disable by default. Otherwise users outside of GHA might easily hit it, depending on their infrastructure.
Let's wait for Chris to approve as well before merging this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with this change. I have a slight worry in the back of my mind that when this timeout is set, if we are not carefully enough we'll run into weird situations where it might overlap just enough with the GH Actions timeout.
Maybe it's worth just mentioning in the help msg that this is not recommended to set in GH Actions?
Currently, the connectivity test suite command doesn't have a timeout. Its context only gets cancelled due to interrupt of SIGTERM signal.
For most connectivity test scenarios and actions, this isn't a problem as they have individual fine-grained timeouts (e.g. request timeout).
But there are test scenarios (e.g.
health
-Cilium Health Probe
) that might run infinitely on failiure (e.g. if a Cilium Agent Pod no longer is available).Therefore, this commit introduces a new flag
--timeout
that provides the possibility to configure an overall timeout for the connectivity test suite. It defaults to0
- meaning no timeout by default (current behavior).Alternative: Introduce timeout for health probe test (I thought it's better to solve with a global timeout - because other test scenarios / actions have the same problem)
Example of an infinite health probe attempt (until GitHub action timeout) in
cilium/cilium
: https://github.com/cilium/cilium/actions/runs/7060997016/job/19221753163 (Of course it would also be possible to only solve it on the usage side - e.g. by setting a timeout in the connectivity test step. But i thought it might be worth for other use-cases too.)