Add CLI flag for aggregation timeout #686

johnSchnake · 2019-04-22T18:58:39Z

What this PR does / why we need it:
One of the most common config settings that may change between
different environments and test cases is the amount of time
Sonobuoy will wait for results. Currently, this value is configurable
but it is in a nested config structure and not terribly well
documented at this time.

Personally, in the past, I had projects where we had to start
using a sonobuoy config-based flow instead of flags due solely
to this field.

This change adds the flag to make it more easy to set.

Signed-off-by: John Schnake [email protected]

Which issue(s) this PR fixes
Fixes #482

Release note:

Added --timeout flag to set the number of seconds Sonobuoy will wait for plugins to complete.

johnSchnake · 2019-04-22T18:58:48Z

// @ceridwen

codecov-io · 2019-04-22T19:13:44Z

Codecov Report

Merging #686 into master will increase coverage by 0.1%.
The diff coverage is 75%.

@@            Coverage Diff            @@
##           master     #686     +/-   ##
=========================================
+ Coverage   42.44%   42.55%   +0.1%     
=========================================
  Files          62       62             
  Lines        3409     3419     +10     
=========================================
+ Hits         1447     1455      +8     
- Misses       1864     1866      +2     
  Partials       98       98

Impacted Files	Coverage Δ
pkg/plugin/aggregation/run.go	`0% <0%> (ø)`	⬆️
cmd/sonobuoy/app/gen.go	`48.36% <100%> (+1.3%)`	⬆️
pkg/config/config.go	`57.89% <100%> (ø)`	⬆️
cmd/sonobuoy/app/args.go	`88.88% <100%> (+0.37%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b56c31...52c65ee. Read the comment docs.

stevesloka · 2019-04-23T14:04:24Z

So now there are 2 flags which determine "wait". There's --wait and now --timeout (added in this PR). Should those do the same thing and be combined?

stevesloka · 2019-04-23T14:08:59Z

This could be a future design PR, but is there a way to define the timeout per plugin? For instance, the e2e tests may take a long time, but maybe my systemd_logs only takes 2mins? This flag could be the default of all and then later maybe let plugin developers further define?

stevesloka · 2019-04-23T14:05:34Z

cmd/sonobuoy/app/args.go

+func AddTimeoutFlag(flag *int, flags *pflag.FlagSet) {
+	flags.IntVar(
+		flag, timeoutFlag, config.DefaultAggregationServerTimeoutSeconds,
+		"How long (in seconds) Sonobuoy will wait for plugins to complete before exiting. 0 indicates no timeout.",


Might be cool to make this a duration type and let users enter 1m or 2h`, etc. Otherwise, you've got to do the math to figure out the number of seconds.

nit: Maybe print the default to the help so users know what it is if not defined?

cobra adds the default; this was copied from my console just now:

... before exiting. 0 indicates no timeout. (default 10800)

johnSchnake · 2019-04-23T15:03:31Z

So now there are 2 flags which determine "wait". There's --wait and now --timeout (added in this PR). Should those do the same thing and be combined?

So there are two flags but they really do indicate different things and could be used in various ways.

The --timeout flag indicates the server timeout and when that is hit, everything turns off and explodes (more or less).

The --wait flag is a CLI-only flag that just indicates how long you are willing to block in the terminal for. It won't err the run just because you hit that limit.

The reason I don't want to sync the two is that a pretty reasonable use case is the following:

run and allow the server to timeout after 2hr
you really dont think it'll take that long so you just have your CLI stop waiting and return after an hour.
after the hour is up you can check the status; you assume it should be done but if not (and it is still running) you can investigate everything while it is still running.

johnSchnake · 2019-04-23T15:06:48Z

This could be a future design PR, but is there a way to define the timeout per plugin? For instance, the e2e tests may take a long time, but maybe my systemd_logs only takes 2mins? This flag could be the default of all and then later maybe let plugin developers further define?

So I agree something like that may eventually be implemented. There are definitely impacts of this overall timeout and per-plugin timeout that play a role in #627 My POC implementation of plugin ordering also showed me that it could be very easy to adjust some of these details by adding them into the plugin manifest metadata.

But yes, I agree it is for another day since this is currently coded for just one big timeout and this was just to make it easier to set.

stevesloka · 2019-04-23T15:10:05Z

Yeah that holds up, but I can still sonobuoy run --wait, then CTRL-C and then sonobuoy status and all works.

The --wait flag is a CLI-only flag that just indicates how long you are willing to block in the terminal for. It won't err the run just because you hit that limit.

But should it? If you had CI running and this happened you'd get a false positive wouldn't you?

I'm just trying to challenge the overall design to make sure we're solving the problem and not just throwing flags at it. Is there a scenario where you'd want the client to wait 5 mins and the server wait 2 hours?

johnSchnake · 2019-04-23T15:48:28Z

Yeah that holds up, but I can still sonobuoy run --wait, then CTRL-C and then sonobuoy status and all works.

True, though you may want to do something like this even in CI. Before the --wait commands existed thats what I had seen done when people were rolling their own polling logic. Wait while it runs for a while, then when we hit our own timeout logic go and gather logs and maybe run some checks while it Sonobuoy is still running, then err out ourselves and clean up.

But should it? If you had CI running and this happened you'd get a false positive wouldn't you?

As the example above says, I don't think forcing the server to error and stop running because we hit a CLI timeout is always the right thing to do. It prevents a user (including CI systems) from investigating if they want to.

Just because sonobuoy run --wait returns doesn't guarantee anything about whether or not anything succeeded; do you mean false positive in that you might think it is done? That is partially why the default time for --wait is much longer than any typical run, currently at a day.

I'm just trying to challenge the overall design to make sure we're solving the problem and not just throwing flags at it.

Absolutely 👍 , IMO that is the discussion that we had when deciding to put this ticket into the milestone though. I appreciate the design questions though and am happy to defend the choice to make the new flag.

Is there a scenario where you'd want the client to wait 5 mins and the server wait 2 hours?

Yes but I also think your comment about CTRL+C plays a role here too; personally, I run checks all the time where this is the [somewhat] the case. If I am iterating on some plugin (in my case, e2e test) and I think it will just take 5m or less and I run it as such. If it runs and I find out it actually took 7m, I really didn't want the server to err at 5m. But the CLI --wait gave me an opportunity to see that 5m has passed and it still isn't done. Again, then I investigate the situation and usually restart my experiment whenever I'm done.

TL;DR; I think that in most situations your sentiment is right, you'd just want the server to pass/fail on its own and then the CLI wait for that terminal status. But I can think of situations where you want to ensure the CLI stops and gives you time to investigate before cleaning anything up. I think some of these things are also related to when/how/if plugin resources are cleaned up. In the end though, this is already a customizable value, just one that multiple people edit regularly and want it to be more simple.

If we decide to prioritize per-plugin timeouts, cleanup logic, etc then I think we should bring up these ideas again though because I do acknowledge there are surely some improvements that could be made.

One of the most common config settings that may change between different environments and test cases is the amount of time Sonobuoy will wait for results. Currently, this value is configurable but it is in a nested config structure and not terribly well documented at this time. Personally, in the past, I had projects where we had to start using a sonobuoy config-based flow instead of flags due solely to this field. This change adds the flag to make it more easy to set. Fixes #482 Signed-off-by: John Schnake <[email protected]>

johnSchnake requested a review from stevesloka April 22, 2019 18:58

stevesloka reviewed Apr 23, 2019

View reviewed changes

johnSchnake merged commit aaa5e13 into vmware-tanzu:master Apr 26, 2019

johnSchnake deleted the timeoutFlag branch April 26, 2019 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CLI flag for aggregation timeout #686

Add CLI flag for aggregation timeout #686

johnSchnake commented Apr 22, 2019

johnSchnake commented Apr 22, 2019

codecov-io commented Apr 22, 2019 •

edited

Loading

stevesloka commented Apr 23, 2019

stevesloka commented Apr 23, 2019

stevesloka Apr 23, 2019

stevesloka Apr 23, 2019

johnSchnake Apr 23, 2019

johnSchnake commented Apr 23, 2019

johnSchnake commented Apr 23, 2019

stevesloka commented Apr 23, 2019

johnSchnake commented Apr 23, 2019

Add CLI flag for aggregation timeout #686

Add CLI flag for aggregation timeout #686

Conversation

johnSchnake commented Apr 22, 2019

johnSchnake commented Apr 22, 2019

codecov-io commented Apr 22, 2019 • edited Loading

Codecov Report

stevesloka commented Apr 23, 2019

stevesloka commented Apr 23, 2019

stevesloka Apr 23, 2019

Choose a reason for hiding this comment

stevesloka Apr 23, 2019

Choose a reason for hiding this comment

johnSchnake Apr 23, 2019

Choose a reason for hiding this comment

johnSchnake commented Apr 23, 2019

johnSchnake commented Apr 23, 2019

stevesloka commented Apr 23, 2019

johnSchnake commented Apr 23, 2019

codecov-io commented Apr 22, 2019 •

edited

Loading