-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flow: always allow running with a partially evaluated graph #4411
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
9de39bc
to
5ad351e
Compare
* Components can now opt-in to being run even if their initial evaluation fails by returning a component.Component instance alongside their evaluation error. Previously, if the initial evaluation of a component failed, it would not be run until a succesful evaluation of the component occurred. This change allows polling-based components (such as `remote.http`) to rely on their native polling mechanisms to auto-resolve health issues if their initial evaluation fails. Without this change, a `remote.http` component with no dependencies that fails on the first evaluate would not be re-evaluated again until the config file is reloaded. * LoadFile will no longer require the first evaluation to complete without error before scheduling components to be run. Previously, no components would run until all components evaluated successfully at least once. This change allows operating the Flow controller in a partially applied state, where a subset of components are healthy. This is critical in scenarios where a managed agent is being run where a subset of components may not always be immediately healthy on startup. Implementing the second behavior implies that the first behavior should also be implemented, since users may not understand when it is required to reload the config file to fix polling-based components. Callers may emulate the previous behavior by not calling `Run` until `LoadFile` succeeds at least once. For the scope of this PR, this is what `grafana-agent run` has always done, and it hasn't changed here. Supersedes grafana#4395.
5ad351e
to
6c02044
Compare
@thampiotr PTAL, I updated the PR to remove the "non-strict mode" terminology in favor of just changing the default semantics, with instructions for how callers can emulate the previous behavior. I've also updated the commit message / PR description to try to make it more clear exactly what behaviors this changes, including when/why it's useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this looks pretty good - I guess we'll need a flag and docs for this?
@@ -272,6 +272,9 @@ func (fr *flowRun) Run(configFile string) error { | |||
// Perform the initial reload. This is done after starting the HTTP server so | |||
// that /metric and pprof endpoints are available while the Flow controller | |||
// is loading. | |||
// | |||
// TODO(rfratto): add a flag to permit the initial load failing, allowing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me looks reasonable. But I'd like to hear the opinion of @spartan0x117 and @jcreixell before moving forward.
I'm taking this out of draft since this is essentially ready for final review and a decision about the direction. |
Do we want/need to add documentation about this behavior? |
I don't think so at this current stage, since nothing changes (yet) from the user's perspective. This only unlocks the capability for Flow to behave correctly if we stopped asserting that the first load is successful. |
// We do the error check at the end to allow the component to still return | ||
// a component instance with an error to signal that it should still run. | ||
if err != nil { | ||
return fmt.Errorf("building component: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about logging this error in addition to returning it?
I am coming at this from the perspective of someone debugging the agent. We should provide as much telemetry data as possible for failed nodes. This is a small tradeoff for giving the extra flexibility in behavior (not stopping the agent on a single bad node)
@rfratto what's missing to move forward with this? the solution seems reasonable to me, I will test it locally to make sure it behaves as expected in the context of agent management but other than that LGTM |
just tested this with
local agent config:
remote config:
|
We're still missing general consensus on whether this is a behavioral change we want to make to the Flow controller. |
I'm unlikely to have time to finish this up over the next few weeks. If this is needed sooner, is someone else able to take this on? |
Seems like this is not needed at the moment / not blocking anything. Should we close this PR? |
Yeah, let's close for now. We can revisit if/when needed |
This commit changes two behaviors about the Flow system:
Components can now opt-in to being run even if their initial evaluation fails by returning a component.Component instance alongside their evaluation error.
Previously, if the initial evaluation of a component failed, it would not be run until a succesful evaluation of the component occurred.
This change allows polling-based components (such as
remote.http
) to rely on their native polling mechanisms to auto-resolve health issues if their initial evaluation fails.Without this change, a
remote.http
component with no dependencies that fails on the first evaluate would not be re-evaluated again until the config file is reloaded.LoadFile will no longer require the first evaluation to complete without error before scheduling components to be run.
Previously, no components would run until all components evaluated successfully at least once.
This change allows operating the Flow controller in a partially applied state, where a subset of components are healthy. This is critical in scenarios where a managed agent is being run where a subset of components may not always be immediately healthy on startup.
Implementing the second behavior implies that the first behavior should also be implemented, since users may not understand when it is required to reload the config file to fix polling-based components.
Callers may emulate the previous behavior by not calling
Run
untilLoadFile
succeeds at least once. For the scope of this PR, this is whatgrafana-agent run
has always done, and it hasn't changed here.Supersedes #4395.