flow: always allow running with a partially evaluated graph #4411

rfratto · 2023-07-12T15:14:11Z

This commit changes two behaviors about the Flow system:

Components can now opt-in to being run even if their initial evaluation fails by returning a component.Component instance alongside their evaluation error.

Previously, if the initial evaluation of a component failed, it would not be run until a succesful evaluation of the component occurred.

This change allows polling-based components (such as remote.http) to rely on their native polling mechanisms to auto-resolve health issues if their initial evaluation fails.

Without this change, a remote.http component with no dependencies that fails on the first evaluate would not be re-evaluated again until the config file is reloaded.
LoadFile will no longer require the first evaluation to complete without error before scheduling components to be run.

Previously, no components would run until all components evaluated successfully at least once.

This change allows operating the Flow controller in a partially applied state, where a subset of components are healthy. This is critical in scenarios where a managed agent is being run where a subset of components may not always be immediately healthy on startup.

Implementing the second behavior implies that the first behavior should also be implemented, since users may not understand when it is required to reload the config file to fix polling-based components.

Callers may emulate the previous behavior by not calling Run until LoadFile succeeds at least once. For the scope of this PR, this is what grafana-agent run has always done, and it hasn't changed here.

Supersedes #4395.

pkg/flow/flow.go

* Components can now opt-in to being run even if their initial evaluation fails by returning a component.Component instance alongside their evaluation error. Previously, if the initial evaluation of a component failed, it would not be run until a succesful evaluation of the component occurred. This change allows polling-based components (such as `remote.http`) to rely on their native polling mechanisms to auto-resolve health issues if their initial evaluation fails. Without this change, a `remote.http` component with no dependencies that fails on the first evaluate would not be re-evaluated again until the config file is reloaded. * LoadFile will no longer require the first evaluation to complete without error before scheduling components to be run. Previously, no components would run until all components evaluated successfully at least once. This change allows operating the Flow controller in a partially applied state, where a subset of components are healthy. This is critical in scenarios where a managed agent is being run where a subset of components may not always be immediately healthy on startup. Implementing the second behavior implies that the first behavior should also be implemented, since users may not understand when it is required to reload the config file to fix polling-based components. Callers may emulate the previous behavior by not calling `Run` until `LoadFile` succeeds at least once. For the scope of this PR, this is what `grafana-agent run` has always done, and it hasn't changed here. Supersedes grafana#4395.

rfratto · 2023-08-01T15:34:21Z

@thampiotr PTAL, I updated the PR to remove the "non-strict mode" terminology in favor of just changing the default semantics, with instructions for how callers can emulate the previous behavior.

I've also updated the commit message / PR description to try to make it more clear exactly what behaviors this changes, including when/why it's useful.

thampiotr

Yeah, this looks pretty good - I guess we'll need a flag and docs for this?

marctc · 2023-08-02T13:40:02Z

cmd/internal/flowmode/cmd_run.go

@@ -272,6 +272,9 @@ func (fr *flowRun) Run(configFile string) error {
 	// Perform the initial reload. This is done after starting the HTTP server so
 	// that /metric and pprof endpoints are available while the Flow controller
 	// is loading.
+	//
+	// TODO(rfratto): add a flag to permit the initial load failing, allowing


For me looks reasonable. But I'd like to hear the opinion of @spartan0x117 and @jcreixell before moving forward.

rfratto · 2023-08-15T15:59:39Z

I'm taking this out of draft since this is essentially ready for final review and a decision about the direction.

mattdurham · 2023-08-15T19:23:37Z

Do we want/need to add documentation about this behavior?

rfratto · 2023-08-15T19:25:45Z

Do we want/need to add documentation about this behavior?

I don't think so at this current stage, since nothing changes (yet) from the user's perspective. This only unlocks the capability for Flow to behave correctly if we stopped asserting that the first load is successful.

erikbaranowski · 2023-08-21T13:53:56Z

pkg/flow/internal/controller/node_component.go

+		// We do the error check at the end to allow the component to still return
+		// a component instance with an error to signal that it should still run.
+		if err != nil {
+			return fmt.Errorf("building component: %w", err)


what do you think about logging this error in addition to returning it?

I am coming at this from the perspective of someone debugging the agent. We should provide as much telemetry data as possible for failed nodes. This is a small tradeoff for giving the extra flexibility in behavior (not stopping the agent on a single bad node)

jcreixell · 2023-08-28T08:53:44Z

@rfratto what's missing to move forward with this? the solution seems reasonable to me, I will test it locally to make sure it behaves as expected in the context of agent management but other than that LGTM

jcreixell · 2023-08-28T10:32:39Z

just tested this with remote.http and module.string and getting the following nil pointer error:

Error: module.string.agent_management:1:1: Unrecognized component name "local.files"

Error: module.string.agent_management:10:18: component "local.file.metrics_writer.content" does not exist
ts=2023-08-28T10:24:12.51244216Z level=info msg="scheduling loaded components and services"
interrupt received
ts=2023-08-28T10:24:12.512466596Z level=debug msg="flow controller exiting"
ts=2023-08-28T10:24:12.512467964Z level=error msg="failed to start reporter" err="context canceled"
ts=2023-08-28T10:24:12.512483342Z level=info component=remote.http.remoteconfig msg="component exited"
ts=2023-08-28T10:24:12.512738041Z level=info service=http msg="now listening for http traffic" addr=127.0.0.1:12345
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3ed7dbe]

goroutine 259 [running]:
github.com/grafana/agent/component/module/string.(*Component).Run(0xc002732280?, {0x7f256c8?, 0xc0027090e0?})
	/home/jorge/workspace/src/github.com/rfratto/agent/component/module/string/string.go:63 +0x1e
github.com/grafana/agent/pkg/flow/internal/controller.(*ComponentNode).Run(0xc002732280, {0x7f256c8, 0xc0027090e0})
	/home/jorge/workspace/src/github.com/rfratto/agent/pkg/flow/internal/controller/node_component.go:330 +0x122
github.com/grafana/agent/pkg/flow/internal/controller.newTask.func1()
	/home/jorge/workspace/src/github.com/rfratto/agent/pkg/flow/internal/controller/scheduler.go:140 +0x74
created by github.com/grafana/agent/pkg/flow/internal/controller.newTask
	/home/jorge/workspace/src/github.com/rfratto/agent/pkg/flow/internal/controller/scheduler.go:137 +0x150

local agent config:

remote.http "remoteconfig" {
  url = "some_url"
  headers = {
    "Authorization" = "Basic <some_key>",
  }
  poll_frequency = "11s"
}

module.string "agent_management" {
  content = remote.http.remoteconfig.content
}

remote config:

  // intentional typo in component name
  local.files "metrics_writer" {
    filename = "<some+path>"
  }

  prometheus.remote_write "grafanacloud" {
    endpoint {
      url = "<some_push_url>"
      basic_auth {
        username = "<some_id>"
        password = local.file.metrics_writer.content
      }
   }
  }

rfratto · 2023-08-28T13:22:17Z

@rfratto what's missing to move forward with this? the solution seems reasonable to me, I will test it locally to make sure it behaves as expected in the context of agent management but other than that LGTM

We're still missing general consensus on whether this is a behavioral change we want to make to the Flow controller.

rfratto · 2023-08-31T18:54:26Z

I'm unlikely to have time to finish this up over the next few weeks. If this is needed sooner, is someone else able to take this on?

thampiotr · 2024-01-08T16:08:49Z

Seems like this is not needed at the moment / not blocking anything. Should we close this PR?

rfratto · 2024-01-08T16:09:17Z

Yeah, let's close for now. We can revisit if/when needed

thampiotr reviewed Jul 12, 2023

View reviewed changes

pkg/flow/flow.go Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

rfratto commented Aug 1, 2023

View reviewed changes

pkg/flow/flow.go Outdated Show resolved Hide resolved

rfratto force-pushed the non-strict-mode branch from 9de39bc to 5ad351e Compare August 1, 2023 15:27

rfratto changed the title ~~flow: prototype "non-strict" loading mode~~ flow: allow running a partially evaluated graph Aug 1, 2023

rfratto changed the title ~~flow: allow running a partially evaluated graph~~ flow: always allow running with a partially evaluated graph Aug 1, 2023

rfratto force-pushed the non-strict-mode branch from 5ad351e to 6c02044 Compare August 1, 2023 15:32

rfratto mentioned this pull request Aug 1, 2023

flow: enable managing services #4664

Merged

thampiotr reviewed Aug 1, 2023

View reviewed changes

marctc reviewed Aug 2, 2023

View reviewed changes

Merge branch 'main' into non-strict-mode

92885b7

rfratto marked this pull request as ready for review August 15, 2023 15:59

rfratto requested a review from a team as a code owner August 15, 2023 15:59

erikbaranowski reviewed Aug 21, 2023

View reviewed changes

tpaschalis mentioned this pull request Apr 11, 2024

errors in services don't abort the Agent startup grafana/alloy#364

Open

rfratto mentioned this pull request Sep 21, 2023

Module id fail #5261

Merged

2 tasks

erikbaranowski assigned marctc, jcreixell, thampiotr and erikbaranowski Nov 2, 2023

erikbaranowski assigned mattdurham Nov 2, 2023

rfratto closed this Jan 8, 2024

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flow: always allow running with a partially evaluated graph #4411

flow: always allow running with a partially evaluated graph #4411

rfratto commented Jul 12, 2023 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

rfratto commented Aug 1, 2023

thampiotr left a comment

marctc Aug 2, 2023

rfratto commented Aug 15, 2023

mattdurham commented Aug 15, 2023

rfratto commented Aug 15, 2023

erikbaranowski Aug 21, 2023

jcreixell commented Aug 28, 2023

jcreixell commented Aug 28, 2023

rfratto commented Aug 28, 2023

rfratto commented Aug 31, 2023 •

edited

Loading

thampiotr commented Jan 8, 2024

rfratto commented Jan 8, 2024

flow: always allow running with a partially evaluated graph #4411

flow: always allow running with a partially evaluated graph #4411

Conversation

rfratto commented Jul 12, 2023 • edited Loading

This comment was marked as outdated.

This comment was marked as outdated.

rfratto commented Aug 1, 2023

thampiotr left a comment

Choose a reason for hiding this comment

marctc Aug 2, 2023

Choose a reason for hiding this comment

rfratto commented Aug 15, 2023

mattdurham commented Aug 15, 2023

rfratto commented Aug 15, 2023

erikbaranowski Aug 21, 2023

Choose a reason for hiding this comment

jcreixell commented Aug 28, 2023

jcreixell commented Aug 28, 2023

rfratto commented Aug 28, 2023

rfratto commented Aug 31, 2023 • edited Loading

thampiotr commented Jan 8, 2024

rfratto commented Jan 8, 2024

rfratto commented Jul 12, 2023 •

edited

Loading

rfratto commented Aug 31, 2023 •

edited

Loading