Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flow: always allow running with a partially evaluated graph #4411

Closed
wants to merge 2 commits into from

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Jul 12, 2023

This commit changes two behaviors about the Flow system:

  • Components can now opt-in to being run even if their initial evaluation fails by returning a component.Component instance alongside their evaluation error.

    Previously, if the initial evaluation of a component failed, it would not be run until a succesful evaluation of the component occurred.

    This change allows polling-based components (such as remote.http) to rely on their native polling mechanisms to auto-resolve health issues if their initial evaluation fails.

    Without this change, a remote.http component with no dependencies that fails on the first evaluate would not be re-evaluated again until the config file is reloaded.

  • LoadFile will no longer require the first evaluation to complete without error before scheduling components to be run.

    Previously, no components would run until all components evaluated successfully at least once.

    This change allows operating the Flow controller in a partially applied state, where a subset of components are healthy. This is critical in scenarios where a managed agent is being run where a subset of components may not always be immediately healthy on startup.

Implementing the second behavior implies that the first behavior should also be implemented, since users may not understand when it is required to reload the config file to fix polling-based components.

Callers may emulate the previous behavior by not calling Run until LoadFile succeeds at least once. For the scope of this PR, this is what grafana-agent run has always done, and it hasn't changed here.

Supersedes #4395.

pkg/flow/flow.go Outdated Show resolved Hide resolved
@rfratto

This comment was marked as outdated.

@thampiotr

This comment was marked as outdated.

pkg/flow/flow.go Outdated Show resolved Hide resolved
@rfratto rfratto changed the title flow: prototype "non-strict" loading mode flow: allow running a partially evaluated graph Aug 1, 2023
@rfratto rfratto changed the title flow: allow running a partially evaluated graph flow: always allow running with a partially evaluated graph Aug 1, 2023
* Components can now opt-in to being run even if their initial
  evaluation fails by returning a component.Component instance alongside
  their evaluation error.

  Previously, if the initial evaluation of a component failed, it would
  not be run until a succesful evaluation of the component occurred.

  This change allows polling-based components (such as `remote.http`) to
  rely on their native polling mechanisms to auto-resolve health issues
  if their initial evaluation fails.

  Without this change, a `remote.http` component with no dependencies
  that fails on the first evaluate would not be re-evaluated again until
  the config file is reloaded.

* LoadFile will no longer require the first evaluation to complete
  without error before scheduling components to be run.

  Previously, no components would run until all components evaluated
  successfully at least once.

  This change allows operating the Flow controller in a partially
  applied state, where a subset of components are healthy. This is
  critical in scenarios where a managed agent is being run where a
  subset of components may not always be immediately healthy on startup.

Implementing the second behavior implies that the first behavior should
also be implemented, since users may not understand when it is required
to reload the config file to fix polling-based components.

Callers may emulate the previous behavior by not calling `Run` until
`LoadFile` succeeds at least once. For the scope of this PR, this is
what `grafana-agent run` has always done, and it hasn't changed here.

Supersedes grafana#4395.
@rfratto
Copy link
Member Author

rfratto commented Aug 1, 2023

@thampiotr PTAL, I updated the PR to remove the "non-strict mode" terminology in favor of just changing the default semantics, with instructions for how callers can emulate the previous behavior.

I've also updated the commit message / PR description to try to make it more clear exactly what behaviors this changes, including when/why it's useful.

Copy link
Contributor

@thampiotr thampiotr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looks pretty good - I guess we'll need a flag and docs for this?

@@ -272,6 +272,9 @@ func (fr *flowRun) Run(configFile string) error {
// Perform the initial reload. This is done after starting the HTTP server so
// that /metric and pprof endpoints are available while the Flow controller
// is loading.
//
// TODO(rfratto): add a flag to permit the initial load failing, allowing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me looks reasonable. But I'd like to hear the opinion of @spartan0x117 and @jcreixell before moving forward.

@rfratto
Copy link
Member Author

rfratto commented Aug 15, 2023

I'm taking this out of draft since this is essentially ready for final review and a decision about the direction.

@rfratto rfratto marked this pull request as ready for review August 15, 2023 15:59
@rfratto rfratto requested a review from a team as a code owner August 15, 2023 15:59
@mattdurham
Copy link
Collaborator

Do we want/need to add documentation about this behavior?

@rfratto
Copy link
Member Author

rfratto commented Aug 15, 2023

Do we want/need to add documentation about this behavior?

I don't think so at this current stage, since nothing changes (yet) from the user's perspective. This only unlocks the capability for Flow to behave correctly if we stopped asserting that the first load is successful.

// We do the error check at the end to allow the component to still return
// a component instance with an error to signal that it should still run.
if err != nil {
return fmt.Errorf("building component: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about logging this error in addition to returning it?

I am coming at this from the perspective of someone debugging the agent. We should provide as much telemetry data as possible for failed nodes. This is a small tradeoff for giving the extra flexibility in behavior (not stopping the agent on a single bad node)

@jcreixell
Copy link
Contributor

@rfratto what's missing to move forward with this? the solution seems reasonable to me, I will test it locally to make sure it behaves as expected in the context of agent management but other than that LGTM

@jcreixell
Copy link
Contributor

just tested this with remote.http and module.string and getting the following nil pointer error:

Error: module.string.agent_management:1:1: Unrecognized component name "local.files"

Error: module.string.agent_management:10:18: component "local.file.metrics_writer.content" does not exist
ts=2023-08-28T10:24:12.51244216Z level=info msg="scheduling loaded components and services"
interrupt received
ts=2023-08-28T10:24:12.512466596Z level=debug msg="flow controller exiting"
ts=2023-08-28T10:24:12.512467964Z level=error msg="failed to start reporter" err="context canceled"
ts=2023-08-28T10:24:12.512483342Z level=info component=remote.http.remoteconfig msg="component exited"
ts=2023-08-28T10:24:12.512738041Z level=info service=http msg="now listening for http traffic" addr=127.0.0.1:12345
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3ed7dbe]

goroutine 259 [running]:
github.com/grafana/agent/component/module/string.(*Component).Run(0xc002732280?, {0x7f256c8?, 0xc0027090e0?})
	/home/jorge/workspace/src/github.com/rfratto/agent/component/module/string/string.go:63 +0x1e
github.com/grafana/agent/pkg/flow/internal/controller.(*ComponentNode).Run(0xc002732280, {0x7f256c8, 0xc0027090e0})
	/home/jorge/workspace/src/github.com/rfratto/agent/pkg/flow/internal/controller/node_component.go:330 +0x122
github.com/grafana/agent/pkg/flow/internal/controller.newTask.func1()
	/home/jorge/workspace/src/github.com/rfratto/agent/pkg/flow/internal/controller/scheduler.go:140 +0x74
created by github.com/grafana/agent/pkg/flow/internal/controller.newTask
	/home/jorge/workspace/src/github.com/rfratto/agent/pkg/flow/internal/controller/scheduler.go:137 +0x150

local agent config:

remote.http "remoteconfig" {
  url = "some_url"
  headers = {
    "Authorization" = "Basic <some_key>",
  }
  poll_frequency = "11s"
}

module.string "agent_management" {
  content = remote.http.remoteconfig.content
}

remote config:

  // intentional typo in component name
  local.files "metrics_writer" {
    filename = "<some+path>"
  }

  prometheus.remote_write "grafanacloud" {
    endpoint {
      url = "<some_push_url>"
      basic_auth {
        username = "<some_id>"
        password = local.file.metrics_writer.content
      }
   }
  }

@rfratto
Copy link
Member Author

rfratto commented Aug 28, 2023

@rfratto what's missing to move forward with this? the solution seems reasonable to me, I will test it locally to make sure it behaves as expected in the context of agent management but other than that LGTM

We're still missing general consensus on whether this is a behavioral change we want to make to the Flow controller.

@rfratto
Copy link
Member Author

rfratto commented Aug 31, 2023

I'm unlikely to have time to finish this up over the next few weeks. If this is needed sooner, is someone else able to take this on?

@thampiotr
Copy link
Contributor

Seems like this is not needed at the moment / not blocking anything. Should we close this PR?

@rfratto
Copy link
Member Author

rfratto commented Jan 8, 2024

Yeah, let's close for now. We can revisit if/when needed

@rfratto rfratto closed this Jan 8, 2024
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants