Make designing pipelines easier by providing lists of compatible components #4754

thampiotr · 2023-08-08T12:34:21Z

Problem

When designing a telemetry pipeline, it is currently difficult to discover what components can connect to what other components.

Specifically, I believe we should focus on making it easier to design a telemetry pipeline - where we are concerned about a conceptually high-level data flow and transformations (e.g. discover files -> read logs -> add labels -> send logs to DB). This is in contrast to a lower-level details where component connections are used for further pipeline configuration (e.g. read a string from env variable and set it as a username argument).

Background

My initial thoughts were that a better naming convention that clearly differentiates between, for example, sources, transformers and sinks would allow to alleviate this problem. However, it quickly led to a rather rigid and verbose names while still leaving some confusion. A rename of nearly all the components would be a large breaking change.

Our current naming convention groups components into namespaces, which in most cases make it easier to narrow down the set of components one needs to look at. However, identifying potential links between components can frequently be challenging, and uncovering connections that span across namespaces can be even more difficult to achieve.

What makes it even harder to navigate is that there are two ways the data can flow into the component:

Directly passed by value through arguments - this is the case for e.g. Targets
Wrote to the component's receiver, which is a component's exported field

Similarly, data can leave the component through an export or written to another component's receiver that is passed to it as an argument. As a side note, this also leads to graphs in UI being somewhat confusing, where the data flow is not reflected by the direction of the arrows.

Proposal

Help to conceptually design pipelines (on a high level) by making it clear:
- What data every component accepts
- What data every component outputs
For example:
- discovery.kubernetes or discovery.gce - accept nothing and output targets
- discovery.relabel - accepts targets and outputs targets
- loki.source.file - accepts targets and outputs Loki logs
- loki.process or loki.relabel - accepts Loki logs and outputs Loki logs
- loki.write - accepts Loki logs and outputs nothing
Use the above information to add an auto-generated section to every component's reference documentation page that will list:
- Components that it can accept data from
- Components that can accept the data from it

Prototype

There is a prototype available here: #4753

It defines in code the connections between components as described above
It has a test that verifies that docs include the auto-generated section for each component
It allows to auto-update the sections in generated docs as the new metadata is added

Here's an example of how the generated docs look like:

The text was updated successfully, but these errors were encountered:

rfratto · 2023-08-08T12:47:54Z

I'm not opposed to this, and I appreciate that you wrote a prototype which includes generating the documentation, which I think is a requirement to help prevent the list from ever being stale.

I have concerns about the specific way the relationships are defined in the prototype (I'd want to find a way to do it just existing the arguments/exports types instead of defining a new Metadata type), but overall I'm personally in favor of this proposal.

thampiotr · 2023-08-08T13:18:56Z

I'd want to find a way to do it just existing the arguments/exports types instead of defining a new Metadata type

There are some challenges with that approach:

There are two ways the data can flow into the component:
- Directly passed by value through arguments - this is the case for e.g. Targets
- Wrote to the component's receiver, which is a component's exported field
Similarly, data can leave the component through an export or written to another component's receiver that is passed to it as an argument.

But what we are interested in here is conceptual data pipeline - what data a component accepts and what data does it output.
We want to focus on telemetry data that helps build pipelines on a high level, we don't want to e.g. include components that can export a string or that can produce a string, because I think it would produce a ton of connections and make it hard to navigate.
In theory, the exported values and arguments that we're interested in can be nested inside structs few levels deep, which can mean we need to write more complex reflection code to discover these.

I think that I may be able to overcome these by implementing a method that will infer the current prototype's Metadata from a component's arguments and exports automatically. We will need to carefully classify based on both exports and arguments (to address 1)). We will only look for certain data types to keep focus on high-level telemetry pipeline connections (to address 2)). We may start with just a simple top-level reflection and address 3) in the future.

Thoughts?

thampiotr · 2023-08-08T14:57:33Z

@rfratto I've updated the prototype with an implementation that infers the Metadata from arguments and exports. Less code and less error-prone, so I like it more :)

ptodev · 2023-08-14T11:56:52Z

@thampiotr - thank you so much for working on this! I've thought about the need for such a thing as well. With this solution we definitely help the users more than we currently do, but when users actually try to wire up their component, I still think they will struggle with how to do it exactly.

For example, users may still not know how to point loki.source.file to a loki.process:

Do they set up this relationship in loki.source.file, or in loki.process?
- How do they find out how to do this?
- Different components do it differently. In otelcol, a component always sets up what the next component should be. But for example in discovery.file, there is an exported attribute which users are expected to specify inside a component like loki.source.file.
When they set up loki.source.file, should they say forward_to = loki.process.local.receiver? Or forward_to = [loki.process.local.receiver] ?
They may not know why exactly two components are listed in the docs as "linked". What is the attribute that links them?

Personally what I think might be even more useful is if each type listed in a component's doc hyperlinked to a new page which tells you exported attributes of components which can supply this type. It would probably save people from trying to find out what attribute to pipe to what other attribute.

Also, a note on otelcol - many otelcol components accept logs, metrics and traces. But some only accept a subset of these 3 signal types. So we would need 3 different sections, for which components can accept the outputs of an otelcol component.

rfratto · 2023-08-15T14:47:15Z

Thoughts?

There are some assumptions that can be made:

We're currently talking about passing around compatible capsule values between components.
Today, we exclusively use capsule values for event streams, where a component with a capsule in its exports receives data, and a component with a capsule in its arguments exports data.

I believe these two assumptions can be combined to automatically build a component compatibility list without having to define a top-level metadata package.

The most current approach seems better. My overall concern is whether we should consider it out of scope for the component package to be aware of the different component namespaces; this may make Flow feel more rigid such that adding new pipelines requires updating more code than it used to.

An additional twist for how you can implement this is to generate component schemas of arguments and exports, and then build tooling on top of those schemas, such as generating compatible component documentation. This adds a layer of indirection for what you have now, but would allow the schemas to be used for other useful tools too, such as editors or config validators that don't import the project as a whole.

thampiotr · 2023-08-16T10:16:54Z

where a component with a capsule in its exports receives data, and a component with a capsule in its arguments exports data.

I want this to also be used with targets, which are pull-based in config though.

adding new pipelines requires updating more code than it used to.

I think we could come up with a convention using capsules (the way you describe above), make sure that Targets also work (they seem to be an exception?) and in case there appear some capsules that we don't want to be included, we could use a marker interface to exclude them?

An additional twist for how you can implement this is to generate component schemas of arguments and exports, and then build tooling on top of those schemas, such as generating compatible component documentation.

That's a good idea! Even if this representation for now has only the fields we need, I think it would make sense to set up foundations for it in the future. I can take a look into this when we do actual implementation.

thampiotr added the proposal Proposal or RFC label Aug 8, 2023

github-project-automation bot added this to Grafana Agent (Public) Aug 8, 2023

github-project-automation bot moved this to Todo in Grafana Agent (Public) Aug 8, 2023

thampiotr mentioned this issue Aug 8, 2023

Prototype of auto-generated compatible components section in docs #4753

Closed

4 tasks

mattdurham added the proposal-accepted Proposal has been accepted. label Aug 16, 2023

rfratto added the type/core label Aug 29, 2023

thampiotr self-assigned this Oct 24, 2023

rfratto removed the type/core label Nov 2, 2023

thampiotr mentioned this issue Nov 15, 2023

Agent documentation enhancements tracking #5778

Closed

thampiotr mentioned this issue Nov 28, 2023

Add auto-generated connected components in documentation #5791

Merged

thampiotr closed this as completed in #5791 Dec 4, 2023

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make designing pipelines easier by providing lists of compatible components #4754

Make designing pipelines easier by providing lists of compatible components #4754

thampiotr commented Aug 8, 2023 •

edited

Loading

rfratto commented Aug 8, 2023

thampiotr commented Aug 8, 2023

thampiotr commented Aug 8, 2023

ptodev commented Aug 14, 2023 •

edited

Loading

rfratto commented Aug 15, 2023 •

edited

Loading

thampiotr commented Aug 16, 2023

Make designing pipelines easier by providing lists of compatible components #4754

Make designing pipelines easier by providing lists of compatible components #4754

Comments

thampiotr commented Aug 8, 2023 • edited Loading

Problem

Background

Proposal

Prototype

rfratto commented Aug 8, 2023

thampiotr commented Aug 8, 2023

thampiotr commented Aug 8, 2023

ptodev commented Aug 14, 2023 • edited Loading

rfratto commented Aug 15, 2023 • edited Loading

thampiotr commented Aug 16, 2023

thampiotr commented Aug 8, 2023 •

edited

Loading

ptodev commented Aug 14, 2023 •

edited

Loading

rfratto commented Aug 15, 2023 •

edited

Loading