Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make designing pipelines easier by providing lists of compatible components #4754

Closed
Tracked by #5778
thampiotr opened this issue Aug 8, 2023 · 6 comments · Fixed by #5791
Closed
Tracked by #5778

Make designing pipelines easier by providing lists of compatible components #4754

thampiotr opened this issue Aug 8, 2023 · 6 comments · Fixed by #5791
Assignees
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. proposal Proposal or RFC proposal-accepted Proposal has been accepted.

Comments

@thampiotr
Copy link
Contributor

thampiotr commented Aug 8, 2023

Problem

When designing a telemetry pipeline, it is currently difficult to discover what components can connect to what other components.

Specifically, I believe we should focus on making it easier to design a telemetry pipeline - where we are concerned about a conceptually high-level data flow and transformations (e.g. discover files -> read logs -> add labels -> send logs to DB). This is in contrast to a lower-level details where component connections are used for further pipeline configuration (e.g. read a string from env variable and set it as a username argument).

Background

My initial thoughts were that a better naming convention that clearly differentiates between, for example, sources, transformers and sinks would allow to alleviate this problem. However, it quickly led to a rather rigid and verbose names while still leaving some confusion. A rename of nearly all the components would be a large breaking change.

Our current naming convention groups components into namespaces, which in most cases make it easier to narrow down the set of components one needs to look at. However, identifying potential links between components can frequently be challenging, and uncovering connections that span across namespaces can be even more difficult to achieve.

What makes it even harder to navigate is that there are two ways the data can flow into the component:

  • Directly passed by value through arguments - this is the case for e.g. Targets
  • Wrote to the component's receiver, which is a component's exported field

Similarly, data can leave the component through an export or written to another component's receiver that is passed to it as an argument. As a side note, this also leads to graphs in UI being somewhat confusing, where the data flow is not reflected by the direction of the arrows.

Proposal

  1. Help to conceptually design pipelines (on a high level) by making it clear:

    • What data every component accepts
    • What data every component outputs

    For example:

    • discovery.kubernetes or discovery.gce - accept nothing and output targets
    • discovery.relabel - accepts targets and outputs targets
    • loki.source.file - accepts targets and outputs Loki logs
    • loki.process or loki.relabel - accepts Loki logs and outputs Loki logs
    • loki.write - accepts Loki logs and outputs nothing
  2. Use the above information to add an auto-generated section to every component's reference documentation page that will list:

    • Components that it can accept data from
    • Components that can accept the data from it

Prototype

There is a prototype available here: #4753

  • It defines in code the connections between components as described above
  • It has a test that verifies that docs include the auto-generated section for each component
  • It allows to auto-update the sections in generated docs as the new metadata is added

Here's an example of how the generated docs look like:
image

@rfratto
Copy link
Member

rfratto commented Aug 8, 2023

I'm not opposed to this, and I appreciate that you wrote a prototype which includes generating the documentation, which I think is a requirement to help prevent the list from ever being stale.

I have concerns about the specific way the relationships are defined in the prototype (I'd want to find a way to do it just existing the arguments/exports types instead of defining a new Metadata type), but overall I'm personally in favor of this proposal.

@thampiotr
Copy link
Contributor Author

I'd want to find a way to do it just existing the arguments/exports types instead of defining a new Metadata type

There are some challenges with that approach:

  1. There are two ways the data can flow into the component:

    • Directly passed by value through arguments - this is the case for e.g. Targets
    • Wrote to the component's receiver, which is a component's exported field

    Similarly, data can leave the component through an export or written to another component's receiver that is passed to it as an argument.

    But what we are interested in here is conceptual data pipeline - what data a component accepts and what data does it output.

  2. We want to focus on telemetry data that helps build pipelines on a high level, we don't want to e.g. include components that can export a string or that can produce a string, because I think it would produce a ton of connections and make it hard to navigate.

  3. In theory, the exported values and arguments that we're interested in can be nested inside structs few levels deep, which can mean we need to write more complex reflection code to discover these.

I think that I may be able to overcome these by implementing a method that will infer the current prototype's Metadata from a component's arguments and exports automatically. We will need to carefully classify based on both exports and arguments (to address 1)). We will only look for certain data types to keep focus on high-level telemetry pipeline connections (to address 2)). We may start with just a simple top-level reflection and address 3) in the future.

Thoughts?

@thampiotr
Copy link
Contributor Author

@rfratto I've updated the prototype with an implementation that infers the Metadata from arguments and exports. Less code and less error-prone, so I like it more :)

@ptodev
Copy link
Contributor

ptodev commented Aug 14, 2023

@thampiotr - thank you so much for working on this! I've thought about the need for such a thing as well. With this solution we definitely help the users more than we currently do, but when users actually try to wire up their component, I still think they will struggle with how to do it exactly.

For example, users may still not know how to point loki.source.file to a loki.process:

  • Do they set up this relationship in loki.source.file, or in loki.process?
    • How do they find out how to do this?
    • Different components do it differently. In otelcol, a component always sets up what the next component should be. But for example in discovery.file, there is an exported attribute which users are expected to specify inside a component like loki.source.file.
  • When they set up loki.source.file, should they say forward_to = loki.process.local.receiver? Or forward_to = [loki.process.local.receiver] ?
  • They may not know why exactly two components are listed in the docs as "linked". What is the attribute that links them?

Personally what I think might be even more useful is if each type listed in a component's doc hyperlinked to a new page which tells you exported attributes of components which can supply this type. It would probably save people from trying to find out what attribute to pipe to what other attribute.

Also, a note on otelcol - many otelcol components accept logs, metrics and traces. But some only accept a subset of these 3 signal types. So we would need 3 different sections, for which components can accept the outputs of an otelcol component.

@rfratto
Copy link
Member

rfratto commented Aug 15, 2023

Thoughts?

There are some assumptions that can be made:

  • We're currently talking about passing around compatible capsule values between components.
  • Today, we exclusively use capsule values for event streams, where a component with a capsule in its exports receives data, and a component with a capsule in its arguments exports data.

I believe these two assumptions can be combined to automatically build a component compatibility list without having to define a top-level metadata package.

The most current approach seems better. My overall concern is whether we should consider it out of scope for the component package to be aware of the different component namespaces; this may make Flow feel more rigid such that adding new pipelines requires updating more code than it used to.

An additional twist for how you can implement this is to generate component schemas of arguments and exports, and then build tooling on top of those schemas, such as generating compatible component documentation. This adds a layer of indirection for what you have now, but would allow the schemas to be used for other useful tools too, such as editors or config validators that don't import the project as a whole.

@thampiotr
Copy link
Contributor Author

where a component with a capsule in its exports receives data, and a component with a capsule in its arguments exports data.

I want this to also be used with targets, which are pull-based in config though.

adding new pipelines requires updating more code than it used to.

I think we could come up with a convention using capsules (the way you describe above), make sure that Targets also work (they seem to be an exception?) and in case there appear some capsules that we don't want to be included, we could use a marker interface to exclude them?

An additional twist for how you can implement this is to generate component schemas of arguments and exports, and then build tooling on top of those schemas, such as generating compatible component documentation.

That's a good idea! Even if this representation for now has only the fields we need, I think it would make sense to set up foundations for it in the future. I can take a look into this when we do actual implementation.

@mattdurham mattdurham added the proposal-accepted Proposal has been accepted. label Aug 16, 2023
@thampiotr thampiotr self-assigned this Oct 24, 2023
@rfratto rfratto removed the type/core label Nov 2, 2023
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. proposal Proposal or RFC proposal-accepted Proposal has been accepted.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants