-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Pipelines V2 #1679
Comments
Sub Proposal 1Based on the simpler spec for a compose transform (#1653) we introduce a concept of component copies (versus references). This allows us to refer to the configuration of a component but create a copy for our pipeline rather than mutate the Next, we expand the
If a pipeline does not follow this spec then we are able to deliver a clear error message. With this spec I'm fairly confident that we can support all of the same topologies as we currently do without any unintended side effects. However, if we decide to go ahead with this proposal we need to investigate further. Remaining IssuesI still feel as though this spec is somewhat hostile to new users. This is mostly down to the fact that you're looking at a linear list of component names as if they're all equivalent: [pipelines]
p1 = [ "src2", "tfm1", "tfm2" ]
p2 = [ "src1", "p1", "tfm3", "snk1" ] Whereas in reality you're looking at a combined list of three different element types, more clearly represented as: [pipelines.p1]
inputs = [ "src2" ]
transforms = [ "tfm1", "tfm2" ]
[pipelines.p2]
inputs = [ "src1", "p1" ]
transforms = [ "tfm3" ]
outputs = [ "snk1" ] (NOTE: I'm NOT suggesting this as a spec) We're pushing the spec in favor of writing speed at the cost of readability, and I'm not 100% convinced the sacrifices we're making aren't going to sting new users trying to grok Vector configs (i.e. does |
Sub Proposal 2Exactly the same as proposal 1 except we make it more explicit:
The change being that source and sink lists within a pipeline must themselves be in an array. The purpose of this requirement is purely for the sake of distinguishing the tiers: [pipelines]
p1 = [ [ "src2" ], "tfm1", "tfm2" ]
p2 = [ [ "src1", "p1" ], "tfm3", [ "snk1" ] ] AdvantagesOne key technical advantage over proposal 1 is that because we are explicitly declaring which components are inputs and which are simply transformations of the pipeline, we are now able to specify a transform as an input (and therefore a reference). This makes it possible to add the pipeline syntax into existing configs with transforms in the topology. From the usability perspective a user familiar with the Remaining IssuesThis syntax still doesn't provide a full picture, but merely a hint of what's going on. Adding brackets also adds more opportunities for typos to break the topology. There's also the (unlikely) problem of pipelines that are only a list of sinks. Imagine if we were to create a group of sinks that all want to consume data from the same range of pipelines. For convenience we might group them in their own pipeline with something like: [pipelines]
p1 = [ [ "src1" ], "tfm1", "tfm2" ]
p2 = [ [ "src2" ], "tfm3" ]
p3 = [ [ "snk1", "snk2", "snk3" ] ]
p4 = [ [ "p1" ], [ "p3" ] ]
p5 = [ [ "p2" ], "tfm4", [ "p3" ] ] There's a more concise way of expressing these pipelines, but assuming that this were the best way to structure it then |
Sub Proposal 3Roughly the same as sub proposal 2 except in a structured format, with three fields: [pipelines.p1]
inputs = [ "src1" ]
pipe = [ "tfm1", "tfm2" ]
outputs = [ "snk1" ]
The name [pipelines]
p1.pipe = [ "tfm1", "tfm2" ]
p2.inputs = [ "src1" ]
p2.pipe = [ "tfm3", "tfm4" ]
p3.inputs = [ "src2", "p2" ]
p3.pipe = [ "p1", "tfm5" ]
p3.outputs = [ "snk1" ] Other name candidates are AdvantagesThis has all of the advantages of proposal 2 along with clear naming in order to distinguish the three tiers of the pipeline even further. A new user not necessarily familiar with pipeline syntax is likely able to fully comprehend the topology expressed here. Remaining IssuesIt's more words. |
Sub Proposal 4I'd like to throw another proposal into the mix. One that explicitly uses [pipelines]
p1 = ["tfm1", ["tfm2", "tfm3"], "tfm4"]
p2 = ["&src1", "p1", "tfm3", "&snk1"]
p3 = ["&src2", "tfm1", "&snk2"]
Identifiers and observabilityIt's worth noting that a copied component will get a unique ID that is used in logs, metrics, etc.
Advantages
Remaining issuesI dislike exposing the pointer/copy syntax at all to the user, but these are developers and I don't think this concept is too advanced. Alternatively, we could just "make it work" by assuming users want to copy transforms and reference sources/sinks. |
These are all really interesting! I appreciate the time and effort spent trying to munge TOML into a useful graph language 😄 My biggest question around all of these proposals is whether we're making our TOML complex enough that we lose the benefits of using TOML in the first place (simplicity, familiarity, etc). Because if that's the case, we'll end up with the worst of both worlds: an awkward and unnatural language for expressing graphs, and a config format that's difficult for new users to pick up. I know writing our own config language is the nuclear option, but it is at least a valuable strawman to compare these proposals against. |
Having written a config language for something like this, I would strongly advise against "the nuclear option". I found it far preferable to embed a scripting language (like Lua or JS, since we already use them) and let it deal with the complexities. |
Ideally, I'd like to avoid conflating the format of our config (TOML, YAML, DOT, custom, etc) with the structure ( For example, we could explore DOT (#1699) as an alternative to pipelines. However, in terms of structure it actually puts us in the same situation as the original pipelines spec, where we need to add more syntax (or assumptions) on top in order to distinguish between references/copies of a subgraph, otherwise we can't support snippet reuse. This digression leads us into the exploration of syntax alone which I don't think is helpful unless we're committed to a certain structure. Vector components aren't generalised nodes on a graph, they have different types (source, transform, sink), which each have their own rules. So when we create a structure for expressing chains of components we need to take that into account somehow. We also want to support snippet reuse without causing unexpected side effects. If we can defer the decision of our config format then it allows us to choose the right structure for Vector, and then afterwards select a format that suits it well, instead of confusing the two and using one as a crutch for the other. With that said I think it's worth doing a review of the structure concepts we currently have so that we're not comparing apples with oranges. I'm picking arbitrary names for these: FlattenedThis is what we currently have. Each component is defined globally and selects the global siblings it wishes to consume from. This results in a flat list of components where the way in which they interact isn't immediately clear, and changing that often requires editing multiple places, giving ample opportunity for errors. The compose transform proposal (#1653) is an attempt to mitigate some of the pain points of writing and maintaining lists of transforms with this structure, but is a complement to the spec rather than a solution. Advantages
Disadvantages
PipelinesStemming from the pipelines proposals, taking a lot of inspiration from graph syntaxes. Topologies are defined as linkable lists of component names. This allows the definition of complex graphs from linear arrays, making them easy to parse for both humans and machines. Advantages
Disadvantages
HierarchicalThis is something we haven't really explored yet as it's pretty much the opposite of the existing flattened structure, and is therefore the most extreme change. In a hierarchical structure there aren't necessarily any global components, just pipelines themselves, where each one specifies its sources, transforms, and sinks: pipeline:
sources:
- type: foo
some: field
- type: bar
some_other: dumb_field
transforms:
- type: a_thing:
do_it: "like this"
- type: a_fork
if: "field.type in [ doc, article, comment ]"
then:
- type: do_this
wat: "this is another transform"
else:
- type: do_this_instead
huh: "this is yet another transform"
sinks:
- type: baz
- type: shared_channel
called: foo Pipelines can be linked to each other, which is how we might decide to handle content based multiplexing: pipeline:
sources:
- type: shared_channel
which: foo
transforms:
- type: remove_stuff_i_dont_want
like: "field.message contains 'nah m8'"
sinks:
- type: boo Note that this may seem very similar to sub proposal 3, but in fact it also requires the ability to inline transforms in order to have forked processing. This also means transforms themselves as part of their spec need to be able to define their children, so in reality this is still a far cry from pipelines. Advantages
Disadvantages
|
Just noting, that we've decided to defer this change, once again, because it is not obvious that this is a clear win. A couple of reasons:
It'll be obvious a few months from now if we want to do this. It should continue to pop up in conversations. |
After implementing pipeline longer than "hello-world" (2 sources, 8 transforms), I can confirm that this proposal looks very promising. |
Thanks @anton-ryzhov, I'm curious, which one of the syntaxes would you prefer? Or do you have a different proposal? |
Closing via #4427 |
We've already had a proposal for a new pipelines directive (#1447) which allows a user to simply list component names in order to create a topology:
And based on these pipelines the
inputs
of each component would be implicitly populated in order to create the described topology.Problems
Leaks
With the original spec the above snippet would create unexpected side effects:
src1
would leak intosnk2
.src2
would leak intotfm2
,tfm3
andsnk1
.Multiplexing
One of the key strengths of our current spec is that it supports a wide range of topologies whilst retaining a flat configuration spec. If a new spec is to replace the current
inputs
way of life then it needs to support multiplexing sources and sinks.However, expanding
pipelines
to support multiple inputs opens us up for situations where there's no clear behavior to expect. Given the following config:It would make sense for Vector to construct the following topology:
But it could also look like this:
Or this:
Even if we have a clear spec, can we expect a user seeing this config for the first time to grok it? Multiple sinks have the same issue.
The text was updated successfully, but these errors were encountered: