chore(schema): schema support RFC #9388

JeanMertz · 2021-09-30T14:25:35Z

closes #3910

👀 RENDERED

Signed-off-by: Jean Mertz <[email protected]>

netlify · 2021-09-30T14:25:44Z

✔️ Deploy Preview for vector-project canceled.

🔨 Explore the source changes: 67592b8

🔍 Inspect the deploy log: https://app.netlify.com/sites/vector-project/deploys/6155c91659cfb700079f0b23

Signed-off-by: Jean Mertz <[email protected]>

spencergilbert · 2021-09-30T15:16:58Z

rfcs/2021-09-14-3910-schema-support.md

+When a source is configured to drop invalid events, and an event does not match
+the schema, the event won't be allowed to proceed to the next component in the
+pipeline.


Does validation only happen at the source layer? Could a user remap events to break the schema contract before events reach the sink which is expecting a guaranteed shape?

EDIT: I believe this is answered by "merging" of schemas as events flow through the topology.

spencergilbert · 2021-09-30T15:18:43Z

rfcs/2021-09-14-3910-schema-support.md

+Currently, these transforms take either the `vrl` condition type, or the
+deprecated `check_fields` type. After implementing this RFC, a new `schema` type
+is added, which allows applying conditional logic based on the schema attached
+to each individual event.


Could a schema here be as generic as json?

No, that's not a schema and should be validated upstream, ideally with the codec option.

jszwedko

Nice work, this seems like it'll solve some common user pain points. For the schema definition, did you have thoughts on what we would use? I saw JSON-schema mentioned below, do you mean https://json-schema.org/ ?

jszwedko · 2021-09-30T16:35:26Z

rfcs/2021-09-14-3910-schema-support.md

+#### Dead-Letter Pipeline
+
+In the future, dropped events can be picked up by a [dead-letter pipeline][dlq],
+but this is out-of-scope for this RFC.
+
+In a sense, accepting invalid events, combined with the `route` transform,
+already lets you manually implement a dead-letter pipeline.
+
+[dlq]: https://github.com/vectordotdev/vector/issues/1772


I think the work Luke is doing to enable transform dead letter queues should be landing soon so I think this RFC could just use that concept as the preferred approach (extending it to sources). I realize it is likely out-of-scope of this particular RFC, but I don't think we should implement source schema support without it only to quickly add it later and need to make a backwards incompatible change.

We could list it as a dependency for this RFC and have a separate RFC for it.

Yeah, this should fall out naturally of #9169. The features are mostly orthogonal, so I don't think it requires much discussion here (see #9417 for an example). The one exception would be to have different outputs "typed" differently, which we don't currently support but plan to.

Agreed. I wasn't aware that work was as-far along as it is. I'll update the RFC to reference that work, and make sure the implementation itself ties into that work as well.

jszwedko · 2021-09-30T16:38:07Z

rfcs/2021-09-14-3910-schema-support.md

+  # Ensure any event produced by this transform adheres to my "custom" schema.
+  schema.out.type = "custom"
+  schema.out.file = "./my_schema.json"


I think we could allow inline definition too. This might be related to a general user desire to "include" other files in a Vector config.

jszwedko · 2021-09-30T16:40:54Z

rfcs/2021-09-14-3910-schema-support.md

+based on the schema. For example, any condition-based transform can use the
+schema ID to pass or fail the condition (e.g. schema-based routing).
+
+### Schema Conversion (TODO)


I am wondering what this means exactly. It seems like it could be something we leave for later. I like what you've outlined above as an initial step.

I am wondering what this means exactly.

In what sense?

To give a bit more detail (which I'll add to the docs as well); currently we already do "schema conversions" by shaping certain known event fields to match whatever the system at the other end of a sink expects. For example, if a system expects timestamps in a "ts" field using a specific format, we do that conversion implicitly. We do this using a list of hard-coded fields that we know often are required by sinks, and require different formats.

With schemas, and schema conversions, this would no longer have to be special-cased, instead we explicitly define schema conversions from one schema to another, and then do the conversion within the sink. It allows us to add more fields as part of the conversion, be more correct in whatever conversion needs to happen, and stop at boot-time if conversion from schema A to B is expected by the pipeline, but unsupported by the sink.

I agree that there's more to work out here, at least from my perspective. Simply having a JSON Schema for an event doesn't provide the same semantic meaning of "this is the timestamp field" as the existing conversions, so I imagine we'd still need some mechanism to identify those well-known fields. Maybe that's what you mean by "explicitly define schema conversions", but it's not totally clear.

I agree, I'll flesh this out more. Your hunch is also correct, there will be a way (to be determined) to write schema conversion logic, which would allow Vector to know that a field ts in one schema refers to the commonly used "event timestamp" field, while in another schema that field is named @timestamp. It would also embed information w.r.t. the type of the field, and how to convert from one to another (e.g. RFC3339 to Unix timestamp).

There is more to be decided here, and it definitely needs more TLC in the RFC.

JeanMertz · 2021-10-01T10:04:01Z

For the schema definition, did you have thoughts on what we would use? I saw JSON-schema mentioned below, do you mean json-schema.org ?

Good point. Yeah, I had json-schema in my mind, but am (1) open to alternatives and (2) we might want to allow other formats in the future regardless.

But either way, I should add a chapter on the choice we make here.

StephenWakely · 2021-10-01T12:01:35Z

Nice.

It feels like schemas could very easily be a case of pulling the type_def work we have in VRL and adding it to Vector's components. It could be possible that by leveraging this, we wouldn't have that much additional work to do.

Rather than having sources that check defined schemas, I wonder if we could just create a VRL function that validates against a given schema. This would be similar to the string!, float! etc functions that we already have, but allow much more comprehensive specifications (maybe the function could even specify a file containing a json schema). This could be used to configure each Route in the Route transform. Since VRL has validated the schema, it would automatically be passed along the route.

So, a source that doesn't create a known schema for an event would just output Any. If the user doesn't use a sink or transform that requires a schema they wouldn't even need to do any additional configuration.

lukesteensen · 2021-10-06T16:52:55Z

rfcs/2021-09-14-3910-schema-support.md

+
+To do this, they can do one of two things.
+
+1. For sources with a **known** schema, Vector guarantees the schema of each


It might be implicit from your discussion of JSON Schema below, but just to be clear, does a "known" schema imply that we know every field and its type? Or just that we guarantee the presence of some keys with a certain type, and there may be additional unknown fields. It seems that the latter will be common, so probably worth supporting directly.

Good question, and yes, I left that somewhat implicit based on the first iteration using JSON schema.

Whether the schema is "closed" (i.e. exact) or "open" should probably be left to the format we use to define such a schema. I'm starting with JSON schema because it's known and commonly used, and in that schema, the additionalProperties keyword would be used to define this.

However, I also envision us adding a simpler Vector-based schema format at some point, similar to our deprecated "check_fields" condition, in which you'd define foo.bar.exists = true, foo.baz.type = ["string"], etc. In that case, we'd probably want the schema to remain open by default.

lukesteensen · 2021-10-06T16:57:36Z

rfcs/2021-09-14-3910-schema-support.md

+#### Dead-Letter Pipeline
+
+In the future, dropped events can be picked up by a [dead-letter pipeline][dlq],
+but this is out-of-scope for this RFC.
+
+In a sense, accepting invalid events, combined with the `route` transform,
+already lets you manually implement a dead-letter pipeline.
+
+[dlq]: https://github.com/vectordotdev/vector/issues/1772


Yeah, this should fall out naturally of #9169. The features are mostly orthogonal, so I don't think it requires much discussion here (see #9417 for an example). The one exception would be to have different outputs "typed" differently, which we don't currently support but plan to.

lukesteensen · 2021-10-06T16:59:28Z

rfcs/2021-09-14-3910-schema-support.md

+
+There are multiple ways this transform ties into the proposed schema support.
+
+##### Type Checking Integration


It might be worth noting that this is a logical extension of the existing DataType-based component-level typing we do today. Just significantly more precise than log vs metric.

lukesteensen · 2021-10-06T17:06:30Z

rfcs/2021-09-14-3910-schema-support.md

+based on the schema. For example, any condition-based transform can use the
+schema ID to pass or fail the condition (e.g. schema-based routing).
+
+### Schema Conversion (TODO)


I agree that there's more to work out here, at least from my perspective. Simply having a JSON Schema for an event doesn't provide the same semantic meaning of "this is the timestamp field" as the existing conversions, so I imagine we'd still need some mechanism to identify those well-known fields. Maybe that's what you mean by "explicitly define schema conversions", but it's not totally clear.

vladimir-dd · 2021-10-21T14:53:56Z

rfcs/2021-09-14-3910-schema-support.md

+##### Output Schema Validation
+
+Additionally, if the operator specified the required output schema of the
+transform, Vector uses VRL's type checking to ensure that whatever the VRL


(probably obvious, but maybe worth calling out) I guess we can only check types of the output event fields, but not their presence or absence(e.g. we append a field based on some runtime conditional check)? Are we going to introduce the notion of required(with optional default value if missing)/optional fields into schema at all?

tobz · 2021-10-29T14:29:41Z

Likely related to #8494 as well.

bits-bot · 2021-11-03T17:24:58Z

All committers have signed the CLA.

jszwedko · 2022-02-08T01:42:10Z

@JeanMertz Seemingly we should close this, yes? I know we ended up shifting direction after this was opened.

jszwedko · 2022-04-01T14:07:56Z

Closing since we used the brief instead.

initial schema support RFC

89e7fae

Signed-off-by: Jean Mertz <[email protected]>

fixup! initial schema support RFC

67592b8

Signed-off-by: Jean Mertz <[email protected]>

spencergilbert reviewed Sep 30, 2021

View reviewed changes

jszwedko reviewed Sep 30, 2021

View reviewed changes

jszwedko mentioned this pull request Oct 1, 2021

VRL is_string function does not imply data type in filter transform #9406

Closed

lukesteensen reviewed Oct 6, 2021

View reviewed changes

jszwedko mentioned this pull request Oct 8, 2021

Make vector less opinionated: allow customization of the "shape" of expected/outputted events for "generic" sources/sinks #9347

Closed

vladimir-dd reviewed Oct 21, 2021

View reviewed changes

tobz added the RFC label Nov 4, 2021

tobz mentioned this pull request Nov 5, 2021

Decoders for sources #257

Closed

jszwedko removed the domain: rfc label Jan 18, 2022

jszwedko closed this Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(schema): schema support RFC #9388

chore(schema): schema support RFC #9388

JeanMertz commented Sep 30, 2021

netlify bot commented Sep 30, 2021 •

edited

Loading

spencergilbert Sep 30, 2021

spencergilbert Sep 30, 2021

binarylogic Sep 30, 2021 •

edited

Loading

jszwedko left a comment

jszwedko Sep 30, 2021

lukesteensen Oct 6, 2021

JeanMertz Oct 7, 2021

jszwedko Sep 30, 2021

jszwedko Sep 30, 2021

JeanMertz Oct 1, 2021

lukesteensen Oct 6, 2021

JeanMertz Oct 7, 2021

JeanMertz commented Oct 1, 2021

StephenWakely commented Oct 1, 2021

lukesteensen Oct 6, 2021

JeanMertz Oct 7, 2021 •

edited

Loading

lukesteensen Oct 6, 2021

lukesteensen Oct 6, 2021

lukesteensen Oct 6, 2021

vladimir-dd Oct 21, 2021 •

edited

Loading

tobz commented Oct 29, 2021

bits-bot commented Nov 3, 2021 •

edited

Loading

jszwedko commented Feb 8, 2022

jszwedko commented Apr 1, 2022


		To do this, they can do one of two things.

		1. For sources with a known schema, Vector guarantees the schema of each


		There are multiple ways this transform ties into the proposed schema support.

		##### Type Checking Integration

chore(schema): schema support RFC #9388

chore(schema): schema support RFC #9388

Conversation

JeanMertz commented Sep 30, 2021

netlify bot commented Sep 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarylogic Sep 30, 2021 • edited Loading

Choose a reason for hiding this comment

jszwedko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JeanMertz commented Oct 1, 2021

StephenWakely commented Oct 1, 2021

Choose a reason for hiding this comment

JeanMertz Oct 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladimir-dd Oct 21, 2021 • edited Loading

Choose a reason for hiding this comment

tobz commented Oct 29, 2021

bits-bot commented Nov 3, 2021 • edited Loading

jszwedko commented Feb 8, 2022

jszwedko commented Apr 1, 2022

netlify bot commented Sep 30, 2021 •

edited

Loading

binarylogic Sep 30, 2021 •

edited

Loading

JeanMertz Oct 7, 2021 •

edited

Loading

vladimir-dd Oct 21, 2021 •

edited

Loading

bits-bot commented Nov 3, 2021 •

edited

Loading