-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(schema): schema support RFC #9388
Conversation
Signed-off-by: Jean Mertz <[email protected]>
✔️ Deploy Preview for vector-project canceled. 🔨 Explore the source changes: 67592b8 🔍 Inspect the deploy log: https://app.netlify.com/sites/vector-project/deploys/6155c91659cfb700079f0b23 |
Signed-off-by: Jean Mertz <[email protected]>
When a source is configured to drop invalid events, and an event does not match | ||
the schema, the event won't be allowed to proceed to the next component in the | ||
pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does validation only happen at the source
layer? Could a user remap
events to break the schema contract before events reach the sink
which is expecting a guaranteed shape?
EDIT: I believe this is answered by "merging" of schemas as events flow through the topology.
Currently, these transforms take either the `vrl` condition type, or the | ||
deprecated `check_fields` type. After implementing this RFC, a new `schema` type | ||
is added, which allows applying conditional logic based on the schema attached | ||
to each individual event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could a schema here be as generic as json
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's not a schema and should be validated upstream, ideally with the codec
option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, this seems like it'll solve some common user pain points. For the schema definition, did you have thoughts on what we would use? I saw JSON-schema
mentioned below, do you mean https://json-schema.org/ ?
#### Dead-Letter Pipeline | ||
|
||
In the future, dropped events can be picked up by a [dead-letter pipeline][dlq], | ||
but this is out-of-scope for this RFC. | ||
|
||
In a sense, accepting invalid events, combined with the `route` transform, | ||
already lets you manually implement a dead-letter pipeline. | ||
|
||
[dlq]: https://github.com/vectordotdev/vector/issues/1772 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the work Luke is doing to enable transform dead letter queues should be landing soon so I think this RFC could just use that concept as the preferred approach (extending it to sources). I realize it is likely out-of-scope of this particular RFC, but I don't think we should implement source schema support without it only to quickly add it later and need to make a backwards incompatible change.
We could list it as a dependency for this RFC and have a separate RFC for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I wasn't aware that work was as-far along as it is. I'll update the RFC to reference that work, and make sure the implementation itself ties into that work as well.
# Ensure any event produced by this transform adheres to my "custom" schema. | ||
schema.out.type = "custom" | ||
schema.out.file = "./my_schema.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could allow inline definition too. This might be related to a general user desire to "include" other files in a Vector config.
based on the schema. For example, any condition-based transform can use the | ||
schema ID to pass or fail the condition (e.g. schema-based routing). | ||
|
||
### Schema Conversion (TODO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering what this means exactly. It seems like it could be something we leave for later. I like what you've outlined above as an initial step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering what this means exactly.
In what sense?
To give a bit more detail (which I'll add to the docs as well); currently we already do "schema conversions" by shaping certain known event fields to match whatever the system at the other end of a sink expects. For example, if a system expects timestamps in a "ts" field using a specific format, we do that conversion implicitly. We do this using a list of hard-coded fields that we know often are required by sinks, and require different formats.
With schemas, and schema conversions, this would no longer have to be special-cased, instead we explicitly define schema conversions from one schema to another, and then do the conversion within the sink. It allows us to add more fields as part of the conversion, be more correct in whatever conversion needs to happen, and stop at boot-time if conversion from schema A to B is expected by the pipeline, but unsupported by the sink.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there's more to work out here, at least from my perspective. Simply having a JSON Schema for an event doesn't provide the same semantic meaning of "this is the timestamp field" as the existing conversions, so I imagine we'd still need some mechanism to identify those well-known fields. Maybe that's what you mean by "explicitly define schema conversions", but it's not totally clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I'll flesh this out more. Your hunch is also correct, there will be a way (to be determined) to write schema conversion logic, which would allow Vector to know that a field ts
in one schema refers to the commonly used "event timestamp" field, while in another schema that field is named @timestamp
. It would also embed information w.r.t. the type of the field, and how to convert from one to another (e.g. RFC3339 to Unix timestamp).
There is more to be decided here, and it definitely needs more TLC in the RFC.
Good point. Yeah, I had json-schema in my mind, but am (1) open to alternatives and (2) we might want to allow other formats in the future regardless. But either way, I should add a chapter on the choice we make here. |
Nice. It feels like schemas could very easily be a case of pulling the Rather than having sources that check defined schemas, I wonder if we could just create a VRL function that validates against a given schema. This would be similar to the So, a source that doesn't create a known schema for an event would just output Any. If the user doesn't use a sink or transform that requires a schema they wouldn't even need to do any additional configuration. |
|
||
To do this, they can do one of two things. | ||
|
||
1. For sources with a **known** schema, Vector guarantees the schema of each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be implicit from your discussion of JSON Schema below, but just to be clear, does a "known" schema imply that we know every field and its type? Or just that we guarantee the presence of some keys with a certain type, and there may be additional unknown fields. It seems that the latter will be common, so probably worth supporting directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, and yes, I left that somewhat implicit based on the first iteration using JSON schema.
Whether the schema is "closed" (i.e. exact) or "open" should probably be left to the format we use to define such a schema. I'm starting with JSON schema because it's known and commonly used, and in that schema, the additionalProperties
keyword would be used to define this.
However, I also envision us adding a simpler Vector-based schema format at some point, similar to our deprecated "check_fields" condition, in which you'd define foo.bar.exists = true
, foo.baz.type = ["string"]
, etc. In that case, we'd probably want the schema to remain open by default.
#### Dead-Letter Pipeline | ||
|
||
In the future, dropped events can be picked up by a [dead-letter pipeline][dlq], | ||
but this is out-of-scope for this RFC. | ||
|
||
In a sense, accepting invalid events, combined with the `route` transform, | ||
already lets you manually implement a dead-letter pipeline. | ||
|
||
[dlq]: https://github.com/vectordotdev/vector/issues/1772 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
There are multiple ways this transform ties into the proposed schema support. | ||
|
||
##### Type Checking Integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth noting that this is a logical extension of the existing DataType
-based component-level typing we do today. Just significantly more precise than log vs metric.
based on the schema. For example, any condition-based transform can use the | ||
schema ID to pass or fail the condition (e.g. schema-based routing). | ||
|
||
### Schema Conversion (TODO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there's more to work out here, at least from my perspective. Simply having a JSON Schema for an event doesn't provide the same semantic meaning of "this is the timestamp field" as the existing conversions, so I imagine we'd still need some mechanism to identify those well-known fields. Maybe that's what you mean by "explicitly define schema conversions", but it's not totally clear.
##### Output Schema Validation | ||
|
||
Additionally, if the operator specified the required output schema of the | ||
transform, Vector uses VRL's type checking to ensure that whatever the VRL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(probably obvious, but maybe worth calling out) I guess we can only check types of the output event fields, but not their presence or absence(e.g. we append a field based on some runtime conditional check)? Are we going to introduce the notion of required(with optional default value if missing)/optional fields into schema at all?
Likely related to #8494 as well. |
@JeanMertz Seemingly we should close this, yes? I know we ended up shifting direction after this was opened. |
Closing since we used the brief instead. |
closes #3910
👀 RENDERED