Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(schema): schema support RFC #9388

Closed
wants to merge 2 commits into from
Closed

Conversation

JeanMertz
Copy link
Contributor

closes #3910

👀 RENDERED

Signed-off-by: Jean Mertz <[email protected]>
@netlify
Copy link

netlify bot commented Sep 30, 2021

✔️ Deploy Preview for vector-project canceled.

🔨 Explore the source changes: 67592b8

🔍 Inspect the deploy log: https://app.netlify.com/sites/vector-project/deploys/6155c91659cfb700079f0b23

Comment on lines +159 to +161
When a source is configured to drop invalid events, and an event does not match
the schema, the event won't be allowed to proceed to the next component in the
pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does validation only happen at the source layer? Could a user remap events to break the schema contract before events reach the sink which is expecting a guaranteed shape?

EDIT: I believe this is answered by "merging" of schemas as events flow through the topology.

Comment on lines +328 to +331
Currently, these transforms take either the `vrl` condition type, or the
deprecated `check_fields` type. After implementing this RFC, a new `schema` type
is added, which allows applying conditional logic based on the schema attached
to each individual event.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could a schema here be as generic as json?

Copy link
Contributor

@binarylogic binarylogic Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's not a schema and should be validated upstream, ideally with the codec option.

Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, this seems like it'll solve some common user pain points. For the schema definition, did you have thoughts on what we would use? I saw JSON-schema mentioned below, do you mean https://json-schema.org/ ?

Comment on lines +185 to +193
#### Dead-Letter Pipeline

In the future, dropped events can be picked up by a [dead-letter pipeline][dlq],
but this is out-of-scope for this RFC.

In a sense, accepting invalid events, combined with the `route` transform,
already lets you manually implement a dead-letter pipeline.

[dlq]: https://github.com/vectordotdev/vector/issues/1772
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the work Luke is doing to enable transform dead letter queues should be landing soon so I think this RFC could just use that concept as the preferred approach (extending it to sources). I realize it is likely out-of-scope of this particular RFC, but I don't think we should implement source schema support without it only to quickly add it later and need to make a backwards incompatible change.

We could list it as a dependency for this RFC and have a separate RFC for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should fall out naturally of #9169. The features are mostly orthogonal, so I don't think it requires much discussion here (see #9417 for an example). The one exception would be to have different outputs "typed" differently, which we don't currently support but plan to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I wasn't aware that work was as-far along as it is. I'll update the RFC to reference that work, and make sure the implementation itself ties into that work as well.

Comment on lines +281 to +283
# Ensure any event produced by this transform adheres to my "custom" schema.
schema.out.type = "custom"
schema.out.file = "./my_schema.json"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could allow inline definition too. This might be related to a general user desire to "include" other files in a Vector config.

based on the schema. For example, any condition-based transform can use the
schema ID to pass or fail the condition (e.g. schema-based routing).

### Schema Conversion (TODO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering what this means exactly. It seems like it could be something we leave for later. I like what you've outlined above as an initial step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering what this means exactly.

In what sense?

To give a bit more detail (which I'll add to the docs as well); currently we already do "schema conversions" by shaping certain known event fields to match whatever the system at the other end of a sink expects. For example, if a system expects timestamps in a "ts" field using a specific format, we do that conversion implicitly. We do this using a list of hard-coded fields that we know often are required by sinks, and require different formats.

With schemas, and schema conversions, this would no longer have to be special-cased, instead we explicitly define schema conversions from one schema to another, and then do the conversion within the sink. It allows us to add more fields as part of the conversion, be more correct in whatever conversion needs to happen, and stop at boot-time if conversion from schema A to B is expected by the pipeline, but unsupported by the sink.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there's more to work out here, at least from my perspective. Simply having a JSON Schema for an event doesn't provide the same semantic meaning of "this is the timestamp field" as the existing conversions, so I imagine we'd still need some mechanism to identify those well-known fields. Maybe that's what you mean by "explicitly define schema conversions", but it's not totally clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'll flesh this out more. Your hunch is also correct, there will be a way (to be determined) to write schema conversion logic, which would allow Vector to know that a field ts in one schema refers to the commonly used "event timestamp" field, while in another schema that field is named @timestamp. It would also embed information w.r.t. the type of the field, and how to convert from one to another (e.g. RFC3339 to Unix timestamp).

There is more to be decided here, and it definitely needs more TLC in the RFC.

@JeanMertz
Copy link
Contributor Author

For the schema definition, did you have thoughts on what we would use? I saw JSON-schema mentioned below, do you mean json-schema.org ?

Good point. Yeah, I had json-schema in my mind, but am (1) open to alternatives and (2) we might want to allow other formats in the future regardless.

But either way, I should add a chapter on the choice we make here.

@StephenWakely
Copy link
Contributor

Nice.

It feels like schemas could very easily be a case of pulling the type_def work we have in VRL and adding it to Vector's components. It could be possible that by leveraging this, we wouldn't have that much additional work to do.

Rather than having sources that check defined schemas, I wonder if we could just create a VRL function that validates against a given schema. This would be similar to the string!, float! etc functions that we already have, but allow much more comprehensive specifications (maybe the function could even specify a file containing a json schema). This could be used to configure each Route in the Route transform. Since VRL has validated the schema, it would automatically be passed along the route.

So, a source that doesn't create a known schema for an event would just output Any. If the user doesn't use a sink or transform that requires a schema they wouldn't even need to do any additional configuration.


To do this, they can do one of two things.

1. For sources with a **known** schema, Vector guarantees the schema of each
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be implicit from your discussion of JSON Schema below, but just to be clear, does a "known" schema imply that we know every field and its type? Or just that we guarantee the presence of some keys with a certain type, and there may be additional unknown fields. It seems that the latter will be common, so probably worth supporting directly.

Copy link
Contributor Author

@JeanMertz JeanMertz Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, and yes, I left that somewhat implicit based on the first iteration using JSON schema.

Whether the schema is "closed" (i.e. exact) or "open" should probably be left to the format we use to define such a schema. I'm starting with JSON schema because it's known and commonly used, and in that schema, the additionalProperties keyword would be used to define this.

However, I also envision us adding a simpler Vector-based schema format at some point, similar to our deprecated "check_fields" condition, in which you'd define foo.bar.exists = true, foo.baz.type = ["string"], etc. In that case, we'd probably want the schema to remain open by default.

Comment on lines +185 to +193
#### Dead-Letter Pipeline

In the future, dropped events can be picked up by a [dead-letter pipeline][dlq],
but this is out-of-scope for this RFC.

In a sense, accepting invalid events, combined with the `route` transform,
already lets you manually implement a dead-letter pipeline.

[dlq]: https://github.com/vectordotdev/vector/issues/1772
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should fall out naturally of #9169. The features are mostly orthogonal, so I don't think it requires much discussion here (see #9417 for an example). The one exception would be to have different outputs "typed" differently, which we don't currently support but plan to.


There are multiple ways this transform ties into the proposed schema support.

##### Type Checking Integration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth noting that this is a logical extension of the existing DataType-based component-level typing we do today. Just significantly more precise than log vs metric.

based on the schema. For example, any condition-based transform can use the
schema ID to pass or fail the condition (e.g. schema-based routing).

### Schema Conversion (TODO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there's more to work out here, at least from my perspective. Simply having a JSON Schema for an event doesn't provide the same semantic meaning of "this is the timestamp field" as the existing conversions, so I imagine we'd still need some mechanism to identify those well-known fields. Maybe that's what you mean by "explicitly define schema conversions", but it's not totally clear.

##### Output Schema Validation

Additionally, if the operator specified the required output schema of the
transform, Vector uses VRL's type checking to ensure that whatever the VRL
Copy link
Contributor

@vladimir-dd vladimir-dd Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(probably obvious, but maybe worth calling out) I guess we can only check types of the output event fields, but not their presence or absence(e.g. we append a field based on some runtime conditional check)? Are we going to introduce the notion of required(with optional default value if missing)/optional fields into schema at all?

@tobz
Copy link
Contributor

tobz commented Oct 29, 2021

Likely related to #8494 as well.

@bits-bot
Copy link

bits-bot commented Nov 3, 2021

CLA assistant check
All committers have signed the CLA.

@tobz tobz added the RFC label Nov 4, 2021
@tobz tobz mentioned this pull request Nov 5, 2021
@jszwedko
Copy link
Member

jszwedko commented Feb 8, 2022

@JeanMertz Seemingly we should close this, yes? I know we ended up shifting direction after this was opened.

@jszwedko jszwedko closed this Apr 1, 2022
@jszwedko
Copy link
Member

jszwedko commented Apr 1, 2022

Closing since we used the brief instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Log schemas
9 participants