Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Remap v1 RFC #3134

Merged
merged 8 commits into from
Aug 3, 2020
Merged
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 153 additions & 0 deletions rfcs/2020-07-21-2744-vicscript.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# RFC #2744 - Vicscript

## Motivation

Simple transforms are easy to document, understand and recommend, powerful transforms are more likely to solve a use case in a concise way. We currently have two tiers of transform at opposite ends of this spectrum:

- Pure TOML (`add_fields`, `remove_fields`, `json_parser`)
- Turing complete runtime (`lua`, the new WASM transform)

This gives us decent coverage for very simple use cases where only one or two native transforms are needed, and also with the super advanced cases where a user basically just needs a framework for running a program they intend to write and maintain themselves.

However, there's a common middle ground where more complex and oftentimes conditional mappings are needed. This is where our native transforms become cumbersome to work with, and yet a full runtime would be heavy handed and is difficult to provide support for (or often simply doesn't perform well enough).

The main motivation with Vicscript (temporary name) is to introduce a tool that strikes a better balance by offering a simple language that isn't turing complete and runs as fast as our native transforms. Something that's easy to document, understand, use, diagnose and fix.

My proposal here is basically just a copy of the bits I think work well from [Bloblang](https://www.benthos.dev/docs/guides/bloblang/about), which itself is a derivative of [IDML](https://idml.io/). This is partly due to my familiarity with these languages, but I also have first hand experience of seeing them in action and know that they are easy to implement and adopt.

### Out of Scope

Event expansion and reduction are within the scope, but routing events is not. For example, it would be possible to express with vicscript that an event should be expanded into several, and then at the sink level filter events (potentially using vicscript again) so that they route to different destinations based on their contents. However, vicscript itself will not have any awareness of sinks, only events.

Joining events are within scope, but aggregation of those events is not. For example, it would be possible to aggregate events using a transaction transform (or wasm, etc) in a form where vicscript can be used to express exactly how the fields of those events should be joined. However, vicscript itself will not have the capability to temporarily store events and has no awareness of delivery guarantees or transactions.

## Guide Level Proposal

The `vicscript` transform allows you to mutate events by defining a sequence of mapping statements. Each mapping statement describes a read-only query operation on the right hand side, and on the left hand side a destination for the query result to be mapped within the resulting event:

```toml
[transforms.mapper]
inputs = [ "foo" ]
type = "vicscript"
mapping = """
.foo = "hello"
.bar = .bar + 10
.baz_is_large = .baz > 10
"""
```

> When executing a Vicscript mapping the source event is never mutated.

The most common query type is a dot separated path, describing how to reach a target field within the input event:

```coffee
.foo = .bar.baz.0.buz
```

Path sections that include whitespace or other special characters (including `.`) can be quoted:

```coffee
.foo = .bar."baz.buz".bev

# Use slash escapes within quotes
.bar = .buz."fub\"\\fob".fab
```

And supports coalescing fields within a path with a pipe operator `|`, where if the first value is null or missing then the next value is selected, and so on:

```coffee
.foo = .bar.(baz | buz).bev
```

Vicscript supports float, int, boolean, string, null, array and object literals:

```coffee
.first = 7
.second = false
.third = "string"
.fourth = null
```

Boolean operators and arithmetic galore:
bruceg marked this conversation as resolved.
Show resolved Hide resolved

```coffee
.is_big = .number > 100
.multiplied = .number * 7
```

Perform assignments conditionally with an `if` statement:

```coffee
.id = .id
.sorted_foo = if .foo.type() == "array" { .foo.sort() }
```

```text
In: {"id":"first","foo":"not an array"}
Out: {"id":"first","foo":"not an array"}

In: {"id":"second","foo":["c","a","d","b"]}
Out: {"id":"second","foo":["c","a","d","b"],"sorted_foo":["a","b","c","d"]}
```

Remove fields by assigning them the `delete` keyword:

```text
.foo = delete
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else here is great, but this still feels funky to me. I understand why it's this way, the left-hand side is the "target" and the right-hand side is the "action", but the traditional function syntax seems more intuitive. Going with our jq theme, this would just be del(.foo):

del(.foo)

Does this introduce significant complexity? I'm also curious to get @lukesteensen's opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is another case where I think we can try to avoid keywords and use something less ambiguous like a builtin function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got options here, bloblang and IDML use foo = deleted(), so it's a function rather than a keyword which we could copy. Implementing deletions as a right-hand query allows you to use it in all the same ways as regular mappings, which means they can be conditional:

.content = if .type == "bar" {
  .bar.body
} else {
  deleted()
}

And also mapped in an iterator in order to remove elements from arrays, etc:

.filtered_foos = .foo.map(if .ele.value < 10 { .ele.value + 10 } else { deleted() })

However, iterators aren't part of the RFC scope and if we're keeping the spec minimal then I think it's reasonable not to support those cases. It's easy to add a del(.foo) left-hand function, but if we want to enable conditional deletions the same as conditional mappings then we'd need to implement if as a statement you can put on the left-hand side:

if .type == "bar" {
  del(.bar.id)
}

At which point we need to decide whether we bother implementing it as a right-hand expression as they'd be parsed and implemented separately. Doing so is useful for expanding the language later, but not so much if we're keeping the language minimal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying, and it makes sense, but from a UX standpoint I find the following much more intuitive:

if .type == "bar" {
  del(.bar.id)
}

I feel like I might be suggesting something that is fundamentally at odds with the purpose of this language (performance). If that is the case, then I'd prefer to go with your proposal, otherwise, I find the above clearer. Performance is a key requirement of this language and I don't want to sacrifice that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, apologies for the back and forth. I've been trying to build consensus off line on the syntax proposed here. Just confirming was we settled on for this RFC:

.foo = "bar"
del(.baz)

This would:

  1. Add a root foo key with the value "bar".
  2. Delete the root baz key.


## Prior Art

### JQ

JQ is a great tool and rightfully gets a lot of love. It also basically solves the same problem as Vicscript. Unfortunately, it's common knowledge that JQ doesn't score high on readability, and therefore it doesn't scale well at all in relation to mapping size and complexity (imagine trying to solve [https://github.com/timberio/vector/issues/1965](https://github.com/timberio/vector/issues/1965) with it).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two important factors here, readability and scalability.

Readability is a tricky one. I definitely agree that jq can be hard to understand, but that's largely a non-issue if the user is already familiar with jq from somewhere else. Given that it's pretty popular, it's hard to say how those balance out.

Scalability is one where we need to think about the target use cases. If we're expecting for this transform to handle mostly light reshaping, then scalability is less of an issue. Users will likely have the opportunity to master the basics before diving into something more complicated and may not mind spending a bit more time to figure out something they recognize is pretty complicated. On the other hand, if we expect a meaningful number of users to jump straight into larger mapping declarations, it's more important that we have something that scales smoothly to that size and level of complexity.


The modules syntax introduced in 1.5 helps a lot by allowing you to break your mapping down, but the syntax is still difficult to learn and awkward to read.

### Tremor Script

[Tremor Script](https://docs.tremor.rs/tremor-script/) is part of the [Tremor](https://docs.tremor.rs/) project and therefore is closely aligned with Vector. Of all existing alternatives this seems the most likely candidate for quick adoption as it's written in Rust and basically solves the same problem we have.

Tremor script is (obviously) designed to work with records as they're modelled within Tremor, we therefore might struggle to get it working with our own event types and the translations to/from may have a performance penalty.

It also seems like there are some key limitations with the language that we would need to contribute solutions for (or fork). The main one being simple coalescing expressions, I can't find anything in their docs to allow something like `foo = bar.(baz | buz).bev` without a nasty bunch of nested `match` expressions.

Similar to JQ the syntax is also far enough removed from basic c-style declarations that there's a learning curve and more difficulty in maintaining syntax highlighters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with all of the above regarding Tremor script. I think it's an impressive feature of Tremor, but would like to avoid it for the reasons you listed.


### IDML and Bloblang

For a bit of background, [IDML](https://idml.io/) was created at DataSift (now a Meltwater company) specifically as a language that customers could write and maintain themselves for translating their unstructured data into a common format. The goals of this language were therefore to be as powerful as needed for the overwhelming majority of mapping use cases, easy to support and quick to adopt.

Unfortunately, the only parser on offer is written in Scala and so we'd likely need to re-implement it anyway in order to get the performance we need. [Bloblang](https://www.benthos.dev/docs/guides/bloblang/about) is a spiritual cousin of IDML, but as it's written in Go it's in the same boat.

However, the language itself is simple, looks clean, and is very easy to pick up, this RFC is mostly going to be a copy of Bloblang, with the opportunity to omit or add features as we see fit.

## Drawbacks

### It's more to support

A key role for us as Vector maintainers is to assist users with their configs and thereby encourage adoption. It stands to reason that making Vector as a project larger would conflict with our ability to do this. However, we already have a range of transforms for mapping and after reviewing some large example configs ([https://github.com/timberio/vector/issues/1965](https://github.com/timberio/vector/issues/1965)) I think scrapping them in favor of a more condensed and readable language would be overall beneficial to us when we're sporting our support hats.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I think this will aid us in support.


The spec for Vicscript doesn't need to be large, in fact the entire purpose of it is to remain minimal, but the reality is that this is something "more" we're going to need to learn and work with.

### It's a lot of effort

Vicscript is an entirely new language spec where we'll need to implement both the parser and executor, which is clearly going to be a significant chunk of work. The obvious mitigation to this is to simply try it out and put a time cap on it, then review how far along the project got. This is only a speculative drawback and it's only going to draw us back once (until we decide to rewrite it for fun).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value exceeds the effort in my opinion.


## Rationale

I think it's clear that there's a benefit to introducing a mapping language to Vector, and therefore the main point of contention here is whether to adopt something else that roughly fits, or to build something bespoke.

Mapping events is a core feature of Vector, and once we introduce a performant and more powerful alternative to our existing mapping transforms it's reasonable to expect it to become the standard for small or moderate sized pipelines. As such, I think it's important for us to have strong ownership of this language. It will allow us to build it into exactly the right solution for Vector users. It will also allow us to guarantee its performance and stability in the future, and provide first class support for it.

Given most of the risk around this idea is simply how difficult it will be I'd say the obvious first step is to test that out with a proof of concept. We can start with a quick implementation of a small subset of the spec and build a parser and execution engine for review. If we end up with nothing, or with a hostile codebase, then we can instead look into adopting other projects and compare.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with everything you've laid out as a rationale. I am big fan of starting with a small experiment.


## Plan of Attack

1. Dip our toes and build a new transform with a parser using something like [nom](https://github.com/Geal/nom). This should only go as far as a tiny subset of the spec, where we allow basic operations that are roughly equivalent to our existing mapping transforms and work directly to/from our internal event type.
2. Review the parser and the codebase, and compare the performance of this transform with the existing mapping transforms. If it sucks then it goes in the bin and we investigate using other existing alternatives.
3. Write a full and formal specification of the language and its behavior. This does not need to include advanced features such as maps, match statements, etc. However, it would be wise to consider these advanced features when finalizing the spec.
4. Since the spec is mostly a copy of Bloblang it would be sensible to share efforts for community tooling such as syntax highlighters, linters, documentation, etc. Breaking our parser out as a reference Rust implementation would help build that community.
5. Gradually expand the language with more advanced features such as match and if statements, as a BETA transform, taking user feedback on board as we move.
6. Once we're happy we remove the BETA flag and phase out the existing transforms that are no longer necessary, we could opt to leave the implementations in for backwards compatibility but the docs should be removed.
7. Finally, we can start reusing vicscript in other places such as at the source/sink level.