Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Remap v1 RFC #3134

Merged
merged 8 commits into from
Aug 3, 2020
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions rfcs/2020-07-21-2744-vicscript.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# RFC #2744 - Vicscript

## Motivation

Simple transforms are easy to document, understand and recommend, powerful transforms are more likely to solve a use case in a concise way. We currently have two tiers of transform at opposite ends of this spectrum:

- Pure TOML (`add_fields`, `remove_fields`, `json_parser`)
- Turing complete runtime (`lua`, the new WASM transform)

This gives us decent coverage for very simple use cases where only one or two native transforms are needed, and also with the super advanced cases where a user basically just needs a framework for running a program they intend to write and maintain themselves.

However, there's a common middle ground where more complex and oftentimes conditional mappings are needed. This is where our native transforms become cumbersome to work with, and yet a full runtime would be heavy handed and is difficult to provide support for (or often simply doesn't perform well enough).

The main motivation with Vicscript (temporary name) is to introduce a tool that strikes a better balance by offering a simple language that isn't turing complete and runs as fast as our native transforms. Something that's easy to document, understand, use, diagnose and fix.

My proposal here is basically just a copy of the bits I think work well from [Bloblang](https://www.benthos.dev/docs/guides/bloblang/about), which itself is a derivative of [IDML](https://idml.io/). This is partly due to my familiarity with these languages, but I also have first hand experience of seeing them in action and know that they are easy to implement and adopt.

### Out of Scope

Event expansion and reduction are within the scope, but routing events is not. For example, it would be possible to express with vicscript that an event should be expanded into several, and then at the sink level filter events (potentially using vicscript again) so that they route to different destinations based on their contents. However, vicscript itself will not have any awareness of sinks, only events.

Joining events are within scope, but aggregation of those events is not. For example, it would be possible to aggregate events using a transaction transform (or wasm, etc) in a form where vicscript can be used to express exactly how the fields of those events should be joined. However, vicscript itself will not have the capability to temporarily store events and has no awareness of delivery guarantees or transactions.

## Guide Level Proposal

The `vicscript` transform allows you to map your events using Vicscript. A Vicscript statement has a left hand side target path and a right hand side query:

```toml
[transforms.mapper]
inputs = [ "foo" ]
type = "vicscript"
mapping = """
foo = "hello"
Copy link
Contributor

@binarylogic binarylogic Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the first presentation of this syntax, the thing that immediately jumps out to me is the lack of clarity around event field assignment. It looks like a bunch of variables are being set with no real action. Following precedence set by our lua transform, I'm curious what you think about this alternative syntax?

event.log.foo = "hello"
event.log.bar = event.log.bar + 10
event.log.baz_is_large = event.log.baz_is_large > 10

I realize the intent of your syntax is to avoid maintaining a runtime environment. So things like this would not be possible:

# not possible
baz = 1
event.log.bar = baz

And we could simply throw an error saying that baz is "unknown". The advantage with my syntax is:

  1. It's obvious you are setting an event level field.
  2. The log namespace means we can use this with metrics (unless you were thinking of a better way to solve this).
  3. It's closer in syntax to other transforms that manipulate events (like lua and wasm).

Disadvantages:

  1. It suggests that this is a runtime. Users might be surprised that my baz = 1 example doesn't work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that'd be reasonable, one thing to note is that enforcing a path prefix would be stripped within a map definition, so you'd still get bare assignments like this:

map foo {
  type = (type.uppercase() | "UNKNOWN")
  name = name.titlecase()
}

event.log.metadata = event.log.metadata.apply("foo")

Another thing to note is that both IDML and Bloblang have an optional path prefix root (root.foo = "bar" would set the event field foo to "bar"), which is useful for when a path clashes with a keyword like if or match, but is also implicit when omitted as a convenience. It's assumed it usually won't be used unless you specifically want to assign directly to the root. On the right hand side we have a similar keyword this which represents the source value in the current context.

Adding them to the above example looks like this:

map foo {
  root.type = (this.type.uppercase() | "UNKNOWN")
  root.name = this.name.titlecase()
}

root.event.log.metadata = this.event.log.metadata.apply("foo")

I would opt to leave these in the spec as they are. We could choose to make them mandatory but I wouldn't rename root to something like event as it's somewhat misleading in the context of maps.

On the topic of variables, we can actually support them without a runtime or making the spec turing complete, it's just a mapping that isn't part of the output event. I originally omitted them from the RFC so we could consider them later as a more advanced feature, Bloblang has them and it's definitely useful, I already have an example here that uses them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would opt to leave these in the spec as they are.

Sounds good. I've come around to leaving out the prefix, especially as we start to incorporate it in other places where specifying event.log would be confusing. Such as dynamic configuration values.

map foo {
  type = (type.uppercase() | "UNKNOWN")
 name = name.titlecase()
}

The map syntax feels a little strange to me as well. As I noted in my other comment, we should agree on these details in the form of a spec so that we can obtain consensus. I realize that's the part of what this RFC aims to accomplish, but it still feels hand-wavy to me.

On the topic of variables, we can actually support them without a runtime or making the spec turing complete, it's just a mapping that isn't part of the output event.

Sounds good to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's a little unclear to have assignment look like free variables, but I'm not a fan of the verbosity of event.log..

The jq approach is interesting here, where assignment to a foo key is just .foo = "bar".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can do that. One thing that's open to confusion here is that . on the left hand side means something different from the right hand side.

Using root and this makes that distinction clear (even if it's not immediately obvious what they point to), whereas a very common pitfall of jq is that users will naturally assume in many places that a dot on the left and the right are the same thing. It means the docs for lots of features ends up needing to also explain that same concept continuously, making them quite verbose: https://stedolan.github.io/jq/manual/#Assignment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that's open to confusion here is that . on the left hand side means something different from the right hand side.

Yeah, this is interesting. It looks like in "normal" = assignment they are the same thing, but "most users will want to use modification assignment operators", where the RHS is a filter. I get the feeling there's some history there and I'm curious what led to the current state.

bar = bar + 10
baz_is_large = baz > 10
"""
```

The most common query type is a dot separated path, describing how to reach a target field within the input event:

```coffee
foo = bar.baz.0.buz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this differentiate between array and map indexing at all? Or would .0 work for both {"0": "foo"} and ["foo"]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's on a right hand side we have the reference object to look at, so this path follows the same rules as JSON pointers where if we approach an array and the path segment can be parsed as a integer then we use it as an index. However, this obviously can't always work on the left hand side unless the array already exists from a previous assignment, so there's room for confusion there.

We could instead opt to force array indexes to be explicit all over and foo.0.bar would always mean an object foo with a key 0, and indexes are expressed as foo[0].bar.

```

Path sections that include whitespace or other special characters (including `.`) can be quoted:

```coffee
foo = bar."baz.buz".bev

# Use slash escapes within quotes
bar = buz."fub\"\\fob".fab
```

And supports coalescing fields within a path with a pipe operator `|`, where if the first value is null or missing then the next value is selected, and so on:

```coffee
foo = bar.(baz | buz).bev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 really like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same! This is really nice and concise.

```

You can assign a value to the root of your event by writing to the `root` keyword:

```coffee
root = nested.object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another example where it's clearer to write:

event.log = nested.object

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both of these cases I worry about the user data having root, event, or log fields. If we can sidestep that whole problem it could save us a lot of headache.

```

Vicscript supports float, int, boolean, string, null, array and object literals:

```coffee
first = 7
second = false
third = "string"
fourth = null
```

As well as a timestamp type:

```coffee
# NOTE: The full API for the timestamp type is out of the scope of this RFC. The
# following example is for demonstrative purposes.

created_at = now().unix()
binarylogic marked this conversation as resolved.
Show resolved Hide resolved
bruceg marked this conversation as resolved.
Show resolved Hide resolved
# Sets `created_at` to an integer representing the current unix timestamp
# (in seconds).
```

Boolean operators and arithmetic galore:
bruceg marked this conversation as resolved.
Show resolved Hide resolved

```coffee
is_big = number > 100
multiplied = number * 7
```

Perform assignments conditionally with an `if` statement:

```coffee
sorted_foo = if foo.type() == "array" { foo.sort() } else { foo }
```

```
In: {"foo":"not an array"}
Out: {"sorted_foo":"not an array"}'

In: {"foo":["c","a","d","b"]}
Out: {"sorted_foo":["a","b","c","d"]}
```


Use a wealth of methods on values in order to perform common mutations on them:

```coffee
sorted = foo.sort()
uppercase = bar.uppercase()
```

```
In: {"foo":["c","a","d","b"],"bar":"hello world"}
Out: {"sorted":["a","b","c","d"],"uppercase":"HELLO WORLD"}
```

## Examples

I've written an implementation of [an embedded CSV parser](2020-07-21-2744-vicscript/example1.coffee) using the above proposal in order to demonstrate what it looks like. This example is currently runnable with Benthos, and I've used the `.coffee` suffix because the syntax highlighting for CoffeeScript basically gives us everything we need already.

## Prior Art

### JQ

JQ is a great tool and rightfully gets a lot of love. It also basically solves the same problem as Vicscript. Unfortunately, it's common knowledge that JQ doesn't score high on readability, and therefore it doesn't scale well at all in relation to mapping size and complexity (imagine trying to solve [https://github.com/timberio/vector/issues/1965](https://github.com/timberio/vector/issues/1965) with it).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two important factors here, readability and scalability.

Readability is a tricky one. I definitely agree that jq can be hard to understand, but that's largely a non-issue if the user is already familiar with jq from somewhere else. Given that it's pretty popular, it's hard to say how those balance out.

Scalability is one where we need to think about the target use cases. If we're expecting for this transform to handle mostly light reshaping, then scalability is less of an issue. Users will likely have the opportunity to master the basics before diving into something more complicated and may not mind spending a bit more time to figure out something they recognize is pretty complicated. On the other hand, if we expect a meaningful number of users to jump straight into larger mapping declarations, it's more important that we have something that scales smoothly to that size and level of complexity.


The modules syntax introduced in 1.5 helps a lot by allowing you to break your mapping down, but the syntax is still difficult to learn and awkward to read.

### Tremor Script

[Tremor Script](https://docs.tremor.rs/tremor-script/) is part of the [Tremor](https://docs.tremor.rs/) project and therefore is closely aligned with Vector. Of all existing alternatives this seems the most likely candidate for quick adoption as it's written in Rust and basically solves the same problem we have.

Tremor script is (obviously) designed to work with records as they're modelled within Tremor, we therefore might struggle to get it working with our own event types and the translations to/from may have a performance penalty.

It also seems like there are some key limitations with the language that we would need to contribute solutions for (or fork). The main one being simple coalescing expressions, I can't find anything in their docs to allow something like `foo = bar.(baz | buz).bev` without a nasty bunch of nested `match` expressions.

Similar to JQ the syntax is also far enough removed from basic c-style declarations that there's a learning curve and more difficulty in maintaining syntax highlighters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with all of the above regarding Tremor script. I think it's an impressive feature of Tremor, but would like to avoid it for the reasons you listed.


### IDML and Bloblang

For a bit of background, [IDML](https://idml.io/) was created at DataSift (now a Meltwater company) specifically as a language that customers could write and maintain themselves for translating their unstructured data into a common format. The goals of this language were therefore to be as powerful as needed for the overwhelming majority of mapping use cases, easy to support and quick to adopt.

Unfortunately, the only parser on offer is written in Scala and so we'd likely need to re-implement it anyway in order to get the performance we need. [Bloblang](https://www.benthos.dev/docs/guides/bloblang/about) is a spiritual cousin of IDML, but as it's written in Go it's in the same boat.

However, the language itself is simple, looks clean, and is very easy to pick up, this RFC is mostly going to be a copy of Bloblang, with the opportunity to omit or add features as we see fit.

## Drawbacks

### It's more to support

A key role for us as Vector maintainers is to assist users with their configs and thereby encourage adoption. It stands to reason that making Vector as a project larger would conflict with our ability to do this. However, we already have a range of transforms for mapping and after reviewing some large example configs ([https://github.com/timberio/vector/issues/1965](https://github.com/timberio/vector/issues/1965)) I think scrapping them in favor of a more condensed and readable language would be overall beneficial to us when we're sporting our support hats.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I think this will aid us in support.


The spec for Vicscript doesn't need to be large, in fact the entire purpose of it is to remain minimal, but the reality is that this is something "more" we're going to need to learn and work with.

### It's a lot of effort

Vicscript is an entirely new language spec where we'll need to implement both the parser and executor, which is clearly going to be a significant chunk of work. The obvious mitigation to this is to simply try it out and put a time cap on it, then review how far along the project got. This is only a speculative drawback and it's only going to draw us back once (until we decide to rewrite it for fun).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value exceeds the effort in my opinion.


## Rationale

I think it's clear that there's a benefit to introducing a mapping language to Vector, and therefore the main point of contention here is whether to adopt something else that roughly fits, or to build something bespoke.

Mapping events is a core feature of Vector, and once we introduce a performant and more powerful alternative to our existing mapping transforms it's reasonable to expect it to become the standard for small or moderate sized pipelines. As such, I think it's important for us to have strong ownership of this language. It will allow us to build it into exactly the right solution for Vector users. It will also allow us to guarantee its performance and stability in the future, and provide first class support for it.

Given most of the risk around this idea is simply how difficult it will be I'd say the obvious first step is to test that out with a proof of concept. We can start with a quick implementation of a small subset of the spec and build a parser and execution engine for review. If we end up with nothing, or with a hostile codebase, then we can instead look into adopting other projects and compare.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with everything you've laid out as a rationale. I am big fan of starting with a small experiment.


## Plan of Attack

1. Dip our toes and build a new transform with a parser using something like [nom](https://github.com/Geal/nom). This should only go as far as a tiny subset of the spec, where we allow basic operations that are roughly equivalent to our existing mapping transforms and work directly to/from our internal event type.
2. Review the parser and the codebase, and compare the performance of this transform with the existing mapping transforms. If it sucks then it goes in the bin and we investigate using other existing alternatives.
3. If instead we're happy with it then at this point we should consider breaking the project out of Vector. Since the spec is mostly a copy of Bloblang it would be sensible to share efforts for community tooling such as syntax highlighters, linters, documentation, etc. Breaking our parser out as a reference Rust implementation would help build that community.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a step before 3 that would outline an actual spec for the language. We need consensus on the details and priority, but this is not necessary for the initial experiment.

4. Gradually expand the language with more advanced features such as match and if statements, as a BETA transform, taking user feedback on board as we move.
5. Once we're happy we remove the BETA flag and phase out the existing transforms that are no longer necessary, we could opt to leave the implementations in for backwards compatibility but the docs should be removed.
6. Finally, we can start reusing vicscript in other places such as at the source/sink level.
47 changes: 47 additions & 0 deletions rfcs/2020-07-21-2744-vicscript/example1.coffee
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# NOTE: This example is intended to demonstrate larger applications for
# Vicscript in order to get a feel for the syntax. However, this uses features
# and concepts that aren't yet decided upon and aren't part of the RFC.
#
# Problem: My nan likes to send me CSV files embedded within JSON documents:
#
# {"items":"item,count\napples,10\noranges,2\n","doc":{"title":"shopping list","description":"get me this stuff"}}
#
# I want to parse and expand the csv doc within items like so:
#
# {
# "doc":{
# "title":"shopping list",
# "description":"get me this stuff",
# "items": [
# {"item":"apples","count":10},
# {"item":"oranges","count":2}
# ]
# }
# }
#
# Note: this example includes some of the more advanced mapping functions from
# Bloblang such as enumerated and map_each. In a real world scenario this
# example would be replaced with something bespoke like items.parse_csv().
#

# First, copy the unchanged contents of doc to our new event.
doc = doc

# Next, parse the csv out into an array of arrays to a temporary variable.
let rows = items.split("\n").map_each(match this.trim() {
this.length() == 0 => deleted(), # Remove empty lines
_ => this.split(","),
})

# The first row is column names
let column_names = $rows.0

# And here's the meaty part where we bring it all together. We walk each element
# of our array of arrays of values, and enumerate the value array. Then, using
# the index of the value, we create a temporary object with a key taken from the
# column_names variable and fold it.
doc.items = $rows.slice(1).map_each(
this.enumerated().fold({}, tally.merge({
bruceg marked this conversation as resolved.
Show resolved Hide resolved
$column_names.index(value.index): value.value
}))
)
40 changes: 40 additions & 0 deletions rfcs/2020-07-21-2744-vicscript/example2.coffee
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# NOTE: This example is intended to demonstrate larger applications for
# Vicscript in order to get a feel for the syntax. However, this uses features
# and concepts that aren't yet decided upon and aren't part of the RFC.
#
# Parses events of any structure:
#
# {
# "nulled": null,
# "basic": true,
# "list": [
# true,
# null,
# [true, null, true],
# {
# "basic": true,
# "buddy": 1.0
# }
# ],
# "map": {
# "basic": true,
# "list": [true, null, true],
# "map": {
# "basic": true,
# "buddy": -1
# }
# }
# }
#
# And returns an object containing and array of all keys and a separate array of all values:
#
# {"keys":["map.map.buddy","list.2.0","map.list.0","list.0","basic","map.list.1","list.2.2","list.3.buddy","map.list.2","map.basic","list.2.1","nulled","list.3.basic","map.map.basic","list.1"],"values":[-1,true,true,true,true,null,true,1,true,true,null,null,true,true,null]}
#

root = this.collapse().map_each({"keys":key,"values":value}).values().fold(
{"keys":[],"values":[]},
{
"keys": tally.keys.append(value.keys),
"values": tally.values.append(value.values)
}
)