Add vicscript RFC

vectordotdev · Jul 21, 2020 · cb93a96 · cb93a96
1 parent 129a861
commit cb93a96
Show file tree

Hide file tree

Showing 2 changed files with 235 additions and 0 deletions.
diff --git a/rfcs/2020-07-21-2744-vicscript.md b/rfcs/2020-07-21-2744-vicscript.md
@@ -0,0 +1,192 @@
+RFC #2744 - Vicscript
+=====================
+
+## Motivation
+
+Simple transforms are easy to document, understand and recommend, powerful transforms are more likely to solve a use case in a concise way. We currently have two tiers of transform at opposite ends of this spectrum:
+
+- Pure TOML (`add_fields`, `remove_fields`, `json_parser`)
+- Turing complete runtime (`lua`, the new WASM transform)
+
+This gives us decent coverage for very simple use cases where only one or two native transforms are needed, and also with the super advanced cases where a user basically just needs a framework for running a program they intend to write and maintain themselves.
+
+However, there's a common middle ground where more complex and oftentimes conditional mappings are needed. This is where our native transforms become cumbersome to work with, and yet a full runtime would be heavy handed and is difficult to provide support for (or often simply doesn't perform well enough).
+
+The main motivation with Vicscript (temporary name) is to introduce a tool that strikes a better balance by offering a simple language that isn't turing complete and runs as fast as our native transforms. Something that's easy to document, understand, use, diagnose and fix.
+
+My proposal here is basically just a copy of the bits I think work well from [Bloblang](https://www.benthos.dev/docs/guides/bloblang/about), which itself is a derivative of [IDML](https://idml.io/). This is partly due to my familiarity with these languages, but I also have first hand experience of seeing them in action and know that they are easy to implement and adopt.
+
+### Out of Scope
+
+Event expansion and reduction are within the scope, but routing events is not. For example, it would be possible to express with vicscript that an event should be expanded into several, and then at the sink level filter events (potentially using vicscript again) so that they route to different destinations based on their contents. However, vicscript itself will not have any awareness of sinks, only events.
+
+Joining events are within scope, but aggregation of those events is not. For example, it would be possible to aggregate events using a transaction transform (or wasm, etc) in a form where vicscript can be used to express exactly how the fields of those events should be joined. However, vicscript itself will not have the capability to temporarily store events and has no awareness of delivery guarantees or transactions.
+
+## Guide Level Proposal
+
+The `vicscript` transform allows you to map your events using Vicscript. A Vicscript statement has a left hand side target path and a right hand side query:
+
+```
+[transforms.mapper]
+  inputs = [ "foo" ]
+  type = "vicscript"
+  mapping = """
+foo = "hello"
+bar = bar + 10
+baz_is_large = baz > 10
+"""
+```
+
+The most common query type is a dot separated path, describing how to reach a target field within the input event:
+
+```
+foo = bar.baz.0.buz
+```
+
+Path sections that include whitespace or other special characters (including `.`) can be quoted:
+
+```jsx
+foo = bar."baz.buz".bev
+
+# Use slash escapes within quotes
+bar = buz."fub\"\\fob".fab
+```
+
+And supports coalescing fields within a path with a pipe operator `|`, where if the first value is null or missing then the next value is selected, and so on:
+
+```
+foo = bar.(baz | buz).bev
+```
+
+You can assign a value to the root of your event by writing to the `root` keyword:
+
+```
+root = nested.object
+```
+
+Vicscript supports float, int, boolean, string, null, array and object literals:
+
+```
+root = [
+  7, false, "string", null, {
+    "first": 11,
+    "second": {"foo":"bar"},
+    "third": """multiple
+lines on this
+string"""
+  }
+]
+```
+
+As well as a timestamp type:
+
+```jsx
+created_at = now().unix()
+```
+
+Boolean operators and arithmetic galore:
+
+```
+is_big = number > 100
+multiplied = number * 7
+```
+
+Perform assignments conditionally with an `if` statement:
+
+```
+sorted_foo = if foo.type() == "array" { foo.sort() } 
+```
+
+And use a `match` statement for pattern matching:
+
+```
+new_doc = match doc {
+  type == "article" => article
+  type == "comment" => comment
+  _ => this
+}
+```
+
+Use a wealth of methods on values in order to perform common mutations on them:
+
+```
+sorted = foo.sort()
+uppercase = bar.uppercase()
+
+foo = bar.map_each(if this.description.contains("delete me") {
+    deleted()
+} else {
+    this
+})
+```
+
+And, finally, create your own re-usable maps with the `map` keyword:
+
+```
+map things {
+  first  = thing_one
+  second = thing_two
+}
+
+foo = value_one.apply("things")
+bar = value_two.apply("things")
+```
+
+## Examples
+
+I've written an implementation of [an embedded CSV parser](2020-07-21-2744-vicscript/example1.coffee) using the above proposal in order to demonstrate what it looks like. This example is currently runnable with Benthos, and I've used the `.coffee` suffix because the syntax highlighting for CoffeeScript basically gives us everything we need already.
+
+## Prior Art
+
+### JQ
+
+JQ is a great tool and rightfully gets a lot of love. It also basically solves the same problem as Vicscript. Unfortunately, it's common knowledge that JQ doesn't score high on readability, and therefore it doesn't scale well at all in relation to mapping size and complexity (imagine trying to solve [https://github.com/timberio/vector/issues/1965](https://github.com/timberio/vector/issues/1965) with it).
+
+The modules syntax introduced in 1.5 helps a lot by allowing you to break your mapping down, but the syntax is still difficult to learn and awkward to read.
+
+### Tremor Script
+
+[Tremor Script](https://docs.tremor.rs/tremor-script/) is part of the [Tremor](https://docs.tremor.rs/) project and therefore is closely aligned with Vector. Of all existing alternatives this seems the most likely candidate for quick adoption as it's written in Rust and basically solves the same problem we have.
+
+Tremor script is (obviously) designed to work with records as they're modelled within Tremor, we therefore might struggle to get it working with our own event types and the translations to/from may have a performance penalty.
+
+It also seems like there are some key limitations with the language that we would need to contribute solutions for (or fork). The main one being simple coalescing expressions, I can't find anything in their docs to allow something like `foo = bar.(baz | buz).bev` without a nasty bunch of nested `match` expressions.
+
+Similar to JQ the syntax is also far enough removed from basic c-style declarations that there's a learning curve and more difficulty in maintaining syntax highlighters.
+
+### IDML and Bloblang
+
+For a bit of background, [IDML](https://idml.io/) was created at DataSift (now a Meltwater company) specifically as a language that customers could write and maintain themselves for translating their unstructured data into a common format. The goals of this language were therefore to be as powerful as needed for the overwhelming majority of mapping use cases, easy to support and quick to adopt.
+
+Unfortunately, the only parser on offer is written in Scala and so we'd likely need to re-implement it anyway in order to get the performance we need. [Bloblang](https://www.benthos.dev/docs/guides/bloblang/about) is a spiritual cousin of IDML, but as it's written in Go it's in the same boat.
+
+However, the language itself is simple, looks clean, and is very easy to pick up, this RFC is mostly going to be a copy of Bloblang, with the opportunity to omit or add features as we see fit.
+
+## Drawbacks
+
+### It's more to support
+
+A key role for us as Vector maintainers is to assist users with their configs and thereby encourage adoption. It stands to reason that making Vector as a project larger would conflict with our ability to do this. However, we already have a range of transforms for mapping and after reviewing some large example configs ([https://github.com/timberio/vector/issues/1965](https://github.com/timberio/vector/issues/1965)) I think scrapping them in favor of a more condensed and readable language would be overall beneficial to us when we're sporting our support hats.
+
+The spec for Vicscript doesn't need to be large, in fact the entire purpose of it is to remain minimal, but the reality is that this is something "more" we're going to need to learn and work with.
+
+### It's a lot of effort
+
+Vicscript is an entirely new language spec where we'll need to implement both the parser and executor, which is clearly going to be a significant chunk of work. The obvious mitigation to this is to simply try it out and put a time cap on it, then review how far along the project got. This is only a speculative drawback and it's only going to draw us back once (until we decide to rewrite it for fun).
+
+## Rationale
+
+I think it's clear that there's a benefit to introducing a mapping language to Vector, and therefore the main point of contention here is whether to adopt something else that roughly fits, or to build something bespoke.
+
+Mapping events is a core feature of Vector, and once we introduce a performant and more powerful alternative to our existing mapping transforms it's reasonable to expect it to become the standard for small or moderate sized pipelines. As such, I think it's important for us to have strong ownership of this language. It will allow us to build it into exactly the right solution for Vector users. It will also allow us to guarantee its performance and stability in the future, and provide first class support for it.
+
+Given most of the risk around this idea is simply how difficult it will be I'd say the obvious first step is to test that out with a proof of concept. We can start with a quick implementation of a small subset of the spec and build a parser and execution engine for review. If we end up with nothing, or with a hostile codebase, then we can instead look into adopting other projects and compare.
+
+## Plan of Attack
+
+1. Dip our toes and build a new transform with a parser using something like [nom](https://github.com/Geal/nom). This should only go as far as a tiny subset of the spec, where we allow basic operations that are roughly equivalent to our existing mapping transforms and work directly to/from our internal event type.
+2. Review the parser and the codebase, and compare the performance of this transform with the existing mapping transforms. If it sucks then it goes in the bin and we investigate using other existing alternatives.
+3. If instead we're happy with it then at this point we should consider breaking the project out of Vector. Since the spec is mostly a copy of Bloblang it would be sensible to share efforts for community tooling such as syntax highlighters, linters, documentation, etc. Breaking our parser out as a reference Rust implementation would help build that community.
+4. Gradually expand the language with more advanced features such as match and if statements, as a BETA transform, taking user feedback on board as we move.
+5. Once we're happy we remove the BETA flag and phase out the existing transforms that are no longer necessary, we could opt to leave the implementations in for backwards compatibility but the docs should be removed.
+6. Finally, we can start reusing vicscript in other places such as at the source/sink level.
diff --git a/rfcs/2020-07-21-2744-vicscript/example1.coffee b/rfcs/2020-07-21-2744-vicscript/example1.coffee
@@ -0,0 +1,43 @@
+# Problem: My nan likes to send me CSV files embedded within JSON documents:
+#
+# {"items":"item,count\napples,10\noranges,2\n","doc":{"title":"shopping list","description":"get me this stuff"}}
+#
+# I want to parse and expand the csv doc within items like so:
+#
+# {
+#   "doc":{
+#     "title":"shopping list",
+#     "description":"get me this stuff",
+#     "items": [
+#       {"item":"apples","count":10},
+#       {"item":"oranges","count":2}
+#     ]
+#   }
+# }
+#
+# Note: this example includes some of the more advanced mapping functions from
+# Bloblang such as enumerated and map_each. In a real world scenario this
+# example would be replaced with something bespoke like items.parse_csv().
+#
+
+# First, copy the unchanged contents of doc to our new event.
+doc = doc
+
+# Next, parse the csv out into an array of arrays to a temporary variable.
+let rows = items.split("\n").map_each(match this.trim() {
+    this.length() == 0 => deleted(), # Remove empty lines
+    _ => this.split(","),
+})
+
+# The first row is column names
+let column_names = $rows.0
+
+# And here's the meaty part where we bring it all together. We walk each element
+# of our array of arrays of values, and enumerate the value array. Then, using
+# the index of the value, we create a temporary object with a key taken from the
+# column_names variable and fold it. 
+doc.items = $rows.slice(1).map_each(
+  this.enumerated().fold({}, tally.merge({
+      $column_names.index(value.index): value.value
+  }))
+)