Skip to content

Commit

Permalink
chore: Remap RFC (vectordotdev#3134)
Browse files Browse the repository at this point in the history
Signed-off-by: Brian Menges <[email protected]>
  • Loading branch information
Jeffail authored and Brian Menges committed Dec 9, 2020
1 parent c7efb6a commit e9f8dc2
Showing 1 changed file with 138 additions and 0 deletions.
138 changes: 138 additions & 0 deletions rfcs/2020-07-21-2744-remapping-syntax.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# RFC #2744 - Remapping Syntax

## Motivation

Simple transforms are easy to document, understand and recommend, powerful transforms are more likely to solve a use case in a concise way. We currently have two tiers of transform at opposite ends of this spectrum:

- Pure TOML (`add_fields`, `remove_fields`, `json_parser`)
- Turing complete runtime (`lua`, the new WASM transform)

This gives us decent coverage for very simple use cases where only one or two native transforms are needed, and also with the super advanced cases where a user basically just needs a framework for running a program they intend to write and maintain themselves.

However, there's a common middle ground where more complex and oftentimes conditional mappings are needed. This is where our native transforms become cumbersome to work with, and yet a full runtime would be heavy handed and is difficult to provide support for (or often simply doesn't perform well enough).

The main motivation with the `remap` transform is to introduce a tool that strikes a better balance by offering a simple language that isn't turing complete and runs as fast as our native transforms. Something that's easy to document, understand, use, diagnose and fix.

My proposal here is basically just a copy of the bits I think work well from [Bloblang](https://www.benthos.dev/docs/guides/bloblang/about), which itself is a derivative of [IDML](https://idml.io/). This is partly due to my familiarity with these languages, but I also have first hand experience of seeing them in action and know that they are easy to implement and adopt.

### Out of Scope

Event expansion and reduction are within the scope, but routing events is not. For example, it would be possible to express with `remap` that an event should be expanded into several, and then at the sink level filter events (potentially using `remap` again) so that they route to different destinations based on their contents. However, the `remap` transform itself will not have any awareness of sinks, only events.

Joining events are within scope, but aggregation of those events is not. For example, it would be possible to aggregate events using a transaction transform (or wasm, etc) in a form where `remap` can be used to express exactly how the fields of those events should be joined. However, `remap` itself will not have the capability to temporarily store events and has no awareness of delivery guarantees or transactions.

## Guide Level Proposal

The `remap` transform allows you to mutate events by defining a sequence of mapping statements. Each mapping statement describes a read-only query operation on the right hand side, and on the left hand side a destination for the query result to be mapped within the resulting event:

```toml
[transforms.mapper]
inputs = [ "foo" ]
type = "remap"
mapping = """
.foo = "hello"
.bar = .bar + 10
.baz_is_large = .baz > 10
"""
```

The most common mapping statement is an assignment of a query result to a dot separated path, and the most common query type is a dot separated path describing how to reach a target field within the input event:

```coffee
.foo = .bar.baz.0.buz
```

Path sections that include whitespace or other special characters (including `.`) can be quoted:

```coffee
.foo = .bar."baz.buz".bev

# Use slash escapes within quotes
.bar = .buz."fub\"\\fob".fab
```

And supports coalescing fields within a path with a pipe operator `|`, where if the first value is null or missing then the next value is selected, and so on:

```coffee
.foo = .bar.(baz | buz).bev
```

The language supports float, int, boolean, string, null, array and object literals:

```coffee
.first = 7
.second = false
.third = "string"
.fourth = null
```

Boolean operators and arithmetic galore:

```coffee
.is_big = .number > 100
.multiplied = .number * 7
```

Remove fields with the `del` function:

```text
del(.foo)
```

Each mapping line is executed sequentially, with the event result of each line fed into the next.

## Prior Art

### JQ

JQ is a great tool and rightfully gets a lot of love. It also basically solves the same problem as `remap`. Unfortunately, it's common knowledge that JQ doesn't score high on readability, and therefore it doesn't scale well at all in relation to mapping size and complexity (imagine trying to solve [https://github.com/timberio/vector/issues/1965](https://github.com/timberio/vector/issues/1965) with it).

The modules syntax introduced in 1.5 helps a lot by allowing you to break your mapping down, but the syntax is still difficult to learn and awkward to read.

### Tremor Script

[Tremor Script](https://docs.tremor.rs/tremor-script/) is part of the [Tremor](https://docs.tremor.rs/) project and therefore is closely aligned with Vector. Of all existing alternatives this seems the most likely candidate for quick adoption as it's written in Rust and basically solves the same problem we have.

Tremor script is (obviously) designed to work with records as they're modelled within Tremor, we therefore might struggle to get it working with our own event types and the translations to/from may have a performance penalty.

It also seems like there are some key limitations with the language that we would need to contribute solutions for (or fork). The main one being simple coalescing expressions, I can't find anything in their docs to allow something like `foo = bar.(baz | buz).bev` without a nasty bunch of nested `match` expressions.

Similar to JQ the syntax is also far enough removed from basic c-style declarations that there's a learning curve and more difficulty in maintaining syntax highlighters.

### IDML and Bloblang

For a bit of background, [IDML](https://idml.io/) was created at DataSift (now a Meltwater company) specifically as a language that customers could write and maintain themselves for translating their unstructured data into a common format. The goals of this language were therefore to be as powerful as needed for the overwhelming majority of mapping use cases, easy to support and quick to adopt.

Unfortunately, the only parser on offer is written in Scala and so we'd likely need to re-implement it anyway in order to get the performance we need. [Bloblang](https://www.benthos.dev/docs/guides/bloblang/about) is a spiritual cousin of IDML, but as it's written in Go it's in the same boat.

However, the language itself is simple, looks clean, and is very easy to pick up, this RFC is mostly going to be a copy of Bloblang, with the opportunity to omit or add features as we see fit.

## Drawbacks

### It's more to support

A key role for us as Vector maintainers is to assist users with their configs and thereby encourage adoption. It stands to reason that making Vector as a project larger would conflict with our ability to do this. However, we already have a range of transforms for mapping and after reviewing some large example configs ([https://github.com/timberio/vector/issues/1965](https://github.com/timberio/vector/issues/1965)) I think scrapping them in favor of a more condensed and readable language would be overall beneficial to us when we're sporting our support hats.

The spec for the `remap` mapping language doesn't need to be large, in fact the entire purpose of it is to remain minimal, but the reality is that this is something "more" we're going to need to learn and work with.

### It's a lot of effort

This is an entirely new language spec where we'll need to implement both the parser and executor, which is clearly going to be a significant chunk of work. The obvious mitigation to this is to simply try it out and put a time cap on it, then review how far along the project got. This is only a speculative drawback and it's only going to draw us back once (until we decide to rewrite it for fun).

## Rationale

I think it's clear that there's a benefit to introducing a mapping language to Vector, and therefore the main point of contention here is whether to adopt something else that roughly fits, or to build something bespoke.

Mapping events is a core feature of Vector, and once we introduce a performant and more powerful alternative to our existing mapping transforms it's reasonable to expect it to become the standard for small or moderate sized pipelines. As such, I think it's important for us to have strong ownership of this language. It will allow us to build it into exactly the right solution for Vector users. It will also allow us to guarantee its performance and stability in the future, and provide first class support for it.

Given most of the risk around this idea is simply how difficult it will be I'd say the obvious first step is to test that out with a proof of concept. We can start with a quick implementation of a small subset of the spec and build a parser and execution engine for review. If we end up with nothing, or with a hostile codebase, then we can instead look into adopting other projects and compare.

## Plan of Attack

1. Dip our toes and build a new transform with a parser using something like [nom](https://github.com/Geal/nom). This should only go as far as a tiny subset of the spec, where we allow basic operations that are roughly equivalent to our existing mapping transforms and work directly to/from our internal event type.
2. Review the parser and the codebase, and compare the performance of this transform with the existing mapping transforms. If it sucks then it goes in the bin and we investigate using other existing alternatives.
3. Write a full and formal specification of the language and its behavior. This does not need to include advanced features such as maps, match statements, etc. However, it would be wise to consider these advanced features when finalizing the spec.
4. Since the spec is mostly a copy of Bloblang it would be sensible to share efforts for community tooling such as syntax highlighters, linters, documentation, etc. Breaking our parser out as a reference Rust implementation would help build that community.
5. Gradually expand the language with more advanced features such as match and if statements, as a BETA transform, taking user feedback on board as we move.
6. Once we're happy we remove the BETA flag and phase out the existing transforms that are no longer necessary, we could opt to leave the implementations in for backwards compatibility but the docs should be removed.
7. Finally, we can start reusing the mapping language in other places such as at the source/sink level.

0 comments on commit e9f8dc2

Please sign in to comment.