We extend the Remap language to be able to determine at compile-time which value type(s) a program returns on completion, so that we have less runtime failure conditions when using Remap within Vector.
This RFC proposes extending the Expression
trait. It also explains how we use
these changes when using Remap in the context of a Vector Condition
. It will
hint at future uses, but does not consider them in-depth.
The proposal does not consider fallible vs infallible expressions, those are left for a future RFC.
It does also not consider using compile-time type checking for function arguments, although the plumbing introduced through this RFC will make such a thing possible in a follow-up RFC.
As we expand the use-case for the Remap language within Vector, we want to be able to guide users as much as possible as they have their first experience with the language in Vector.
Since the language is used in different contexts, that expect different outcomes of a program, the chances of the program failing at runtime grows, which can result in undetected failures if the log output of Vector isn't observed.
We should strive for as much "boot-time" errors as possible, so that problems are detected early-on, before any data processing starts.
Specifically, we support using the Remap language within Vector in any context
that takes a so-called "condition". These conditions allow components to perform
a check against an event, and return a boolean outcome. For example, in the
swimlanes
transform, the check determines if an event should be included in a
specific swimlane.
Currently, any non-boolean return value of a remap program results in a failure of the condition, which can be confusing to users.
source
— The Remap source code written by users, to compile and execute.program
— A list of expressions, compiled from source and a list of functions, ready to run against an object.expression
— A piece of source code that resolves to a value.value
— a concrete value type, such as string, integer, boolean, etc.object
— A generic term for whatever outside data container the program manipulates (such as an "event").
Remap is an expression-based language. Everything, including statements, is an expression. Meaning, everything can return a value.
For example, the program .foo = "bar"
consists of two expressions:
- The literal expression
"bar"
, which resolves to aValue::String
. - The assignment expression
.foo = …
, which resolves to the value of the expression on the right side of the assignment operator.
This example program, when executed, always resolves to a String
value.
Another example program is floor(.bar)
. This program again consists of two
expressions:
- The query path expression
.bar
, which resolves to any value thebar
object field contains. - The function expression
floor
, which resolves to either anInteger
, orFloat
.
Since the floor
function is the final expression that runs in the program, the
outcome of a successful execution of the program is either an Integer
or
Float
.
Expressions are implemented using the Expression
trait:
pub trait Expression: Send + Sync {
fn execute(&self, state: &mut State, object: &mut dyn Object) -> Result<Option<Value>>;
}
Expressions can be fallible (not covered in this RFC), can return None
(not
covered in-depth in this RFC), or return a resolved Value
.
To allow the execution context of a program to determine the outcome of running
the program, we extend the Expression
trait as follows:
pub trait Expression: Send + Sync {
fn execute(&self, state: &mut State, object: &mut dyn Object) -> Result<Option<Value>>;
/// Describes which [`ValueKind`]s the expression can resolve to.
///
/// bike-shedding of function and enum names welcome.
fn resolves_to(&self) -> ResolveKind;
}
We define ResolveKind
as follows:
pub enum ResolveKind {
/// The expression can resolve to any value at runtime.
Any,
/// The expressions resolves to one of the defined [`ValueKind`]s.
OneOf(Vec<ValueKind>),
/// If the expression succeeds, it might resolve to a value, but doesn't
/// have to.
Maybe(Box<ResolveKind>),
}
The ResolveKind
enum allows us to determine if an expression resolves to a
predefined list of values. It also defines if the expression can resolve to
"nothing" (e.g. None
).
ValueKind
is defined as follows:
pub enum ValueKind {
String,
Integer,
Float,
Boolean,
Timestamp,
Map,
Array,
}
As can be seen from the ValueKind
, it is not possible to define the value
kinds contained within a map or array, only the top-level value kind of an
expression can be defined.
Some expressions don't have their own return value, but instead resolve to the return value of an expression they themselves execute.
One such an example is the IfStatement
expression, which resolves to either
the true_expression
or the false_expression
.
Because expressions are initialized at compile-time, we are able to have such "transient expressions" return the values they resolve to based on whatever expressions they themselves hold:
impl Expression for IfStatement {
fn resolves_to(&self) -> ResolveKind {
let true_resolves = self.true_expression.resolves_to();
let false_resolves = self.false_expression.resolves_to();
// return the combined set of true and false resolves.
}
// …
}
The same applies to other expressions, such as Assignment
, Block
, and more.
To allow a program to define the expected outcome, we expand the Program
struct.
This struct currently looks like this:
pub struct Program {
pub(crate) expressions: Vec<Expr>,
}
To initialize a program, you call the new
function:
pub fn new(source: &str, function_definitions: &[Box<dyn Function>]) -> Result<Self>;
The program then parses the source to a list of expressions, the final expression resolving to the result of the program.
We'll extend the program to allow defining which value kinds the final expression is allowed to resolve to:
(note that a Remap program always runs to completion, unless one of the earlier expressions returns an error, in which case there is no resolved value)
pub struct Program {
pub(crate) resolves_to: ResolveKind.
pub(crate) expressions: Vec<Expr>,
}
This allows us to check the ResolveKind
of the final expression against the
configured one for the program. If the former falls within the bounds of the
program, the program is considered valid, and no boot-time error occurs, or else
we inform the user that they need to update their script to return a more
precise value kind.
One more example we need to consider is variable or path assignments, and how those influence compile-time type checking.
Given this example:
$foo = true
$foo
This produces three expressions:
- The
Literal
expressiontrue
, which is of typeBoolean
. - The
Assignment
expression$foo = …
which returns a typeBoolean
. - The
Variable
expression$foo
, which returns typeBoolean
.
Given the above expressions, it follows that this program will return a
Boolean
value.
However, this is not the case.
The current implementation of the Assignment
expression stores the target
identifier (foo
), and the expression to store (true
) at compile-time. At
runtime, it resolves the expression, and stores the resulting value in a global
runtime store, either as a variable or path with the identifier foo
.
The Variable
expression also stores the target identifier (foo
) at
compile-time. At runtime it queries the global runtime store for a variable
named foo
, and retrieves the value stored by the assignment expression.
The problem here is that, at compile-time, since the variable expression at line 2 only knows of the variable identifier, and the global runtime store is not available at compile-time, there is no way to infer what value kind the variable foo will resolve to at runtime.
To solve this problem we'll introduce a compile-time store to keep track of resolve kinds for targets (variables and paths):
struct CompilerState {
variables: HashMap<String, ResolveKind>,
// …
}
We'll also update the resolves_to
function to take a reference to
CompilerState
.
pub trait Expression: Send + Sync {
fn resolves_to(&self, store: &CompilerState) -> ResolveKind;
}
This allows (as an example) the Variable
expression to fetch the resolve kind
of a variable:
impl Expression for Variable {
fn resolves_to(&self, state: &CompilerState) -> ResolveKind {
state.variable(&self.ident).unwrap_or(ResolveKind::Any)
}
}
The Assignment
expression will store the resolve kind of a variable on
initialization:
impl Assignment {
fn new(ident: String, expression: Box<dyn Expression>, state: &mut CompilerState) -> Self {
let resolve = expression.resolves_to(&state);
state.variables_mut().insert(ident, resolve);
Self { ident, expression }
}
}
The same principle applies to query paths.
After this change, using the original example:
$foo = true
$foo
The compiler now knows that the variable $foo
always resolves to a Boolean
kind.
If a script is used in a Remap function within a Condition
, the Remap program
is expected to return a Maybe(Boolean)
resolve kind. A Some(true)
value
passes the condition, Some(false)
and None
fail the condition.
Any other value expectation is a boot-time error, which can be solved among
other ways by explicitly checking for … == …
(e.g. using the Equality
expression), or using a function such as to_bool
.
The believe is that the proposed implementation is straight-forward enough, and causes a minimal amount of churn, that the advantages (more boot-time checks) outweigh the cost - adding more complexity to the language implementation (there is no change in the usage or syntax of the language).
The biggest drawback is that each expression (and thus each new function) has to
implement the new resolves_to
method, which increases work needed to add new
functions to the language.
Other than that, the usual drawback of maintaining the code we write applies here.
We can forgo adding a compile-time check on the return value of a program. This means we can't guide users how they should use the language within a given context, requiring them to check their Vector log output to know when a program fails.
The biggest downside to this is that the program can fail for one event, but not for another, so making sure things work for the first x events that are processed is not enough, you need to add monitoring to be aware of any script failures for events that didn't return the expected value kind for a given event path.
An alternative solution is to force a program to use a specific final expression implementation (or one of several).
For example, if we want a program to always return a boolean, we can require the
final expression to be one of Equality
, Comparison
, or the to_bool
function.
The problem with this is:
-
It puts continuous burden on us to keep those program call-sites accurate. If we add a new
my_func
function which also always resolves to a boolean, it must be explicitly allowed to work within a given context. -
Because functions are provided when a program is compiled (the Remap language itself does not contain any functions), there are no guarantees that
to_bool
is available, which makes this more tedious to get right.
A program could be allowed to convert the final expression as it sees fit
(through some kind of function that takes an Expression
and returns a
Value
).
It could then match on whatver Value
the final expression resolved to, and do
whatever it needs to, to convert that to the acceptable value kind (such as a
boolean) before returning the final value of the program.
This is probably the easiest to implement, but it will manipulate the outcome of a program without the knowledge of the user, which can result in unexpected situations and can still result in runtime failures if whatever value the final expression returns cannot be converted to whatever value the program is expected to resolve to.
- None, pending review feedback.
- Push initial proof-of-concept demonstration of proposed solution.
- Convert all existing expressions to implement
resolves_to
. - Update
Condition
Remap usage to expect programs to return aBoolean
value.