VRL is_string function does not imply data type in filter transform #9406

jerome-kleinen-kbc-be · 2021-10-01T13:40:39Z

Vector Version

version="0.16.1" arch="x86_64" build_id="739e878 2021-08-26"

Vector Configuration File

[sources.input]
  type = "stdin"

[transforms.filter]
  type = "filter"
  inputs = ["input"]
  condition = '''is_string(.message) && !contains(.message, s'test')'''

[sinks.output]
  type = "console"
  inputs = ["filter"]
  # Encoding
  encoding.codec = "text"
  encoding.only_fields = ["message"]

Debug Output

Oct 01 15:36:09.986 ERROR vector::topology: Configuration error. error=Transform "filter":
error[E100]: unhandled error
  ┌─ :1:1
  │
1 │ is_string(.message) && !contains(.message, s'test')
  │ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  │ │
  │ expression can result in runtime error
  │ handle the error case to ensure runtime success
  │
  = see documentation about error handling at https://errors.vrl.dev/#handling
  = learn more about error code 100 at https://errors.vrl.dev/100
  = see language documentation at https://vrl.dev

Expected Behavior

I believe this should work. The runtime error comes from the contains() function expecting .message to be a string.

Actual Behavior

In order to make this work, you need to use f.e.

[sources.input]
  type = "stdin"

[transforms.filter]
  type = "filter"
  inputs = ["input"]
  condition = '''!contains(string!(.message), s'test')'''

[sinks.output]
  type = "console"
  inputs = ["filter"]
  # Encoding
  encoding.codec = "text"
  encoding.only_fields = ["message"]

However, my config is a minified example. I use several contains statements, and adding the string() function to all of them makes it harder to read, and is potentially wasteful if vector has to do the same check multiple times.

Example Data

Additional Context

References

The text was updated successfully, but these errors were encountered:

spencergilbert · 2021-10-01T13:56:46Z

I think this would work, though I'm not sure it's more readable than including string!() around each of the .message values.

EDIT: fix vrl 😅

condition = '''message = string!(.message); !contains(message, s'test')'''

@JeanMertz or @StephenWakely could confirm that.

Sidenote on your minified config. encoding.codec = "text" will already only print the .message key, so the only_fields config isn't strictly needed 😃

jerome-kleinen-kbc-be · 2021-10-01T14:12:58Z

Thanks for the quick response.

I guess you mean

condition = '''message = string!(.message); !contains(message, s'test')'''

otherwise message would be a boolean, but I catch your drift.

I'm curious about the performance impact of adding the string() function in each contains function. I want to apply this filter to approximately 1 billion lines a day, with 5 contains conditions.

jszwedko · 2021-10-01T17:39:58Z

I don't believe there should be much performance overhead to the casting (especially compared to calling is_string()).

I opened vectordotdev/vrl#91 to track this as an enhancement. It is something we've thought about before, but haven't fully discussed or decided on the approach. I'll close out this issue in-lieu of that one, but please feel free to add any additional thoughts there.

Notably, we have some upcoming work to add the concept of "schemas" to Vector in #9388 which would also help address this by not requiring the type-cast.

jszwedko · 2021-10-01T17:52:30Z

Just to confirm, it does seem the overhead of calling string is a bit greater than is_string, but given the numbers you provided either seem like they should be fine. If we had is_string add type information, it would likely slow down slightly.

Local benchmarks:

vrl_stdlib/functions/string/string
                        time:   [45.828 ns 46.901 ns 48.227 ns]
                        thrpt:  [20.735 Melem/s 21.322 Melem/s 21.821 Melem/s]

vrl_stdlib/functions/is_string/string
                        time:   [39.018 ns 39.081 ns 39.155 ns]
                        thrpt:  [25.539 Melem/s 25.588 Melem/s 25.629 Melem/s]
                 change:
                        time:   [+1.6262% +1.8527% +2.1100%] (p = 0.00 < 0.05)
                        thrpt:  [-2.0664% -1.8190% -1.6002%]
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

jeromekleinen · 2021-10-01T20:26:49Z

@jszwedko I imagine those benches are for single usage of either functions, right? Considering I want to use 5 contains() conditionals in the filter transform (with accompanying string() calls), I guess the actual throughput would be divided by 5, or will vrl be able to "cache" this?

Perhaps an idea that comes up: it would be cool if there was a way you could benchmark your vector configs "end 2 end" , for example using a generator or file source and blackhole sink. This would basically benchmark all your intermediate transforms topology at once. I would see myself using this to optimize my configs.

jszwedko · 2021-10-01T21:46:18Z

@jszwedko I imagine those benches are for single usage of either functions, right? Considering I want to use 5 contains() conditionals in the filter transform (with accompanying string() calls), I guess the actual throughput would be divided by 5, or will vrl be able to "cache" this?

Yeah, these benches are just for calling the individual function. For separate filter transforms, Vector will parallelize execution, but yes, if you were running on a single core, the throughput would be negatively affected (maybe divided by 5? I'm not sure with modern CPU tricks 😅 ). The schema support RFC I mentioned earlier #9388 does aim to allow users to encode their schemas (as well as leverage known schemas) in a way that would only validate once for each event. Currently, this is uncached as each transform that uses VRL has its own VRL runtime.

Perhaps an idea that comes up: it would be cool if there was a way you could benchmark your vector configs "end 2 end" , for example using a generator or file source and blackhole sink. This would basically benchmark all your intermediate transforms topology at once. I would see myself using this to optimize my configs.

Yeah, this would be cool. It is definitely possible right now, you just have to wire it up yourself (comment out all sources/sinks and replace them with a generator/blackhole). I could see us making this easier being valuable though, for sure.

jerome-kleinen-kbc-be added the type: bug A code related bug. label Oct 1, 2021

jszwedko mentioned this issue Mar 28, 2023

Allow "type guards" in VRL vectordotdev/vrl#91

Open

jszwedko closed this as completed Oct 1, 2021

jerome-kleinen-kbc-be mentioned this issue Oct 5, 2021

Filter can't accept "contains" function without handling #9462

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VRL is_string function does not imply data type in filter transform #9406

VRL is_string function does not imply data type in filter transform #9406

jerome-kleinen-kbc-be commented Oct 1, 2021

spencergilbert commented Oct 1, 2021 •

edited

Loading

jerome-kleinen-kbc-be commented Oct 1, 2021

jszwedko commented Oct 1, 2021

jszwedko commented Oct 1, 2021 •

edited

Loading

jeromekleinen commented Oct 1, 2021 •

edited

Loading

jszwedko commented Oct 1, 2021

VRL is_string function does not imply data type in filter transform #9406

VRL is_string function does not imply data type in filter transform #9406

Comments

jerome-kleinen-kbc-be commented Oct 1, 2021

Vector Version

Vector Configuration File

Debug Output

Expected Behavior

Actual Behavior

Example Data

Additional Context

References

spencergilbert commented Oct 1, 2021 • edited Loading

jerome-kleinen-kbc-be commented Oct 1, 2021

jszwedko commented Oct 1, 2021

jszwedko commented Oct 1, 2021 • edited Loading

jeromekleinen commented Oct 1, 2021 • edited Loading

jszwedko commented Oct 1, 2021

spencergilbert commented Oct 1, 2021 •

edited

Loading

jszwedko commented Oct 1, 2021 •

edited

Loading

jeromekleinen commented Oct 1, 2021 •

edited

Loading