Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VRL is_string function does not imply data type in filter transform #9406

Closed
jerome-kleinen-kbc-be opened this issue Oct 1, 2021 · 6 comments
Labels
type: bug A code related bug.

Comments

@jerome-kleinen-kbc-be
Copy link

Vector Version

version="0.16.1" arch="x86_64" build_id="739e878 2021-08-26"

Vector Configuration File

[sources.input]
  type = "stdin"

[transforms.filter]
  type = "filter"
  inputs = ["input"]
  condition = '''is_string(.message) && !contains(.message, s'test')'''

[sinks.output]
  type = "console"
  inputs = ["filter"]
  # Encoding
  encoding.codec = "text"
  encoding.only_fields = ["message"]

Debug Output

Oct 01 15:36:09.986 ERROR vector::topology: Configuration error. error=Transform "filter":
error[E100]: unhandled error
  ┌─ :1:1
  │
1 │ is_string(.message) && !contains(.message, s'test')
  │ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  │ │
  │ expression can result in runtime error
  │ handle the error case to ensure runtime success
  │
  = see documentation about error handling at https://errors.vrl.dev/#handling
  = learn more about error code 100 at https://errors.vrl.dev/100
  = see language documentation at https://vrl.dev

Expected Behavior

I believe this should work. The runtime error comes from the contains() function expecting .message to be a string.

Actual Behavior

In order to make this work, you need to use f.e.

[sources.input]
  type = "stdin"

[transforms.filter]
  type = "filter"
  inputs = ["input"]
  condition = '''!contains(string!(.message), s'test')'''

[sinks.output]
  type = "console"
  inputs = ["filter"]
  # Encoding
  encoding.codec = "text"
  encoding.only_fields = ["message"]

However, my config is a minified example. I use several contains statements, and adding the string() function to all of them makes it harder to read, and is potentially wasteful if vector has to do the same check multiple times.

Example Data

Additional Context

References

@jerome-kleinen-kbc-be jerome-kleinen-kbc-be added the type: bug A code related bug. label Oct 1, 2021
@spencergilbert
Copy link
Contributor

spencergilbert commented Oct 1, 2021

I think this would work, though I'm not sure it's more readable than including string!() around each of the .message values.

EDIT: fix vrl 😅

condition = '''message = string!(.message); !contains(message, s'test')'''

@JeanMertz or @StephenWakely could confirm that.

Sidenote on your minified config. encoding.codec = "text" will already only print the .message key, so the only_fields config isn't strictly needed 😃

@jerome-kleinen-kbc-be
Copy link
Author

Thanks for the quick response.

I guess you mean

condition = '''message = string!(.message); !contains(message, s'test')'''

otherwise message would be a boolean, but I catch your drift.

I'm curious about the performance impact of adding the string() function in each contains function. I want to apply this filter to approximately 1 billion lines a day, with 5 contains conditions.

@jszwedko
Copy link
Member

jszwedko commented Oct 1, 2021

I don't believe there should be much performance overhead to the casting (especially compared to calling is_string()).

I opened vectordotdev/vrl#91 to track this as an enhancement. It is something we've thought about before, but haven't fully discussed or decided on the approach. I'll close out this issue in-lieu of that one, but please feel free to add any additional thoughts there.

Notably, we have some upcoming work to add the concept of "schemas" to Vector in #9388 which would also help address this by not requiring the type-cast.

@jszwedko jszwedko closed this as completed Oct 1, 2021
@jszwedko
Copy link
Member

jszwedko commented Oct 1, 2021

Just to confirm, it does seem the overhead of calling string is a bit greater than is_string, but given the numbers you provided either seem like they should be fine. If we had is_string add type information, it would likely slow down slightly.

Local benchmarks:

vrl_stdlib/functions/string/string
                        time:   [45.828 ns 46.901 ns 48.227 ns]
                        thrpt:  [20.735 Melem/s 21.322 Melem/s 21.821 Melem/s]
vrl_stdlib/functions/is_string/string
                        time:   [39.018 ns 39.081 ns 39.155 ns]
                        thrpt:  [25.539 Melem/s 25.588 Melem/s 25.629 Melem/s]
                 change:
                        time:   [+1.6262% +1.8527% +2.1100%] (p = 0.00 < 0.05)
                        thrpt:  [-2.0664% -1.8190% -1.6002%]
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

@jeromekleinen
Copy link

jeromekleinen commented Oct 1, 2021

@jszwedko I imagine those benches are for single usage of either functions, right? Considering I want to use 5 contains() conditionals in the filter transform (with accompanying string() calls), I guess the actual throughput would be divided by 5, or will vrl be able to "cache" this?

Perhaps an idea that comes up: it would be cool if there was a way you could benchmark your vector configs "end 2 end" , for example using a generator or file source and blackhole sink. This would basically benchmark all your intermediate transforms topology at once. I would see myself using this to optimize my configs.

@jszwedko
Copy link
Member

jszwedko commented Oct 1, 2021

@jszwedko I imagine those benches are for single usage of either functions, right? Considering I want to use 5 contains() conditionals in the filter transform (with accompanying string() calls), I guess the actual throughput would be divided by 5, or will vrl be able to "cache" this?

Yeah, these benches are just for calling the individual function. For separate filter transforms, Vector will parallelize execution, but yes, if you were running on a single core, the throughput would be negatively affected (maybe divided by 5? I'm not sure with modern CPU tricks 😅 ). The schema support RFC I mentioned earlier #9388 does aim to allow users to encode their schemas (as well as leverage known schemas) in a way that would only validate once for each event. Currently, this is uncached as each transform that uses VRL has its own VRL runtime.

Perhaps an idea that comes up: it would be cool if there was a way you could benchmark your vector configs "end 2 end" , for example using a generator or file source and blackhole sink. This would basically benchmark all your intermediate transforms topology at once. I would see myself using this to optimize my configs.

Yeah, this would be cool. It is definitely possible right now, you just have to wire it up yourself (comment out all sources/sinks and replace them with a generator/blackhole). I could see us making this easier being valuable though, for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

4 participants