Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore trailing spaces in CEF messages #17253

Merged
merged 3 commits into from
Mar 27, 2020

Conversation

adriansr
Copy link
Contributor

@adriansr adriansr commented Mar 25, 2020

What does this PR do?

This patch updates the Ragel state machine to skip trailing space at the end of CEF messages.

Some CEF exporters, Check Point for example, have been observed to add a trailing space to CEF messages:

"CEF:0:| [...] src=127.0.0.1 "

Currently, this space character is interpreted as part of the last field's value, which can cause decoding errors if the value is an integer or an IP address:

"error": {
    "message": "sourceAddress: value is not a valid IP address"
  }

For maximizing compatibility, we also want to ignore other kinds of white-space characters (newline, carriage return, tab). For example we can get a trailing newline when processing CEF messages from UDP input instead of syslog, which removes newlines.

Trailing space in other extensions' values is preserved, as the CEF standard permits (but discourages) it's use in non-final extensions.

Why is it important?

We've observed users experiencing this problem, trying to fix it (unsuccessfully) with a processor:

- dissect:
      when:
        equals:
          event.module: "cef"
      tokenizer: "%{sourceAddress} "
      field: "cef.extensions.sourceAddress"
      target_prefix: "cef.extensions"

Why a draft?

While the current solution is complete, I'm not satisfied of the changes that introduces to the Ragel SM definition.

Originally I wanted to get something more elegant like this to work:

extensions = (extension " ")* final_extension space*

So that we can keep the extension machine as-is, and add a specialized final_extension machine that will disallow trailing space. However, I didn't manage to get this kind of pattern to work. It requires rewriting a lot of the capture actions to allow for the necessary backtracking, and I wasn't confident enough to get that right without introducing more problems than I was solving, or dedicating too many hours to this fix, due to my limited experience with ragel.

The current solution works by accident. I decided to try a different approach starting by disallowing trailing space in all extensions, and found out that it works as I wanted and captures white-space as value in all extensions but the last. This is a side effect of how an extension value is captured differently in extension_key vs extension_eof. The former captures the previous value up to mark-1, that is the start of the current key minus the space separator. The later captures up to extValueEnd, which is not incremented for trailing space.

The current machine definition is an ugly mix of " " and space usage, because we want to have the space character as the only separator, but trim al trailing space: [\t\v\f\n\r ]

This patch updates the ragel state machine to skip trailing whitespace
at the end of CEF messages.

Some CEF exporters, Check Point for example, have been observed to add a
trailing whitespace to CEF messages:

> "CEF:0:| [...] src=127.0.0.1 "

Currently, this space character is interpreted as part of the last
field's value, which can cause decoding errors if the value is an
integer or an IP address.

For maximizing compatibility, we also want to ignore other kinds of
whitespace characters (new line, carriage return, tab). For example we
can get a trailing newline when processing CEF messages from UDP input
instead of syslog, which removes newlines.

Whitespace in non-final extensions is preserved, as the CEF standard
permits (but discourages) it's use in non-final extensions.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/siem (Team:SIEM)

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see you got it working! Two things I'm curious about. 1) Maybe run benchcmp on this to see how/if the result changed. 2) How does this compare vs a TrimRight solution in a benchmark.

extensions = " "* extension (" " extension)*;
extension_value = (space* extension_value_chars_nospace @extension_value_mark)* >extension_value_start $err(extension_err);
extension = extension_key equal extension_value;
extensions = " "* extension (space* " " extension)* space* %/extension_eof;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised that %/extension_eof worked for each individual extension, rather than just the final one. I'm definitely no Ragel expert.

Copy link
Contributor Author

@adriansr adriansr Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I guess I didn't explain myself properly. extension_eof worked as expected. It's just that it captures a value until the last point saved by @extension_value_mark while extension_key saves the previous value up to the start of the separator. This made no difference before, but now it helps to capture trailing spaces in non-final extensions and to drop it in the final extension.

I had to move it to a different place here to adjust for the case where a msg is padded and EOF is never reached by extension_value due to it not accepting trailing spaces.

@adriansr
Copy link
Contributor Author

adriansr commented Mar 26, 2020

@andrewkroh I ran some benchmarks with count=10 and comparing with benchstat.

All implementations are compared against the original state machine that doesn't trim.

Modified SM (this PR)

name                      old time/op  new time/op  delta
pkg:github.com/elastic/beats/v7/x-pack/filebeat/processors/decode_cef goos:darwin goarch:amd64
ProcessorRun/short_msg-8  5.29µs ± 0%  5.34µs ± 1%  +1.02%  (p=0.000 n=9+9)
ProcessorRun/long_msg-8   65.5µs ± 1%  65.7µs ± 0%    ~     (p=0.165 n=10+10)
pkg:github.com/elastic/beats/v7/x-pack/filebeat/processors/decode_cef/cef goos:darwin goarch:amd64
EventUnpack-8             2.17µs ± 1%  2.14µs ± 1%  -1.25%  (p=0.000 n=10+9)

strings.TrimRight [changes]

name                      old time/op  new time/op  delta
pkg:github.com/elastic/beats/v7/x-pack/filebeat/processors/decode_cef goos:darwin goarch:amd64
ProcessorRun/short_msg-8  5.29µs ± 0%  5.36µs ± 1%  +1.29%  (p=0.000 n=9+10)
ProcessorRun/long_msg-8   65.5µs ± 1%  64.3µs ± 1%  -1.94%  (p=0.000 n=10+10)
pkg:github.com/elastic/beats/v7/x-pack/filebeat/processors/decode_cef/cef goos:darwin goarch:amd64
EventUnpack-8             2.17µs ± 1%  2.17µs ± 1%    ~     (p=0.388 n=10+9)

Custom Loop [changes]

name                      old time/op  new time/op  delta
pkg:github.com/elastic/beats/v7/x-pack/filebeat/processors/decode_cef goos:darwin goarch:amd64
ProcessorRun/short_msg-8  5.29µs ± 0%  5.30µs ± 1%    ~     (p=0.660 n=9+10)
ProcessorRun/long_msg-8   65.5µs ± 1%  64.4µs ± 1%  -1.72%  (p=0.000 n=10+9)
pkg:github.com/elastic/beats/v7/x-pack/filebeat/processors/decode_cef/cef goos:darwin goarch:amd64
EventUnpack-8             2.17µs ± 1%  2.08µs ± 1%  -4.36%  (p=0.000 n=10+10)

Dedicated state-machine [changes]

Did this one just for the sake of it.

name                      old time/op  new time/op  delta
pkg:github.com/elastic/beats/v7/x-pack/filebeat/processors/decode_cef goos:darwin goarch:amd64
ProcessorRun/short_msg-8  5.29µs ± 0%  5.40µs ± 1%  +2.05%  (p=0.000 n=9+10)
ProcessorRun/long_msg-8   65.5µs ± 1%  66.1µs ± 1%  +0.91%  (p=0.001 n=10+10)
pkg:github.com/elastic/beats/v7/x-pack/filebeat/processors/decode_cef/cef goos:darwin goarch:amd64
EventUnpack-8             2.17µs ± 1%  2.16µs ± 1%  -0.61%  (p=0.003 n=10+10)

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for running the comparison. I didn't know about benchstat 😄 .

LGTM, but needs a changelog.

@adriansr adriansr marked this pull request as ready for review March 26, 2020 14:10
@adriansr adriansr added the needs_backport PR is waiting to be backported to other branches. label Mar 26, 2020
@adriansr adriansr changed the title [DRAFT] Ignore trailing spaces in CEF messages Ignore trailing spaces in CEF messages Mar 26, 2020
@adriansr adriansr merged commit a1ec55a into elastic:master Mar 27, 2020
adriansr added a commit to adriansr/beats that referenced this pull request Mar 27, 2020
This patch updates the ragel state machine to skip trailing spaces
at the end of CEF messages.

Some CEF exporters, Check Point for example, have been observed to add a
trailing space to CEF messages:

> "CEF:0:| [...] src=127.0.0.1 "

Currently, this space character is interpreted as part of the last
field's value, which can cause decoding errors if the value is an
integer or an IP address.

For maximizing compatibility, we also want to ignore other kinds of
space characters (new line, carriage return, tab). For example we
can get a trailing newline when processing CEF messages from UDP 
input instead of syslog, which removes newlines.

Spaces in non-final extensions are preserved, as the CEF standard
permits (but discourages) it's use in non-final extensions.

(cherry picked from commit a1ec55a)
@adriansr adriansr added v7.8.0 and removed needs_backport PR is waiting to be backported to other branches. labels Mar 27, 2020
adriansr added a commit to adriansr/beats that referenced this pull request Mar 27, 2020
This patch updates the ragel state machine to skip trailing spaces
at the end of CEF messages.

Some CEF exporters, Check Point for example, have been observed to add a
trailing space to CEF messages:

> "CEF:0:| [...] src=127.0.0.1 "

Currently, this space character is interpreted as part of the last
field's value, which can cause decoding errors if the value is an
integer or an IP address.

For maximizing compatibility, we also want to ignore other kinds of
space characters (new line, carriage return, tab). For example we
can get a trailing newline when processing CEF messages from UDP 
input instead of syslog, which removes newlines.

Spaces in non-final extensions are preserved, as the CEF standard
permits (but discourages) it's use in non-final extensions.

(cherry picked from commit a1ec55a)
adriansr added a commit that referenced this pull request Apr 3, 2020
This patch updates the ragel state machine to skip trailing spaces
at the end of CEF messages.

Some CEF exporters, Check Point for example, have been observed to add a
trailing space to CEF messages:

> "CEF:0:| [...] src=127.0.0.1 "

Currently, this space character is interpreted as part of the last
field's value, which can cause decoding errors if the value is an
integer or an IP address.

For maximizing compatibility, we also want to ignore other kinds of
space characters (new line, carriage return, tab). For example we
can get a trailing newline when processing CEF messages from UDP 
input instead of syslog, which removes newlines.

Spaces in non-final extensions are preserved, as the CEF standard
permits (but discourages) it's use in non-final extensions.

(cherry picked from commit a1ec55a)
adriansr added a commit that referenced this pull request Apr 3, 2020
This patch updates the ragel state machine to skip trailing spaces
at the end of CEF messages.

Some CEF exporters, Check Point for example, have been observed to add a
trailing space to CEF messages:

> "CEF:0:| [...] src=127.0.0.1 "

Currently, this space character is interpreted as part of the last
field's value, which can cause decoding errors if the value is an
integer or an IP address.

For maximizing compatibility, we also want to ignore other kinds of
space characters (new line, carriage return, tab). For example we
can get a trailing newline when processing CEF messages from UDP 
input instead of syslog, which removes newlines.

Spaces in non-final extensions are preserved, as the CEF standard
permits (but discourages) it's use in non-final extensions.

(cherry picked from commit a1ec55a)
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
This patch updates the ragel state machine to skip trailing spaces
at the end of CEF messages.

Some CEF exporters, Check Point for example, have been observed to add a
trailing space to CEF messages:

> "CEF:0:| [...] src=127.0.0.1 "

Currently, this space character is interpreted as part of the last
field's value, which can cause decoding errors if the value is an
integer or an IP address.

For maximizing compatibility, we also want to ignore other kinds of
space characters (new line, carriage return, tab). For example we
can get a trailing newline when processing CEF messages from UDP 
input instead of syslog, which removes newlines.

Spaces in non-final extensions are preserved, as the CEF standard
permits (but discourages) it's use in non-final extensions.

(cherry picked from commit ee5d5bd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants