Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field Reference: handle special characters #14044

Merged
merged 12 commits into from
May 24, 2022

Conversation

yaauie
Copy link
Member

@yaauie yaauie commented Apr 27, 2022

Release notes

  • Fixes a long-standing issue where setting a field whose name includes characters that are reserved in the FieldReference syntax ([ and ]) had undefined behaviour that resulted in pipeline crashes or the silent truncation of the field name.
  • Introduces two experimental opt-in methods for referencing fields whose names include characters that are reserved in the FieldReference syntax

What does this PR do?

This PR has three commits that tell a story:

  1. Demonstrate previously-unvalidated edge-cases in which FieldReference-special characters in field names result in undesired behaviour:
    Category Behaviour Example
    Valid field reference truncate path, leaving only key [deeply][nested][square-brackets][field] -> field
    Invalid field reference rejected input, potentially crash log[WARN] -> CRASH!
  2. Fixes the Event API's ConvertedMap internals so that data structures with fields whose names include FieldReference-special characters (currently [ and ]) correctly behave as expected – that is, uses field names as-given. This has the down-side of allowing fields on an event that cannot be referenced using the FieldReference syntax.
  3. Introduces two EXPERIMENTAL opt-in ways of being able to reference fields with special characters in their names.

Why is it important/What is the impact to the user?

  • Allows plugins that handle as-given data-structures (like the JSON codec) to correctly create events from the events as-given, even if those events contain FieldReference-special characters.
  • Provides an experimental path forward for referencing fields whose names contain FieldReference-special characters

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files (and/or docker env variables)
  • I have added tests that prove my fix is effective or that my feature works

How to test this PR locally

With the following pipeline.conf, we create an event from JSON data that contains a field named [bracketed][field], and attempt to address that field in a variety of ways (field reference literals, sprintf):

input { generator { count => 1 codec => json message => '{"[bracketed][field]": "okay"}' } }
filter {
  if [bracketed][field] {
    mutate { add_tag => "MATCH-literal" }
  }
  if [%5Bbracketed%5D%5Bfield%5D] {
    mutate { add_tag => "MATCH-percent" }
  }
  if [[bracketed][field]] {
    mutate { add_tag => "MATCH-ampersand" }
  }
  mutate {
    replace => {
      "sprintf" => "LITERAL:~%{[bracketed][field]}~ PERCENT:~%{%5Bbracketed%5D%5Bfield%5D}~ AMPERSAND:~%{[bracketed][field]}~"
    }
  }
}
output { stdout { codec => rubydebug } }

Execute each of the following to observe behaviour:

bin/logstash -f pipeline.conf
bin/logstash -f pipeline.conf --field-reference-escape-style=PERCENT
bin/logstash -f pipeline.conf --field-reference-escape-style=AMPERSAND

Related issues

Resolves: #13606
Resolves: #11608
Supersedes: #13479

Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap`
depend on identity and not equivalence, and therefore rely on the keys being
_interned_ strings. In order to avoid hitting the JVM's global String intern
pool (which can have performance problems), operations to normalize a string
to its interned counterpart have traditionally relied on the behaviour of
`FieldReference#from` returning a likely-cached `FieldReference`, that had
an interned `key` and an empty `path`.

This is problematic on two points.

First, when `ConvertedMap` was given data with keys that _were_ valid string
field references representing a nested field (such as `[host][geo][location]`),
the implementation of `ConvertedMap#put` effectively silently discarded the
path components because it assumed them to be empty, and only the key was
kept (`location`).

Second, when `ConvertedMap` was given a map whose keys contained what the
field reference parser considered special characters but _were NOT_
valid field references, the resulting `FieldReference.IllegalSyntaxException`
caused the operation to abort.

Instead of using the `FieldReference` cache, which sits on top of objects whose
`key` and `path`-components are known to have been interned, we introduce an
internment helper on our `ConvertedMap` that is also backed by the global string
intern pool, and ensure that our field references are primed through this pool.

In addition to fixing the `ConvertedMap#newFromMap` functionality, this has
three net effects:

 - Our ConvertedMap operations still use strings
   from the global intern pool
 - We have a new, smaller cache of individual field
   names, improving lookup performance
 - Our FieldReference cache no longer is flooded
   with fragments and therefore is more likely to
   remain performant

NOTE: this does NOT create isolated intern pools, as doing so would require
      a careful audit of the possible code-paths to `ConvertedMap#putInterned`.
      The new cache is limited to 10k strings, and when more are used only
      the FIRST 10k strings will be primed into the cache, leaving the
      remainder to always hit the global String intern pool.

NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_
      be referenced with the existing FieldReference implementation.

Resolves: elastic#13606
Resolves: elastic#11608
Adds a `config.field_reference.escape_style` option and a companion
command-line flag `--field-reference-escape-style` allowing a user
to opt into one of two proposed escape-sequence implementations for field
reference parsing:

 - `PERCENT`: URI-style `%`+`HH` hexadecimal encoding of UTF-8 bytes
 - `AMPERSAND`: HTML-style `&#`+`DD`+`;` encoding of decimal Unicode code-points

The default is `NONE`, which does _not_ proccess escape sequences.
With this setting a user effectively cannot reference a field whose name
contains FieldReference-reserved characters.

| ESCAPE STYLE | `[`     | `]`     |
| ------------ | ------- | ------- |
| `NONE`       | _N/A_   | _N/A_   |
| `PERCENT`    | `%5B`   | `%5D`   |
| `AMPERSAND`  | `&elastic#91;` | `&elastic#93;` |
@yaauie
Copy link
Member Author

yaauie commented Apr 28, 2022

docs/static/field-reference.asciidoc Outdated Show resolved Hide resolved
docs/static/settings-file.asciidoc Outdated Show resolved Hide resolved
@kares kares self-requested a review May 4, 2022 16:02
Copy link
Contributor

@kares kares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇 👏 👏 👏

from my testing this seems to be working fine - only have minor comments that aren't blockers in terms of shipping the feature ...

looked into a very naive performance test of various escape styles, given the regular expression use on the new ones (obviously only matters when the cache gets full) :

  • the PERCENT style (un-cached) degrades by ~ 30% compared to NONE
  • the AMPERSAND is on par with NONE (the escaped.contains("&") seems to help)

@@ -87,6 +87,11 @@ class LogStash::Runner < Clamp::StrictCommand
:default => LogStash::SETTINGS.get_default("config.string"),
:attribute_name => "config.string"

option ["--field-reference-escape-style"], "STYLE",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got confused that this does not take effect when using irb e.g.
bin/logstash --field-reference-escape-style PERCENT -i irb but that's for a separate issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in d58447d by moving the setting's application from the agent (which isn't started for shell sessions) to the runner before shell sessions are invoked.

@github-actions
Copy link
Contributor

📃 DOCS PREVIEWhttps://logstash_14044.docs-preview.app.elstc.co/diff

@yaauie yaauie requested a review from karenzone May 18, 2022 18:04
@@ -87,6 +87,11 @@ class LogStash::Runner < Clamp::StrictCommand
:default => LogStash::SETTINGS.get_default("config.string"),
:attribute_name => "config.string"

option ["--field-reference-escape-style"], "STYLE",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in d58447d by moving the setting's application from the agent (which isn't started for shell sessions) to the runner before shell sessions are invoked.

docs/static/field-reference.asciidoc Outdated Show resolved Hide resolved
docs/static/field-reference.asciidoc Outdated Show resolved Hide resolved
docs/static/field-reference.asciidoc Outdated Show resolved Hide resolved
docs/static/field-reference.asciidoc Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

📃 DOCS PREVIEWhttps://logstash_14044.docs-preview.app.elstc.co/diff

Copy link
Contributor

@karenzone karenzone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments inline, most notably about replacing "experimental" with "technical preview."

[[formal-grammar-escape-sequences]]
=== Escape Sequences

In order to reference a field whose name contains a character that has special meaning in the field reference grammar, it needs to be escaped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In order to reference a field whose name contains a character that has special meaning in the field reference grammar, it needs to be escaped.
For {ls} to reference a field whose name contains a character that has special meaning in the field reference grammar, the character must be escaped.


- `NONE` (default): no escape sequence processing is done. Fields containing literal square brackets cannot be referenced by the Event API.
- `PERCENT`: URI-style percent encoding of UTF-8 bytes. The left square bracket (`[`) is expressed as `%5B`, and the right square bracket (`]`) is expressed as `%5D`.
// NOTE: the following is _also_ HTML-escaped in the asciidoc source document so that browsers rendering the HTML will unwrap one escape and leave the remaining.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// NOTE: the following is _also_ HTML-escaped in the asciidoc source document so that browsers rendering the HTML will unwrap one escape and leave the remaining.
// Note that the following is _also_ HTML-escaped in the asciidoc source document so that browsers rendering the HTML will unwrap one escape and leave the remaining.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using regular words rather than the asciidoc admonition format. The comment treatment would keep it from getting formatted, but it keeps catching my eye.

* `AMPERSAND`: HTML-style `&#`{plus}`DD`{plus}`;` encoding of decimal Unicode code-points (`[` -> `&#91;`; `]` -> `&#91;`)
* `NONE`: field names containing special characters _cannot_ be referenced.

| `NONE`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this entry case sensitive? In this topic, we already have two instances of None and one instance of none.

Copy link
Member Author

@yaauie yaauie May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. And there really is no reason for it to be upcase, so I have changed the implementation to be downcase throughout to match the other none.

Both other instances of "None" in the file should actually be "N/A" since they represent an absence of a default value instead of a default value that is the literal N+o+n+e

@@ -178,6 +178,17 @@ Values other than `disabled` are currently considered BETA, and may produce unin
| When set to `true`, quoted strings will process the following escape sequences: `\n` becomes a literal newline (ASCII 10). `\r` becomes a literal carriage return (ASCII 13). `\t` becomes a literal tab (ASCII 9). `\\` becomes a literal backslash `\`. `\"` becomes a literal double quotation mark. `\'` becomes a literal quotation mark.
| `false`

| `config.field_reference.escape_style`
a| _EXPERIMENTAL_ setting that provides a way to reference fields that contain <<formal-grammar-escape-sequences,field reference special characters>> `[` and `]`.
Copy link
Contributor

@karenzone karenzone May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "Technical preview" instead of "Experimental." Reference: https://github.com/elastic/docs#using-the-technical-preview-admonition

I tried adding/formatting this nugget in a variety of ways. So far, I don't really like any of them. Here are two of several things I tried:

Screen Shot 2022-05-20 at 4 31 18 PM

Screen Shot 2022-05-20 at 4 39 51 PM

Admonitions are supposed to handle formatting, but I haven't hit on a combination that looks good and provides adequate info for the user. Only tagging the option with "Preview" doesn't convey the risk that the option might change or go away. I can sync with @gtback next week for ideas and design intent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish the admonitions contained a link to a better place where we define what we mean and add more detail. I've never seen these used in a table, but let me know what I can do to help.

Copy link
Contributor

@kares kares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@github-actions
Copy link
Contributor

📃 DOCS PREVIEWhttps://logstash_14044.docs-preview.app.elstc.co/diff

@yaauie yaauie merged commit d845411 into elastic:main May 24, 2022
@yaauie yaauie deleted the field-reference-special-characters branch June 22, 2022 02:41
@dlouzan
Copy link

dlouzan commented Jun 28, 2022

@yaauie Did this make it to v8.3.0? The PR labels say so, but the tag diffs and release notes say otherwise.

We are experiencing this very error and would be really happy to put a fix in place 🙇

/cc @fh1ch

@yaauie
Copy link
Member Author

yaauie commented Jun 28, 2022

Yes, it is in 8.3.0.

@dlouzan
Copy link

dlouzan commented Jun 29, 2022

@yaauie I must be blind, I promise I checked the release notes but did not see it. Our initial tests show very promising results, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants