-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Field Reference: handle special characters #14044
Field Reference: handle special characters #14044
Conversation
Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap` depend on identity and not equivalence, and therefore rely on the keys being _interned_ strings. In order to avoid hitting the JVM's global String intern pool (which can have performance problems), operations to normalize a string to its interned counterpart have traditionally relied on the behaviour of `FieldReference#from` returning a likely-cached `FieldReference`, that had an interned `key` and an empty `path`. This is problematic on two points. First, when `ConvertedMap` was given data with keys that _were_ valid string field references representing a nested field (such as `[host][geo][location]`), the implementation of `ConvertedMap#put` effectively silently discarded the path components because it assumed them to be empty, and only the key was kept (`location`). Second, when `ConvertedMap` was given a map whose keys contained what the field reference parser considered special characters but _were NOT_ valid field references, the resulting `FieldReference.IllegalSyntaxException` caused the operation to abort. Instead of using the `FieldReference` cache, which sits on top of objects whose `key` and `path`-components are known to have been interned, we introduce an internment helper on our `ConvertedMap` that is also backed by the global string intern pool, and ensure that our field references are primed through this pool. In addition to fixing the `ConvertedMap#newFromMap` functionality, this has three net effects: - Our ConvertedMap operations still use strings from the global intern pool - We have a new, smaller cache of individual field names, improving lookup performance - Our FieldReference cache no longer is flooded with fragments and therefore is more likely to remain performant NOTE: this does NOT create isolated intern pools, as doing so would require a careful audit of the possible code-paths to `ConvertedMap#putInterned`. The new cache is limited to 10k strings, and when more are used only the FIRST 10k strings will be primed into the cache, leaving the remainder to always hit the global String intern pool. NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_ be referenced with the existing FieldReference implementation. Resolves: elastic#13606 Resolves: elastic#11608
Adds a `config.field_reference.escape_style` option and a companion command-line flag `--field-reference-escape-style` allowing a user to opt into one of two proposed escape-sequence implementations for field reference parsing: - `PERCENT`: URI-style `%`+`HH` hexadecimal encoding of UTF-8 bytes - `AMPERSAND`: HTML-style `&#`+`DD`+`;` encoding of decimal Unicode code-points The default is `NONE`, which does _not_ proccess escape sequences. With this setting a user effectively cannot reference a field whose name contains FieldReference-reserved characters. | ESCAPE STYLE | `[` | `]` | | ------------ | ------- | ------- | | `NONE` | _N/A_ | _N/A_ | | `PERCENT` | `%5B` | `%5D` | | `AMPERSAND` | `&elastic#91;` | `&elastic#93;` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥇 👏 👏 👏
from my testing this seems to be working fine - only have minor comments that aren't blockers in terms of shipping the feature ...
looked into a very naive performance test of various escape styles, given the regular expression use on the new ones (obviously only matters when the cache gets full) :
- the
PERCENT
style (un-cached) degrades by ~ 30% compared toNONE
- the
AMPERSAND
is on par withNONE
(theescaped.contains("&")
seems to help)
@@ -87,6 +87,11 @@ class LogStash::Runner < Clamp::StrictCommand | |||
:default => LogStash::SETTINGS.get_default("config.string"), | |||
:attribute_name => "config.string" | |||
|
|||
option ["--field-reference-escape-style"], "STYLE", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got confused that this does not take effect when using irb
e.g.
bin/logstash --field-reference-escape-style PERCENT -i irb
but that's for a separate issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in d58447d by moving the setting's application from the agent (which isn't started for shell sessions) to the runner before shell sessions are invoked.
Co-authored-by: Karol Bucek <[email protected]>
📃 DOCS PREVIEW ✨ https://logstash_14044.docs-preview.app.elstc.co/diff |
@@ -87,6 +87,11 @@ class LogStash::Runner < Clamp::StrictCommand | |||
:default => LogStash::SETTINGS.get_default("config.string"), | |||
:attribute_name => "config.string" | |||
|
|||
option ["--field-reference-escape-style"], "STYLE", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in d58447d by moving the setting's application from the agent (which isn't started for shell sessions) to the runner before shell sessions are invoked.
📃 DOCS PREVIEW ✨ https://logstash_14044.docs-preview.app.elstc.co/diff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments inline, most notably about replacing "experimental" with "technical preview."
docs/static/field-reference.asciidoc
Outdated
[[formal-grammar-escape-sequences]] | ||
=== Escape Sequences | ||
|
||
In order to reference a field whose name contains a character that has special meaning in the field reference grammar, it needs to be escaped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to reference a field whose name contains a character that has special meaning in the field reference grammar, it needs to be escaped. | |
For {ls} to reference a field whose name contains a character that has special meaning in the field reference grammar, the character must be escaped. |
docs/static/field-reference.asciidoc
Outdated
|
||
- `NONE` (default): no escape sequence processing is done. Fields containing literal square brackets cannot be referenced by the Event API. | ||
- `PERCENT`: URI-style percent encoding of UTF-8 bytes. The left square bracket (`[`) is expressed as `%5B`, and the right square bracket (`]`) is expressed as `%5D`. | ||
// NOTE: the following is _also_ HTML-escaped in the asciidoc source document so that browsers rendering the HTML will unwrap one escape and leave the remaining. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// NOTE: the following is _also_ HTML-escaped in the asciidoc source document so that browsers rendering the HTML will unwrap one escape and leave the remaining. | |
// Note that the following is _also_ HTML-escaped in the asciidoc source document so that browsers rendering the HTML will unwrap one escape and leave the remaining. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using regular words rather than the asciidoc admonition format. The comment treatment would keep it from getting formatted, but it keeps catching my eye.
docs/static/settings-file.asciidoc
Outdated
* `AMPERSAND`: HTML-style `&#`{plus}`DD`{plus}`;` encoding of decimal Unicode code-points (`[` -> `[`; `]` -> `[`) | ||
* `NONE`: field names containing special characters _cannot_ be referenced. | ||
|
||
| `NONE` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this entry case sensitive? In this topic, we already have two instances of None
and one instance of none
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. And there really is no reason for it to be upcase, so I have changed the implementation to be downcase throughout to match the other none
.
Both other instances of "None" in the file should actually be "N/A" since they represent an absence of a default value instead of a default value that is the literal N
+o
+n
+e
docs/static/settings-file.asciidoc
Outdated
@@ -178,6 +178,17 @@ Values other than `disabled` are currently considered BETA, and may produce unin | |||
| When set to `true`, quoted strings will process the following escape sequences: `\n` becomes a literal newline (ASCII 10). `\r` becomes a literal carriage return (ASCII 13). `\t` becomes a literal tab (ASCII 9). `\\` becomes a literal backslash `\`. `\"` becomes a literal double quotation mark. `\'` becomes a literal quotation mark. | |||
| `false` | |||
|
|||
| `config.field_reference.escape_style` | |||
a| _EXPERIMENTAL_ setting that provides a way to reference fields that contain <<formal-grammar-escape-sequences,field reference special characters>> `[` and `]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "Technical preview" instead of "Experimental." Reference: https://github.com/elastic/docs#using-the-technical-preview-admonition
I tried adding/formatting this nugget in a variety of ways. So far, I don't really like any of them. Here are two of several things I tried:
Admonitions are supposed to handle formatting, but I haven't hit on a combination that looks good and provides adequate info for the user. Only tagging the option with "Preview" doesn't convey the risk that the option might change or go away. I can sync with @gtback next week for ideas and design intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish the admonitions contained a link to a better place where we define what we mean and add more detail. I've never seen these used in a table, but let me know what I can do to help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
📃 DOCS PREVIEW ✨ https://logstash_14044.docs-preview.app.elstc.co/diff |
Yes, it is in 8.3.0.
|
@yaauie I must be blind, I promise I checked the release notes but did not see it. Our initial tests show very promising results, thank you! |
Release notes
[
and]
) had undefined behaviour that resulted in pipeline crashes or the silent truncation of the field name.What does this PR do?
This PR has three commits that tell a story:
[deeply][nested][square-brackets][field]
->field
log[WARN]
-> CRASH!ConvertedMap
internals so that data structures with fields whose names include FieldReference-special characters (currently[
and]
) correctly behave as expected – that is, uses field names as-given. This has the down-side of allowing fields on an event that cannot be referenced using the FieldReference syntax.Why is it important/What is the impact to the user?
Checklist
How to test this PR locally
With the following
pipeline.conf
, we create an event from JSON data that contains a field named[bracketed][field]
, and attempt to address that field in a variety of ways (field reference literals, sprintf):Execute each of the following to observe behaviour:
Related issues
Resolves: #13606
Resolves: #11608
Supersedes: #13479