Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spacetime: escapable field references #13479

Closed

Conversation

yaauie
Copy link
Member

@yaauie yaauie commented Dec 7, 2021

This is a proof-of-concept that is NOT meant to be merged.

When this PR proves that what it is attempting is possible, a new PR will be opened to replace it.

What does this PR do?

This PR attempts to find a path forward for EventFactory#newEvent(Map) when the provided payload contains as map whose individual keys aren't valid field references.

  • decouple field name cache from the FieldReference cache, allowing these fields to be created.
  • Extend the FieldReference syntax to introduce a way to reference these fields in a pipeline, such as by using escape sequences. This may be a breaking change, which would require per-pipeline FieldReference parsers.
  • Adjust the Treetop grammar to properly capture field references that contain escape sequences EDIT: work carries forward the idea of using either percent-encoding or ampersand-encoding, which would be transparent to the treetop grammar.
  • Create a new EventFactory implementation or wrapper that emulates the "old" behaviour of treating the keys as field references, and allow it to be selected with a pipeline-level setting. This implementation should properly nest the keys with the field reference's path, and should raise an exception when encountering fields that are not valid. EDIT: the "old" behaviour was undefined and basically never worked. Either the field name was a valid FieldReference and we silently discarded the path, or the field name was not a valid FieldReference and we threw an error. The only case where it worked correctly was when a top-level field name was wrapped in square brackets.

@yaauie yaauie force-pushed the spacetime/escapable-field-references branch from a6bccbd to 81423c6 Compare April 13, 2022 17:22
@yaauie yaauie force-pushed the spacetime/escapable-field-references branch 2 times, most recently from 39ac2e3 to 8c88db3 Compare April 20, 2022 17:51
@yaauie yaauie force-pushed the spacetime/escapable-field-references branch from 8c88db3 to 7e98b4d Compare April 25, 2022 15:58
yaauie added 2 commits April 25, 2022 19:10
Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap`
depend on identity and not equivalence, and therefore rely on the keys being
_interned_ strings. In order to avoid hitting the JVM's global String intern
pool (which can have performance problems), operations to normalize a string
to its interned counterpart have traditionally relied on the behaviour of
`FieldReference#from` returning a likely-cached `FieldReference`, that had
an interned `key` and an empty `path`.

This is problematic on two points.

First, when `ConvertedMap` was given data with keys that _were_ valid string
field references representing a nested field (such as `[host][geo][location]`),
the implementation of `ConvertedMap#put` effectively silently discarded the
path components because it assumed them to be empty, and only the key was
kept (`location`).

Second, when `ConvertedMap` was given a map whose keys contained what the
field reference parser considered special characters but _were NOT_
valid field references, the resulting `FieldReference.IllegalSyntaxException`
caused the operation to abort.

Instead of using the `FieldReference` cache, which sits on top of objects whose
`key` and `path`-components are known to have been interned, we introduce an
internment helper on our `ConvertedMap` that is also backed by the global string
intern pool, and ensure that our field references are primed through this pool.

In addition to fixing the `ConvertedMap#newFromMap` functionality, this has
three net effects:

 - Our ConvertedMap operations still use strings
   from the global intern pool
 - We have a new, smaller cache of individual field
   names, improving lookup performance
 - Our FieldReference cache no longer is flooded
   with fragments and therefore is more likely to
   remain performant

NOTE: this does NOT create isolated intern pools, as doing so would require
      a careful audit of the possible code-paths to `ConvertedMap#putInterned`.
      The new cache is limited to 10k strings, and when more are used only
      the FIRST 10k strings will be primed into the cache, leaving the
      remainder to always hit the global String intern pool.

NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_
      be referenced with the existing FieldReference implementation.

Resolves: elastic#13606
Resolves: elastic#11608
Adds a `config.field_reference.escape_style` option and a companion
command-line flag `--field-reference-escape-style` allowing a user
to opt into one of two proposed escape-sequence implementations for field
reference parsing:

 - `PERCENT`: URI-style `%`+`HH` hexadecimal encoding of UTF-8 bytes
 - `AMPERSAND`: HTML-style `&#`+`DD`+`;` encoding of decimal Unicode code-points

The default is `NONE`, which does _not_ proccess escape sequences.
With this setting a user effectively cannot reference a field whose name
contains FieldReference-reserved characters.

| ESCAPE STYLE | `[`     | `]`     |
| ------------ | ------- | ------- |
| `NONE`       | _N/A_   | _N/A_   |
| `PERCENT`    | `%5B`   | `%5D`   |
| `AMPERSAND`  | `&elastic#91;` | `&elastic#93;` |
@yaauie yaauie force-pushed the spacetime/escapable-field-references branch from 7e98b4d to 179a530 Compare April 25, 2022 19:10
@yaauie yaauie closed this Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants