-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[discuss] strict field reference parser rejects valid input data #11608
Comments
Hey, im also running into this, i found a work around using filter, but it kills my performance..... it goes from 2k events / sec to 500~1000 |
@rafael-adcp Right. What kind of solution would work for you? We cannot allow the use of field reference syntax into a field key so we have to come up with idea to deal with that kind of situation. Would replacing the brackets with another char work for you? |
@colinsurprenant
|
also im not 100% familiar into the reasons why the behaviour of this changed from logstash 6.X to 7+ also important to notice that there are some well known so i bet those ( |
@rafael-adcp Thanks, I am well aware of the filtering workaround and its performance & complexity consequences. This is a [discuss] issue (also tagged as such) to discuss how we can improve this behaviour. In my previous question «What kind of solution would work for you?» I was more interested in hearing what in the future could be improved in logstash to deal with this situation, not your current workaround, but thanks for sharing. Also,
I agree in principle but practically speaking we still have to deal with the problem of not allowing a field key which is ambiguous with a field reference. Although the JSON is valid, we cannot allow a key such as
Not really. We are really focusing on the case where a valid JSON document (note that this could also happen with other inputs/codecs) has a field key that uses the field reference syntax which cannot be allowed as-is. For example, imagine your JSON input is the following: |
The event is being tagged with
The various JSON codecs handle this specific exception by creating a new event with the un-decoded payload and the This tag leads people to believe that we cannot parse the JSON (which we can), when the real problem is that we cannot create a |
@yaauie that's right and I believe that this is the topic of the discussion here; specifically find/offer a way for users to be able to create events from a JSON (or other format decoding) that contains a field key which is invalid in our event structure but valid in the original format (JSON) without having to resort to exception handling using the |
bump. |
there's a few options this could get handled:
replace seems the most "universal" option, escape would be nice (the most user-friendly one out-of-the-box) but LS would need to make changes to the event API to not process escaped
p.s. a bit annoying there isn't a specific error type: |
If we support the escape of chars |
@kares agree, we should use a specific exception for Invalid FieldReference. @andsel I think that could make sense, I'll play with this idea to see how it could work. I like it because it would be completely independent from the actual parser used, would be consistent and not require special configuration and would switch the burden to the config author to correctly address field names with brackets by escaping them. |
Pointing out here that this is how syslog-ng encodes JSON arrays. i.e. a log message of |
My preferences:
Whichever solution is chosen, I do not believe a JSON provider should have to modify valid JSON output due to a Logstash limitation. And it definitely should not crash the pipeline. Although this may not be the correct place to discuss a workaround, I would expect that Elastic/Logstash would recommend official workaround(s) until this issue is addressed. |
Has any progress been made on a resolution here? |
If anyone comes across this issue, I have create a temporary json filter plugin - https://github.com/OpenSource-THG/logstash-filter-json/ Please note, this has not been performance tested so your distance may vary but it is sufficient for my requirements. |
I downgraded to logstash 6.x and now the pipeline is not failing, but i get these warnings:
While Logstash 7.x would crash with LogStash::Json::ParserError: Invalid FieldReference |
Hi, we are also seeing this error and it's currently stopping all our logstash pipelines. We also can't figure out how to drop all those logs, we tried multiple ways like the following:
error message:
has anyone figured this out? |
Also interested by a fix on this issue, I get those "broken" json from kafka input, so I can't really filter them. |
We're having this same issue, with events coming through a RabbitMQ input plugin using a json codec. Json looks like this in our case: |
The limitations still stands: individual field names in an event, including in nested key/value maps, cannot themselves contain square brackets. This is an error condition in LS7+, but in LS5 & LS6 it merely had undefined and sometimes-surprising behaviour including null-pointer exceptions and crashes. But we can't control what some vendors send to Logstash, and we need a way to handle valid JSON. Several others have suggested using the ruby filter to parse JSON strings to a key/value map, then recursively walk over the result to transform keys. This can be made to work, but a recursive walking transformation isn't particularly performant. I have made a tested script for use with the Ruby filter plugin, that will do a zero-decode transformation of any valid JSON string such that the result is still a valid JSON string but the object it represents does not have square brackets in the field names. It can be configured to replace them with underscores, with matching parens or curly-brakets, or to strip them entirely. It does not apply the transformation to encoded field values or to JSON syntax elements. The idea is that you would accept the JSON-encoded data as text from your input, use this sanitizer first, then use the json filter to do the actual decoding into its object representation. json-sanitize-field-names.logstash-filter-ruby.rb Its use looks something like:
This script has 3 parameters, all of which are optional and have sensible defaults:
|
This solution is also a good one |
Anybody experience major performance hits when working around this? I implemented the script by @yaauie and it works well. It only runs when a |
Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap` depend on identity and not equivalence, and therefore rely on the keys being _interned_ strings. In order to avoid hitting the JVM's global String intern pool (which can have performance problems), operations to normalize a string to its interned counterpart have traditionally relied on the behaviour of `FieldReference#from` returning a likely-cached `FieldReference`, that had an interned `key` and an empty `path`. This is problematic on two points. First, when `ConvertedMap` was given data with keys that _were_ valid string field references representing a nested field (such as `[host][geo][location]`), the implementation of `ConvertedMap#put` effectively silently discarded the path components because it assumed them to be empty, and only the key was kept (`location`). Second, when `ConvertedMap` was given a map whose keys contained what the field reference parser considered special characters but _were NOT_ valid field references, the resulting `FieldReference.IllegalSyntaxException` caused the operation to abort. Instead of using the `FieldReference` cache, which sits on top of objects whose `key` and `path`-components are known to have been interned, we introduce an internment helper on our `ConvertedMap` that is also backed by the global string intern pool, and ensure that our field references are primed through this pool. In addition to fixing the `ConvertedMap#newFromMap` functionality, this has three net effects: - Our ConvertedMap operations still use strings from the global intern pool - We have a new, smaller cache of individual field names, improving lookup performance - Our FieldReference cache no longer is flooded with fragments and therefore is more likely to remain performant NOTE: this does NOT create isolated intern pools, as doing so would require a careful audit of the possible code-paths to `ConvertedMap#putInterned`. The new cache is limited to 10k strings, and when more are used only the FIRST 10k strings will be primed into the cache, leaving the remainder to always hit the global String intern pool. NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_ be referenced with the existing FieldReference implementation. Resolves: elastic#13606 Resolves: elastic#11608
Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap` depend on identity and not equivalence, and therefore rely on the keys being _interned_ strings. In order to avoid hitting the JVM's global String intern pool (which can have performance problems), operations to normalize a string to its interned counterpart have traditionally relied on the behaviour of `FieldReference#from` returning a likely-cached `FieldReference`, that had an interned `key` and an empty `path`. This is problematic on two points. First, when `ConvertedMap` was given data with keys that _were_ valid string field references representing a nested field (such as `[host][geo][location]`), the implementation of `ConvertedMap#put` effectively silently discarded the path components because it assumed them to be empty, and only the key was kept (`location`). Second, when `ConvertedMap` was given a map whose keys contained what the field reference parser considered special characters but _were NOT_ valid field references, the resulting `FieldReference.IllegalSyntaxException` caused the operation to abort. Instead of using the `FieldReference` cache, which sits on top of objects whose `key` and `path`-components are known to have been interned, we introduce an internment helper on our `ConvertedMap` that is also backed by the global string intern pool, and ensure that our field references are primed through this pool. In addition to fixing the `ConvertedMap#newFromMap` functionality, this has three net effects: - Our ConvertedMap operations still use strings from the global intern pool - We have a new, smaller cache of individual field names, improving lookup performance - Our FieldReference cache no longer is flooded with fragments and therefore is more likely to remain performant NOTE: this does NOT create isolated intern pools, as doing so would require a careful audit of the possible code-paths to `ConvertedMap#putInterned`. The new cache is limited to 10k strings, and when more are used only the FIRST 10k strings will be primed into the cache, leaving the remainder to always hit the global String intern pool. NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_ be referenced with the existing FieldReference implementation. Resolves: elastic#13606 Resolves: elastic#11608
Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap` depend on identity and not equivalence, and therefore rely on the keys being _interned_ strings. In order to avoid hitting the JVM's global String intern pool (which can have performance problems), operations to normalize a string to its interned counterpart have traditionally relied on the behaviour of `FieldReference#from` returning a likely-cached `FieldReference`, that had an interned `key` and an empty `path`. This is problematic on two points. First, when `ConvertedMap` was given data with keys that _were_ valid string field references representing a nested field (such as `[host][geo][location]`), the implementation of `ConvertedMap#put` effectively silently discarded the path components because it assumed them to be empty, and only the key was kept (`location`). Second, when `ConvertedMap` was given a map whose keys contained what the field reference parser considered special characters but _were NOT_ valid field references, the resulting `FieldReference.IllegalSyntaxException` caused the operation to abort. Instead of using the `FieldReference` cache, which sits on top of objects whose `key` and `path`-components are known to have been interned, we introduce an internment helper on our `ConvertedMap` that is also backed by the global string intern pool, and ensure that our field references are primed through this pool. In addition to fixing the `ConvertedMap#newFromMap` functionality, this has three net effects: - Our ConvertedMap operations still use strings from the global intern pool - We have a new, smaller cache of individual field names, improving lookup performance - Our FieldReference cache no longer is flooded with fragments and therefore is more likely to remain performant NOTE: this does NOT create isolated intern pools, as doing so would require a careful audit of the possible code-paths to `ConvertedMap#putInterned`. The new cache is limited to 10k strings, and when more are used only the FIRST 10k strings will be primed into the cache, leaving the remainder to always hit the global String intern pool. NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_ be referenced with the existing FieldReference implementation. Resolves: elastic#13606 Resolves: elastic#11608
Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap` depend on identity and not equivalence, and therefore rely on the keys being _interned_ strings. In order to avoid hitting the JVM's global String intern pool (which can have performance problems), operations to normalize a string to its interned counterpart have traditionally relied on the behaviour of `FieldReference#from` returning a likely-cached `FieldReference`, that had an interned `key` and an empty `path`. This is problematic on two points. First, when `ConvertedMap` was given data with keys that _were_ valid string field references representing a nested field (such as `[host][geo][location]`), the implementation of `ConvertedMap#put` effectively silently discarded the path components because it assumed them to be empty, and only the key was kept (`location`). Second, when `ConvertedMap` was given a map whose keys contained what the field reference parser considered special characters but _were NOT_ valid field references, the resulting `FieldReference.IllegalSyntaxException` caused the operation to abort. Instead of using the `FieldReference` cache, which sits on top of objects whose `key` and `path`-components are known to have been interned, we introduce an internment helper on our `ConvertedMap` that is also backed by the global string intern pool, and ensure that our field references are primed through this pool. In addition to fixing the `ConvertedMap#newFromMap` functionality, this has three net effects: - Our ConvertedMap operations still use strings from the global intern pool - We have a new, smaller cache of individual field names, improving lookup performance - Our FieldReference cache no longer is flooded with fragments and therefore is more likely to remain performant NOTE: this does NOT create isolated intern pools, as doing so would require a careful audit of the possible code-paths to `ConvertedMap#putInterned`. The new cache is limited to 10k strings, and when more are used only the FIRST 10k strings will be primed into the cache, leaving the remainder to always hit the global String intern pool. NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_ be referenced with the existing FieldReference implementation. Resolves: elastic#13606 Resolves: elastic#11608
* add failing tests for Event.new with field that look like field references * fix: correctly handle FieldReference-special characters in field names. Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap` depend on identity and not equivalence, and therefore rely on the keys being _interned_ strings. In order to avoid hitting the JVM's global String intern pool (which can have performance problems), operations to normalize a string to its interned counterpart have traditionally relied on the behaviour of `FieldReference#from` returning a likely-cached `FieldReference`, that had an interned `key` and an empty `path`. This is problematic on two points. First, when `ConvertedMap` was given data with keys that _were_ valid string field references representing a nested field (such as `[host][geo][location]`), the implementation of `ConvertedMap#put` effectively silently discarded the path components because it assumed them to be empty, and only the key was kept (`location`). Second, when `ConvertedMap` was given a map whose keys contained what the field reference parser considered special characters but _were NOT_ valid field references, the resulting `FieldReference.IllegalSyntaxException` caused the operation to abort. Instead of using the `FieldReference` cache, which sits on top of objects whose `key` and `path`-components are known to have been interned, we introduce an internment helper on our `ConvertedMap` that is also backed by the global string intern pool, and ensure that our field references are primed through this pool. In addition to fixing the `ConvertedMap#newFromMap` functionality, this has three net effects: - Our ConvertedMap operations still use strings from the global intern pool - We have a new, smaller cache of individual field names, improving lookup performance - Our FieldReference cache no longer is flooded with fragments and therefore is more likely to remain performant NOTE: this does NOT create isolated intern pools, as doing so would require a careful audit of the possible code-paths to `ConvertedMap#putInterned`. The new cache is limited to 10k strings, and when more are used only the FIRST 10k strings will be primed into the cache, leaving the remainder to always hit the global String intern pool. NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_ be referenced with the existing FieldReference implementation. Resolves: #13606 Resolves: #11608 * field_reference: support escape sequences Adds a `config.field_reference.escape_style` option and a companion command-line flag `--field-reference-escape-style` allowing a user to opt into one of two proposed escape-sequence implementations for field reference parsing: - `PERCENT`: URI-style `%`+`HH` hexadecimal encoding of UTF-8 bytes - `AMPERSAND`: HTML-style `&#`+`DD`+`;` encoding of decimal Unicode code-points The default is `NONE`, which does _not_ proccess escape sequences. With this setting a user effectively cannot reference a field whose name contains FieldReference-reserved characters. | ESCAPE STYLE | `[` | `]` | | ------------ | ------- | ------- | | `NONE` | _N/A_ | _N/A_ | | `PERCENT` | `%5B` | `%5D` | | `AMPERSAND` | `[` | `]` | * fixup: no need to double-escape HTML-ish escape sequences in docs * Apply suggestions from code review Co-authored-by: Karol Bucek <[email protected]> * field-reference: load escape style in runner * docs: sentences over semiciolons * field-reference: faster shortcut for PERCENT escape mode * field-reference: escape mode control downcase * field_reference: more s/experimental/technical preview/ * field_reference: still more s/experimental/technical preview/ Co-authored-by: Karol Bucek <[email protected]>
You should update to version 8.3.2, I know that this version does the work! 👍 |
Many issues have been reported, mainly related to JSON decoding either in a codec or filter, where a valid JSON document contains keys that starts with a
[
which is interpreted as a logstash field reference and results in anLogStash::Json::ParserError: Invalid FieldReference
error.To reproduce:
The problem we have is that the keys are in fact valid JSON but are not parsable by logstash and result in a bad user experience.
I believe we should offer some way to mitigate that, maybe by allowing the user to specify some replacement character for the brackets that denote a field reference? Open to suggestions.
This relates to the FieldReference strict mode introduced in #9543
WDYT?
The text was updated successfully, but these errors were encountered: