[discuss] strict field reference parser rejects valid input data #11608

colinsurprenant · 2020-02-19T16:58:02Z

Many issues have been reported, mainly related to JSON decoding either in a codec or filter, where a valid JSON document contains keys that starts with a [ which is interpreted as a logstash field reference and results in an LogStash::Json::ParserError: Invalid FieldReference error.

To reproduce:

echo '{"[foo":"bar"}' | bin/logstash -e 'input{stdin{codec=>json_lines}} output{stdout{codec=>rubydebug}}'
...
[2020-02-19T11:46:58,786][WARN ][logstash.codecs.jsonlines][main][ee68f56b1186b09c0ebc08387e2d8df11ff00788d3a22c61eeda228a073bb104] JSON parse error, original data now in message field {:error=>#<LogStash::Json::ParserError: Invalid FieldReference: `[foo`>, :data=>"{\"[foo\":\"bar\"}"}
{
    "@timestamp" => 2020-02-19T16:46:58.803Z,
       "message" => "{\"[foo\":\"bar\"}",
          "host" => "mbp15r",
      "@version" => "1",
          "tags" => [
        [0] "_jsonparsefailure"
    ]
}

The problem we have is that the keys are in fact valid JSON but are not parsable by logstash and result in a bad user experience.

I believe we should offer some way to mitigate that, maybe by allowing the user to specify some replacement character for the brackets that denote a field reference? Open to suggestions.

This relates to the FieldReference strict mode introduced in #9543

WDYT?

The text was updated successfully, but these errors were encountered:

rafael-adcp · 2020-02-20T11:39:26Z

Hey, im also running into this, i found a work around using filter, but it kills my performance..... it goes from 2k events / sec to 500~1000

colinsurprenant · 2020-02-20T14:47:51Z

@rafael-adcp Right. What kind of solution would work for you? We cannot allow the use of field reference syntax into a field key so we have to come up with idea to deal with that kind of situation. Would replacing the brackets with another char work for you?

rafael-adcp · 2020-02-20T15:53:33Z

@colinsurprenant
place this at your logstash filter also heads up im grabing it from doc so if you are using a different field / structure just change that

filter {
    # community workaround solution: https://discuss.elastic.co/t/avoiding-field-reference-grammer-in-json-parsing/177899/4
    ## the goal of applying this only on JSON failures is that
    ## filtering is way expensive which means it uses way more CPU + it slow things down (throughput)
    ## (and i really mean it)
    ## so by applying this only when it fails to parse we are ensuring
    ## that only "messages" with issues will be filtered therefore optimizing performance
    ## HEADS UP NOT THAT IM ONLY REPLACING: "[" AND "]" as more appears it just a matter of adding them to the regex
    if "_jsonparsefailure" in [tags] {
        ruby {
            code => "
            def sanitize_field_reference(item)
                case item
                    when Hash
                        item.keys.each{ |k| item[k.gsub(/\[|\]/, '_')] = sanitize_field_reference(item.delete(k)) }
                        return item
                    when Array
                        return item.map { |e| sanitize_field_reference(e) }
                    else
                        return item
                end
            end
            event.set('doc', sanitize_field_reference(event.get('doc')))
            "
        }
    }
}

rafael-adcp · 2020-02-20T15:57:35Z

also im not 100% familiar into the reasons why the behaviour of this changed from logstash 6.X to 7+
though as a user if my json is valid it should work without any hacks on my side, so whatever behaviour happens when [ appears it should be behind a config

also important to notice that there are some well known "special characters" to logstash
https://www.elastic.co/guide/en/logstash/current/plugins-filters-kv.html#plugins-filters-kv-remove_char_key

so i bet those (<>\[\],) will also cause confusion

colinsurprenant · 2020-02-20T16:28:44Z

@rafael-adcp Thanks, I am well aware of the filtering workaround and its performance & complexity consequences.

This is a [discuss] issue (also tagged as such) to discuss how we can improve this behaviour. In my previous question «What kind of solution would work for you?» I was more interested in hearing what in the future could be improved in logstash to deal with this situation, not your current workaround, but thanks for sharing.

Also,

though as a user if my json is valid it should work without any hacks on my side

I agree in principle but practically speaking we still have to deal with the problem of not allowing a field key which is ambiguous with a field reference. Although the JSON is valid, we cannot allow a key such as [foo] to be used as a key in the event so we have to think about ways to mitigate that situation.

also important to notice that there are some well known "special characters" to logstash so i bet those (<>[],) will also cause confusion

Not really. We are really focusing on the case where a valid JSON document (note that this could also happen with other inputs/codecs) has a field key that uses the field reference syntax which cannot be allowed as-is.

For example, imagine your JSON input is the following: {"[foo]":"bar"} and imagine nothing was done in logstash to prevent that; it would lead to problems trying to access that field in the config because the field key itself is a field reference. You cannot not do something like if [[foo]] in your config and just writing if [foo] refers to the foo field key and not [foo].

yaauie · 2020-02-20T17:57:00Z

The event is being tagged with _jsonparsefailure because the ruby LogStash::Event::from_json in logstash core raises a ruby LogStash::Json::ParserError upon encountering any java exception (even exceptions that are unrelated to JSON parsing):

            try {
                events = Event.fromJson(value.asJavaString());
            } catch (Exception e) {
                throw RaiseException.from(context.runtime, RubyUtil.PARSER_ERROR, e.getMessage());
            }

-- src/main/java/org/logstash/ext/JrubyEventExtLibrary.java:[email protected]

The various JSON codecs handle this specific exception by creating a new event with the un-decoded payload and the _jsonparsefailure tag.

This tag leads people to believe that we cannot parse the JSON (which we can), when the real problem is that we cannot create a LogStash::Event from the structure the encoded JSON represents.

colinsurprenant · 2020-02-20T18:29:06Z

@yaauie that's right and I believe that this is the topic of the discussion here; specifically find/offer a way for users to be able to create events from a JSON (or other format decoding) that contains a field key which is invalid in our event structure but valid in the original format (JSON) without having to resort to exception handling using the _jsonparsefailure tag which will be highly inefficient if this situation if not in fact exceptional but regular throughout the input data.

colinsurprenant · 2020-03-26T19:45:09Z

bump.

kares · 2020-03-31T06:22:15Z

there's a few options this could get handled:

possibility to escape => "[]" characters by default
trim => ["[", "]", " "] characters from begin/end (partial solution)
replace => [ "[", "", "]", "" ] characters characters using the user mapping

replace seems the most "universal" option, escape would be nice (the most user-friendly one out-of-the-box) but LS would need to make changes to the event API to not process escaped \[ references.

~~would also use this opportunity to decouple the JSON parsing from Event -> to have the ability to parse raw data using LS semantics into a Hash/Map like structure.~~

p.s. a bit annoying there isn't a specific error type: RuntimeError (Invalid FieldReference: `[foo`)

andsel · 2020-03-31T10:23:51Z

If we support the escape of chars [ and ] in field reference in side Logstash pipelines definitions, we should be ok.
I mean if a json contains a field named [field] we create a field in the event with the same name.
So usually in Logstash pipelines we use the synthax [normal_field] to reference normal_field, in this case we should use [\[squared\]] to reference the field [squared].
Am I missing something?

colinsurprenant · 2020-06-01T20:43:40Z

@kares agree, we should use a specific exception for Invalid FieldReference.

@andsel I think that could make sense, I'll play with this idea to see how it could work. I like it because it would be completely independent from the actual parser used, would be consistent and not require special configuration and would switch the burden to the config author to correctly address field names with brackets by escaping them.

sinkingpoint · 2020-06-10T18:53:28Z

Pointing out here that this is how syslog-ng encodes JSON arrays. i.e. a log message of {"foo": ["bar", "baz"]} gets encoded as {"foo[0]": "bar", "foo[1]": "baz"}. So when we upgraded this broke our syslog handling for any log that logged an array (e.g. tags)

malvidin · 2020-07-01T10:23:55Z

@rafael-adcp Right. What kind of solution would work for you? We cannot allow the use of field reference syntax into a field key so we have to come up with idea to deal with that kind of situation. Would replacing the brackets with another char work for you?

My preferences:

Escape the characters
Replace the characters with less common field characters in a somewhat reversible way, e.g. replace => [ "[", "<", "]", ">" ]
Replace the characters with a non-letter character, e.g. replace => [ "[", "_", "]", "_" ]
Remove the characters, e.g. replace => [ "[", "", "]", "" ]

Whichever solution is chosen, I do not believe a JSON provider should have to modify valid JSON output due to a Logstash limitation. And it definitely should not crash the pipeline.

Although this may not be the correct place to discuss a workaround, I would expect that Elastic/Logstash would recommend official workaround(s) until this issue is addressed.

joshblease · 2021-03-31T15:52:16Z

Has any progress been made on a resolution here?

joshblease · 2021-04-06T16:25:17Z

If anyone comes across this issue, I have create a temporary json filter plugin - https://github.com/OpenSource-THG/logstash-filter-json/
You can provide a hash of which characters to replace and with what.

Please note, this has not been performance tested so your distance may vary but it is sufficient for my requirements.

SamSpiri · 2021-10-11T14:39:06Z

also im not 100% familiar into the reasons why the behaviour of this changed from logstash 6.X to 7+ though as a user if my json is valid it should work without any hacks on my side, so whatever behaviour happens when [ appears it should be behind a config

also important to notice that there are some well known "special characters" to logstash https://www.elastic.co/guide/en/logstash/current/plugins-filters-kv.html#plugins-filters-kv-remove_char_key

so i bet those (<>\[\],) will also cause confusion

I downgraded to logstash 6.x and now the pipeline is not failing, but i get these warnings:

[2021-10-11T14:33:54,117][WARN ][org.logstash.FieldReference] Detected ambiguous Field Reference `limitsQueryParameters[amount]`, which we expanded to the path `[limitsQueryParameters, amount]`; in a future release of Logstash, ambiguous Field References will not be expanded.

While Logstash 7.x would crash with LogStash::Json::ParserError: Invalid FieldReference

balusarakesh · 2021-10-24T18:53:38Z

Hi,

we are also seeing this error and it's currently stopping all our logstash pipelines. We also can't figure out how to drop all those logs, we tried multiple ways like the following:

      ruby {
        code => '
          hash = event.to_hash
          hash.each { |key,value|
            if value.include? "log[" then
              event.delete(key)
            if value.include? "log[" then
              event.delete(key)
            end
          }
       '
       }
---------
        if "log[" in [log] {
          drop {}
        }

error message:

Pipeline worker error, the pipeline will be stopped {:pipeline_id=>"main", :error=>"(RuntimeError) Invalid FieldReference: `log[MatchDataSend]`"

has anyone figured this out?

glestel · 2021-11-03T14:13:52Z

Also interested by a fix on this issue, I get those "broken" json from kafka input, so I can't really filter them.

chrismon · 2021-12-07T20:21:31Z

We're having this same issue, with events coming through a RabbitMQ input plugin using a json codec. Json looks like this in our case: {"foo[bar]"=0} and the error we get is: JSON parse error, original data now in message field {:message=>"Invalid FieldReference: foo[bar]". Running Logstash 7.15.2.

yaauie · 2021-12-08T21:10:51Z

The limitations still stands: individual field names in an event, including in nested key/value maps, cannot themselves contain square brackets. This is an error condition in LS7+, but in LS5 & LS6 it merely had undefined and sometimes-surprising behaviour including null-pointer exceptions and crashes.

But we can't control what some vendors send to Logstash, and we need a way to handle valid JSON. Several others have suggested using the ruby filter to parse JSON strings to a key/value map, then recursively walk over the result to transform keys. This can be made to work, but a recursive walking transformation isn't particularly performant.

I have made a tested script for use with the Ruby filter plugin, that will do a zero-decode transformation of any valid JSON string such that the result is still a valid JSON string but the object it represents does not have square brackets in the field names. It can be configured to replace them with underscores, with matching parens or curly-brakets, or to strip them entirely. It does not apply the transformation to encoded field values or to JSON syntax elements.

The idea is that you would accept the JSON-encoded data as text from your input, use this sanitizer first, then use the json filter to do the actual decoding into its object representation.

json-sanitize-field-names.logstash-filter-ruby.rb

Its use looks something like:

input {
  # inputs configured in such a way that they place a JSON _string_ in the top-level `message` field.
  # this can often be done by using the `plain` or `lines` codec, but depends on your input plugin.
}
filter {
  ruby {
    path => "${LOGSTASH_HOME}/ruby-scripts/json-sanitize-field-names.logstash-filter-ruby.rb"
    script_params => {
      source => message
      target => sanitized_message
    }
  }
  json {
    source => sanitized_message
  }
}

This script has 3 parameters, all of which are optional and have sensible defaults:

source: a field reference to the source JSON string (default: message)
target: a field reference to where we should place the sanitized JSON string (optional; if unspecified it will perform an in-place transformation, replacing the source)
transform: one of several named transformations (default: mask_underscore):
- mask_underscore: replace opening ([) or closing (]) square brackets with underscores _
- mask_curly: replace opening ([) and closing (]) square brackets with opening curly brackets ({) and closing curly brackets (}) respectively
- mask_paren: replace opening ([) and closing (]) square brackets with opening parens (() and closing parens ()) respectively
- strip: strip both opening ([) or closing (]) square brackets without replacing them

SamSpiri · 2021-12-08T23:38:41Z

If anyone comes across this issue, I have create a temporary json filter plugin - https://github.com/OpenSource-THG/logstash-filter-json/ You can provide a hash of which characters to replace and with what.

Please note, this has not been performance tested so your distance may vary but it is sufficient for my requirements.

This solution is also a good one

caseydm · 2022-01-18T15:51:56Z

Anybody experience major performance hits when working around this? I implemented the script by @yaauie and it works well. It only runs when a _jsonparsefailure tag is found. But I went from ingesting at 4k events per second down to 150. I'm trying to verify that the core issue is the new parser. But curious if others experienced the same thing.

Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap` depend on identity and not equivalence, and therefore rely on the keys being _interned_ strings. In order to avoid hitting the JVM's global String intern pool (which can have performance problems), operations to normalize a string to its interned counterpart have traditionally relied on the behaviour of `FieldReference#from` returning a likely-cached `FieldReference`, that had an interned `key` and an empty `path`. This is problematic on two points. First, when `ConvertedMap` was given data with keys that _were_ valid string field references representing a nested field (such as `[host][geo][location]`), the implementation of `ConvertedMap#put` effectively silently discarded the path components because it assumed them to be empty, and only the key was kept (`location`). Second, when `ConvertedMap` was given a map whose keys contained what the field reference parser considered special characters but _were NOT_ valid field references, the resulting `FieldReference.IllegalSyntaxException` caused the operation to abort. Instead of using the `FieldReference` cache, which sits on top of objects whose `key` and `path`-components are known to have been interned, we introduce an internment helper on our `ConvertedMap` that is also backed by the global string intern pool, and ensure that our field references are primed through this pool. In addition to fixing the `ConvertedMap#newFromMap` functionality, this has three net effects: - Our ConvertedMap operations still use strings from the global intern pool - We have a new, smaller cache of individual field names, improving lookup performance - Our FieldReference cache no longer is flooded with fragments and therefore is more likely to remain performant NOTE: this does NOT create isolated intern pools, as doing so would require a careful audit of the possible code-paths to `ConvertedMap#putInterned`. The new cache is limited to 10k strings, and when more are used only the FIRST 10k strings will be primed into the cache, leaving the remainder to always hit the global String intern pool. NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_ be referenced with the existing FieldReference implementation. Resolves: elastic#13606 Resolves: elastic#11608

* add failing tests for Event.new with field that look like field references * fix: correctly handle FieldReference-special characters in field names. Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap` depend on identity and not equivalence, and therefore rely on the keys being _interned_ strings. In order to avoid hitting the JVM's global String intern pool (which can have performance problems), operations to normalize a string to its interned counterpart have traditionally relied on the behaviour of `FieldReference#from` returning a likely-cached `FieldReference`, that had an interned `key` and an empty `path`. This is problematic on two points. First, when `ConvertedMap` was given data with keys that _were_ valid string field references representing a nested field (such as `[host][geo][location]`), the implementation of `ConvertedMap#put` effectively silently discarded the path components because it assumed them to be empty, and only the key was kept (`location`). Second, when `ConvertedMap` was given a map whose keys contained what the field reference parser considered special characters but _were NOT_ valid field references, the resulting `FieldReference.IllegalSyntaxException` caused the operation to abort. Instead of using the `FieldReference` cache, which sits on top of objects whose `key` and `path`-components are known to have been interned, we introduce an internment helper on our `ConvertedMap` that is also backed by the global string intern pool, and ensure that our field references are primed through this pool. In addition to fixing the `ConvertedMap#newFromMap` functionality, this has three net effects: - Our ConvertedMap operations still use strings from the global intern pool - We have a new, smaller cache of individual field names, improving lookup performance - Our FieldReference cache no longer is flooded with fragments and therefore is more likely to remain performant NOTE: this does NOT create isolated intern pools, as doing so would require a careful audit of the possible code-paths to `ConvertedMap#putInterned`. The new cache is limited to 10k strings, and when more are used only the FIRST 10k strings will be primed into the cache, leaving the remainder to always hit the global String intern pool. NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_ be referenced with the existing FieldReference implementation. Resolves: #13606 Resolves: #11608 * field_reference: support escape sequences Adds a `config.field_reference.escape_style` option and a companion command-line flag `--field-reference-escape-style` allowing a user to opt into one of two proposed escape-sequence implementations for field reference parsing: - `PERCENT`: URI-style `%`+`HH` hexadecimal encoding of UTF-8 bytes - `AMPERSAND`: HTML-style `&#`+`DD`+`;` encoding of decimal Unicode code-points The default is `NONE`, which does _not_ proccess escape sequences. With this setting a user effectively cannot reference a field whose name contains FieldReference-reserved characters. | ESCAPE STYLE | `[` | `]` | | ------------ | ------- | ------- | | `NONE` | _N/A_ | _N/A_ | | `PERCENT` | `%5B` | `%5D` | | `AMPERSAND` | `[` | `]` | * fixup: no need to double-escape HTML-ish escape sequences in docs * Apply suggestions from code review Co-authored-by: Karol Bucek <[email protected]> * field-reference: load escape style in runner * docs: sentences over semiciolons * field-reference: faster shortcut for PERCENT escape mode * field-reference: escape mode control downcase * field_reference: more s/experimental/technical preview/ * field_reference: still more s/experimental/technical preview/ Co-authored-by: Karol Bucek <[email protected]>

felipefuller · 2022-08-03T15:38:42Z

You should update to version 8.3.2, I know that this version does the work! 👍

colinsurprenant added discuss enhancement labels Feb 19, 2020

colinsurprenant mentioned this issue Feb 19, 2020

Invalid FieldReference: variables[XRAY_ENABLED] #11332

Open

victorboissiere mentioned this issue May 18, 2020

square bracket in json key cause logstash exception elastic/beats#16108

Closed

roaksoax added the int-shortlist label Oct 11, 2021

karenzone mentioned this issue Dec 8, 2021

Doc: Add workaround for field ref parser rejecting valid input data #13487

Open

andsel mentioned this issue Apr 11, 2022

JSON codec doesn't handle special chars in field names #13606

Closed

yaauie mentioned this issue Apr 27, 2022

Field Reference: handle special characters #14044

Merged

5 tasks

yaauie closed this as completed in #14044 May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discuss] strict field reference parser rejects valid input data #11608

[discuss] strict field reference parser rejects valid input data #11608

colinsurprenant commented Feb 19, 2020 •

edited

Loading

rafael-adcp commented Feb 20, 2020 •

edited

Loading

colinsurprenant commented Feb 20, 2020

rafael-adcp commented Feb 20, 2020 •

edited

Loading

rafael-adcp commented Feb 20, 2020

colinsurprenant commented Feb 20, 2020

yaauie commented Feb 20, 2020 •

edited

Loading

colinsurprenant commented Feb 20, 2020

colinsurprenant commented Mar 26, 2020

kares commented Mar 31, 2020 •

edited

Loading

andsel commented Mar 31, 2020

colinsurprenant commented Jun 1, 2020

sinkingpoint commented Jun 10, 2020

malvidin commented Jul 1, 2020

joshblease commented Mar 31, 2021

joshblease commented Apr 6, 2021

SamSpiri commented Oct 11, 2021

balusarakesh commented Oct 24, 2021

glestel commented Nov 3, 2021

chrismon commented Dec 7, 2021

yaauie commented Dec 8, 2021 •

edited

Loading

SamSpiri commented Dec 8, 2021

caseydm commented Jan 18, 2022

felipefuller commented Aug 3, 2022

[discuss] strict field reference parser rejects valid input data #11608

[discuss] strict field reference parser rejects valid input data #11608

Comments

colinsurprenant commented Feb 19, 2020 • edited Loading

rafael-adcp commented Feb 20, 2020 • edited Loading

colinsurprenant commented Feb 20, 2020

rafael-adcp commented Feb 20, 2020 • edited Loading

rafael-adcp commented Feb 20, 2020

colinsurprenant commented Feb 20, 2020

yaauie commented Feb 20, 2020 • edited Loading

colinsurprenant commented Feb 20, 2020

colinsurprenant commented Mar 26, 2020

kares commented Mar 31, 2020 • edited Loading

andsel commented Mar 31, 2020

colinsurprenant commented Jun 1, 2020

sinkingpoint commented Jun 10, 2020

malvidin commented Jul 1, 2020

joshblease commented Mar 31, 2021

joshblease commented Apr 6, 2021

SamSpiri commented Oct 11, 2021

balusarakesh commented Oct 24, 2021

glestel commented Nov 3, 2021

chrismon commented Dec 7, 2021

yaauie commented Dec 8, 2021 • edited Loading

SamSpiri commented Dec 8, 2021

caseydm commented Jan 18, 2022

felipefuller commented Aug 3, 2022

colinsurprenant commented Feb 19, 2020 •

edited

Loading

rafael-adcp commented Feb 20, 2020 •

edited

Loading

rafael-adcp commented Feb 20, 2020 •

edited

Loading

yaauie commented Feb 20, 2020 •

edited

Loading

kares commented Mar 31, 2020 •

edited

Loading

yaauie commented Dec 8, 2021 •

edited

Loading