JSON codec doesn't handle special chars in field names #13606

garethhumphriesgkc · 2022-01-12T04:56:44Z

Logstash information:

$ docker run --rm docker.elastic.co/logstash/logstash:7.16.2 --version
Using bundled JDK: /usr/share/logstash/jdk
logstash 7.16.2

OS version (uname -a if on a Unix-like system):

$ docker --version
Docker version 20.10.7, build 20.10.7-0ubuntu5.1
$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=21.10
DISTRIB_CODENAME=impish
DISTRIB_DESCRIPTION="Ubuntu 21.10"
$ uname -a
Linux workstation 5.13.0-22-generic #22-Ubuntu SMP Fri Nov 5 13:21:36 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behaviour:
When sending json data though the json filter or input codec, if one of the field names contains a specials char (e.g. '['), the parser fails.
Note that '[' is a legal char in a field name according to the json spec, and can be successfully parsed by other parsers (e.g. jq and ruby)

Steps to reproduce:

Start a very simple logstash instance: docker run --rm -p 6996:6996 docker.elastic.co/logstash/logstash:7.16.2 -e 'input {tcp { port => 6996 } }filter { json { source => "message" } }output { stdout {} }'
Send in some sample data - when there are special chars present _jsonparsefailure occurs: echo '{"str[substr]": true}' | nc -w1 127.0.0.1 6996

[2022-01-12T04:45:11,758][WARN ][logstash.filters.json    ][main][a700f2c38ad0ec2c2223fd9c1342d70b4cac73422170d6d5c123552c5bca662b] Exception caught in json filter {:source=>"message", :raw=>"{\"str[substr]\": true}", :exception=>#<RuntimeError: Invalid FieldReference: `str[substr]`>}
{
    "@timestamp" => 2022-01-12T04:45:11.615Z,
       "message" => "{\"str[substr]\": true}",
          "tags" => [
        [0] "_jsonparsefailure"
    ],
          "host" => "gateway",
          "port" => 58704,
      "@version" => "1"
}

Notes:

skip_on_invalid_json => true doesn't change this behaviour
Identical behaviour when config is: 'input {tcp { port => 6996 codec => json } }output { stdout {} }'
When input is '{"str\[substr": true}', we don't get the exception but message isn't parsed. This is true for any normal char with a preceding \ (e.g.: '{"str\a": true}'). While this is invalid according to the spec, most other parsers treat an escaped non-special char as just the char itself.
Thus far only [ and ] seem to provoke the exception.
When special chars are absent, message is parsed fine
When json filter and input codec are both absent, message is parsed fine

The text was updated successfully, but these errors were encountered:

andsel · 2022-04-11T12:39:30Z

Thank you @garethhumphriesgkc for reporting,
this is a long standing issue due to the way Logstash event references the fields.

More info at #11608

There is a workaround suggested in comment that simply translates square brackets to something else

garethhumphriesgkc · 2022-04-11T20:57:09Z

Hi Andsel,

The workarounds there seem to be either downgrade to logstash 6 or put in a filter or script with significant performance impact - are these what you're suggesting?

The obvious fix seems to be to escape the [ on ingest, treat it as escaped throughout the pipeline, and unescape it on output once processing is complete, is there a reason that wouldn't work? Is something more permanent in the works? If so, is there any indication of timeline?

andsel · 2022-04-12T07:47:33Z

The workarounds there seem to be either downgrade to logstash 6 or put in a filter or script with significant performance impact - are these what you're suggesting?

Sorry for the confusion, I suggest to use scripts. About performance impact we don't have any measure about it. The Ruby script uses gsub with regex and we should expect some impact, but depends on how long and complicated are the strings to proces.

The obvious fix seems to be to escape the [ on ingest, treat it as escaped throughout the pipeline, and unescape it on output once processing is complete, is there a reason that wouldn't work?

The problem sits in the way Logstash pipeline parser works, so it's something deep in the code base.
How do you intend to escape in input and unescape in output? Modifying the code in the plugins?

Is something more permanent in the works? If so, is there any indication of timeline?

we have some ideas on it and working on it, but no timeline

garethhumphriesgkc · 2022-04-13T01:24:20Z

The follow-up comment on that thread mentions 4k eps down to 150, that's quite dramatic. It also permanently changes the data, so it can't ever be the same before and after logstash.

There are plenty of other json parsers that use square brackets in their language, so I'd look to what they do - as you suggest though "field[string]" seems like the most obvious choice.

It's only when parsing events the issue happens, so logstash can clearly handle the embedded chars under normally operation, it's just a matter of having the json filter/codec handle them and being able to represent them in the DSL.

garethhumphriesgkc · 2022-04-13T02:21:14Z

I stand corrected - other filters/codecs don't handle it. I got thrown by the CSV filter persisting detected headers across TCP connections - when you start a brand new CSV input and inlcude [] in the header row, I see "tags" => [[0] "_csvparsefailure"]

Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap` depend on identity and not equivalence, and therefore rely on the keys being _interned_ strings. In order to avoid hitting the JVM's global String intern pool (which can have performance problems), operations to normalize a string to its interned counterpart have traditionally relied on the behaviour of `FieldReference#from` returning a likely-cached `FieldReference`, that had an interned `key` and an empty `path`. This is problematic on two points. First, when `ConvertedMap` was given data with keys that _were_ valid string field references representing a nested field (such as `[host][geo][location]`), the implementation of `ConvertedMap#put` effectively silently discarded the path components because it assumed them to be empty, and only the key was kept (`location`). Second, when `ConvertedMap` was given a map whose keys contained what the field reference parser considered special characters but _were NOT_ valid field references, the resulting `FieldReference.IllegalSyntaxException` caused the operation to abort. Instead of using the `FieldReference` cache, which sits on top of objects whose `key` and `path`-components are known to have been interned, we introduce an internment helper on our `ConvertedMap` that is also backed by the global string intern pool, and ensure that our field references are primed through this pool. In addition to fixing the `ConvertedMap#newFromMap` functionality, this has three net effects: - Our ConvertedMap operations still use strings from the global intern pool - We have a new, smaller cache of individual field names, improving lookup performance - Our FieldReference cache no longer is flooded with fragments and therefore is more likely to remain performant NOTE: this does NOT create isolated intern pools, as doing so would require a careful audit of the possible code-paths to `ConvertedMap#putInterned`. The new cache is limited to 10k strings, and when more are used only the FIRST 10k strings will be primed into the cache, leaving the remainder to always hit the global String intern pool. NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_ be referenced with the existing FieldReference implementation. Resolves: elastic#13606 Resolves: elastic#11608

* add failing tests for Event.new with field that look like field references * fix: correctly handle FieldReference-special characters in field names. Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap` depend on identity and not equivalence, and therefore rely on the keys being _interned_ strings. In order to avoid hitting the JVM's global String intern pool (which can have performance problems), operations to normalize a string to its interned counterpart have traditionally relied on the behaviour of `FieldReference#from` returning a likely-cached `FieldReference`, that had an interned `key` and an empty `path`. This is problematic on two points. First, when `ConvertedMap` was given data with keys that _were_ valid string field references representing a nested field (such as `[host][geo][location]`), the implementation of `ConvertedMap#put` effectively silently discarded the path components because it assumed them to be empty, and only the key was kept (`location`). Second, when `ConvertedMap` was given a map whose keys contained what the field reference parser considered special characters but _were NOT_ valid field references, the resulting `FieldReference.IllegalSyntaxException` caused the operation to abort. Instead of using the `FieldReference` cache, which sits on top of objects whose `key` and `path`-components are known to have been interned, we introduce an internment helper on our `ConvertedMap` that is also backed by the global string intern pool, and ensure that our field references are primed through this pool. In addition to fixing the `ConvertedMap#newFromMap` functionality, this has three net effects: - Our ConvertedMap operations still use strings from the global intern pool - We have a new, smaller cache of individual field names, improving lookup performance - Our FieldReference cache no longer is flooded with fragments and therefore is more likely to remain performant NOTE: this does NOT create isolated intern pools, as doing so would require a careful audit of the possible code-paths to `ConvertedMap#putInterned`. The new cache is limited to 10k strings, and when more are used only the FIRST 10k strings will be primed into the cache, leaving the remainder to always hit the global String intern pool. NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_ be referenced with the existing FieldReference implementation. Resolves: #13606 Resolves: #11608 * field_reference: support escape sequences Adds a `config.field_reference.escape_style` option and a companion command-line flag `--field-reference-escape-style` allowing a user to opt into one of two proposed escape-sequence implementations for field reference parsing: - `PERCENT`: URI-style `%`+`HH` hexadecimal encoding of UTF-8 bytes - `AMPERSAND`: HTML-style `&#`+`DD`+`;` encoding of decimal Unicode code-points The default is `NONE`, which does _not_ proccess escape sequences. With this setting a user effectively cannot reference a field whose name contains FieldReference-reserved characters. | ESCAPE STYLE | `[` | `]` | | ------------ | ------- | ------- | | `NONE` | _N/A_ | _N/A_ | | `PERCENT` | `%5B` | `%5D` | | `AMPERSAND` | `[` | `]` | * fixup: no need to double-escape HTML-ish escape sequences in docs * Apply suggestions from code review Co-authored-by: Karol Bucek <[email protected]> * field-reference: load escape style in runner * docs: sentences over semiciolons * field-reference: faster shortcut for PERCENT escape mode * field-reference: escape mode control downcase * field_reference: more s/experimental/technical preview/ * field_reference: still more s/experimental/technical preview/ Co-authored-by: Karol Bucek <[email protected]>

garethhumphriesgkc · 2022-12-19T20:19:24Z

I just tested this, and the bug does not appear to be fixed. Running 8.5.3, the above repro still results in a _jsonparsefailure:

$ docker run --rm -p 6996:6996 docker.elastic.co/logstash/logstash:8.5.3 -e 'input {tcp { port => 6996 } }filter { json { source => "message" } }output { stdout {} }' --field-reference-escape-style ampersand
.
.
.
[2022-12-19T20:17:12,486][WARN ][logstash.filters.json    ][main][a700f2c38ad0ec2c2223fd9c1342d70b4cac73422170d6d5c123552c5bca662b] Exception caught in json filter {:source=>"message", :raw=>"{\"str[substr]\": true}", :exception=>#<RuntimeError: Invalid FieldReference: `str[substr]`>}
{
      "@version" => "1",
       "message" => "{\"str[substr]\": true}",
    "@timestamp" => 2022-12-19T20:17:12.344138503Z,
          "tags" => [
        [0] "_jsonparsefailure"
    ],
         "event" => {
        "original" => "{\"str[substr]\": true}"
    }
}

andsel · 2022-12-21T14:31:22Z

I tried with a couple of pipelines and the problem seems to be in using the JSON filter for some reason.

Feeding the data

{"str[substr]": true}

to

logstash-8.3.3/bin/logstash -e 'input {tcp { port => 1234 } }filter { json { source => "message" } }output { stdout {} }' --field-reference-escape-style=percent

fails with

[2022-12-21T15:21:02,919][WARN ][logstash.filters.json    ][main][dacc140c452c0d09a3d6f5a419982229df4afbd6a1ec8113d5c8f800b4df40b9] Exception caught in json filter {:source=>"message", :raw=>"{\"str[substr]\": true}", :exception=>#<RuntimeError: Invalid FieldReference: `str[substr]`>}
{
    "@timestamp" => 2022-12-21T14:21:02.786132Z,
       "message" => "{\"str[substr]\": true}",
      "@version" => "1",
         "event" => {
        "original" => "{\"str[substr]\": true}"
    },
          "tags" => [
        [0] "_jsonparsefailure"
    ]
}

but using the json codec in the input

logstash-8.3.3/bin/logstash -e 'input {tcp { port => 1234 codec => json } }output { stdout {} }' --field-reference-escape-style=percent

it successed with

[2022-12-21T15:21:44,272][INFO ][logstash.codecs.jsonlines][main][609c80422695297123b69074205871c4ed126bca9b52cc36541056e9aaa0a50d] ECS compatibility is enabled but `target` option was not specified. This may cause fields to be set at the top-level of the event where they are likely to clash with the Elastic Common Schema. It is recommended to set the `target` option to avoid potential schema conflicts (if your data is ECS compliant or non-conflicting, feel free to ignore this message)
{
       "@version" => "1",
     "@timestamp" => 2022-12-21T14:21:48.929152Z,
    "str[substr]" => true
}

garethhumphriesgkc added bug status:needs-triage labels Jan 12, 2022

andsel added status:confirmed and removed status:needs-triage labels Apr 11, 2022

andsel self-assigned this Apr 11, 2022

yaauie mentioned this issue Apr 27, 2022

Field Reference: handle special characters #14044

Merged

5 tasks

yaauie closed this as completed in #14044 May 24, 2022

roaksoax removed the status:confirmed label May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON codec doesn't handle special chars in field names #13606

JSON codec doesn't handle special chars in field names #13606

garethhumphriesgkc commented Jan 12, 2022 •

edited

Loading

andsel commented Apr 11, 2022 •

edited

Loading

garethhumphriesgkc commented Apr 11, 2022

andsel commented Apr 12, 2022

garethhumphriesgkc commented Apr 13, 2022

garethhumphriesgkc commented Apr 13, 2022

garethhumphriesgkc commented Dec 19, 2022

andsel commented Dec 21, 2022

JSON codec doesn't handle special chars in field names #13606

JSON codec doesn't handle special chars in field names #13606

Comments

garethhumphriesgkc commented Jan 12, 2022 • edited Loading

andsel commented Apr 11, 2022 • edited Loading

garethhumphriesgkc commented Apr 11, 2022

andsel commented Apr 12, 2022

garethhumphriesgkc commented Apr 13, 2022

garethhumphriesgkc commented Apr 13, 2022

garethhumphriesgkc commented Dec 19, 2022

andsel commented Dec 21, 2022

garethhumphriesgkc commented Jan 12, 2022 •

edited

Loading

andsel commented Apr 11, 2022 •

edited

Loading