Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple problems parsing JSON, escaping, encoding JSON #1278

Closed
ekimekim opened this issue Apr 15, 2019 · 83 comments
Closed

Multiple problems parsing JSON, escaping, encoding JSON #1278

ekimekim opened this issue Apr 15, 2019 · 83 comments
Assignees

Comments

@ekimekim
Copy link

I'd like to preface all the following with a note to the developers: Thank you for your work on fluent-bit, and please take the following as an attempt to help others / constructive criticism.
These issues have been hounding me for a while and apologies if some of that frustration peeks through in my language. 🙂

I was investigating various problems with JSON parsing in fluent-bit.
There's a lot of other Issues about this, but I felt it'd be useful to get a minimal reproduction, against latest master, and with all the issues I found in one place.

Issues I believe are related, at least based on a quick scan: #278, #580, #615, #617, #673, #831, #907, #965, #1021, #1044, #1114, #1130, #1170

If you came here looking for a quick fix

As far as I can tell, there's no way currently to configure fluent-bit to correctly parse a JSON string value.

You can get most of the way there with a config that applies the escaped_utf8 decoder followed by the escaped decoder.
This will work for everything except strings that contain literal backslashes. See below for detail.

Alternately, if you want to parse nested JSON, you should apply the json decoder only.
Note that any string values contained in this nested JSON will then ALSO need to be decoded as above.

My setup

All the following is based on commit 00dd804, which is master at time of writing.

Here is my config file. I am using the tail input and file output for testing purposes and to keep the tests free of other distractions.

[SERVICE]
	Log_Level debug
	Parsers_File parsers.conf
	Daemon off

[INPUT]
	Name tail
	Tag test
	Path test.in
	Parser test

[OUTPUT]
	Name file
	Path test.out
	Format out_file

and parsers.conf:

[PARSER]
	Name test
	Format json
	# We will be adding decoders here

Parsing JSON

The crux of the whole problem is with how fluent-bit parses JSON values that contain strings.
This is done by flb_pack_json(), which converts the incoming buffer to a list of tokens using the jsmn library. jsmn returns tokens corresponding to objects, arrays, strings and primitives.
These tokens are then handled by tokens_to_msgpack which converts this list of tokens to a msgpack-encoded string.

The issue lies in the JSMN_STRING case:

        flen = (t->end - t->start);
        switch (t->type) {
...
        case JSMN_STRING:
            msgpack_pack_str(&pck, flen);
            msgpack_pack_str_body(&pck, js + t->start, flen);
            break;

The string tokens returned by jsmn point to the section of the buffer that is between the opening and closing quotes of the string. For example, in the JSON buffer {"foo": "two\nlines"}, the token corresponding to the "two\nlines" value points to:

{"foo": "two\nlines"}
         ^start   ^end

Note how this substring still contains the escaped newline as a backslash then an n, not as an unescaped actual newline character.

fluent-bit does no further processing on this value, but copies it into the new msgpack value verbatim.
The result is a msgpack value which contains the literal characters 't', 'w', 'o', '\', 'n', 'l', 'i', 'n', 'e', 's'.

Hence, if no other decoding is done, this string would be converted back into JSON as {"foo": "two\\nlines"}, as the encoder will see the backslash and escape it.

Now, I would argue that the program is already badly behaved because its JSON parser does not, in fact, correctly parse JSON.
However, the documentation (somewhat obliquely) indicates that you are meant to use a decoder to rectify this siutation.

If you know in advance exactly what fields you need to apply this to, then we can attempt to use decoders to improve this.
However, I'd like to point out that at this point we already are no longer able to process arbitrary JSON objects, since there's no way to apply a decoder to "all string fields" without hard-coding that set of fields in your config file.

Let's now consider the behaviour if you attempt to use various decoders on this string value.

Option 1: Use escaped decoder

The escaped decoder will convert the escapes \n, \a, \b, \t, \v, \f and \r to their respective standard interpretations (eg. \n becomes a newline).

Any other escaped characters (eg. \", \\ or \w) will be converted to the escaped character, eg. ", \ and w respectively.

This is problematic for un-escaping a JSON string because it lacks support for the \uXXXX escape, converting the \u to a u. For example, {"foo": "Pok\u00e9mon"}, when decoding the foo key in this way, becomes {"foo": "Poku00e9mon"}.

It also happens to handle escapes that JSON does not support such as \v, but this is fine since any valid JSON input won't have those escapes, and jsmn checks for validity when parsing.

Option 2: Use escaped_utf8 decoder

This decoder will successfully convert \uXXXX escapes. However, this line:

        if (*in_buf == '\\' && !is_json_escape(next)) {

means that any other JSON escapes will be ignored.

In addition, this decoder has a somewhat subtle bug. It doesn't handle \\ correctly.

The code loops through each character, checking if it is a backslash and the next char is not in the list of chars that would make it a JSON escape. This includes \\, so it treats the first backslash as not being part of any escape, then moves on to the next character.
The next character, of course, is another backslash. This also gets interpreted as an escape.
So, for example, "\\v" will be interpreted as \, then \v, which gets converted to a vertical tab, and the output will be \<VT>.

Amusingly, if \\ is the last thing in the string, then the final backslash is instead joined with the terminating nul character of the string, ie. \, \, NUL becomes \, NUL,
the nul terminator is skipped over, and one byte past the end of the buffer is dereferenced before the length check kicks in and halts the loop.
Turning this minor buffer overflow into a segfault or further business is left as an exercise for the reader.

Since fluent-bit keeps string lengths alongside all its strings, the fact that we have an extra NUL terminator doesn't seem to affect things too much. It shows up in the output, eg.
{"foo": "\\\u0000"}.

Option 3: Both

If you use a config like this:

Decode_Field_As escaped_utf8 MY_FIELD do_next
Decode_Field_As escaped MY_FIELD

then all escapes will be decoded. You will still have to deal with the escaped-backslash issues above, further filtered through the second layer of escaping, for example \\h -> h, \\u002f -> /.

This is the least broken option I've been able to find.

The 'json' decoder (nested json)

For fields which contain strings intended to be parsed as JSON, eg. {"log": "{\"inner\": \"foo\"}"}, the json decoder is provided.

One important aspect of this decoder (and not documented anywhere as far as I can tell) is that it first does a escaped decode on the string before attempting to parse it.

ie. Continuing the above example of {"log": "{\"inner\": \"foo\"}"}, the log field is initially parsed as {\"inner\": \"foo\"}, then the json decoder unescapes it to {"inner": "foo"}, which is then parsed as JSON.

However, note that as discussed above, the escaped decoder does not correctly parse unicode escapes.
This is not generally an issue, since there are no characters in valid JSON which would need to be escaped using a unicode escape (\ is escaped as \\, " is escaped as \", etc).
But there are valid (albeit pathological) JSON documents which will therefore fail to parse, as any character may be replaced with a unicode-escaped version even if it's otherwise valid.
For example if the same example as above was written as {"log": "{\u0022inner\u0022: \u0022foo\u0022}"}, the escaped decoder would result in the string {u0022inneru0022: u0022foou0022}, which will then fail to parse as JSON.

You can mitigate this by passing the field through the encoded_utf8 decoder first, but that then exposes you to the \\ bugs noted above, which is likely worse since any escaped characters inside the nested JSON will cause \\s to appear in the string. For example, under this scheme {"log": "{\"ctcp\": \"\\u0001PING\\u0001\"}" will be first decoded by the encoded_utf8 decoder to {\"ctcp\": \"\<SOH>PING\<SOH>\"} then by the encoded decoder to {"ctcp": "<SOH>PING<SOH>"} (where in both cases <SOH> indicates a literal 0x01 character). This is not valid JSON due to the unescaped control code and will be rejected by the JSON decoder.

Finally, it should be noted that the implementation currently also has a check that the first character of the JSON string is a { or [:

    /* It must be a map or array */
    if (dec->buffer[0] != '{' && dec->buffer[0] != '[') {
        return -1; 
    }   

Hence any JSON document with leading whitespace will not be parsed despite being a valid object.

Note that any fields parsed into JSON using this decoder will themselves need to have the escaped_utf8 and escaped decoders applied to them in order for escapes in them to be processed.

Output encoding issues

In addition to all the above issues with parsing JSON fields, there are also issues with outputting data in JSON format.

The code which encodes a record to JSON uses non-JSON escapes \a and \v. For example, the input {"vtab": "\u000b"}, properly decoded, should produce the same output, but instead produces {"vtab": "\v"}, which is not valid JSON.

The code is also not capable of outputting non-BMP unicode characters.
This could arguably be considered a parsing bug, not an encoding bug.

Non-BMP unicode chars are encoded in JSON as a surrogate pair in \u notation (RFC4627).
For example, 🙂 (U+1F642 SLIGHTLY SMILING FACE) is encoded in JSON as \ud83d\ude42.

Fluent-bit does not handle this case specially, instead (assuming you've used the encoded_utf8 decoder) decoding the two code points individually and storing them internally in utf-8.
Surrogate pairs are not valid in utf-8 strings, and when these characters are attempted to be encoded, the implementation (arguably correctly) reports [pack] invalid UTF-8 bytes, skipping and drops those characters.

Suggested fixes

The following are obviously bugs (ie. not intended behaviour) and should be fixed:

  • The behaviour of \\ in escaped_utf8
  • The use of non-JSON escapes (\a, \v) in JSON output
  • The inability for nested JSON fields to begin with whitespace

After that, things get trickier. I'm not sure why escaped_utf8 does not parse JSON escapes (and in fact this directly contradicts the examples in its documentation), but it appears to be intentional.
I suggest fixing this to bring it in line with its documentation, so that users need only decode with escaped_utf8 in order to get correct JSON string decoding.
As part of this change, the json decoder should also use this decoding step instead of escaped, so it can correctly interpret unicode-escaped characters.

Decoding surrogate pairs during escaped_utf8 decoding (and then in turn encoding them back into surrogate pairs when encoding as JSON) should also be done.

However, none of this deals with the case where the user cannot hard-code in their config file the full list of fields that need to be decoded. Not to mention that the current scheme of splitting the full JSON parsing process over two independent steps (JSON parser + escaped_utf8 decoder) is extremely confusing to users.

Unfortunately at this point fixing that behaviour would introduce breaking changes.

I suggest either:

  • Deprecate the existing JSON parser and introduce a new parser which correctly decodes escapes when parsing the JSON (ie. in tokens_to_msgpack()). Call it json2 or json_full or json_decoded or something.

  • Add a flag that affects the behaviour of the JSON parser.

Another option would be to add a new form of decoder specification which applies to all fields, eg. by allowing a wildcard for field name or via a new directive. This would allow all fields to be decoded after being parsed, but will break in cases (such as when parsing nested JSON) where it is unknown which fields are "new" and need decoding and which are not (and double-decoding is a bug, as it will eg. remove literal backslashes or turn \\n into a newline instead of \n).

In any case, the documentation should be updated to very clearly instruct the user on how to configure the parser to correctly parse all valid JSON objects.

@fpytloun
Copy link

fpytloun commented May 24, 2019

We are facing the same issue.

I tried to use combination of decoders with try_next like this:

        Decode_Field_As   escaped_utf8 log   do_next
        Decode_Field_As   json                log   try_next
        Decode_Field_As   escaped         log

Which should escape only if JSON parsing fails (eg. event is not json) but it doesn't seem to be working.
When I use do_next instead, it will handle JSON logs fine but regular logs won't get parsed (events looks fine but regex parsers won't match).

@edsiper
Copy link
Member

edsiper commented Jun 3, 2019

FYI: working on this.

@kiich
Copy link

kiich commented Jun 3, 2019

Super excited to hear that you are looking at this @edsiper - I have ran into several issues when trying to fix all of our logging issues with fluentbit and have come up with a decent list of issues/impact of using different decoders which I will create a new issue with.

It will be great to be able to solve all/most of these soon!

BTW, great catching up with you at KubeCon Europe!

@edsiper
Copy link
Member

edsiper commented Jun 5, 2019

I am doing some progress here. Here are the results of a simple modification where only a 'json' parser without decoders do the work.

I am using these 3 simple lines as examples (if you have others please feel free to share)

{"id": 0, "log": "utf-8 code message: Fluent Bit \u00a9"}
{"id": 1, "log": "escaped \"message\""}
{"id": 2, "log": {"nested": {"key": "\\\u00a9"}}}

stdout Msgpack output:

[0] tail.0: [1559706413.733776628, {"id"=>0, "log"=>"utf-8 code message: Fluent Bit ©"}]
[1] tail.0: [1559706413.733791898, {"id"=>1, "log"=>"escaped "message""}]
[2] tail.0: [1559706413.733801106, {"id"=>2, "log"=>{"nested"=>{"key"=>"\©"}}}]

stdout JSON output:

{"date":1559706413.733777, "id":0, "log":"utf-8 code message: Fluent Bit \u00a9"}
{"date":1559706413.733792, "id":1, "log":"escaped \"message\""}
{"date":1559706413.733801, "id":2, "log":{"nested":{"key":"\\\u00a9"}}} 

I have not tested "stringified JSONs yet".. work in process.

@gitfool
Copy link
Contributor

gitfool commented Jun 5, 2019

Are nested escaped messages covered by the above too? Our Kubernetes log is structured like:

{"Timestamp":"2019-06-05T05:23:51.7911585+00:00","MessageTemplate":"{StatusCode} {StatusCodeName}","RenderedMessage":"404 \"NotFound\""}

i.e. strings in the RenderedMessage property value are always quoted and so escaped.

@kiich
Copy link

kiich commented Jun 5, 2019

Great to hear your progress @edsiper !
I'll be super interested to hear what happens with stringified JSON which I assume means logs produced by Docker, right?

Also +1 to what @gitfool mentioned about the nested escaped messages as that is for sure 1 issue i found that the resulting log sent to Splunk breaks Splunk displaying/recognising it as JSON. (it gets fixed if i use decoder escaped_utf8 but that then breaks many other things as i mentioned in #1370 )

Also, could we test escaped new line i.e. \\n as part of this as that is also breaking/not showing correctly in Splunk for me.

@edsiper
Copy link
Member

edsiper commented Jun 5, 2019

Update

(thanks again for the feedback!).

Looks like we are seeing some light here, please review carefully the following behavior, results and share some feedback:

1. Parsers configuration file parsers.conf

[PARSER]
    Name   json
    Format json
    Decode_Field_As  json  log

2. Fluent Bit configuration file

[SERVICE]
    Flush        1
    Log_Level    info
    Parsers_File parsers.conf
    Daemon       off

[INPUT]
    Name   tail
    Path   test.log
    Parser json

[OUTPUT]
    Name   stdout
    Match  *

[OUTPUT]
    Name   stdout
    Match  *
    Format json_lines

3 Log file test.log with different JSON lines (with special characters):

{"id": 0, "log": "utf-8 code message: Fluent Bit \u00a9"}
{"id": 1, "log": "escaped \"message\""}
{"id": 2, "log": {"nested": {"key": "\\\u00a9"}}}
{"id": 3, "Timestamp":"2019-06-05T05:23:51.7911585+00:00","MessageTemplate":"{StatusCode} {StatusCodeName}","RenderedMessage":"404 \"NotFound\""}
{"id": 4, "message": "\"firstline\"\n\"second line\"\nliteral breakline \\n"}
{"id": 5, "log":"{\"message\": \"1. first line\\n2. A \\\"quoted\\\" word\\n3. literal breakline \\\\n\\n4. Copyright char: \\u00a9\\n5. Backslash: \\\\\\n\", \"second_key\": 123}\n","stream":"stdout","time":"2019-06-05T17:27:48.854223332Z"}

Comments about the file content:

  • On record 5 (id 5), the original message was generated by reading a text file and then creating a JSON map with it:

    • Raw text file content
      1. first line
      2. A "quoted" word
      3. literal breakline \n
      4. Copyright char: ©
      5. Backslash: \
      
    • A Python script read the file content above and compose an outgoing map with two keys, JSON output generated is:
      {"message": "1. first line\n2. A \"quoted\" word\n3. literal breakline \\n\n4. Copyright char: \u00a9\n5. Backslash: \\\n", "second_key": 123}
    • A Docker container runs a script that prints out the JSON message from previous step, so the final Docker log content is:
      {"log":"{\"message\": \"1. first line\\n2. A \\\"quoted\\\" word\\n3. literal breakline \\\\n\\n4. Copyright char: \\u00a9\\n5. Backslash: \\\\\\n\", \"second_key\": 123}\n","stream":"stdout","time":"2019-06-05T17:27:48.854223332Z"}

4 Running Fluent Bit and outputs in msgpack and json:

  • 4.1: msgpack output:
    [0] tail.0: [1559759324.245235220, {"id"=>0, "log"=>"utf-8 code message: Fluent Bit ©"}]
    [1] tail.0: [1559759324.245251275, {"id"=>1, "log"=>"escaped "message""}]
    [2] tail.0: [1559759324.245262709, {"id"=>2, "log"=>{"nested"=>{"key"=>"\©"}}}]
    [3] tail.0: [1559759324.245270281, {"id"=>3, "Timestamp"=>"2019-06-05T05:23:51.7911585+00:00", "MessageTemplate"=>"{StatusCode} {StatusCodeName}", "RenderedMessage"=>"404 "NotFound""}]
    [4] tail.0: [1559759324.245274444, {"id"=>4, "message"=>""firstline"
    "second line"
    literal breakline \n"}]
    [5] tail.0: [1559759324.245296339, {"id"=>5, "log"=>{"message"=>"1. first line
    2. A "quoted" word
    3. literal breakline \n
    4. Copyright char: ©
    5. Backslash: \
    ", "second_key"=>123}, "stream"=>"stdout", "time"=>"2019-06-05T17:27:48.854223332Z"}]
    
  • 4.2: JSON output:
    {"date":1559759324.245235, "id":0, "log":"utf-8 code message: Fluent Bit \u00a9"}
    {"date":1559759324.245251, "id":1, "log":"escaped \"message\""}
    {"date":1559759324.245263, "id":2, "log":{"nested":{"key":"\\\u00a9"}}}
    {"date":1559759324.245270, "id":3, "Timestamp":"2019-06-05T05:23:51.7911585+00:00", "MessageTemplate":"{StatusCode} {StatusCodeName}", "RenderedMessage":"404 \"NotFound\""}
    {"date":1559759324.245275, "id":4, "message":"\"firstline\"\n\"second line\"\nliteral breakline \\n"}
    {"date":1559759324.245296, "id":5, "log":{"message":"1. first line\n2. A \"quoted\" word\n3. literal breakline \\n\n4. Copyright char: \u00a9\n5. Backslash: \\\n", "second_key":123}, "stream":"stdout", "time":"2019-06-05T17:27:48.854223332Z"} 

5 Extra Test using the Stream Processor

This is a new feature landing in the upcoming Fluent Bit v1.2.0:

Query all streams from tail and select the record where the map log['second_key'] equals 123:

$ fluent-bit -R parsers.conf -i tail -p path=test.log -p parser=json \
  -T "SELECT * FROM STREAM:tail.0 WHERE log['second_key'] = 123;" -o null -f 1
Fluent Bit v1.2.0
Copyright (C) Treasure Data

[2019/06/05 12:36:22] [ info] [storage] initializing...
[2019/06/05 12:36:22] [ info] [storage] in-memory
[2019/06/05 12:36:22] [ info] [storage] normal synchronization mode, checksum disabled
[2019/06/05 12:36:22] [ info] [engine] started (pid=5026)
[2019/06/05 12:36:22] [ info] [sp] stream processor started
[2019/06/05 12:36:22] [ info] [sp] registered task: flb-console:0
[0] [1559759782.051455, {"id"=>5, "log"=>{"message"=>"1. first line
2. A "quoted" word
3. literal breakline \n
4. Copyright char: ©
5. Backslash: \
", "second_key"=>123}, "stream"=>"stdout", "time"=>"2019-06-05T17:27:48.854223332Z"}]

As of this point please share your feedback!.

note: What I am digging into now is the special characters in the key names.

@gitfool
Copy link
Contributor

gitfool commented Jun 5, 2019

I should clarify, the example output I specified was raw stdout from the app, using structured log middleware that writes json output to stdout; also the same as shown with kubectl logs:

{"Timestamp":"2019-06-05T05:23:51.7911585+00:00","MessageTemplate":"{StatusCode} {StatusCodeName}","RenderedMessage":"404 \"NotFound\""}

So by the time Fluent Bit sees it, the above would be in the nested log element in the tailed Docker log, presumably with more escaping, along with the usual stream and time elements.

@edsiper
Copy link
Member

edsiper commented Jun 5, 2019

@gitfool Your app JSON output becomes the following content in Docker log:

{"log":"{\"Timestamp\":\"2019-06-05T05:23:51.7911585+00:00\",\"MessageTemplate\":\"{StatusCode} {StatusCodeName}\",\"RenderedMessage\":\"404 \\\"NotFound\\\"\"}\n","stream":"stdout","time":"2019-06-05T19:29:26.885083823Z"}

Processing with Fluent Bit

1. msgpack output:

[1559763024.832672445, {"id"=>6, "log"=>{"Timestamp"=>"2019-06-05T05:23:51.7911585+00:00", "MessageTemplate"=>"{StatusCode} {StatusCodeName}", "RenderedMessage"=>"404 "NotFound""}, "stream"=>"stdout", "time"=>"2019-06-05T19:29:26.885083823Z"}]

2 JSON output:

 {"date":1559763024.832672, "id":6, "log":{"Timestamp":"2019-06-05T05:23:51.7911585+00:00", "MessageTemplate":"{StatusCode} {StatusCodeName}", "RenderedMessage":"404 \"NotFound\""}, "stream":"stdout", "time":"2019-06-05T19:29:26.885083823Z"}

it looks good to me, please confirm if is the expected output in your side.

@gitfool
Copy link
Contributor

gitfool commented Jun 5, 2019

LGTM. Also matches what I see in Datadog logs - which uses its own parser and bypasses Fluent Bit.

@kiich
Copy link

kiich commented Jun 5, 2019

Wow, thanks for the update @edsiper
Taking a look now but i can confirm that the Docker log file created in your test

{"log":"{\"message\": \"1. first line\\n2. A \\\"quoted\\\" word\\n3. literal breakline \\\\n\\n4. Copyright char: \\u00a9\\n5. Backslash: \\\\\\n\", \"second_key\": 123}\n","stream":"stdout","time":"2019-06-05T17:27:48.854223332Z"}

is not quite the same as mine in that Docker log generated by my app looks like:

{"log":"{\"timestamp\":\"2019-06-03T20:41:27.292+00:00\",\"level\":\"INFO\",\"message\":\"org.xml.sax.SAXParseException: Premature end of file.\\n\\tat org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)\\n\\tat org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)\\n\\tat org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)\\n\\tat org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)\\n\\tat org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)\\n\\tat org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)\"}\n", "stream":"stdout","time":"2019-06-03T20:41:27.292925629Z"}

i.e. with only 2 backslash for new line (and for tab also).
So the app is logging stack trace which in turn the logger of the app escapes it as \n and then Docker string-JSON it as \\n i believe.

I can confirm though that the escaping of quotes inside quotes looks the same as mine so that looks promising!

Happy to test it out as well in my cluster.

@gitfool
Copy link
Contributor

gitfool commented Jun 5, 2019

I'd also be happy to test given a prerelease Docker image for Fluent Bit.

@kiich
Copy link

kiich commented Jun 5, 2019

After re-reading the Docker logline again and 4.2 JSON output, I realised it is the same as my app's Docker logfile and the JSON output looks like something Splunk will be able to display!

Will try and send the JSON output from your test @edsiper to Splunk and see how it looks!

edsiper added a commit that referenced this issue Jun 6, 2019
The current implementation only decoded/unescaped \u characters, this
patch extend it support to: \", \\, \/, \n, \a, \b, \t, \v, \f and \r.

Signed-off-by: Eduardo Silva <[email protected]>
edsiper added a commit that referenced this issue Jun 6, 2019
When converting from JSON to MessagePack, a string was packed without
any processing leading to many issues. This has been the root cause of
many 'escape sequence' problems when reading and generating JSON data.

This problem has been really well documented and explained on #1278 by
Mike Lang (@ekimekim), public kudos!

This patch make sure that before to package any string, do a proper
unescape of utf-8 and special characters on every string.

Signed-off-by: Eduardo Silva <[email protected]>
edsiper added a commit that referenced this issue Jun 6, 2019
Due to recent improvements in the JSON -> Msgpack encoder, it's not
longer necessary to run a new 'unescape' routine since JSON strings
are now properly converted.

This patch makes to workaround cases where the stringified JSON map
have leading empty spaces at the beginning, e.g:

  {"log": "   {\"k\": 123}"}

So when decoding 'log' value as JSON it will be converted properly
to a map.

note: issue described on #1278.

Signed-off-by: Eduardo Silva <[email protected]>
edsiper added a commit that referenced this issue Jun 6, 2019
When Merge_Log option is set, Filter Kubernetes aims to work with or
without the help of a previous parser/decoded step in the original
record.

This patch aims to fix unescaping with the new core routine and also
improve the data handling based on different situations like:

 - is original record log a string or a map ?
   - if is a map, keep it as is and register entries as new root key.
   - if is a string, unescape properly.
 - Upon parsing (by JSON or Annotation parser):
   - no merge is possible: convert log value to JSON was impossible,
     on this case just pack the unescaped version of the string.
   - pack unescaped string version and merge new discovered keys.

Signed-off-by: Eduardo Silva <[email protected]>
edsiper added a commit that referenced this issue Jun 6, 2019
The new improvements on JSON->Msgpack encoding helps to reduce
the work on custom decoders used before. From now using a simple 'json'
decoder is enough to get expected results in outgoing strings.

Signed-off-by: Eduardo Silva <[email protected]>
edsiper added a commit that referenced this issue Jun 6, 2019
…ults (#1278)

Some test cases were expecting some extra backslashes that are coming from
a previous decoding/encoding issues in Fluent Bit. This patch adjust to
the expected outputs.

For short: you should not longer see some '\\\' in your JSON strings :)

Signed-off-by: Eduardo Silva <[email protected]>
@edsiper
Copy link
Member

edsiper commented Jun 6, 2019

Update: we are ready to test out the new changes

Please use the following new Docker image:

edsiper/fluent-bit-1.2-next:0

To make sure it works properly your Docker parser must look like this:

[PARSER]
    Name         docker
    Format       json
    Time_Key     time
    Time_Format  %Y-%m-%dT%H:%M:%S.%L
    Time_Keep    On
    # Command      |  Decoder | Field | Optional Action
    # =============|==================|=================
    Decode_Field_As    json     log

Note: if you skip the decode_field_as line it's fine too, since filter kubernetes will do it parsing if not done before (if Merge_Log option is set).

If you want to see the recent changes in the code base refer to the following link:

https://github.com/fluent/fluent-bit/commits/issue-1278

Please test it and share some feedback!

@mitchellmaler
Copy link

@edsiper I just tested out the new image without a decoder and now the log messages are parsed into fields if it is json. Nice work!

@edsiper
Copy link
Member

edsiper commented Jun 14, 2019

All,

first of all thanks to @ekimekim for taking time to report the issue, troubleshoot, provide insights and suggest the proper fixes, this ticket was a very valuable contribution, thanks!

I would say "most" of the problems were addressed, the remaining encoding issue is the handling of surrogates (re: Emojis and other chars), that implementation will be handled separately and it's already in the roadmap.

The fixes will be part of Fluent Bit v1.2 release that will be available during the next week. At this point and after 11k pulls of the test image I am closing this ticket as fixed.

thanks again everyone that helped out providing feedback and testing.

@edsiper edsiper added bug fixed and removed waiting-for-user Waiting for more information, tests or requested changes work-in-process labels Jun 14, 2019
@rafaelmagu
Copy link

@edsiper I'm seeing an issue with tag 5 of your test Docker image where an escaped string inside the JSON map generated by an app running on Kubernetes doesn't get unescaped correctly, despite the rest of the map being parsed fine.

Relevant config bits:

# For Java stack traces
[PARSER]
    Name        first_line
    Format      regex
    Regex       ^{"log":"(?!\\u0009)(?<log>\S(?:(\\")|[^"]){9}(?:(\\")|[^"])*)"

[PARSER]
    Name        nested_json
    Format      json
    Time_Keep   true
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S.%L
    Decode_Field_As json log do_next
    Decode_Field_As escaped message

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           On
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On

[INPUT]
    Name             tail
    Path             /var/log/containers/*.log
    Parser           nested_json
    Tag              kube.*
    Refresh_Interval 5
    Mem_Buf_Limit    5MB
    Skip_Long_Lines  On
    Multiline        On
    Multiline_Flush  5
    Parser_Firstline first_line

Original log message from container stdout:

{"@timestamp":"2019-06-20T02:04:20.698Z","@version":"1","message":"172.22.25.128 - - [2019-06-20T02:04:20.698Z] \"GET /healthcheck HTTP/1.1\" 200 -1","method":"GET","protocol":"HTTP/1.1","status_code":200,"requested_url":"GET /healthcheck HTTP/1.1","requested_uri":"/healthcheck","remote_host":"172.22.25.128","remote_user":"-","content_length":-1,"elapsed_time":0}

Object as displayed in Kibana:

{"@timestamp":"2019-06-20T02:14:03.214Z","@version":"1","message":"172.22.1.68 - - [2019-06-20T02:14:03.214Z] \"GET /healthcheck HTTP/1.1\" 200 -1","method":"GET","protocol":"HTTP/1.1","status_code":200,"requested_url":"GET /healthcheck HTTP/1.1","requested_uri":"/healthcheck","remote_host":"172.22.1.68","remote_user":"-","content_length":-1,"elapsed_time":1}

What is the recommended way to unescape message? Should I escape before json in the list of decoders?

@edsiper
Copy link
Member

edsiper commented Jun 21, 2019

@rafaelmagu as stated above you should not use decoders

@rafaelmagu
Copy link

Thanks, @edsiper. I removed the decoders completely (so it's just using default docker parser) and the string is still encoded:

172.22.1.68 - - [2019-06-20T02:14:03.214Z] \"GET /healthcheck HTTP/1.1\" 200 -1

@gjcarneiro
Copy link

To be honest, this looks correct. With Python 3 json module, you can see that the message decodes correctly (using the "Object as displayed in Kibana" in the comment further up):

>>> import json
>>> print(json.loads(rb'{"@timestamp":"2019-06-20T02:14:03.214Z","@version":"1","message":"172.22.1.68 - - [2019-06-20T02:14:03.214Z] \"GET /healthcheck HTTP/1.1\" 200 -1","method":"GET","protocol":"HTTP/1.1","status_code":200,"requested_url":"GET /healthcheck HTTP/1.1","requested_uri":"/healthcheck","remote_host":"172.22.1.68","remote_user":"-","content_length":-1,"elapsed_time":1}')["message"])
172.22.1.68 - - [2019-06-20T02:14:03.214Z] "GET /healthcheck HTTP/1.1" 200 -1

@rafaelmagu
Copy link

@gjcarneiro That's the output I'd expect to see in Kibana. But I'm still getting the quotes.

@qingling128
Copy link
Collaborator

The fixes will be part of Fluent Bit v1.2 release that will be available during the next week.

@edsiper - Is the release timeline still relevant or is it postponed as we are still investigating this issue?

@edsiper
Copy link
Member

edsiper commented Jun 24, 2019

@qingling128 this ticket is already fixed. The release should be this week depending of a pending feature we are implementing in the stream processor.

@kiich
Copy link

kiich commented Jun 24, 2019 via email

@qingling128
Copy link
Collaborator

@edsiper - Glad to hear that. Thanks!

@edsiper
Copy link
Member

edsiper commented Jun 24, 2019

@kiich

  • 1.1.x -> minor release of 1.1 series (latest is 1.1.3)
  • 1.2.x -> next major release

@edsiper
Copy link
Member

edsiper commented Jun 27, 2019

Fluent Bit v1.2.0 is out: https://fluentbit.io/announcements/v1.2.0/

@edsiper edsiper closed this as completed Jun 27, 2019
@rafaelmagu
Copy link

FYI I've upgraded my cluster to use 1.2.0, and now no JSON logs are being parsed (I have removed the decoders as per the upgrade notes).

@edsiper
Copy link
Member

edsiper commented Jun 27, 2019

@rafaelmagu please open a new ticket and share your configmap

@jraby
Copy link
Contributor

jraby commented Jul 3, 2019

@edsiper Am I correct thinking that with those fixes v1.2 will still output (invalid) \v in a HTTP json output? (meaning, item named Output encoding issues in the bug description is still open ?)

For example, a docker container running a command like this: echo -e one\\vtwo will yield a docker log file with: {"log":"one\u000btwo\n","stream":"stdout","time":"2019-07-03T15:40:19.840758223Z"}

and fluent-bit sends out a payload like this: [{"date":"2019-07-03T15:40:19.840758Z", "log":"one\vtwo\n", "stream":"stdout", "time":"2019-07-03T15:40:19.840758223Z"}]

EDIT: (add config)

[PARSER]
    Name        docker
    Format      json
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S.%L
    Time_Keep   On

[PARSER]
    Name        ns_k8s_parser
    Format      regex
    # The cache used by the k8s filter uses the namespace:podname:containername as a key
    # The regex has been modified to include the container id as part of the container_name
    # so that we do not use the same meta data for same ns:podname:containername  (different run == different meta)
    # <tag>.ns-job-b393033d-7191-423a-a060-07a20794067a_ns_ns-job-b393033d-7191-423a-a060-07a20794067a-1-41be60b64fd943743d5b5b9e09b20a8ab50e360f461637c1d20aea8d52749662.log
    Regex       (?<tag>[^.]+)?\.?(?<pod_name>[^_]+)_(?<namespace_name>[^_]+)_(?<container_name>.+-(?<docker_id>[a-z0-9]{64}))\.log$

[INPUT]
    Name              tail
    Tag               ns.*
    Path              /var/log/containers/*_ns_*.log
    Parser            docker
    DB                /var/log/flb_kube_ns.db
    Mem_Buf_Limit     64MB
    Skip_Long_Lines   On
    Refresh_Interval  1
[FILTER]
    Name           kubernetes
    Match          ns.*
    Kube_URL       https://kubernetes.default.svc:443
    Regex_Parser   ns_k8s_parser
    Merge_JSON_Log On
[OUTPUT]
    Name            http
    Match           ns.*
    Host            127.0.0.1
    Port            9011
    URI             /logit
    Format          json
    json_date_format iso8601
    Retry_Limit     False


edsiper added a commit that referenced this issue Jul 3, 2019
edsiper added a commit that referenced this issue Jul 3, 2019
@edsiper
Copy link
Member

edsiper commented Jul 3, 2019

@jraby thanks for reporting the issue.

In order to work on specific remaining problems found after v1.2 release we are opening new tickets. For this specific case, I've opened #1415.

FYI: the following commits address the issue:

@edsiper
Copy link
Member

edsiper commented Jul 3, 2019

This thread will be locked/muted, please feel free to open new tickets and reference this one for similar problems found.

@fluent fluent locked as resolved and limited conversation to collaborators Jul 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests