Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decode_json_field: process objects and arrays only #11312

Merged
merged 12 commits into from
Mar 26, 2019

Conversation

michalpristas
Copy link
Contributor

Problem
Processor unmarshals fields incorrectly causing invalid decoding
E.g:
"2017 some string" = > 2017
"2016-09-28T01:40:26.760+0000" => 2016
"123" => 123

Solution
Process only objects and arrays to avoid data loss.

Fixes #3534

libbeat/processors/actions/decode_json_fields_test.go Outdated Show resolved Hide resolved
@@ -146,6 +146,10 @@ func unmarshal(maxDepth int, text string, fields *interface{}, processArray bool
return v, false
}

if !strings.HasPrefix(str, "[") && !strings.HasPrefix(str, "{") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You check here for [ but couldn't find an example of it the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a bug with arrays solved in PR #11318 which also includes tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it, the mentioned bug is about processing when it should not process, so it should be fine to test it with processing enabled

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a quite simple heuristic, but should be fine. It still can fail, though. But decodeJSON should have failed in the first place. For example given the string "2016-20-10" the lexer should find that - is an invalid character. The dec.More() call only returns false (as there is no more valid data in the stream), but More also returns false if there was an error. We can improve decodeJSON by checking the dec.Token() error code after dec.More().

For the heuristics (we can still guard decodeJSON using some heuristics), we should remove whitespace when testing as well:

func isStructured(s string) bool {
  s = strings.TrimSpace(s)
  end = len(s) - 1
  return end > 0 && (
    (s[0] == '[' && s[end] == ']') ||
    (s[0] == '{' && s[end] == '}')
  )
}

@ruflin
Copy link
Contributor

ruflin commented Mar 21, 2019

@michalpristas Should #11318 be merged before this one?

@michalpristas
Copy link
Contributor Author

michalpristas commented Mar 21, 2019

@ruflin i added arrays with explicit processArray true, so it is not influenced by #11318
so there's no need to favor one

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change LGTM.

@michalpristas I see you marked it as breaking change. I would rather think of it as a bug fix even though I see your point that for some users it potentially change the data format. But my thinking is that before it was not as expected.

@urso @andrewkroh Could you chime in here as you were working / looking into the initial issue?

@michalpristas
Copy link
Contributor Author

@ruflin thanks, a good eye with the changelog. i must admit it was not intentional

@@ -146,6 +146,10 @@ func unmarshal(maxDepth int, text string, fields *interface{}, processArray bool
return v, false
}

if !strings.HasPrefix(str, "[") && !strings.HasPrefix(str, "{") {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a quite simple heuristic, but should be fine. It still can fail, though. But decodeJSON should have failed in the first place. For example given the string "2016-20-10" the lexer should find that - is an invalid character. The dec.More() call only returns false (as there is no more valid data in the stream), but More also returns false if there was an error. We can improve decodeJSON by checking the dec.Token() error code after dec.More().

For the heuristics (we can still guard decodeJSON using some heuristics), we should remove whitespace when testing as well:

func isStructured(s string) bool {
  s = strings.TrimSpace(s)
  end = len(s) - 1
  return end > 0 && (
    (s[0] == '[' && s[end] == ']') ||
    (s[0] == '{' && s[end] == '}')
  )
}

@michalpristas michalpristas merged commit 6bff9a6 into elastic:master Mar 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants