Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parsers in filestream input #24763

Merged
merged 9 commits into from
Apr 22, 2021

Conversation

kvch
Copy link
Contributor

@kvch kvch commented Mar 25, 2021

This PR adds support for pasers in the filestream input.

Example configuration that aggregates fives lines into a single event and parses the JSON contents:

- type: filestream
  enabled: true
  paths:
  - test.log
  parsers:
  - multiline:
      type: count
      count_lines: 5
      skip_newline: true
  - ndjson:
      fields_under_root: true

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 25, 2021
@kvch kvch added in progress Pull request is currently in progress. Team:Elastic-Agent Label for the Agent team labels Mar 25, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 25, 2021
@kvch kvch marked this pull request as draft March 25, 2021 13:36
@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 25, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #24763 updated

  • Start Time: 2021-04-20T16:17:13.878+0000

  • Duration: 76 min 39 sec

  • Commit: 0664c8b

Test stats 🧪

Test Results
Failed 0
Passed 47239
Skipped 5134
Total 52373

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 47239
Skipped 5134
Total 52373

Copy link

@urso urso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the approach looks good to me. I assume syslog and CRI-O will be added next?

Have you tried JSON in JSON or ndjson -> multiline, as is common in the case of containers?

filebeat/input/filestream/config.go Outdated Show resolved Hide resolved
filebeat/input/filestream/config.go Outdated Show resolved Hide resolved
filebeat/input/filestream/input.go Show resolved Hide resolved
filebeat/input/filestream/parser.go Show resolved Hide resolved
@urso
Copy link

urso commented Mar 29, 2021

it works most of the time

Most of the time? You see any potential limitation that don't exist in comparison to the logs input?

@kvch
Copy link
Contributor Author

kvch commented Mar 30, 2021

it works most of the time

Most of the time? You see any potential limitation that don't exist in comparison to the logs input?

JSON parsing is tricky in the log input, and I first wanted to see what you think about the basics of the refactoring than talk about the painful details. ATM I believe I can get JSON parsing working similarly to log input.

The main problem is that the JSON reader only parses the json fields and puts them under json key. A few steps of JSON processing is done after the reader pipeline is done and a new beat.Event is created. If keys_under_root is enabled, a separate function merges the fields of a beat.Event and the keys under json. Also, it sets the fields under @meta if document_id is set. Obviously, a Reader/parser in this state cannot do it because they work on Messages not Events.

@urso
Copy link

urso commented Mar 31, 2021

The main problem is that the JSON reader only parses the json fields and puts them under json key. A few steps of JSON processing is done after the reader pipeline is done and a new beat.Event is created. If keys_under_root is enabled, a separate function merges the fields of a beat.Event and the keys under json. Also, it sets the fields under @meta if document_id is set. Obviously, a Reader/parser in this state cannot do it because they work on Messages not Events.

I see. Not sure how much postprocessing will be required, but maybe we can introduce a generic helper function/method that can convert the Message to a beat.Event or extend a beat.Event from a Message object.
Regarding fiels under root, maybe we can pass a path prefix to the JSON parser, such that fields are automatically put under that prefix. If the prefix is empty, fields will be put under root. By default we can pass "json" to the transformer/parser. That way we can keep the current functionality, but have the message already in the right format. Plus, in case we add syslog or xml parsing for example, we will also publish fields. In that case we should not use the "json" namespace.

Maybe we also need to enhance Message to create a set of meta fields, such that post processing of meta does not differ from normal fields?

@kvch kvch force-pushed the feature-filebeat-filestream-parsers branch 2 times, most recently from acda1f2 to b560e7e Compare April 8, 2021 16:48
@kvch kvch marked this pull request as ready for review April 8, 2021 16:50
@kvch kvch force-pushed the feature-filebeat-filestream-parsers branch from b560e7e to 231d20c Compare April 8, 2021 16:51
@kvch kvch changed the title Proof of concenpt of parsers in filestream Add support for parsers in filestream input Apr 8, 2021
@kvch kvch force-pushed the feature-filebeat-filestream-parsers branch 3 times, most recently from 74e7393 to 3b0d514 Compare April 12, 2021 16:42
@kvch kvch requested a review from urso April 12, 2021 18:06
go.mod Outdated Show resolved Hide resolved
filebeat/input/filestream/parser.go Show resolved Hide resolved
}
fields["message"] = string(m.Content)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the complete function body up until here I wonder, if this could be a separate parser we just append to the list of parsers? In that case the beat.Event could be extracted from the message directly.


for _, proc := range inp.msgPostProc {
proc.PostProcess(&m)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to build and apply the post processors separately? Couldn't we just ensure that the parser correctly modifies the message? Are there fields that must be 'set' before the post processing is run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the fields log.offset, log.file.path are set before the input merges the JSON fields with the existing fields.

@@ -359,16 +368,24 @@ func (inp *filestream) eventFromMessage(m reader.Message, path string) beat.Even
},
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can/should inject a processor at the beginning? E.g. if we have filebeat collect the output of another filebeat, shall we really overwrite the existing fields with the new ones we define here, or should we make an attempt to keep the original fields from the original log file intact if they are present?

// specific language governing permissions and limitations
// under the License.

// +build integration
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, these are quite a few integration tests. Having a few integration tests is indeed important, as we combine quite a few components.

But as the focus is on the parsers, it would be nice to have more unit tests with parsers only, instead of full fledged integration tests that operate on the disk.

I understand that most of these tests are copied from the existing system tests and it is better to keep them as is. But maybe we can create a follow up issue to clean the tests up a little.

// The message key might have been modified by multiline
if len(pp.cfg.MessageKey) > 0 && len(msg.Content) > 0 {
jsonFields[pp.cfg.MessageKey] = string(msg.Content)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which value has MessageKey and what do we store in here if we didn't run multiline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MessageKey is only needed if multiline is configured before the JSON parser.

@mergify
Copy link
Contributor

mergify bot commented Apr 14, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature-filebeat-filestream-parsers upstream/feature-filebeat-filestream-parsers
git merge upstream/master
git push upstream feature-filebeat-filestream-parsers

@kvch kvch force-pushed the feature-filebeat-filestream-parsers branch from 01e84d5 to dd077b9 Compare April 15, 2021 13:56
@kvch kvch added needs_backport PR is waiting to be backported to other branches. and removed in progress Pull request is currently in progress. labels Apr 20, 2021
@kvch kvch requested a review from urso April 20, 2021 10:48
var config readjson.Config
cfg := ns.Config()
cfg.Unpack(&config)
pp = append(pp, readjson.NewJSONPostProcessor(&config))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens to post processing if I have multiple JSON parsers? E.g. json => multiline => json?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, multiple post processors are added. It means that e.g. if keys_under_root or overwrite_keys is enabled for both json parsers, the last parser "wins".

Copy link

@urso urso Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more curious what happens to the event if one has keys_under_root disable and one has it enabled.

e.g. let's assume I have the configuration:

parsers:
  - ndjson.fields_under_root: true
  - ndjson.fields_under_root: false

with this event:

{"log.level": "debug", "message": "{\"id\": 5}"}

then given the parsers I would assume my event will look like:

{
  "log.level": "debug",
  "json": { "id": 5 }
}

What what will actually be produced is:

{
  "log.level": "debug",
  "id": 5,
}

Because the post processor removes json from the event, it is not the first one that wins, but the last one... in this case.

@kvch kvch requested a review from urso April 20, 2021 15:11
@urso
Copy link

urso commented Apr 21, 2021

Looks good for now. But unfortunately we are not off the hook yet. In order to support docker json logs or syslog parsing in the future, we have to try to make the parsers more self-contained without the need for post processing. Maybe we have to 'fork' the current readers and create new implementations for the filestream input. The current readers we have are bound to the logs inputs only, anyways.

@kvch kvch merged commit 30331bc into elastic:master Apr 22, 2021
@kvch kvch added v7.13.0 and removed needs_backport PR is waiting to be backported to other branches. labels Apr 22, 2021
kvch added a commit to kvch/beats that referenced this pull request Apr 22, 2021
This PR adds support for pasers in the `filestream` input.

Example configuration that aggregates fives lines into a single event and parses the JSON contents:
```yaml
- type: filestream
  enabled: true
  paths:
  - test.log
  parsers:
  - multiline:
      type: count
      count_lines: 5
      skip_newline: true
  - ndjson:
      fields_under_root: true
```

(cherry picked from commit 30331bc)
kvch added a commit to kvch/beats that referenced this pull request Apr 22, 2021
This PR adds support for pasers in the `filestream` input.

Example configuration that aggregates fives lines into a single event and parses the JSON contents:
```yaml
- type: filestream
  enabled: true
  paths:
  - test.log
  parsers:
  - multiline:
      type: count
      count_lines: 5
      skip_newline: true
  - ndjson:
      fields_under_root: true
```

(cherry picked from commit 30331bc)
v1v added a commit to v1v/beats that referenced this pull request Apr 22, 2021
…-github-pr-comment-template

* upstream/master:
  Add support for parsers in filestream input (elastic#24763)
  Skip flaky test TestFilestreamTruncate (elastic#25218)
  backport: Add 7.13 branch (elastic#25189)
  Update decode_json_fields.asciidoc (elastic#25056)
  [Elastic Agent] Fix status and inspect command to work inside running container (elastic#25204)
v1v added a commit to v1v/beats that referenced this pull request Apr 22, 2021
…ng-versions-stack

* upstream/master: (28 commits)
  Add support for parsers in filestream input (elastic#24763)
  Skip flaky test TestFilestreamTruncate (elastic#25218)
  backport: Add 7.13 branch (elastic#25189)
  Update decode_json_fields.asciidoc (elastic#25056)
  [Elastic Agent] Fix status and inspect command to work inside running container (elastic#25204)
  Check native environment before starting (elastic#25186)
  Change event.code and winlog.event_id type (elastic#25176)
  [Ingest Manager] Proxy processes/elastic-agent to stats (elastic#25193)
  Update mergify backporting to 7.x and 7.13 (elastic#25196)
  [Heartbeat]: ensure synthetics version co* [Heartbeat]: ensure synthetics version compatability for suites  * address review and fix notice  * fix lowercase struct  * fix version conflict and rebase  * update go.* stuff to master  * fix notice.txt  * move validate inside sourcempatability for suites (elastic#24777)
  [Filebeat] Ensure Kibana audit `event.category` and `event.type` are still processed as strings. (elastic#25101)
  Update replace.asciidoc (elastic#25055)
  Fix nil panic when overwriting metadata (elastic#24741)
  [Filebeat] Add Malware Bazaar to Threat Intel Module (elastic#24570)
  Fix k8s svc selectors mapping (elastic#25169)
  [Ingest Manager] Make agent retry values for bootstraping configurable (elastic#25163)
  [Metricbeat] Remove elasticsearc.index.created from the SM code (elastic#25113)
  [Ingest Manager] Keep http and logging config during enroll (elastic#25132)
  Refactor kubernetes autodiscover to avoid skipping short-living pods (elastic#24742)
  [libbeat] New decode xml wineventlog processor (elastic#25115)
  ...
kvch added a commit that referenced this pull request Apr 22, 2021
This PR adds support for pasers in the `filestream` input.

Example configuration that aggregates fives lines into a single event and parses the JSON contents:
```yaml
- type: filestream
  enabled: true
  paths:
  - test.log
  parsers:
  - multiline:
      type: count
      count_lines: 5
      skip_newline: true
  - ndjson:
      fields_under_root: true
```

(cherry picked from commit 30331bc)
kvch added a commit that referenced this pull request Apr 22, 2021
This PR adds support for pasers in the `filestream` input.

Example configuration that aggregates fives lines into a single event and parses the JSON contents:
```yaml
- type: filestream
  enabled: true
  paths:
  - test.log
  parsers:
  - multiline:
      type: count
      count_lines: 5
      skip_newline: true
  - ndjson:
      fields_under_root: true
```

(cherry picked from commit 30331bc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants