Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat modules: keep raw message #8083

Closed
urso opened this issue Aug 24, 2018 · 10 comments
Closed

Filebeat modules: keep raw message #8083

urso opened this issue Aug 24, 2018 · 10 comments

Comments

@urso
Copy link

urso commented Aug 24, 2018

Filebeat modules parse and remove the original message. When original contents is JSON, the original message (as is), is not even published by filebeat. For debugging, re-processing, or just displaying original logs, filebeat should be able to publish the original unprocessed contents as well.

  • Add raw contents to log.message
  • Add option to modules to keep original message in log.message. Enabled by default (document this as backwards incompatible change)
  • Update json reader to also report original unparsed contents
@ruflin
Copy link
Contributor

ruflin commented Aug 27, 2018

@urso should we continue with this PR or you want to open a separate one? #7207

@urso
Copy link
Author

urso commented Aug 27, 2018

I think we can continue with the PR.

@kvch
Copy link
Contributor

kvch commented Nov 19, 2018

In #8448 many people provided feedback regarding this feature. I am collecting my thoughts here, because I am proposing a different solution, so I might close the PR. I consider #8950 as a related issue, we can solve both problems with the same solution.

I think keeping original messages can be turned on by default, because the field containing the raw message will not be indexed. If something is not indexed it, it can be compressed better. Ref: #8448 (comment)
Regardless, I will do some benchmarking to see the size requirements.

I think keeping raw messages are relevant in two cases in Filebeat: JSON messages and modules. These two elements are given by Filebeat and there is no other way to reprocess the input once Filebeat did its own thing. The first case, JSON parsing, is straightforward. The raw string is copied and added to the event.

The second case is a bit more complex as there are at least two possible solution. So far we have tried copying the raw message when the message is processed by Filebeat. However, I think it would be a better solution to handle keeping the raw message on ES (or possibly on LS side). All module pipelines contain a processor which removes the element "message" of every event. This field contains the raw message which is then processed by ES. I propose to make deleting that field configurable for modules. I would even add the option to rename it to something more meaningful e.g "log.original". So I would introduce two new options for modules: keep_original_message and original_message_key.

If keep_original_message is set, the pipeline of the module does not contain the remove processor. If original_message_key is configured a rename processor is added to the end of the processors for each module pipeline. This means that if users are using these options they need to reload the pipelines if something is changed.

We already have minimal control over pipelines by templates (e.g convert_timezone). But I would use a different approach as it would be error-prone to these directives to every module. I would have Filebeat read the configuration and add a remove or a rename processor as needed before loading it to ES.

In other cases, for example user defined pipelines, the responsibility of keeping the raw message can be delegated to our users. As those users are know enough to create their own pipelines, they can take care of keeping unprocessed messages.

WDYT? @urso @ruflin @ph

@webmat
Copy link
Contributor

webmat commented Nov 19, 2018

Just dropping by, wearing my ECS hat. We've called this log.original in ECS, so +1 on log.original for the default field name. And of course totally fine to make it configurable on top of this :-)

@ruflin
Copy link
Contributor

ruflin commented Nov 23, 2018

If we decide to move the handling of keeping the raw message to ingest pipelines it means no processing like dissect is allowed to happen on the edge node. I personally see creating log.original as a feature that is spans also outside modules. We don't have all the modules today but keep adding. Someone ingesting data for service foo will not have a module today but if there is one tomorrow and we have log.original, the data can be reprocessed based on log.original.

Let's not offer users to store the data in something else then log.original. This will allow us to always know log.original is there and we can rely on it. If someone really wants to change it and break the system, he can use the rename processor. Let's make it as hard as possible to change it.

Proposal for moving this forward:

  1. Introduce a config option on the prospector / input level to keep the original message around, off by default
  2. Follow up PR to discuss if we should enable it by default.

Like this we can get the functionality and then focus the discussion on if it should be on or off by default.

@kvch
Copy link
Contributor

kvch commented Mar 18, 2019

In our internal discussion we decided to add two new processors; one for copying and one for truncating messages. Exposing the functionality as processors is more flexible and gives more control to users.

@kvch
Copy link
Contributor

kvch commented Apr 5, 2019

Closing as both copy_fields and truncate_fields are merged to master:

Example configuration to keep raw message:

processors:
- copy_fields:
    fields:
        - from: message
          to: event.original
    fail_on_error: false
    ignore_missing: true
- truncate_fields:
    fields:
      - event.original
    max_bytes: 1024
    fail_on_error: false
    ignore_missing: true

@goranmicic
Copy link

goranmicic commented Apr 9, 2020

Closing as both copy_fields and truncate_fields are merged to master:

* #11303

* #11297

Example configuration to keep raw message:

processors:
- copy_fields:
    fields:
        - from: message
          to: event.original
    fail_on_error: false
    ignore_missing: true
- truncate_fields:
    fields:
      - event.original
    max_bytes: 1024
    fail_on_error: false
    ignore_missing: true

Hi,

I have tried keeping the original line (json) from json log with suggested processors but the result is the same.
The message field doesn't exist at all to be copied.

Tried to run the processor against json field together with json.keys_under_root: false and the result is the same as json.*, expected.

When using json.keys_under_root: false the filebeat writes to its log that it is trying to push the json line (block) from the json log with "json" parent key.

When using json.keys_under_root: true the filebeat writes to its log that it is trying to push the json line (block) from the json log without parent key.

Again, expected, but what i can't figure out is at what moment the processor should copy the original message and write it as value of the "message" key while parsing json log?

This is exactly the case no.1. from above if i am not mistaken.

Maybe i am missing something, please do offer a suggestion if possible.

I am pasting filebeat.yml:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/somejsonstructured.log
  json.keys_under_root: true
  json.overwrite_keys: false
  json.add_error_key: true
filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false
setup.template.settings:
  index.number_of_shards: 1
setup.kibana:
  host: "localhost:5601"
output.elasticsearch:
  hosts: ["localhost:9200"]
processors:
  - copy_fields:
      fields:
         - from: message
           to: rawlog
      fail_on_error: false
      ignore_missing: true
  - truncate_fields:
      fields:
         - rawlog
      max_bytes: 1024
      fail_on_error: false
      ignore_missing: true

Thank you!

@daskanu
Copy link

daskanu commented Jun 3, 2022

We are parsing our json logs to parent key 'json' and use the following processor to restore the raw message:

- script:
    when:
      and:
      - has_fields: ["json"]
      - not:
          has_fields: ["message"]
    lang: javascript
    id: json_stringify
    source: >
      function process(event) {
        var jsons = event.Get("json");
        if (jsons != null) {
          event.Put("message",JSON.stringify(jsons));
        }
      }

Hope this helps and hopefully the Elastic boys will implement a flag for the inputs to keep raw messages when parsing JSON.

@Fondaz
Copy link

Fondaz commented Aug 17, 2022

Closing as both copy_fields and truncate_fields are merged to master:

* #11303

* #11297

Example configuration to keep raw message:

processors:
- copy_fields:
    fields:
        - from: message
          to: event.original
    fail_on_error: false
    ignore_missing: true
- truncate_fields:
    fields:
      - event.original
    max_bytes: 1024
    fail_on_error: false
    ignore_missing: true

Hi,

I have tried keeping the original line (json) from json log with suggested processors but the result is the same. The message field doesn't exist at all to be copied.

Tried to run the processor against json field together with json.keys_under_root: false and the result is the same as json.*, expected.

When using json.keys_under_root: false the filebeat writes to its log that it is trying to push the json line (block) from the json log with "json" parent key.

When using json.keys_under_root: true the filebeat writes to its log that it is trying to push the json line (block) from the json log without parent key.

Again, expected, but what i can't figure out is at what moment the processor should copy the original message and write it as value of the "message" key while parsing json log?

This is exactly the case no.1. from above if i am not mistaken.

Maybe i am missing something, please do offer a suggestion if possible.

I am pasting filebeat.yml:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/somejsonstructured.log
  json.keys_under_root: true
  json.overwrite_keys: false
  json.add_error_key: true
filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false
setup.template.settings:
  index.number_of_shards: 1
setup.kibana:
  host: "localhost:5601"
output.elasticsearch:
  hosts: ["localhost:9200"]
processors:
  - copy_fields:
      fields:
         - from: message
           to: rawlog
      fail_on_error: false
      ignore_missing: true
  - truncate_fields:
      fields:
         - rawlog
      max_bytes: 1024
      fail_on_error: false
      ignore_missing: true

Thank you!

Any news? Same issue, I don't have original log even with copy_fields processor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants