Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep original messages in case of Filebeat modules #8448

Closed

Conversation

kvch
Copy link
Contributor

@kvch kvch commented Sep 26, 2018

From now on it is possible to keep the original messages of Filebeat modules. This makes it possible to process the raw messages event again if required.

Configuration

This is an input level configuration, so it must be configured in the input section of each module.

# ...module configuration...
  input:
    keep_original_message: true

This also means that it is valid to configure this option from inputs. However, it is suppressed if an event is not coming from a Filebeat module.

This is the first step of implementing #8083 and the alternative to #7207.

@kvch kvch added review Filebeat Filebeat labels Sep 26, 2018
@kvch kvch requested review from urso and ruflin September 26, 2018 15:30
@kvch
Copy link
Contributor Author

kvch commented Sep 26, 2018

Added entry to our Changelog as a breaking change, as it is going to have a big impact on users' Elasticsearch instances.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We renamed this to log.original in ECS. I think it also fits quite well with your code as you often use original: https://github.com/elastic/ecs#log

@@ -81,6 +93,9 @@ filebeat.modules:
# Input configuration (advanced). Any input configuration option
# can be added under this section.
#input:
#Keeps the original message, so the data can be processed again on Ingest Node
#It requires increased storage size, because the sizes of events are approximately doubled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not make statements about doubling the size without measuring it. I also think it's not needed to have this in the config file but we could add a note about it in the docs that the there is increase storage use.

@ruflin
Copy link
Contributor

ruflin commented Sep 27, 2018

Thanks for taking this on. I'm closing #7207 in favor of this one.

Questions:

  • How will the original message look like in case of truncated messages?
  • How will the original message look like in case of genera json?
  • How will the original message look like in case of docker multiline?

@@ -19,6 +19,8 @@ https://github.com/elastic/beats/compare/v6.4.0...master[Check the HEAD diff]

*Filebeat*

- Keep original messages in case of Filebeat modules. {pull}8448[8448]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not put it under breaking change as it's and addition but we should definitively have a note in the migration guide about the additional storage use.

@kvch
Copy link
Contributor Author

kvch commented Sep 28, 2018

The contents of log.original is the same as it was in your PR: "full multiline message, is encoded and the new line at the end is stripped". This copying the original message is done by a processor at the beginning, so whatever comes out of the readers of log.Harvester is going to be stored in log.original.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a difference in the two implementation. As the original message was copied here (https://github.com/elastic/beats/pull/7207/files#diff-8c78b3728628b1d2ecbd6a3a066629e8R47) it is not affected afterwards by the limit reader. Same reason on why I added it in multiline.

Haven't tested it but I assume if MaxBytes is set to 10 and the line is 20 bytes long, with the current code also the original will be 10 bytes which makes it impossible to reprocess.

@kvch
Copy link
Contributor Author

kvch commented Oct 1, 2018

Yes, that's exactly how it works. I'll do refactoring as requested.

For the record, in this case it's even riskier to enable this feature by default, because users cannot limit the size of log.original field.

@ruflin
Copy link
Contributor

ruflin commented Oct 1, 2018

We should definitively mention that in the docs to make users aware of it.

@kvch kvch requested a review from ph October 11, 2018 10:35
@ph
Copy link
Contributor

ph commented Oct 16, 2018

@kvch I like that is just another processor that is run before any users defined processors.

For the record, in this case it's even riskier to enable this feature by default, because users cannot limit the size of log.original field.

I agree with you, we should not set it to true by default. Let's take a common scenario, reading java log file, by default when nothing bad happens a document will be at least the double of the original size, this will have the following direct impact:

  • More data to serialize
  • More data to compressed, since we have duplicate values we should have a good compression ratio.
  • More data to send over the wire.
  • More memory usage after unserialization.

The above should be under control when the system is in an happy state, but as soon as a stack trace happen it will generates really big multiple events and this is where things will start to explose.

In that case, I would prefer that we don't set that value by default and maybe we could add section about reprocessing of events in our documentation that points to this option but also expose any caveats we could have, like the possibility of truncated values.

because users cannot limit the size of log.original field.

This makes me thing what if we provide a truncate processors? I believe this could be handy and useful to allow users to truncate a field, ie stack traces in the multiline context, or limiting content after doing a dissect.

Notes: I am aware of the 500 default limit limit of multiline, but maybe we could provide more flexibility.

var defaultConfig = inputOutletConfig{
KeepOriginalMsg: true,
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly believe this should be a opt-in feature.

@kvch
Copy link
Contributor Author

kvch commented Oct 17, 2018

@ph I enabled the option by default for modules, because we agreed to do so. I think this solution is an acceptable middle ground, because it still keeps the original message read by Filebeat before it is parsed by an Ingest node. But users still have control over the message sizes.
Maybe we could add a new option e.g original_message_length. We could store the original untruncated message produced by the reader pipeline, but right before it leaves the pipeline, it could be truncated as the user configured the limit.

@ph @urso @ruflin do we need to have another conversation regarding this change?

@ph
Copy link
Contributor

ph commented Oct 17, 2018

@ruflin Do you also agree to have it on by default for modules?

@ruflin
Copy link
Contributor

ruflin commented Oct 18, 2018

For me the goal of keeping the original message around is to allow reprocessing. If we truncated it by default we loose this capability.

For on by default: What if we release this option in 6.x but off by default so we can start playing around with it. Then we can still have a discussion again for 7.0 if we should enable it by default or not. This buys us time to also do some size tests.

@ph
Copy link
Contributor

ph commented Oct 18, 2018

For me the goal of keeping the original message around is to allow reprocessing. If we truncated it by default we loose this capability.

We still need that upper bound check for memory control, but I would say that in the majority of case that should be fine.

For on by default: What if we release this option in 6.x but off by default so we can start playing around with it. Then we can still have a discussion again for 7.0 if we should enable it by default or not. This buys us time to also do some size tests.

This looks more reasonable to me. I +1 that strategy.

@tbragin
Copy link
Contributor

tbragin commented Oct 24, 2018

I agree -- no setting this on by default in 6.x. I'd consider it a breaking change as it could change the size of data sent dramatically.

Before setting it on by default in the future, I think we'd need to do benchmarking to see what the impact is (including increase to size on disk) and have a conversation about it.

cc @cdahlqvist

@cdahlqvist
Copy link

If the main reason for keeping it is reprocessing, we can have it not being indexed. If it is just in the source it should compress quite well.

@ruflin
Copy link
Contributor

ruflin commented Oct 26, 2018

@cdahlqvist The indexing is off by default: https://github.com/elastic/beats/pull/8448/files#diff-a0e7c7d7619e2c513393f5c49a980d11R119

@kvch
Copy link
Contributor Author

kvch commented Apr 2, 2019

Closing in favour of new processor PRs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants