Keep original messages in case of Filebeat modules #8448

kvch · 2018-09-26T15:30:45Z

From now on it is possible to keep the original messages of Filebeat modules. This makes it possible to process the raw messages event again if required.

Configuration

This is an input level configuration, so it must be configured in the input section of each module.

# ...module configuration...
  input:
    keep_original_message: true

This also means that it is valid to configure this option from inputs. However, it is suppressed if an event is not coming from a Filebeat module.

This is the first step of implementing #8083 and the alternative to #7207.

kvch · 2018-09-26T15:34:04Z

Added entry to our Changelog as a breaking change, as it is going to have a big impact on users' Elasticsearch instances.

ruflin

We renamed this to log.original in ECS. I think it also fits quite well with your code as you often use original: https://github.com/elastic/ecs#log

ruflin · 2018-09-27T09:28:02Z

filebeat/filebeat.reference.yml

@@ -81,6 +93,9 @@ filebeat.modules:
    # Input configuration (advanced). Any input configuration option
    # can be added under this section.
    #input:
+      #Keeps the original message, so the data can be processed again on Ingest Node
+      #It requires increased storage size, because the sizes of events are approximately doubled.


Let's not make statements about doubling the size without measuring it. I also think it's not needed to have this in the config file but we could add a note about it in the docs that the there is increase storage use.

ruflin · 2018-09-27T09:33:49Z

Thanks for taking this on. I'm closing #7207 in favor of this one.

Questions:

How will the original message look like in case of truncated messages?
How will the original message look like in case of genera json?
How will the original message look like in case of docker multiline?

ruflin · 2018-09-27T09:34:34Z

CHANGELOG.asciidoc

@@ -19,6 +19,8 @@ https://github.com/elastic/beats/compare/v6.4.0...master[Check the HEAD diff]

 *Filebeat*

+- Keep original messages in case of Filebeat modules. {pull}8448[8448]


I would not put it under breaking change as it's and addition but we should definitively have a note in the migration guide about the additional storage use.

kvch · 2018-09-28T09:30:53Z

The contents of log.original is the same as it was in your PR: "full multiline message, is encoded and the new line at the end is stripped". This copying the original message is done by a processor at the beginning, so whatever comes out of the readers of log.Harvester is going to be stored in log.original.

ruflin

I think there is a difference in the two implementation. As the original message was copied here (https://github.com/elastic/beats/pull/7207/files#diff-8c78b3728628b1d2ecbd6a3a066629e8R47) it is not affected afterwards by the limit reader. Same reason on why I added it in multiline.

Haven't tested it but I assume if MaxBytes is set to 10 and the line is 20 bytes long, with the current code also the original will be 10 bytes which makes it impossible to reprocess.

kvch · 2018-10-01T13:46:55Z

Yes, that's exactly how it works. I'll do refactoring as requested.

For the record, in this case it's even riskier to enable this feature by default, because users cannot limit the size of log.original field.

ruflin · 2018-10-01T16:59:08Z

We should definitively mention that in the docs to make users aware of it.

ph · 2018-10-16T14:56:09Z

@kvch I like that is just another processor that is run before any users defined processors.

For the record, in this case it's even riskier to enable this feature by default, because users cannot limit the size of log.original field.

I agree with you, we should not set it to true by default. Let's take a common scenario, reading java log file, by default when nothing bad happens a document will be at least the double of the original size, this will have the following direct impact:

More data to serialize
More data to compressed, since we have duplicate values we should have a good compression ratio.
More data to send over the wire.
More memory usage after unserialization.

The above should be under control when the system is in an happy state, but as soon as a stack trace happen it will generates really big multiple events and this is where things will start to explose.

In that case, I would prefer that we don't set that value by default and maybe we could add section about reprocessing of events in our documentation that points to this option but also expose any caveats we could have, like the possibility of truncated values.

because users cannot limit the size of log.original field.

This makes me thing what if we provide a truncate processors? I believe this could be handy and useful to allow users to truncate a field, ie stack traces in the multiline context, or limiting content after doing a dissect.

Notes: I am aware of the 500 default limit limit of multiline, but maybe we could provide more flexibility.

ph · 2018-10-16T14:56:53Z

filebeat/channel/factory.go

+var defaultConfig = inputOutletConfig{
+	KeepOriginalMsg: true,
+}
+


I strongly believe this should be a opt-in feature.

kvch · 2018-10-17T09:48:35Z

@ph I enabled the option by default for modules, because we agreed to do so. I think this solution is an acceptable middle ground, because it still keeps the original message read by Filebeat before it is parsed by an Ingest node. But users still have control over the message sizes.
Maybe we could add a new option e.g original_message_length. We could store the original untruncated message produced by the reader pipeline, but right before it leaves the pipeline, it could be truncated as the user configured the limit.

@ph @urso @ruflin do we need to have another conversation regarding this change?

ph · 2018-10-17T12:21:46Z

@ruflin Do you also agree to have it on by default for modules?

ruflin · 2018-10-18T06:58:45Z

For me the goal of keeping the original message around is to allow reprocessing. If we truncated it by default we loose this capability.

For on by default: What if we release this option in 6.x but off by default so we can start playing around with it. Then we can still have a discussion again for 7.0 if we should enable it by default or not. This buys us time to also do some size tests.

ph · 2018-10-18T12:59:24Z

For me the goal of keeping the original message around is to allow reprocessing. If we truncated it by default we loose this capability.

We still need that upper bound check for memory control, but I would say that in the majority of case that should be fine.

For on by default: What if we release this option in 6.x but off by default so we can start playing around with it. Then we can still have a discussion again for 7.0 if we should enable it by default or not. This buys us time to also do some size tests.

This looks more reasonable to me. I +1 that strategy.

tbragin · 2018-10-24T13:59:39Z

I agree -- no setting this on by default in 6.x. I'd consider it a breaking change as it could change the size of data sent dramatically.

Before setting it on by default in the future, I think we'd need to do benchmarking to see what the impact is (including increase to size on disk) and have a conversation about it.

cc @cdahlqvist

cdahlqvist · 2018-10-24T16:36:58Z

If the main reason for keeping it is reprocessing, we can have it not being indexed. If it is just in the source it should compress quite well.

ruflin · 2018-10-26T09:03:28Z

@cdahlqvist The indexing is off by default: https://github.com/elastic/beats/pull/8448/files#diff-a0e7c7d7619e2c513393f5c49a980d11R119

kvch · 2019-04-02T08:19:53Z

Closing in favour of new processor PRs:

Keep original messages in case of Filebeat modules

9f2ee77

kvch added review Filebeat Filebeat labels Sep 26, 2018

kvch requested review from urso and ruflin September 26, 2018 15:30

add changelog entry

144c228

ruflin reviewed Sep 27, 2018

View reviewed changes

ruflin mentioned this pull request Sep 27, 2018

Add raw log message to log handler #7207

Closed

ruflin reviewed Sep 27, 2018

View reviewed changes

kvch added 4 commits September 28, 2018 11:15

mv changelog entry

016484c

rename field to log.original

df4304b

update config references

d89be27

rm unused import

32eb901

followup in tests

da242a3

ruflin suggested changes Sep 30, 2018

View reviewed changes

kvch requested a review from ph October 11, 2018 10:35

ph suggested changes Oct 16, 2018

View reviewed changes

jsoriano mentioned this pull request Nov 7, 2018

Keep message field intact when using modules #8950

Closed

kvch mentioned this pull request Nov 19, 2018

Filebeat modules: keep raw message #8083

Closed

kvch closed this Apr 2, 2019

exekias mentioned this pull request Nov 26, 2019

Filebeat modules: Always send message by default #14708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep original messages in case of Filebeat modules #8448

Keep original messages in case of Filebeat modules #8448

kvch commented Sep 26, 2018

kvch commented Sep 26, 2018

ruflin left a comment

ruflin Sep 27, 2018

ruflin commented Sep 27, 2018

ruflin Sep 27, 2018

kvch commented Sep 28, 2018

ruflin left a comment

kvch commented Oct 1, 2018

ruflin commented Oct 1, 2018

ph commented Oct 16, 2018

ph Oct 16, 2018

kvch commented Oct 17, 2018

ph commented Oct 17, 2018

ruflin commented Oct 18, 2018

ph commented Oct 18, 2018

tbragin commented Oct 24, 2018

cdahlqvist commented Oct 24, 2018

ruflin commented Oct 26, 2018

kvch commented Apr 2, 2019

		@@ -19,6 +19,8 @@ https://github.com/elastic/beats/compare/v6.4.0...master[Check the HEAD diff]

		Filebeat

		- Keep original messages in case of Filebeat modules. {pull}8448[8448]

Keep original messages in case of Filebeat modules #8448

Keep original messages in case of Filebeat modules #8448

Conversation

kvch commented Sep 26, 2018

Configuration

kvch commented Sep 26, 2018

ruflin left a comment

Choose a reason for hiding this comment

ruflin Sep 27, 2018

Choose a reason for hiding this comment

ruflin commented Sep 27, 2018

ruflin Sep 27, 2018

Choose a reason for hiding this comment

kvch commented Sep 28, 2018

ruflin left a comment

Choose a reason for hiding this comment

kvch commented Oct 1, 2018

ruflin commented Oct 1, 2018

ph commented Oct 16, 2018

ph Oct 16, 2018

Choose a reason for hiding this comment

kvch commented Oct 17, 2018

ph commented Oct 17, 2018

ruflin commented Oct 18, 2018

ph commented Oct 18, 2018

tbragin commented Oct 24, 2018

cdahlqvist commented Oct 24, 2018

ruflin commented Oct 26, 2018

kvch commented Apr 2, 2019