Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic Multiline Filter Re-Design #4309

Closed
PettitWesley opened this issue Nov 10, 2021 · 25 comments
Closed

Generic Multiline Filter Re-Design #4309

PettitWesley opened this issue Nov 10, 2021 · 25 comments
Labels

Comments

@PettitWesley
Copy link
Contributor

PettitWesley commented Nov 10, 2021

New Fluent Bit Multiline Filter Design

Background

In this section, you will learn the following key background information which is necessary to understand the plan and design:

  • Refresher on how logs are processed in our different container architectures
  • The different types of multiline log use cases
  • The new Fluent Bit multiline user interface and its support in different plugins

Understanding this design also requires a basic understanding of Fluent Bit and its log pipeline and routing and how customers configure plugins. Please read the Fluent Bit docs for that background.

How Container Logs are processed in K8s

In k8s, container stdout logs are written to files. Here is the data-flow when a customer uses Fluent Bit:

App container stdout/stderr => container runtime => log file 
  => Fluent Bit Tail Input

The key take-away here is that logs do not go straight to a log file, they pass through the container runtime first.

How Container Logs are processed with fluentd docker log driver

When the fluentd log driver is used (ex: Amazon ECS FireLens), the following is the data-flow:

App container stdout/stderr => container runtime => fluentd log driver => unix socket 
  => Fluent Bit Forward input

The key take-away here is that logs do not go straight to Fluent Bit, they pass through the container runtime and log driver first.

Multiline Log Use Case 1: Stack traces and logs with newlines

The main use case when folks talk about multiline is to concatenate log events that were produces with newlines. This means that the application produces a single log statement that contained newlines. There are many cases where this can happen; one important example is stack traces. Stack traces are almost always split into a series of lines.

For example, a stack trace might look like this:

Dec 14 06:41:08 Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting!
    at com.myproject.module.MyProject.badMethod(MyProject.java:22)
    at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18)
    at com.myproject.module.MyProject.anotherMethod(MyProject.java:14)
    at com.myproject.module.MyProject.someMethod(MyProject.java:10)
    at com.myproject.module.MyProject.main(MyProject.java:6)

When a user views the stacktrace in a monitoring destination like Amazon CloudWatch, they want a single log event that is the entire stack trace.

Remember from above that in container use cases logs always pass through the container runtime first. The runtime uses newlines as a delimiter for logs- logs are split into lines. In k8s, each line is written line by line to the log file. With the fluentd log driver, each line is new event that is sent by the log driver across the tcp or unix socket connection.

Remember that Fluent Bit does not understand what a user’s logs mean. When it’s reading a log file, it just sees a series of lines. And when it’s receiving events over the FireLens unix socket, it also sees a series of events. Thus, without a multiline log concatenation feature, Fluent Bit will treat each line as a separate event.

Here is an example of what the above stack trace might look like after processing with the Fluentd Docker Log Driver (some default fields, like container name and source, omitted for brevity):

{ "container_id": "1", "log": "Dec 14 06:41:08 Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting!" }
{ "container_id": "1", "log": "at com.myproject.module.MyProject.badMethod(MyProject.java:22)" }
{ "container_id": "1", "log": "at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18)" }
{ "container_id": "1", "log": "at com.myproject.module.MyProject.anotherMethod(MyProject.java:14)" }
{ "container_id": "1", "log": "at com.myproject.module.MyProject.someMethod(MyProject.java:10)" }
{ "container_id": "1", "log": "at com.myproject.module.MyProject.main(MyProject.java:6)" }

AWS for Fluent Bit issue: aws/aws-for-fluent-bit#100

Multiline Log Use Case 2: Long Log Lines

The other major multiline logging use case is long log lines. Monitoring is critical to modern containerized applications, and we have seen many customers that produce very large and verbose log events. Structured logging, where logs are generated by the app with a set schema/format (usually JSON) is also very common. Many customers, will log huge amounts of information in a single JSON log event- 1 MB log events are not unheard of.

These large logs are emitted by the application in a single line of code/single print statement. However, the container runtime must process them, and most container runtimes, including Docker and Containerd, will split log lines when they are greater than 16KB. In the case of log files, each 16KB chunk of a split log is written to a new log line. In the case of the Fluentd Docker Log driver, each 16KB chunk of data is a separate event.

For the Fluentd Docker Log Driver, a key will be set to note that the message is partial. Below are real split docker logs. A large log line was sent to both stdout and stderr, which are separate pipes, so each is split and creates a series of messages. Notice that stderr and stdout have different values for partial_id.

{"source": "stdout", "log": "...", "partial_message": "true", "partial_id": "dc37eb08b4242c41757d4cd995d983d1cdda4589193755a22fcf47a638317da0", "partial_ordinal": "1", "partial_last": "false", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig"}]
{"container_name": "/hopeful_taussig", "source": "stderr", "log": "...", "partial_message": "true", "partial_id": "ecccce95711776e6a06d631af8e9227686446814eba7a87cb59b36bbaaad8b58", "partial_ordinal": "1", "partial_last": "false", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2"}]
{"partial_last": "false", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig", "source": "stdout", "log": "...", "partial_message": "true", "partial_id": "dc37eb08b4242c41757d4cd995d983d1cdda4589193755a22fcf47a638317da0", "partial_ordinal": "2"}]
{"log": "...", "partial_message": "true", "partial_id": "dc37eb08b4242c41757d4cd995d983d1cdda4589193755a22fcf47a638317da0", "partial_ordinal": "3", "partial_last": "false", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig", "source": "stdout"}]
{"source": "stderr", "log": "...", "partial_message": "true", "partial_id": "ecccce95711776e6a06d631af8e9227686446814eba7a87cb59b36bbaaad8b58", "partial_ordinal": "2", "partial_last": "false", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig"}]
{"partial_id": "ecccce95711776e6a06d631af8e9227686446814eba7a87cb59b36bbaaad8b58", "partial_ordinal": "3", "partial_last": "false", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig", "source": "stderr", "log": "...", "partial_message": "true"}]
{"container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig", "source": "stdout", "log": "...", "partial_message": "true", "partial_id": "dc37eb08b4242c41757d4cd995d983d1cdda4589193755a22fcf47a638317da0", "partial_ordinal": "4", "partial_last": "false"}]
{"partial_last": "false", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig", "source": "stderr", "log": "...", "partial_message": "true", "partial_id": "ecccce95711776e6a06d631af8e9227686446814eba7a87cb59b36bbaaad8b58", "partial_ordinal": "4"}]
{"container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig", "source": "stdout", "log": "...", "partial_message": "true", "partial_id": "dc37eb08b4242c41757d4cd995d983d1cdda4589193755a22fcf47a638317da0", "partial_ordinal": "5", "partial_last": "true"}]
{"log": "...", "partial_message": "true", "partial_id": "ecccce95711776e6a06d631af8e9227686446814eba7a87cb59b36bbaaad8b58", "partial_ordinal": "5", "partial_last": "true", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig", "source": "stderr"}]

The following fields should be present in split docker logs:

  • partial_message: boolean to signal if the message is partial or not.
  • partial_id: all parts of a split message get the same unique ID
  • partial_ordinal: counts up from 1, ordering the parts of the split message.
  • partial_last: is this the last split message or not.

AWS for Fluent Bit issue: aws/aws-for-fluent-bit#25

Fluent Bit Multiline Support

Fluent Bit currently has partial support for multiline log use cases.

There is a new multiline API: https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/multiline-parsing

There is support in key plugins:

Technical Design: Fluent Bit support for multiline log use cases

Technical Background: how filters work

To understand why the generic multiline filter is broken, we need to first understand how filters fit into the Fluent Bit log pipeline.
flb_pipeline

This is how the log pipeline is presented to a user. However, internally, from the point of view of Fluent Bit’s concurrency model, filters are implemented as an extension of inputs. Filters are invoked serially in the same context as an input that is ingesting a log record. Filters can not take advantage of concurrency or perform asynchronous operations. They are expected to return quickly when sent log records.

Event loop pipeline

As shown in the diagram above, log records are processed by a filter as soon as the input ingests them. This is important, because what records the filter sees will be determined by the behavior of the input.

Some inputs (Tail) ingest many records all at once, and thus a filter that comes after the tail input will see an entire chunk of records. However, other inputs (Forward), especially inputs that use network connections, will only ingest a single record at a time. This means the filter will receive each record one by one.

This can be seen in the call chain for an input to ingest records:

  1. Input calls flb_input_chunk_append_raw
  2. The append function invokes flb_filter_do

Technical Background: how the multiline core library works

The new multiline core library is fairly simple and straightforward to use. Code is here.

Here’s how it works:

  1. The client code creates a multiline stream, which is an identifier for logs that can be buffered and parsed together as multilines. This allows client code to process multiple separate streams of data at the same time.
  2. The client code appends records one by one to the stream.
  3. The multiline library will apply multiline parsers to the incoming records. If a parser matches and the record is part of a multiline log, then the library will buffer that line until it has received all other lines that complete it to make a single multiline log record.
  4. When a multiline log record is complete (or if the record is not part of a multiline log), the library will invoke the flush callback provided by the client code. The client flush callback is free to do whatever it wants with emitted record.

Why the multiline filter is broken: It can’t do asynchronous buffering

The multiline filter is broken because filters are expected to be serial code that can perform their work exclusively on the records they were sent in a single invocation. This means that when a filter is sent records one by one, it must return something for each record. However, if the record is a multiline, then a single record is useless. The filter needs to receive all records that compose a single multiline log, and then concatenate them together, and then return a single new record. This is not possible with the current architecture.

A reasonable question to ask is: why can’t the filter simply buffer each line until it has a completed multiline record? Why can’t it simply receive each record one by one, returning nothing each time and buffering until it has all records composing a multiline message, and then when the multiline is complete, return the whole concatenated message?

The filter could be implemented with this simple buffering. However, there is a problem that would make it fail in some cases. A multiline can not be known to be completed until a new record is received that is not part of the currently buffered multiline. That is how the multiline library works, it looks at each incoming record, applying its parsers, and the multiline is only known to be complete once a new line does not match a continuation parser.

This is easy to understand with an example:

Dec 14 06:41:08 Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting!
    at com.myproject.module.MyProject.badMethod(MyProject.java:22)
    at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18)
    at com.myproject.module.MyProject.anotherMethod(MyProject.java:14)
    at com.myproject.module.MyProject.someMethod(MyProject.java:10)
    at com.myproject.module.MyProject.main(MyProject.java:6)
another line...

Imagine processing this input line by line. Until the “another line...” line is received, you would have no way of knowing whether or not the next line will be another part of the stack trace, or not.

So for the simple buffering design in the filter, imagine that only the stack trace is received, and after that, no new log records are received and Fluent Bit is shut down. The filter would never know the multiline was completed, and thus could not flush it. And furthermore, filters do not have any way to asynchronously flush data, and they have no way of flushing buffered data on shut down. A filter is code that should not have any context shared between invocations, it’s just a simple callback that receives data, processes it serially, and then returns.

Thus, the filter could get “stuck” and never emit some records if it used a simple buffering design.

Solution: Use in_emitter and a flush timer

The solution is to use the in_emitter plugin. This is an internal plugin introduced when the rewrite_tag filter was created. It’s an internal API that allows plugins to register an input plugin that they can emit records too, and the input will then emit the records into the beginning of the Fluent Bit pipeline. This allows us to get around the limitation that filters can not asynchronously flush data outside of the synchronous filter callback. The multiline library also supports an asynchronous flush timer, which can be used to create a max time that it can buffer data.

Together, these can be used to fix the multiline filter. When the filter callback is run with some data, it will return no records. Instead, the records are ingested into the multiline library and buffered. Once the multiline record is complete and concatenated, or the flush time expires, the multiline flush callback will be triggered, which will write to the in_emitter instance. The new concatenated records will be re-ingested into the log pipeline under the same tag.

To prevent a cycle in the log pipeline, the filter will not process/ingest records from its own emitter input. (Remember, they are re-ingested with the same tag, so the filter match pattern would match the new records too). Changes will be made to the filter callback to pass it the input instance that ingested the log records, allowing the filter to check whether records came from its own in_emitter input plugin. If they came from the emitter instance, processing was already completed and the records will be passed down the pipeline as is.

This has a side effect that other filters that come before the multiline filter will be applied twice. Thus, the documentation will call out that customers should always configure the multiline filter to be the first filter.

Multiline filter Stream Design

Remember, that the multiline library supports multiple streams of data. Each stream is basically a buffer for data that can be processed together. The filter will use the combination of the unique full input instance name + the tag to uniquely identify a stream of data.

Partial Message Key Support

The design we have discussed so far only covers multiline Use Case 1. There is also Use Case 2, long messages that were split and must be re-joined. The generic filter only needs to support the type of partial messages ingested by the forward input, the tail plugin already supports partial messages written to log files. In the forward input case, the partial messages are noted by a key in the record.

Recall the example from earlier:

{"source": "stdout", "log": "...", "partial_message": "true", "partial_id": "dc37eb08b4242c41757d4cd995d983d1cdda4589193755a22fcf47a638317da0", "partial_ordinal": "1", "partial_last": "false", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig"}]
{"container_name": "/hopeful_taussig", "source": "stderr", "log": "...", "partial_message": "true", "partial_id": "ecccce95711776e6a06d631af8e9227686446814eba7a87cb59b36bbaaad8b58", "partial_ordinal": "1", "partial_last": "false", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2"}]
{"partial_last": "false", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig", "source": "stdout", "log": "...", "partial_message": "true", "partial_id": "dc37eb08b4242c41757d4cd995d983d1cdda4589193755a22fcf47a638317da0", "partial_ordinal": "2"}]
{"log": "...", "partial_message": "true", "partial_id": "dc37eb08b4242c41757d4cd995d983d1cdda4589193755a22fcf47a638317da0", "partial_ordinal": "3", "partial_last": "false", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig", "source": "stdout"}]
{"source": "stderr", "log": "...", "partial_message": "true", "partial_id": "ecccce95711776e6a06d631af8e9227686446814eba7a87cb59b36bbaaad8b58", "partial_ordinal": "2", "partial_last": "false", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig"}]
{"partial_id": "ecccce95711776e6a06d631af8e9227686446814eba7a87cb59b36bbaaad8b58", "partial_ordinal": "3", "partial_last": "false", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig", "source": "stderr", "log": "...", "partial_message": "true"}]
{"container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig", "source": "stdout", "log": "...", "partial_message": "true", "partial_id": "dc37eb08b4242c41757d4cd995d983d1cdda4589193755a22fcf47a638317da0", "partial_ordinal": "4", "partial_last": "false"}]
{"partial_last": "false", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig", "source": "stderr", "log": "...", "partial_message": "true", "partial_id": "ecccce95711776e6a06d631af8e9227686446814eba7a87cb59b36bbaaad8b58", "partial_ordinal": "4"}]
{"container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig", "source": "stdout", "log": "...", "partial_message": "true", "partial_id": "dc37eb08b4242c41757d4cd995d983d1cdda4589193755a22fcf47a638317da0", "partial_ordinal": "5", "partial_last": "true"}]
{"log": "...", "partial_message": "true", "partial_id": "ecccce95711776e6a06d631af8e9227686446814eba7a87cb59b36bbaaad8b58", "partial_ordinal": "5", "partial_last": "true", "container_id": "a96998303938eab6087a7f8487ca40350f2c252559bc6047569a0b11b936f0f2", "container_name": "/hopeful_taussig", "source": "stderr"}]

The multiline library does not support this multiline use case. Instead, the filter will be implemented with new partial message key buffering code. The code will use a similar design as explained for the multiline library; the same in_emitter instance will be used to re-ingest concatenated records into the pipeline. A timer callback will be used to implement a max timeout for buffered data.

The algorithm for buffering and emitting partial records is very simple:

  1. If the partial_message key is present and its value is “true” (case insensitive), create a new buffer for the log using its partial_id if one does not already exist. If one does, append the data to the buffer. Each buffer will store a timestamp for when it was last added too, and will be flushed when that timestamp is older than the configured flush timeout.
  2. When the partial_last key is found to be true for a unique partial_id, the log is complete and will be flushed.
  3. If the flush timeout is reached for a buffer, flush its data.

New Configuration Options and Filter CX

Please see the existing filter configuration options as a refresher: https://docs.fluentbit.io/manual/pipeline/filters/multiline-stacktrace#configuration-parameters

The following new configuration options will be added.

Option Name Type Default Validations Description/Notes
flush_ms integer 2000 must be positive and greater than 0 The max time is milliseconds to buffer pending multiline records
mode string parser "parser" or "partial_message" A single filter configuration can only solve one of the multiline use cases, to solve both, two filter instances are required.
partial_message_key string partial_message Not compatible with the "parser" mode. The key that denotes partial records. Partial records are denoted when this key is "true" (case insensitive)
partial_id_key string partial_id Not compatible with "parser" mode The key that denotes the ID for partial records
partial_last_key string partial_last Not compatible with "parser" mode The key that denotes whether or not this is the last part of the split message
emitter_name string multiline_emitter Name of the emitter input. Option copied from the rewrite_tag filter.
emitter_storage.type string memory Value can be "memory" or "filesystem" Storage type of the emitter input. Option copied from the rewrite_tag filter.
emitter_mem_buf_limit size 10M set a memory buffer limit to restrict memory usage of emitter. Option copied from the rewrite_tag filter.

Design Limitations

  1. As already noted, the filter will re-ingest records with the same tag. So we must call out in the documentation that this means users should configure the multiline filter as the first filter, otherwise other filters would be applied twice.
  2. The multiline filter still has no way to guarantee it will flush pending data on shutdown. This scenario will be investigated in the final testing phase. Users should be able to be safe by configuring a flush_ms which is a good bit lower than the service Grace setting which controls the grace period for Fluent Bit to send all logs on shutdown.

Multiline Filter Launch Proposal

These changes will fix the multiline filter and make it function as users expect. However, it is also a very significant change to how the filter currently works, and this change is transparent- users will see the new in_emitter input and will incur some slight overhead from it from the buffering. Thus, this change should ideally be introduced in v1.9.0, the next minor version.

A discussion will need to be had with the community on whether this “fixed” mode can be the only experience post-launch, or if it should be gated behind a configuration option that would allow users to return to the old (broken) behavior if desired. Our recommendation is that no such option should be added, this is a bug fix to a broken component, not a new feature; there is no reason why a user would prefer the old broken behavior.

Prototype

A fully functional prototype has been created and shared with customers, it can solve Multiline Use Case 1: aws/aws-for-fluent-bit#100 (comment)

@PettitWesley
Copy link
Contributor Author

@nokute78 This is my design to fix the generic multiline filter, which I wrote to help my team understand. I have discussed most of this design with @edsiper, except for the partial message key part.

My main question for the community is whether this new design should be gated behind an option or not. My opinion is that we should introduce the fixed filter in V1.9.0 and should not gate this new version behind an option. This is not a new feature, its a fix to the existing plugin which is entirely broken for most use cases.

What do you guys think? What is the rough date for 1.9.0?

@tsalzinger
Copy link

tsalzinger commented Nov 11, 2021

For the AWS ECS FireLens use case the partial_key would not be sufficient. The messages actually contain more metadata that should be considered:

  • partial_message - "true" if part of a partial message, omitted otherwise
  • partial_id - a unique id for each message split into mutliple partial messages
  • partial_ordinal - a sorting key for partials, unique per partial_id
  • partial_last - "true" if the message is the last message for a given partial_id

The main difference here is that the last message is denoted by partial_last and not by the first message having partial_message set to "false".

So with these metadata in mind I can define the following test cases that would not work with the propsed design

Partial followed by non partial

messages = [
        { "container_id" => "1", "log" => "start partial -", "partial_message" => "true", "partial_id" => "1", "partial_last" => "false", "partial_ordinal" => "1" },
        { "container_id" => "1", "log" => " end partial", "partial_message" => "true", "partial_id" => "1", "partial_last" => "true", "partial_ordinal" => "2"},
        { "container_id" => "1", "log" => "non partial"},
      ]
      expected = [
        { "container_id" => "1", "log" => "start partial - end partial" },
        { "container_id" => "1", "log" => "non partial" },
      ]

Mutliple interspersed partials

messages = [
        { "container_id" => "1", "log" => "start partial 1 -", "partial_message" => "true", "partial_id" => "1", "partial_last" => "false", "partial_ordinal" => "1" },
        { "container_id" => "1", "log" => "start partial 2 -", "partial_message" => "true", "partial_id" => "2", "partial_last" => "false", "partial_ordinal" => "1" },
        { "container_id" => "1", "log" => " end partial 1", "partial_message" => "true", "partial_id" => "1", "partial_last" => "true", "partial_ordinal" => "2"},
        { "container_id" => "1", "log" => " end partial 2", "partial_message" => "true", "partial_id" => "2", "partial_last" => "true", "partial_ordinal" => "2" },
      ]
      expected = [
        { "container_id" => "1", "log" => "start partial 1 - end partial 1" },
        { "container_id" => "1", "log" => "start partial 2 - end partial 2" },
      ]

Out of order partials

messages = [
        { "container_id" => "1", "log" => " end partial 1", "partial_message" => "true", "partial_id" => "1", "partial_last" => "true", "partial_ordinal" => "3" },
        { "container_id" => "1", "log" => "start partial 1 -", "partial_message" => "true", "partial_id" => "1", "partial_last" => "false", "partial_ordinal" => "1" },
        { "container_id" => "1", "log" => " middle partial 1 -", "partial_message" => "true", "partial_id" => "1", "partial_last" => "false", "partial_ordinal" => "2" },
      ]
      expected = [
        { "container_id" => "1", "log" => "start partial 1 - middle partial 1 - end partial 1" },
      ]

So in addition to your configuraiton options I could imagine the following:

Option Name Type Default Validations Description/Notes
partial_end_key string If provided the last message of a partial is expected to have "true" set in both partial_key and partial_end_key records, otherwise the last message will be assumed to be the first message where partial_key is not present or "false"
parital_id_key string If provided partials will be concatenated only if these record contains the same ID
partial_sort_key string If provided partials will be sorted based on this record

To be honest I don't know if the 'Mutliple interspersed partials' and 'Out of order partials' are actually a valid use-case (I didn't encounter them so far) but the last partial of a message is definitely denoted by the specific partial_last key as opposed to the design that was outlined above.

@nokute78
Copy link
Collaborator

@PettitWesley Thank you for summarizing.
In my understanding, the points are

  • Using in_emitter for asynchronous flushing.
  • Extracting flush callback for filter plugin to get input instance and determine if in_emitter or other inputs.
  • partial_key for docker meta data.

FYI: We should take care intermediate buffer which is described the records are ingested into the multiline library and buffered.
in_emitter had intermediate buffer and it caused #4049.
Flooding records exhausted intermediate buffer.

My opinion is that we should introduce the fixed filter in V1.9.0 and should not gate this new version behind an option.

I agree if there is no breaking change.

@nokute78
Copy link
Collaborator

@tsalzinger Thank you for information.

@PettitWesley
It is complicated for buffering.
If we support it, we need to extract multiline library.

Note:
Docker source is
https://github.com/moby/moby/blob/v20.10.10/daemon/logger/fluentd/fluentd.go#L116
All metadata types are string.
Below is a definition of partial_id.
https://github.com/moby/moby/blob/v20.10.10/api/types/backend/backend.go#L34

@edsiper
Copy link
Member

edsiper commented Nov 13, 2021

Some important notes:

Original multiline filter aims to work at Chunk level context only. So I'd say this is a feature extension to use buffers, which is ok. If there are some perf penalties by using buffering (which is expected) we might consider offering buffering functionality under a new configurable option.

For the AWS ECS FireLens use case the partial_key would not be sufficient. The messages actually contain more metadata that should be considered:

partial_message - "true" if part of a partial message, omitted otherwise
partial_id - a unique id for each message split into mutliple partial messages
partial_ordinal - a sorting key for partials, unique per partial_id
partial_last - "true" if the message is the last message for a given partial_id

I got confused here, is this way to split messages "AWS specific" ?, reassembling this kind of message and considering out of order delivery might be problematic, it should be solved "before" hitting Fluent Bit pipeline.

@PettitWesley
Copy link
Contributor Author

@edsiper

If there are some perf penalties by using buffering (which is expected) we might consider offering buffering functionality under a new configurable option.

But if there's no valid use case for the non-buffered mode, then what's the point in being able to disable it? It will only confuse users.

I am willing to add a config option, but I want it to default to buffered mode.

I got confused here, is this way to split messages "AWS specific" ?, reassembling this kind of message and considering out of order delivery might be problematic, it should be solved "before" hitting Fluent Bit pipeline.

No. This is not AWS specific nor does the splitting happen in an AWS owned component. Container runtimes that process container stdout/stderr logs buffer them in chunks and split lines over 16KB. Docker does this. This is a well known use case, which the Fluent plugin concat filter can solve: https://github.com/fluent-plugins-nursery/fluent-plugin-concat

@Juliahahaha
Copy link

Is there a time estimation for solving this issue?
Thanks in advance!

@PettitWesley
Copy link
Contributor Author

@Juliahahaha I can't provide any hard estimate publicly but I am actively working on implementing this design.

Please see here for a fully working prototype that you can test: aws/aws-for-fluent-bit#100 (comment)

@mdraijer
Copy link

Regarding the configuration of partial_key: given the example of CRI-O output (see below) there should be an extra option, e.g. partial_key_value, that might default to true. For CRI-O there is also a key that denotes whether a line is just a part of a message, or whether it is the last/only line of the message. The values however are P (part) and F (final line) instead of false and true.

Example output (prefix with log id and n/m is added by me: the CRI-O output starts with the timestamp):

A 1/1 - 2021-11-18T14:47:34.730832241+00:00 stdout F some short log message
B 1/2 - 2021-11-18T14:47:34.730832241+00:00 stdout P some longer log message part one
B 2/2 - 2021-11-18T14:47:34.730908438+00:00 stdout F end of the longer message
C 1/3 - 2021-11-18T14:47:34.730908438+00:00 stdout P another longer log message part one
C 2/3 - 2021-11-18T14:47:34.730939216+00:00 stdout P continued in part two
D 1/2 - 2021-11-18T14:47:34.735454549+00:00 stderr F some log message in stderr part one
C 3/3 - 2021-11-18T14:47:34.735482854+00:00 stdout F still continued after a line from a different stream
D 2/2 - 2021-11-18T14:47:34.735482854+00:00 stderr F and the end of the stderr message

This illustrates:

  • Lines have an indication in field 3: F for a one-line message and for the concluding line of a multi-line message; P for parts other than the final part of a multi-line message.
  • Log messages from different streams (stdout, stderr) can be mixed up (examples C and D).

Hope to see this working out of the box in the new solution.

@PettitWesley
Copy link
Contributor Author

@mdraijer My understanding was that I do not need to cover the CRI use case in the filter, since that is already solved in tail plugin with the new multiline feature and built-in CRI parser. https://docs.fluentbit.io/manual/pipeline/inputs/tail#multiline-core-v1.8

CRI logs would always be sent to log files right? This is a pattern in the logs, not a separate key also right?

The filter is mainly targeting the fluentd docker log driver or other log streaming use cases. If you have log files, you should always just use the multiline support in tail. It will be more performant.

@mdraijer
Copy link

mdraijer commented Dec 2, 2021

@PettitWesley I checked again and for many log messages it is indeed working fine. However there is one specific use case that is breaking the concatenation:

Lines in different streams (stdout, stderr) are mixed in the CRI-O logfiles.

See my example for messages C and D: the parts of C are not successively printed in the CRI-O log because one line from D is in between. This can only happen because C is printed in a different stream than D.

@mdraijer
Copy link

mdraijer commented Dec 2, 2021

@PettitWesley I understand now. The text in the issue description has a lot about the parsing of log files, which made me miss the point in the lines:

in_tail: Supports both Use Case 1 and Use Case 2 with the new multiline API (tailing/reading log files)
generic multiline filter: the goal is to support all multiline use cases in a generic way, so that customers can have multiline support no matter which input they use. However, (explained in detail in the technical design section), the filter is currently entirely broken for most use cases: Multiline Filter example not working with Forward input #4173

which is of course that the problem exists only in the filter 'multiline' and not in the input 'tail-with-multiline-parser'.

I have now filed a bug for this specific issue: #4387 Sorry for the confusion.

@PettitWesley
Copy link
Contributor Author

@mdraijer Thank you for this thorough investigation. Also this is useful, since I think the mixing of ordering of messages that come from stderr/stdout is what @tsalzinger was bringing up. (Thanks for that post too, I am behind on diving deep into it myself, but will get to it soon).

@linuxxin
Copy link

linuxxin commented Dec 9, 2021

image
I now set Multiline and the output to the command line is connected, but I looked at it in Kibana and it showed the following

@linuxxin
Copy link

I have solved the problem of multi-line log, but it appears that multi-line log does not belong to any service, how to solve this problem, thank you as shown in the following picture
image

@PettitWesley
Copy link
Contributor Author

@linuxxin We need more info than this picture to help you (like configuration and deployment details). Also, this issue is for the multiline filter re-design discussion, not for debugging user issues, so please open a new issue.

@linuxxin
Copy link

linuxxin commented Dec 14, 2021

@linuxxin我们需要比这张图片更多的信息来帮助你(比如配置和部署细节)。另外,这个 issue 是针对多线滤波器重新设计的讨论,不是为了调试用户问题,所以请开一个新 issue。

#4448 (comment) Submitted please help to check thank you

@alexku7
Copy link

alexku7 commented Jan 2, 2022

Hello
As part of multiline improvements, could be very useful to add the negation option to the regex evaluation.
Something very similar what exists in filebeat:
input {
file {
path => "/var/log/someapp.log"
codec => multiline {
pattern => "^%{TIMESTAMP_ISO8601} "
negate => true
what => previous
}
}
}

It might simplify many scenarios , for example by specifying the first line pattern with timestamp format and the secvond statement that simply can define that all what isn't the timestamp is related to the previous line

@PettitWesley
Copy link
Contributor Author

@tsalzinger I have updated the design for the partial_message support, which I am working on implementing now. Feel free to check the update, and thank you for your feedback.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label May 13, 2022
@aslafy-z
Copy link

No stale please!

@PettitWesley
Copy link
Contributor Author

@aslafy-z Both the buffer option and the partial_message support have been released: https://fluentbit.io/announcements/v1.9.3/

@github-actions github-actions bot removed the Stale label May 14, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Aug 13, 2022
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@ashishmodi7
Copy link

fluentbit_filter_drop_records_total metric is increasing when using multiline parser. The records are flowing properly to Splunk, but still filter drop metrics in increasing. This issue is reproducible in Fluent Bit v3.0.6
Any update on this?

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants