Proposal: Allow Output Plugins to optionally control their own OK/Retry/Error metrics #6141

PettitWesley · 2022-10-05T03:33:54Z

Background

Fluent Bit Flushing and Metrics Mechanism

In Fluent Bit, logs are ingested from inputs and then batched into chunks of roughly 2MB. These are buffered and then sent to outputs. An output flush event is supposed to either succeed or fail to send the entire chunk. Flush events are ideally not supposed to be dependent on each other, enabling concurrency/workers to send multiple chunks at a time.

So for each chunk, an output must either return FLB_OK, FLB_RETRY, or FLB_ERROR. These returns are counted and become the Fluent Bit prometheus metrics: https://docs.fluentbit.io/manual/administration/monitoring

S3 Output Buffering

The S3 plugin has its own buffering mechanism, because it serves a unique use case. Customers expect large files, potentially several GB in size in their S3 bucket. Tons of little 2 MB files, one created for each chunk, would be a poor user experience.

Thus, AWS implemented custom buffering in out_s3. The plugin accepts data from flush events and then buffers it into the file system to create large files. Thus, it does not send data in most flush events. Thus, it always returns FLB_OK for all flushes (unless there was an error writing to the buffer files). This means that the prometheus metrics for S3 are meaningless.

Problem Statement

Because S3 has its own buffering and retry strategy, the prometheus output metrics are useless for customers. Actually, they are worse than useless- they are misleading. In most cases, the metrics for S3 will show only success, since it only ever returns FLB_OK. Thus, customers may think their uploads are all succeeding even though they are not.

Goal

S3 output metrics in the prometheus endpoint should be meaningful and useful to users. S3 customers do not think in terms of failed chunks, they think in terms of failed file uploads.

Proposal: Allow Output Plugins to optionally control their own OK/Retry/Error metrics

Currently, there is code that increments the error, retry and success metrics when flush events are completed here:

The proposal is that Fluent Bit supports an output flag, FLB_OUTPUT_OMIT_METRICS which if set will bypass the metric counter code above.

Instead, the plugin will have the option of incrementing the cmetric counters itself. This would allow the S3 output to increment the counters based on file uploads.

Usefulness for other plugins

This could potentially be useful for other outputs as well. Many outputs sometimes have to make multiple requests to send a single chunk. For example, if an API (ex: AWS CloudWatch PutLogEvents) has a 1 MB max payload size, then each ~2MB chunk could take 2+ requests to upload.

Output plugin developers may prefer to decide that the meaning of the retry/success metrics for their plugin is based on API calls, not chunks. Essentially, the argument here is that end-users do not think in terms of chunks, its an internal Fluent Bit concept, and thus the prometheus metrics are maximally useful if they are tied to end-user needs.

The text was updated successfully, but these errors were encountered:

jeongukjae · 2023-03-23T06:45:39Z

+1

It must be more useful if the metrics are also controlled in Golang output plugins.

jmcarp · 2023-08-24T03:04:09Z

What's the status of this proposal? Given that metrics aren't meaningful for the s3 output plugin, do you all have a recommendation for monitoring it?

PettitWesley · 2023-08-24T05:03:17Z

@jmcarp unfortunately, I don't really have any good ideas besides that you can monitor for the rates of different error and warn messages, including these: aws/aws-for-fluent-bit#702

Sorry!

jmcarp · 2023-08-24T15:42:54Z

I'm not sure how far along this proposal is, but what do you think about adding new metrics to the s3 output plugin like these custom metrics in the stackdriver plugin: https://github.com/fluent/fluent-bit/blob/master/plugins/out_stackdriver/stackdriver_conf.c#L524-L568? That wouldn't fix the problem of the generic metrics being misleading for the s3 plugin, but at least it would be possible to monitor the plugin using metrics.

PettitWesley · 2023-08-24T17:13:26Z

@jmcarp yea that has been my plan for some time, unfortunately I have not gotten time to implement it, and won't soon. If you or anyone else would like to try, let me know, and I can try to help.

PettitWesley mentioned this issue Dec 17, 2022

Losing logs with S3 plugin aws/aws-for-fluent-bit#495

Open

github-actions bot added the Stale label Jan 4, 2023

fluent deleted a comment from github-actions bot Jan 5, 2023

PettitWesley added long-term Long term issues (exempted by stale bots) exempt-stale AWS Issues with AWS plugins or experienced by users running on AWS and removed Stale labels Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Allow Output Plugins to optionally control their own OK/Retry/Error metrics #6141

Proposal: Allow Output Plugins to optionally control their own OK/Retry/Error metrics #6141

PettitWesley commented Oct 5, 2022

jeongukjae commented Mar 23, 2023

jmcarp commented Aug 24, 2023

PettitWesley commented Aug 24, 2023

jmcarp commented Aug 24, 2023

PettitWesley commented Aug 24, 2023

Proposal: Allow Output Plugins to optionally control their own OK/Retry/Error metrics #6141

Proposal: Allow Output Plugins to optionally control their own OK/Retry/Error metrics #6141

Comments

PettitWesley commented Oct 5, 2022

Background

Fluent Bit Flushing and Metrics Mechanism

S3 Output Buffering

Problem Statement

Goal

Proposal: Allow Output Plugins to optionally control their own OK/Retry/Error metrics

Usefulness for other plugins

jeongukjae commented Mar 23, 2023

jmcarp commented Aug 24, 2023

PettitWesley commented Aug 24, 2023

jmcarp commented Aug 24, 2023

PettitWesley commented Aug 24, 2023