Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Allow Output Plugins to optionally control their own OK/Retry/Error metrics #6141

Open
PettitWesley opened this issue Oct 5, 2022 · 5 comments
Labels
AWS Issues with AWS plugins or experienced by users running on AWS exempt-stale long-term Long term issues (exempted by stale bots)

Comments

@PettitWesley
Copy link
Contributor

Background

Fluent Bit Flushing and Metrics Mechanism

In Fluent Bit, logs are ingested from inputs and then batched into chunks of roughly 2MB. These are buffered and then sent to outputs. An output flush event is supposed to either succeed or fail to send the entire chunk. Flush events are ideally not supposed to be dependent on each other, enabling concurrency/workers to send multiple chunks at a time.

So for each chunk, an output must either return FLB_OK, FLB_RETRY, or FLB_ERROR. These returns are counted and become the Fluent Bit prometheus metrics: https://docs.fluentbit.io/manual/administration/monitoring

S3 Output Buffering

The S3 plugin has its own buffering mechanism, because it serves a unique use case. Customers expect large files, potentially several GB in size in their S3 bucket. Tons of little 2 MB files, one created for each chunk, would be a poor user experience.

Thus, AWS implemented custom buffering in out_s3. The plugin accepts data from flush events and then buffers it into the file system to create large files. Thus, it does not send data in most flush events. Thus, it always returns FLB_OK for all flushes (unless there was an error writing to the buffer files). This means that the prometheus metrics for S3 are meaningless.

Problem Statement

Because S3 has its own buffering and retry strategy, the prometheus output metrics are useless for customers. Actually, they are worse than useless- they are misleading. In most cases, the metrics for S3 will show only success, since it only ever returns FLB_OK. Thus, customers may think their uploads are all succeeding even though they are not.

Goal

S3 output metrics in the prometheus endpoint should be meaningful and useful to users. S3 customers do not think in terms of failed chunks, they think in terms of failed file uploads.

Proposal: Allow Output Plugins to optionally control their own OK/Retry/Error metrics

Currently, there is code that increments the error, retry and success metrics when flush events are completed here:

The proposal is that Fluent Bit supports an output flag, FLB_OUTPUT_OMIT_METRICS which if set will bypass the metric counter code above.

Instead, the plugin will have the option of incrementing the cmetric counters itself. This would allow the S3 output to increment the counters based on file uploads.

Usefulness for other plugins

This could potentially be useful for other outputs as well. Many outputs sometimes have to make multiple requests to send a single chunk. For example, if an API (ex: AWS CloudWatch PutLogEvents) has a 1 MB max payload size, then each ~2MB chunk could take 2+ requests to upload.

Output plugin developers may prefer to decide that the meaning of the retry/success metrics for their plugin is based on API calls, not chunks. Essentially, the argument here is that end-users do not think in terms of chunks, its an internal Fluent Bit concept, and thus the prometheus metrics are maximally useful if they are tied to end-user needs.

@github-actions github-actions bot added the Stale label Jan 4, 2023
@fluent fluent deleted a comment from github-actions bot Jan 5, 2023
@PettitWesley PettitWesley added long-term Long term issues (exempted by stale bots) exempt-stale AWS Issues with AWS plugins or experienced by users running on AWS and removed Stale labels Jan 5, 2023
@jeongukjae
Copy link

+1

It must be more useful if the metrics are also controlled in Golang output plugins.

@jmcarp
Copy link

jmcarp commented Aug 24, 2023

What's the status of this proposal? Given that metrics aren't meaningful for the s3 output plugin, do you all have a recommendation for monitoring it?

@PettitWesley
Copy link
Contributor Author

@jmcarp unfortunately, I don't really have any good ideas besides that you can monitor for the rates of different error and warn messages, including these: aws/aws-for-fluent-bit#702

Sorry!

@jmcarp
Copy link

jmcarp commented Aug 24, 2023

I'm not sure how far along this proposal is, but what do you think about adding new metrics to the s3 output plugin like these custom metrics in the stackdriver plugin: https://github.com/fluent/fluent-bit/blob/master/plugins/out_stackdriver/stackdriver_conf.c#L524-L568? That wouldn't fix the problem of the generic metrics being misleading for the s3 plugin, but at least it would be possible to monitor the plugin using metrics.

@PettitWesley
Copy link
Contributor Author

@jmcarp yea that has been my plan for some time, unfortunately I have not gotten time to implement it, and won't soon. If you or anyone else would like to try, let me know, and I can try to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AWS Issues with AWS plugins or experienced by users running on AWS exempt-stale long-term Long term issues (exempted by stale bots)
Projects
None yet
Development

No branches or pull requests

3 participants