Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issues with Fluent Bit TCP Input #294

Open
PettitWesley opened this issue Feb 3, 2022 · 9 comments
Open

Tracking Issues with Fluent Bit TCP Input #294

PettitWesley opened this issue Feb 3, 2022 · 9 comments

Comments

@PettitWesley
Copy link
Contributor

PettitWesley commented Feb 3, 2022

We have seen the following issue reports from customers:

Of these, I have only been able to reproduce throughput limitations. I will post my findings in this ticket.

@PettitWesley
Copy link
Contributor Author

In all of these tests I used log4j to send to the TCP input with code based on what you can find here: https://github.com/aws/aws-for-fluent-bit/tree/mainline/troubleshooting/tools/log4j-tcp-app

The only real finding is that the TCP input can ingest logs at a higher rate when all outputs have workers enabled. This makes sense, without workers, Fluent bit is actually a single threaded program, and all inputs, filters, and outputs contend for the same thread: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md

But with workers, each output gets its own thread pool, which frees up the main thread to only focus on the inputs and filters.

@PettitWesley
Copy link
Contributor Author

I got more scientific and thorough with each test I did, here was the first:

Top legend is log size. The metric evaluated is the number of logs sent per millisecond- the rate at which the TCP appender can append to the Fluent Bit socket. So higher numbers are better and means logs are accepted faster.

100 bytes 500 bytes 1000 bytes 2000 bytes 5000 bytes
1024 CPU shares and File output 534.05 406.2 348.65 206.9 115.35
1024 CPU shares and S3, Firehose, and CW Output 505.3 410.6 328.4 206.55 6.9
1024 CPU shares and S3, Firehose, and CW Outputs with 1 worker each 444.5 441.05 347.2 207.7 101.65
No CPU limit and S3, Firehose, and CW Outputs with 1 worker each 529.4 429.85 347.7 208.55 108.4

@PettitWesley
Copy link
Contributor Author

PettitWesley commented May 16, 2022

In all of the graphs below, the metric obtained is the number of log events that could be sent per millisecond. So higher is better.

To be clear: In all of these tests, the metric is taken at the log4j side, Its the amount of time that log4j spent sending these logs. As soon as the TCP socket returned, then that event is "done" So, this is the input side throughput, not the output side throughput. IIRC, I used the default settings in the app which means I sent a half million events in the test: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/tools/log4j-tcp-app/src/main/java/com/mycompany/app/App.java

Average Log Throughput Rate (Higher is better)

Version Limited to 1024 CPU Shares Workers? 1000 bytes 2500 bytes 5000 bytes
           
2.21.3 7 2 1
2.21.3 334 195 113
2.21.3 4 2 1
Matt's Priority Event Loop 8 3 1
Matt's Priority Event Loop 345 189 116
Matt's Priority Event Loop Input Prioritized 121 74 8
Matt's Priority Event Loop Input Prioritized 357 191 119

Log Throughput Rate Standard Deviation (Shows variance in the performance)

Version Limited to 1024 CPU Shares Workers? 1000 bytes 2500 bytes 5000 bytes
2.21.3 154.3 97.5 57.8
2.21.3 78.8 34 17
2.21.3 177.8 102.8 54.7
Matt's Priority Event Loop 21.7 9.2 45.8
Matt's Priority Event Loop 75.2 31.7 16.8
Matt's Priority Event Loop Input Prioritized 60.9 17.7 21.2
Matt's Priority Event Loop Input Prioritized 74.2 28.8 16.7

Conclusion: Workers and the Priority Event Loop Change improve TCP Input Performance

"Matt's Priority Event Loop" refers to a change that @matthewfala made here: fluent/fluent-bit#4869

He did a talk on this a Fluent Con which will go up on Youtube soon. Basically, the change makes the Fluent Bit scheduler use priority based scheduling, instead of first come first serve scheduling. The vanilla version of the change was merged and released into 1.9 series, which corresponds to "Matt's Priority Event Loop" here. The "Matt's Priority Event Loop Input Prioritized" is an experiment we did in giving input events higher priority, which interestingly didn't do much.

@LucasHantz
Copy link

Hello, this comment to let you know that we have done the job to upgrade our configuration given your recommendation and also added all the metrics needed to have more insights but this will only reach production in the beginning of June. Will let you know by then if we still have the problem. Thanks!

@PettitWesley
Copy link
Contributor Author

I have written this guide on our current recommendations/mitigations for log4j TCP socket appender failures: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#log4j-tcp-appender-write-failure

@PettitWesley
Copy link
Contributor Author

This little script is an easier way to test the TCP input than the example log4j app: #502

@yashaswi90
Copy link

@PettitWesley As you provided the mitigations, so I would like to know like, Is there is no resolution for this TCP input do we need to work with mitigations only?

@PettitWesley
Copy link
Contributor Author

@yashaswi90 Yes unfortunately at this time I do not have a full root cause and thus no full perfect fix for these issues, just mitigations.

@PettitWesley
Copy link
Contributor Author

Failures of the TCP health check outlined here: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/health-check#not-recommended-tcp-input-health-check

I got a user report that they saw this health check repeatedly fail when CPU was almost completely saturated. Fluent Bit appeared to keep sending some logs (though I can't confirm this with debug logs) but the TCP health check input stopped and the health check failed. Log throughput was several MB/s in this case. It succeeded in a non-saturated CPU case even though Fluent Bit was only given 48/1024 CPU shares. Use case was ECS FireLens with 4 cloudwatch_logs outputs and 3 TCP input (1 for healthcheck, 2 for logs), and one tail input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants