-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking Issues with Fluent Bit TCP Input #294
Comments
In all of these tests I used log4j to send to the TCP input with code based on what you can find here: https://github.com/aws/aws-for-fluent-bit/tree/mainline/troubleshooting/tools/log4j-tcp-app The only real finding is that the TCP input can ingest logs at a higher rate when all outputs have workers enabled. This makes sense, without workers, Fluent bit is actually a single threaded program, and all inputs, filters, and outputs contend for the same thread: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md But with workers, each output gets its own thread pool, which frees up the main thread to only focus on the inputs and filters. |
I got more scientific and thorough with each test I did, here was the first: Top legend is log size. The metric evaluated is the number of logs sent per millisecond- the rate at which the TCP appender can append to the Fluent Bit socket. So higher numbers are better and means logs are accepted faster.
|
In all of the graphs below, the metric obtained is the number of log events that could be sent per millisecond. So higher is better. To be clear: In all of these tests, the metric is taken at the log4j side, Its the amount of time that log4j spent sending these logs. As soon as the TCP socket returned, then that event is "done" So, this is the input side throughput, not the output side throughput. IIRC, I used the default settings in the app which means I sent a half million events in the test: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/tools/log4j-tcp-app/src/main/java/com/mycompany/app/App.java Average Log Throughput Rate (Higher is better)
Log Throughput Rate Standard Deviation (Shows variance in the performance)
Conclusion: Workers and the Priority Event Loop Change improve TCP Input Performance "Matt's Priority Event Loop" refers to a change that @matthewfala made here: fluent/fluent-bit#4869 He did a talk on this a Fluent Con which will go up on Youtube soon. Basically, the change makes the Fluent Bit scheduler use priority based scheduling, instead of first come first serve scheduling. The vanilla version of the change was merged and released into 1.9 series, which corresponds to "Matt's Priority Event Loop" here. The "Matt's Priority Event Loop Input Prioritized" is an experiment we did in giving input events higher priority, which interestingly didn't do much. |
Hello, this comment to let you know that we have done the job to upgrade our configuration given your recommendation and also added all the metrics needed to have more insights but this will only reach production in the beginning of June. Will let you know by then if we still have the problem. Thanks! |
I have written this guide on our current recommendations/mitigations for log4j TCP socket appender failures: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#log4j-tcp-appender-write-failure |
This little script is an easier way to test the TCP input than the example log4j app: #502 |
@PettitWesley As you provided the mitigations, so I would like to know like, Is there is no resolution for this TCP input do we need to work with mitigations only? |
@yashaswi90 Yes unfortunately at this time I do not have a full root cause and thus no full perfect fix for these issues, just mitigations. |
I got a user report that they saw this health check repeatedly fail when CPU was almost completely saturated. Fluent Bit appeared to keep sending some logs (though I can't confirm this with debug logs) but the TCP health check input stopped and the health check failed. Log throughput was several MB/s in this case. It succeeded in a non-saturated CPU case even though Fluent Bit was only given 48/1024 CPU shares. Use case was ECS FireLens with 4 cloudwatch_logs outputs and 3 TCP input (1 for healthcheck, 2 for logs), and one tail input. |
We have seen the following issue reports from customers:
Of these, I have only been able to reproduce throughput limitations. I will post my findings in this ticket.
The text was updated successfully, but these errors were encountered: