Tracking Issues with Fluent Bit TCP Input #294

PettitWesley · 2022-02-03T22:29:17Z

We have seen the following issue reports from customers:

Failures of the TCP health check outlined here: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/health-check#not-recommended-tcp-input-health-check
TCP input sometimes blocks and stops accepting logs while Fluent Bit is still running
Limitations in the throughput TCP input can accept.

Of these, I have only been able to reproduce throughput limitations. I will post my findings in this ticket.

PettitWesley · 2022-05-16T05:16:47Z

In all of these tests I used log4j to send to the TCP input with code based on what you can find here: https://github.com/aws/aws-for-fluent-bit/tree/mainline/troubleshooting/tools/log4j-tcp-app

The only real finding is that the TCP input can ingest logs at a higher rate when all outputs have workers enabled. This makes sense, without workers, Fluent bit is actually a single threaded program, and all inputs, filters, and outputs contend for the same thread: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md

But with workers, each output gets its own thread pool, which frees up the main thread to only focus on the inputs and filters.

PettitWesley · 2022-05-16T05:17:55Z

I got more scientific and thorough with each test I did, here was the first:

Top legend is log size. The metric evaluated is the number of logs sent per millisecond- the rate at which the TCP appender can append to the Fluent Bit socket. So higher numbers are better and means logs are accepted faster.

	100 bytes	500 bytes	1000 bytes	2000 bytes	5000 bytes
1024 CPU shares and File output	534.05	406.2	348.65	206.9	115.35
1024 CPU shares and S3, Firehose, and CW Output	505.3	410.6	328.4	206.55	6.9
1024 CPU shares and S3, Firehose, and CW Outputs with 1 worker each	444.5	441.05	347.2	207.7	101.65
No CPU limit and S3, Firehose, and CW Outputs with 1 worker each	529.4	429.85	347.7	208.55	108.4

PettitWesley · 2022-05-16T05:23:24Z

In all of the graphs below, the metric obtained is the number of log events that could be sent per millisecond. So higher is better.

To be clear: In all of these tests, the metric is taken at the log4j side, Its the amount of time that log4j spent sending these logs. As soon as the TCP socket returned, then that event is "done" So, this is the input side throughput, not the output side throughput. IIRC, I used the default settings in the app which means I sent a half million events in the test: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/tools/log4j-tcp-app/src/main/java/com/mycompany/app/App.java

Average Log Throughput Rate (Higher is better)

Version	Limited to 1024 CPU Shares	Workers?	1000 bytes	2500 bytes	5000 bytes

2.21.3	✅	❌	7	2	1
2.21.3	✅	✅	334	195	113
2.21.3	❌	❌	4	2	1
Matt's Priority Event Loop	✅	❌	8	3	1
Matt's Priority Event Loop	✅	✅	345	189	116
Matt's Priority Event Loop Input Prioritized	✅	❌	121	74	8
Matt's Priority Event Loop Input Prioritized	✅	✅	357	191	119

Log Throughput Rate Standard Deviation (Shows variance in the performance)

Version	Limited to 1024 CPU Shares	Workers?	1000 bytes	2500 bytes	5000 bytes
2.21.3	✅	❌	154.3	97.5	57.8
2.21.3	✅	✅	78.8	34	17
2.21.3	❌	❌	177.8	102.8	54.7
Matt's Priority Event Loop	✅	❌	21.7	9.2	45.8
Matt's Priority Event Loop	✅	✅	75.2	31.7	16.8
Matt's Priority Event Loop Input Prioritized	✅	❌	60.9	17.7	21.2
Matt's Priority Event Loop Input Prioritized	✅	✅	74.2	28.8	16.7

Conclusion: Workers and the Priority Event Loop Change improve TCP Input Performance

"Matt's Priority Event Loop" refers to a change that @matthewfala made here: fluent/fluent-bit#4869

He did a talk on this a Fluent Con which will go up on Youtube soon. Basically, the change makes the Fluent Bit scheduler use priority based scheduling, instead of first come first serve scheduling. The vanilla version of the change was merged and released into 1.9 series, which corresponds to "Matt's Priority Event Loop" here. The "Matt's Priority Event Loop Input Prioritized" is an experiment we did in giving input events higher priority, which interestingly didn't do much.

LucasHantz · 2022-05-20T06:46:46Z

Hello, this comment to let you know that we have done the job to upgrade our configuration given your recommendation and also added all the metrics needed to have more insights but this will only reach production in the beginning of June. Will let you know by then if we still have the problem. Thanks!

PettitWesley · 2022-12-09T18:00:30Z

I have written this guide on our current recommendations/mitigations for log4j TCP socket appender failures: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#log4j-tcp-appender-write-failure

PettitWesley · 2022-12-16T05:20:40Z

This little script is an easier way to test the TCP input than the example log4j app: #502

yashaswi90 · 2022-12-28T05:26:30Z

@PettitWesley As you provided the mitigations, so I would like to know like, Is there is no resolution for this TCP input do we need to work with mitigations only?

PettitWesley · 2023-01-03T23:49:14Z

@yashaswi90 Yes unfortunately at this time I do not have a full root cause and thus no full perfect fix for these issues, just mitigations.

PettitWesley · 2023-06-27T00:25:09Z

Failures of the TCP health check outlined here: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/health-check#not-recommended-tcp-input-health-check

I got a user report that they saw this health check repeatedly fail when CPU was almost completely saturated. Fluent Bit appeared to keep sending some logs (though I can't confirm this with debug logs) but the TCP health check input stopped and the health check failed. Log throughput was several MB/s in this case. It succeeded in a non-saturated CPU case even though Fluent Bit was only given 48/1024 CPU shares. Use case was ECS FireLens with 4 cloudwatch_logs outputs and 3 TCP input (1 for healthcheck, 2 for logs), and one tail input.

PettitWesley mentioned this issue May 16, 2022

[engine] caught signal (SIGSEGV) #351

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issues with Fluent Bit TCP Input #294

Tracking Issues with Fluent Bit TCP Input #294

PettitWesley commented Feb 3, 2022 •

edited

Loading

PettitWesley commented May 16, 2022

PettitWesley commented May 16, 2022

PettitWesley commented May 16, 2022 •

edited

Loading

LucasHantz commented May 20, 2022

PettitWesley commented Dec 9, 2022

PettitWesley commented Dec 16, 2022

yashaswi90 commented Dec 28, 2022

PettitWesley commented Jan 3, 2023

PettitWesley commented Jun 27, 2023

Tracking Issues with Fluent Bit TCP Input #294

Tracking Issues with Fluent Bit TCP Input #294

Comments

PettitWesley commented Feb 3, 2022 • edited Loading

PettitWesley commented May 16, 2022

PettitWesley commented May 16, 2022

PettitWesley commented May 16, 2022 • edited Loading

LucasHantz commented May 20, 2022

PettitWesley commented Dec 9, 2022

PettitWesley commented Dec 16, 2022

yashaswi90 commented Dec 28, 2022

PettitWesley commented Jan 3, 2023

PettitWesley commented Jun 27, 2023

PettitWesley commented Feb 3, 2022 •

edited

Loading

PettitWesley commented May 16, 2022 •

edited

Loading