Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOMs/crashes when using fluent-bit 1.8.11 #278

Closed
dylanlingelbach opened this issue Dec 14, 2021 · 27 comments
Closed

OOMs/crashes when using fluent-bit 1.8.11 #278

dylanlingelbach opened this issue Dec 14, 2021 · 27 comments

Comments

@dylanlingelbach
Copy link

Describe the question/issue

We are seeing constant OOMKills when upgrading to [email protected]+ - most recently tried on [email protected]

These OOMs happen within minutes of the fluent-bit pod starting and we see them without significant log traffic - our app is just logging health checks and status messages.

It does not appear to be a slow memory leak - we do not see memory climb slowly and then reach out pod limit. Rather memory is low and the pod receives an OOMKill.

When downgrading to [email protected] we do not see these OOMKills and fluent-bit is able to run fine.

If I remove the cloudwatch_logs output, I do not see the OOMKills.

However fluent-bit CPU usage is significantly higher in v1.8.11 vs v1.7.9 with both cloudwatch_logs and without cloudwatch_logs

1.7.9

~$ kubectl top pods -n logging --containers --use-protocol-buffers
POD                NAME         CPU(cores)   MEMORY(bytes)
fluent-bit-49gpm   fluent-bit   6m           17Mi
fluent-bit-59kcv   fluent-bit   7m           15Mi
fluent-bit-62wzg   fluent-bit   7m           12Mi
fluent-bit-djnbn   fluent-bit   6m           11Mi
fluent-bit-frfnx   fluent-bit   6m           15Mi
fluent-bit-tg7ng   fluent-bit   6m           12Mi
fluent-bit-zngds   fluent-bit   5m           16Mi

1.8.11

~/Code/catalytic-web (release/iqt-1.1.0|REBASE-i 1/1)$  kubectl top pods -n logging --containers --use-protocol-buffers
POD                NAME         CPU(cores)   MEMORY(bytes)
fluent-bit-czkhw   fluent-bit   944m         23Mi
fluent-bit-frfl7   fluent-bit   678m         16Mi
fluent-bit-nzz7m   fluent-bit   1075m        16Mi
fluent-bit-t26j4   fluent-bit   717m         26Mi

See fluent/fluent-bit#4192 for more discussion

Logs sent by our system during a few minute stretch can be found here: logs.log

Configuration

Config map: cm.yaml.txt
Pod: pod.yaml.txt

Fluent Bit Log Output

fluent-bit debug logs are here: fluent-bit.log

Fluent Bit Version Info

v1.8.11

We do not see the issue on v1.7.9

Cluster Details

  • do you use App Mesh or a service mesh?
    No

  • does you use VPC endpoints in a network restricted VPC?
    No

  • Is throttling from the destination part of the problem? Please note that occasional transient network connection errors are often caused by exceeding limits. For example, CW API can block/drop Fluent Bit connections when throttling is triggered.
    Not sure, but seems unlikely given how consistent it is and how low of log volume it is.

  • ECS or EKS
    EKS

  • Fargate or EC2
    EC2

  • Daemon or Sidecar deployment for Fluent Bit
    Daemon

Application Details

Steps to reproduce issue

  1. Run [email protected] with the provided config and wait
  2. OOMKills on the fluent-bit pod

Related Issues

@PettitWesley
Copy link
Contributor

I do see 3 instances of throttling exceptions in the logs:

[2021/12/14 19:44:45] [error] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] PutLogEvents API responded with error='ThrottlingException', message='Rate exceeded for logStreamName k8s-fluent-bit.fluent-bit-ghhjw'

So you are sending lots of logs... this looks like its the log stream for fluent bit itself? Did you turn on debug logging in both cases?

@dylanlingelbach
Copy link
Author

@PettitWesley the log stream for our application is here: https://github.com/aws/aws-for-fluent-bit/files/7714393/logs.log

There is not a lot being sent.

I think the throttling exceptions are coming from the fluent-bit debug logs? We see the OOMKills even without fluent-bit debug logging turned on.

@PettitWesley
Copy link
Contributor

PettitWesley commented Dec 15, 2021

Yeah looks like the vast majority of logs are coming from fluent bit itself, I think if you remove debug logging there should be a noticeable change:

[2021/12/14 19:44:43] [debug] [output:cloudwatch_logs:cloudwatch_logs_kube] Sent 6 events to CloudWatch
[2021/12/14 19:44:43] [debug] [output:cloudwatch_logs:cloudwatch_logs_kube] Sent 2 events to CloudWatch
[2021/12/14 19:44:43] [debug] [output:cloudwatch_logs:cloudwatch_logs_aws_routed_eni] Sent 18 events to CloudWatch
[2021/12/14 19:44:43] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:44] [debug] [output:cloudwatch_logs:cloudwatch_logs_pgbouncer] Sent 132 events to CloudWatch
[2021/12/14 19:44:44] [debug] [output:cloudwatch_logs:cloudwatch_logs_kube] Sent 5 events to CloudWatch
[2021/12/14 19:44:44] [debug] [output:cloudwatch_logs:cloudwatch_logs_aws_routed_eni] Sent 88 events to CloudWatch
[2021/12/14 19:44:44] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 163 events to CloudWatch
[2021/12/14 19:44:44] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1941 events to CloudWatch
[2021/12/14 19:44:45] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:45] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:45] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1938 events to CloudWatch
[2021/12/14 19:44:45] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:47] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:47] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:47] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1519 events to CloudWatch
[2021/12/14 19:44:47] [debug] [output:cloudwatch_logs:cloudwatch_logs_containerd] Sent 11 events to CloudWatch
[2021/12/14 19:44:48] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1934 events to CloudWatch
[2021/12/14 19:44:48] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:49] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:49] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1941 events to CloudWatch
[2021/12/14 19:44:49] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:50] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:50] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:51] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:51] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:51] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:51] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:52] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 1942 events to CloudWatch
[2021/12/14 19:44:52] [debug] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Sent 504 events to CloudWatch
[2021/12/14 19:44:52] [debug] [output:cloudwatch_logs:cloudwatch_logs_kubelet] Sent 1 events to CloudWatch

I'm working on another bug report right now but after it will do some memory leak and performance testing on the CW output on a few different versions...

@PettitWesley
Copy link
Contributor

@dylanlingelbach Wait, are you using the upstream release, not the AWS for Fluent Bit releaase? https://github.com/aws/aws-for-fluent-bit/tags

Please use our release if you aren't. Lately we had some bugs that impact performance and stability and have had to add some custom patches (which we are working on eventually getting upstream in some form).

I'm not saying this will fix the issue, and I am currently working on running some benchmarks, but please use our release.

@dylanlingelbach
Copy link
Author

@PettitWesley yes, we are using the upstream release, sorry if that wasn't clear from the prior thread. Trying aws-for-fluent-bit now

@AlexSc
Copy link

AlexSc commented Dec 15, 2021

I've been working on diagnosing an issue that looks very similar to this (N.B. also using upstream fluentbit). Here's a sample from trace logs:

[2021/12/15 00:46:30] [debug] [output:cloudwatch_logs:cloudwatch_logs.6] Sending 3 events
[2021/12/15 00:46:30] [debug] [output:cloudwatch_logs:cloudwatch_logs.6] Sending log events to log stream ping.us-east-1a-ip-10-0-14-63
[2021/12/15 00:46:30] [trace] [upstream] get new connection for logs.us-east-1.amazonaws.com:443, net setup:
net.connect_timeout        = 10 seconds
net.source_address         = any
net.keepalive              = enabled
net.keepalive_idle_timeout = 30 seconds
[2021/12/15 00:46:30] [debug] [upstream] KA connection #117 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled)
[2021/12/15 00:46:30] [debug] [http_client] not using http_proxy for header
[2021/12/15 00:46:30] [debug] [aws_credentials] Requesting credentials from the EC2 provider..
[2021/12/15 00:46:30] [trace] [io coro=0x7f619fc26cc8] [net_write] trying 1588 bytes
[2021/12/15 00:46:30] [trace] [io coro=0x7f619fc26cc8] [net_write] ret=0 total=1588/1588
[2021/12/15 00:46:30] [trace] [io coro=0x7f619fc26cc8] [net_write] trying 1851 bytes
[2021/12/15 00:46:30] [trace] [io coro=0x7f619fc26cc8] [net_write] ret=0 total=1851/1851
[2021/12/15 00:46:30] [trace] [io coro=0x7f619fc26cc8] [net_read] try up to 4095 bytes
[2021/12/15 00:46:30] [trace] [io coro=0x7f619fc26cc8] [net_read] ret=6
[2021/12/15 00:46:30] [trace] [io coro=0x7f619fc26cc8] [net_read] try up to 4089 bytes
[2021/12/15 00:46:30] [trace] [io coro=0x7f619fc26cc8] [net_read] ret=6
[2021/12/15 00:46:30] [trace] [io coro=0x7f619fc26cc8] [net_read] try up to 4083 bytes
[2021/12/15 00:46:30] [trace] [io coro=0x7f619fc26cc8] [net_read] ret=6
[2021/12/15 00:46:30] [trace] [io coro=0x7f619fc26cc8] [net_read] try up to 4077 bytes
[2021/12/15 00:46:30] [trace] [io coro=0x7f619fc26cc8] [net_read] ret=6
[2021/12/15 00:46:30] [trace] [io coro=0x7f619fc26cc8] [net_read] try up to 4071 bytes
[2021/12/15 00:46:30] [trace] [io coro=0x7f619fc26cc8] [net_read] ret=6

This continues until the process OOMs or the disk fills with fluent bit logs. I haven't tested different versions of fluent bit yet, but I suspect that prior to fluent bit 1.8.7 the same thing would still happen, except that since the AWS http response buffer is only 4k instead of OOMing we'd just fill the 4k buffer and then error out on malformed http response.

@dylanlingelbach If you can, I'd try running fluent bit with trace logs or seeing if 1.7.9 logs [http_client] malformed HTTP response.

Meanwhile, I'm going to spend my day with strace.

@AlexSc
Copy link

AlexSc commented Dec 15, 2021

Turns out my issue is fluent/fluent-bit#4098. Specifically, SSL_get_error() is indicating SSL_ERROR_ZERO_RETURN, which is 6.

@PettitWesley
Copy link
Contributor

@AlexSc @dylanlingelbach Yea and this is why you should use the AWS release, we created a patch commit to fix that issue which is added to our release. However, it is not in the upstream release yet due to ongoing discussions with the upstream community about the best way to solve it.

@PettitWesley
Copy link
Contributor

I will have an update on this tomorrow from the results from some performance tests I ran.

@dylanlingelbach
Copy link
Author

@PettitWesley I've confirmed no OOMKills when using [email protected] in place of [email protected]

I am now seeing errors in the fluent-bit logs like:

[2021/12/16 01:56:11] [error] [http_client] broken connection to logs.us-east-1.amazonaws.com:443 ?
[2021/12/16 01:56:11] [error] [output:cloudwatch_logs:cloudwatch_logs_pushbot_api] Failed to send log events
[2021/12/16 01:56:11] [error] [output:cloudwatch_logs:cloudwatch_logs_pushbot_api] Failed to send log events
[2021/12/16 01:56:11] [error] [output:cloudwatch_logs:cloudwatch_logs_pushbot_api] Failed to send events

That looks similar to what was initially reported in #274

I am going to close this issue as switching to [email protected] fixes it, and it looks like there is a PR up to fix in fluent-bit: fluent/fluent-bit#4100

If you have any ideas on Failed to send errors let me know. I am doing some more investigations this morning and will open a new bug if needed.

FWIW, I am really confused on the difference between fluent-bit and aws-for-fluent-bit. This statement in the README seems incorrect?

Does this mean you are diverging from fluent/fluent-bit?

No. We continue to consume Fluent Bit from its main repository. We are not forking Fluent Bit.

@PettitWesley
Copy link
Contributor

@dylanlingelbach Our plan is still not to fork Fluent Bit. We haven't changed that text because we do not want to be in the current state we are in, with special patches just for the AWS release. Ideally, eventually, everything will get accepted upstream and we will be back to just distributing builds of the upstream versions.

As far as the broken connection issue goes, we have a number of reports of increased frequency of those errors. This is the actual main issue to track it (the one you linked is just another customer report that I am still following up on): fluent/fluent-bit#4332

We are working on the broken connection issue and have made some progress in our understanding of it but also think its gonna be a little bit of time before we get a fix out, partly because the holidays are coming, partly because it seems to be quite complicated.

@dylanlingelbach
Copy link
Author

@PettitWesley got it, thanks for the info. Looking at fluent/fluent-bit#4332 I think since we are on version 1.7.9 of the upstream we aren't impacted, but anything based on the 1.8.x branch (including [email protected]) will have this issue. Does that seem right to you?

@PettitWesley
Copy link
Contributor

@dylanlingelbach Yea we think that somewhere in the 1.8.x series this problem began

@PettitWesley
Copy link
Contributor

@dylanlingelbach One thing I notice is that your config is huge... with multiple usages of rewrite_tag (which each are like there own input that get an additional input side buffer), and multiple outputs. I have been doing some performance testing recently, and basically, with a config like that, it is very believable that you're hitting spikes in memory sometimes. With so many plugins configured, it's easy for a scenario to occur that causes that.

Long term we might to try to see if there's a way to make Fluent Bit limit its own memory better (mem_buf_limit partly does this, but only for input buffers)- via some sort of user configuration.

@dylanlingelbach
Copy link
Author

@PettitWesley yes, we have 18 different services configured that have different log formats that we are parsing and sending off to different cloud watch streams. Is that a higher number than most people run? I assumed there were customers running with many more services with different log formats.

To be clear, after upgrading to [email protected] we are no longer seeing high memory usage or OOMKills. We are exclusively seeing fluent/fluent-bit#4332 now.

@PettitWesley
Copy link
Contributor

@dylanlingelbach that's not unheard of but its high, your config is among the longest I've ever seen. Fluent Bit can handle it (as you have seen), my point is just that you might not be able to be as aggressive with memory limits as other folks are. With that many plugins, you may need give Fluent Bit extra room from time to time.

@dylanlingelbach
Copy link
Author

@PettitWesley got it, thank you, we will keep an eye on it. We are seeing memory usage with [email protected] under 100Mi and our limit is set at 200Mi, so have a lot of room as is

@PettitWesley
Copy link
Contributor

@dylanlingelbach FYI that we have made some progress in understanding fluent/fluent-bit#4332 and might soon have a proposed fix that you can beta test if you want.

@dylanlingelbach
Copy link
Author

@PettitWesley sorry for the delay, I have been focused on something else for the past few weeks. I'd love to beta test a fix when you have it. Thanks for following up!

@PettitWesley
Copy link
Contributor

@matthewfala FYI that @dylanlingelbach is interested in testing the connection issue proposed patch

@matthewfala
Copy link
Contributor

I'm sorry, but it seems like the patch will most likely not help in this case, since the patch is for async networking and cloudwatch plugin only uses sync networking. With that said, I can post a patch soon, and you can test it if you would like, but the patch does not intend to fix the cloudwatch plugins errors (I actually don't know how broken connection errors can come up on sync networking)

The only thing I can think of is that recycled connections are going stale, and not detected somehow, maybe because connections to cloudwatch aren't supposed to be used between several log streams.

Could you try add the following to your cloudwatch plugin's output?:
[OUTPUT]
Name cloudwatch_logs
# Add the following line
net.keepalive off

This may decrease network performance, but I'm curious if it will resolve your broken pipe issues. This will make a new network connection to cloudwatch each time a log is sent rather than reusing connections.

If that works, then to increase performance, then maybe try to remove the net.keepalive configuration and add
net.keepalive_max_recycle <10 or some other number, try find max without errors>

@PettitWesley
Copy link
Contributor

For CW, broken connections can also indicate throttling, as the CW frontend will often block connections from the same IP instead of a full throttling error response.

@matthewfala
Copy link
Contributor

Hi @dylanlingelbach,
If you would like to test the patch which may somehow unexpectedly resolve your case, please use the following image: 826489191740.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-for-fluent-bit:1.8.11-1.8-event-loop-priority-r2

@dylanlingelbach
Copy link
Author

@matthewfala I will try to test that image and turning off keep alive later this week. Will let you know what I find!

@dylanlingelbach
Copy link
Author

@matthewfala @PettitWesley I haven't had a chance to test the image but in getting ready to I found some interesting things

  1. The fluent/fluent-bit:1.8.12 image fixes the OOMs I was seeing (Maybe out_cloudwatch_logs: Fix invalid memory access bug #4425 backport to v1.8 fluent/fluent-bit#4604 fixed it?)
  2. With fluent/fluent-bit:1.8.12 AND keep alive still on I did see a lot of broken connections to CloudWatch like:
[2022/01/27 23:00:17] [error] [http_client] broken connection to logs.us-east-1.amazonaws.com:443 ?
[2022/01/27 23:00:17] [error] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Failed to send log events
[2022/01/27 23:00:17] [error] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Failed to send log events
[2022/01/27 23:00:17] [error] [output:cloudwatch_logs:cloudwatch_logs_fluent_bit] Failed to send events
  1. With fluent/fluent-bit:1.8.12 and keep alive off, I no longer see broken connections but see a lot of these informational messages:
[2022/01/28 02:16:11] [ info] [filter:throttle:throttle.24] 1643336171: limit is 100.00 per 1m with window size of 5, current rate is: 0 per interval
[2022/01/28 02:17:11] [ info] [filter:throttle:throttle.24] 1643336231: limit is 100.00 per 1m with window size of 5, current rate is: 0 per interval
[2022/01/28 02:18:11] [ info] [filter:throttle:throttle.24] 1643336291: limit is 100.00 per 1m with window size of 5, current rate is: 0 per interval
[2022/01/28 02:19:11] [ info] [filter:throttle:throttle.24] 1643336351: limit is 100.00 per 1m with window size of 5, current rate is: 0 per interval
[2022/01/28 02:20:11] [ info] [filter:throttle:throttle.24] 1643336411: limit is 100.00 per 1m with window size of 5, current rate is: 0 per interval

I can't tell if those messages are truly information or indicating an issue.

I think I shared our config earlier but we have 29 [OUTPUT] sections that send to CloudWatch.

I am going to try to spend some more time today looking at metrics and seeing if fluent/fluent-bit:1.8.12 will work for us.

@matthewfala
Copy link
Contributor

Hi @dylanlingelbach. Cool! So keep alive off may impact network performance, so you may want to also try with keep alive on and
net.keepalive_max_recycle <10_or_100_or_some_other_number,_try_find_max_without_errors>.

As per the new informational messages, those logs are coming from the [FILTER] throttle plugin, which I don't see in your config. Did you happen to add that recently?
https://docs.fluentbit.io/manual/pipeline/filters/throttle

This doesn't seem like an error.... Glad 1.8.12 + keep alive config is resolving most of your problems.

@dylanlingelbach
Copy link
Author

@matthewfala ah, yes we did. We added a throttle config to the fluent-bit logs to handle the case where we had a misconfiguration in fluent-bit that led to fluent-bit logs, which hit the misconfiguration, which caused more fluent-bit logs, etc.

Our throttle config looks like:

[FILTER]
    Name                throttle
    Match               fluent-bit.*

    # this should only ever be eclipsed if Fluent-Bit is messed up
    Rate                100

    Window              5
    Interval            1m

    # When Print_Status is enabled, once per `Interval` FB will print a message like the following:
    #   [2021/12/17 23:07:28] [ info] [filter:throttle:throttle.24] 1639782448: limit is 100.00 per 1m with window size of 5, current rate is: 99 per interval
    # This can be useful to check if throttling is active, and as a means to gauge the volume of FB logs (if it's high,
    # that usually means something is wrong).
    # If you're subject to throttling, you won't actually see this message in your output, since it will itself be
    # throttled. But you can use the lack of this message to identify that throttling has occurred.
    Print_Status        On

Completely forgot about the interval logs. This is great, thank you! We should be set with v1.8.12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants