-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOMs/crashes when using fluent-bit 1.8.11 #278
Comments
I do see 3 instances of throttling exceptions in the logs:
So you are sending lots of logs... this looks like its the log stream for fluent bit itself? Did you turn on debug logging in both cases? |
@PettitWesley the log stream for our application is here: https://github.com/aws/aws-for-fluent-bit/files/7714393/logs.log There is not a lot being sent. I think the throttling exceptions are coming from the |
Yeah looks like the vast majority of logs are coming from fluent bit itself, I think if you remove debug logging there should be a noticeable change:
I'm working on another bug report right now but after it will do some memory leak and performance testing on the CW output on a few different versions... |
@dylanlingelbach Wait, are you using the upstream release, not the AWS for Fluent Bit releaase? https://github.com/aws/aws-for-fluent-bit/tags Please use our release if you aren't. Lately we had some bugs that impact performance and stability and have had to add some custom patches (which we are working on eventually getting upstream in some form). I'm not saying this will fix the issue, and I am currently working on running some benchmarks, but please use our release. |
@PettitWesley yes, we are using the upstream release, sorry if that wasn't clear from the prior thread. Trying |
I've been working on diagnosing an issue that looks very similar to this (N.B. also using upstream fluentbit). Here's a sample from trace logs:
This continues until the process OOMs or the disk fills with fluent bit logs. I haven't tested different versions of fluent bit yet, but I suspect that prior to fluent bit 1.8.7 the same thing would still happen, except that since the AWS http response buffer is only 4k instead of OOMing we'd just fill the 4k buffer and then error out on malformed http response. @dylanlingelbach If you can, I'd try running fluent bit with trace logs or seeing if 1.7.9 logs Meanwhile, I'm going to spend my day with strace. |
Turns out my issue is fluent/fluent-bit#4098. Specifically, SSL_get_error() is indicating |
@AlexSc @dylanlingelbach Yea and this is why you should use the AWS release, we created a patch commit to fix that issue which is added to our release. However, it is not in the upstream release yet due to ongoing discussions with the upstream community about the best way to solve it. |
I will have an update on this tomorrow from the results from some performance tests I ran. |
@PettitWesley I've confirmed no OOMKills when using I am now seeing errors in the
That looks similar to what was initially reported in #274 I am going to close this issue as switching to If you have any ideas on FWIW, I am really confused on the difference between
|
@dylanlingelbach Our plan is still not to fork Fluent Bit. We haven't changed that text because we do not want to be in the current state we are in, with special patches just for the AWS release. Ideally, eventually, everything will get accepted upstream and we will be back to just distributing builds of the upstream versions. As far as the broken connection issue goes, we have a number of reports of increased frequency of those errors. This is the actual main issue to track it (the one you linked is just another customer report that I am still following up on): fluent/fluent-bit#4332 We are working on the broken connection issue and have made some progress in our understanding of it but also think its gonna be a little bit of time before we get a fix out, partly because the holidays are coming, partly because it seems to be quite complicated. |
@PettitWesley got it, thanks for the info. Looking at fluent/fluent-bit#4332 I think since we are on version |
@dylanlingelbach Yea we think that somewhere in the 1.8.x series this problem began |
@dylanlingelbach One thing I notice is that your config is huge... with multiple usages of rewrite_tag (which each are like there own input that get an additional input side buffer), and multiple outputs. I have been doing some performance testing recently, and basically, with a config like that, it is very believable that you're hitting spikes in memory sometimes. With so many plugins configured, it's easy for a scenario to occur that causes that. Long term we might to try to see if there's a way to make Fluent Bit limit its own memory better (mem_buf_limit partly does this, but only for input buffers)- via some sort of user configuration. |
@PettitWesley yes, we have 18 different services configured that have different log formats that we are parsing and sending off to different cloud watch streams. Is that a higher number than most people run? I assumed there were customers running with many more services with different log formats. To be clear, after upgrading to |
@dylanlingelbach that's not unheard of but its high, your config is among the longest I've ever seen. Fluent Bit can handle it (as you have seen), my point is just that you might not be able to be as aggressive with memory limits as other folks are. With that many plugins, you may need give Fluent Bit extra room from time to time. |
@PettitWesley got it, thank you, we will keep an eye on it. We are seeing memory usage with |
@dylanlingelbach FYI that we have made some progress in understanding fluent/fluent-bit#4332 and might soon have a proposed fix that you can beta test if you want. |
@PettitWesley sorry for the delay, I have been focused on something else for the past few weeks. I'd love to beta test a fix when you have it. Thanks for following up! |
@matthewfala FYI that @dylanlingelbach is interested in testing the connection issue proposed patch |
I'm sorry, but it seems like the patch will most likely not help in this case, since the patch is for async networking and cloudwatch plugin only uses sync networking. With that said, I can post a patch soon, and you can test it if you would like, but the patch does not intend to fix the cloudwatch plugins errors (I actually don't know how broken connection errors can come up on sync networking) The only thing I can think of is that recycled connections are going stale, and not detected somehow, maybe because connections to cloudwatch aren't supposed to be used between several log streams. Could you try add the following to your cloudwatch plugin's output?: This may decrease network performance, but I'm curious if it will resolve your broken pipe issues. This will make a new network connection to cloudwatch each time a log is sent rather than reusing connections. If that works, then to increase performance, then maybe try to remove the net.keepalive configuration and add |
For CW, broken connections can also indicate throttling, as the CW frontend will often block connections from the same IP instead of a full throttling error response. |
Hi @dylanlingelbach, |
@matthewfala I will try to test that image and turning off keep alive later this week. Will let you know what I find! |
@matthewfala @PettitWesley I haven't had a chance to test the image but in getting ready to I found some interesting things
I can't tell if those messages are truly information or indicating an issue. I think I shared our config earlier but we have 29 I am going to try to spend some more time today looking at metrics and seeing if |
Hi @dylanlingelbach. Cool! So keep alive off may impact network performance, so you may want to also try with keep alive on and As per the new informational messages, those logs are coming from the [FILTER] throttle plugin, which I don't see in your config. Did you happen to add that recently? This doesn't seem like an error.... Glad 1.8.12 + keep alive config is resolving most of your problems. |
@matthewfala ah, yes we did. We added a throttle config to the fluent-bit logs to handle the case where we had a misconfiguration in fluent-bit that led to fluent-bit logs, which hit the misconfiguration, which caused more fluent-bit logs, etc. Our throttle config looks like:
Completely forgot about the interval logs. This is great, thank you! We should be set with v1.8.12. |
Describe the question/issue
We are seeing constant OOMKills when upgrading to
[email protected]+
- most recently tried on[email protected]
These OOMs happen within minutes of the
fluent-bit
pod starting and we see them without significant log traffic - our app is just logging health checks and status messages.It does not appear to be a slow memory leak - we do not see memory climb slowly and then reach out pod limit. Rather memory is low and the pod receives an OOMKill.
When downgrading to
[email protected]
we do not see these OOMKills andfluent-bit
is able to run fine.If I remove the
cloudwatch_logs
output, I do not see the OOMKills.However
fluent-bit
CPU usage is significantly higher in v1.8.11 vs v1.7.9 with bothcloudwatch_logs
and withoutcloudwatch_logs
1.7.9
1.8.11
See fluent/fluent-bit#4192 for more discussion
Logs sent by our system during a few minute stretch can be found here: logs.log
Configuration
Config map: cm.yaml.txt
Pod: pod.yaml.txt
Fluent Bit Log Output
fluent-bit
debug logs are here: fluent-bit.logFluent Bit Version Info
v1.8.11
We do not see the issue on v1.7.9
Cluster Details
do you use App Mesh or a service mesh?
No
does you use VPC endpoints in a network restricted VPC?
No
Is throttling from the destination part of the problem? Please note that occasional transient network connection errors are often caused by exceeding limits. For example, CW API can block/drop Fluent Bit connections when throttling is triggered.
Not sure, but seems unlikely given how consistent it is and how low of log volume it is.
ECS or EKS
EKS
Fargate or EC2
EC2
Daemon or Sidecar deployment for Fluent Bit
Daemon
Application Details
Steps to reproduce issue
[email protected]
with the provided config and waitfluent-bit
podRelated Issues
The text was updated successfully, but these errors were encountered: