-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Losing logs with S3 plugin #495
Comments
Apologies for this issue. Currently, S3 output has a hard-coded 5 retries for chunks, but it doesn't output any messages to tell you that it is retrying. In addition, the timestamp in the S3 key of the file in S3 is not actually the timestamp of any log record in the file. It's just the time when S3 created the buffer file. How do you search for the logs? Is kibana search based on the timestamp in the log or the timestamp of the S3 file? If its the S3 file, then you unfortunately need to increase your range of search in time. The team and especially I apologize for these bugs. We have been working on fixing them this quarter and are hoping to get a release out very soon with the fixes:
One immediate mitigation step you can take is to enable our Additionally, we could build you a custom image with our fixes, would you be interested in that? |
@Claych FYI |
These two PRs will fix the bugs with s3 timestamp and can let you set the number with retry_limit in configuration file. The doc will be updated synchronously when these two PRs release, and you can see the details here for s3 later. |
Thank you very much for the reply. We are indexing the log file entries not on the S3 timestamp but on the timestamps either from the application timestamp or from the container run time logger. So timestamp is not a problem. We know logs are missing because our searches using an application timestamp sometimes come up empty, but we can find the entries when we search the container logs. Just a bit confused by the And sure, I would be interested in a custom image, I have several non-production clusters that also have the error where I could try it out. Thanks again. |
Sorry @fhitchen for this issue you are experiencing.
My bad, yea you do not need to make any config update to enable it.
Can you elaborate on what this means? Do you mean that when you check the actual log files on disk or use Also, we have some guidance on investigating log loss here: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#log-loss One thing that'd be super useful is if you can provide us with details on any correlations you see in the lost logs, as the above guide notes. For example, do they call come in a series, as if one chunk/buffer file was lost, or are they very mixed over time. |
Hi @PettitWesley, My bad this time. What I meant about knowing we had missing log entries was nothing to do with timestamp. All our log entries have a traceId in them. We could not find the traceid in OpenSearch but we could when we did a I have found a pattern in the distribution of the
If I look at the input bytes/second on the nodes with the highest number of occurrences, it is much busier that the nodes with the lowest. Highest occurencesLowest occurencesHere is the actual distribution by time of the error message from the worst fluent-bit POD.
From what I can see, it looks as though it happen with no noticeable pattern, just 3-8 times an hour or so. One strange thing, why does the error timestamp second always end in 9? I checked another fluent-bit POD, the error timestamp was always 5. Must be some internal flush interval for logs? |
So S3 plugin is unfortunately very complicated. Much more so than other plugins. In most outputs, the Fluent Bit engine sends the output chunks of records via its "flush callback function" which will then send the records before that callback function returns. S3 is more complicated because it has to buffer files locally so that it can get large files in S3. A Fluent Bit chunk is targeted to be around 2 megabytes, which is too small for most use cases for S3. So when the S3 "flush callback function" gets a chunk of logs, it most of the time just buffers it to the file system to create a large file. S3 also has a "timer callback function" which runs at a regular interval and checks if the pending buffer files are ready to send or not. The regular time interval at which you see these errors possibly indicates the failures are all happening in the "timer callback function". Fluent Bit is a concurrent log routing tool: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md#concurrency Recently, @matthewfala pointed out to me that S3 has a potential concurrency bug that will affect the "timer callback function" execution. I am working on fixing this. The pending fix is built into these images:
One is for init, one is just normal image without init. I think currently you are not using our init tag, so use the 2nd one. https://github.com/aws/aws-for-fluent-bit#using-the-init-tag Code Source:
@fhitchen There is also one more S3 issue I need to mention. Apologies for how many S3 issues there are right now. Currently, the retry and error metrics for S3 are broken. I have opened this issue for the design of the fix: fluent/fluent-bit#6141 |
I switched the image to the 144718711470.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.29.0-s3-sync-io and immediately noticed the number of errors increased. So instead of going back to the 2.23.4 version I went to the 2.29.0 image to see if there was a difference and forgot I had made the change. When I looked 24 hours later I could see with this fluent-bit version the number of errors had gone from about 180 a day to 3,500. Going back to 2.23.4, I will let you know how it goes. |
@fhitchen Can you confirm you mean 2.23.4? That version is quite old: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.23.4 Its on 1.8.15 Fluent Bit, and was out in april. Many bugs have been fixed since then. |
I can confirm that I switched back to 2.23.4. I am already seeing a big drop in error count, down from about 50 per 30 minutes to 8 in the same time period. The original issue I reported was from our production cluster which is on 2.28.4. I will make them the same. |
@PettitWesley what is the expected behaviour of the plugin if the hard retry count exhausts. I am seeing the following log : |
Describe the question/issue
Have had team members complaining that log file entries are missing when they go to search for them in Kibana. The fluent-bit logs do show failures and retries happening quite frequently, but there appears to be one particular log sequence that never seems to get retried. On our 64 node EKS cluster I can see the error
Could not send chunk with tag
happening around 1.600 times a day in OpenSearch.I have gone through all the steps in the debugging issues guide and nothing has helped so far, so I am hoping for more guidance here.
Configuration
Fluent Bit Log Output
Have tried to turn on debug logging, but it was so verbose that I could not see the error happening.
This is the error sequence that does not appear to be subject to any kind of retry.
[2022/12/13 15:58:59] [error] [http_client] broken connection to s3.us-west-2.amazonaws.com:443 ?
[2022/12/13 15:58:59] [error] [http_client] broken connection to s3.us-west-2.amazonaws.com:443 ?
[2022/12/13 15:58:59] [error] [output:s3:s3.1] PutObject request failed
[2022/12/13 15:58:59] [error] [output:s3:s3.1] Could not send chunk with tag kube.var.log.containers.x.log
[2022/12/13 15:58:59] [error] [http_client] broken connection to s3.us-west-2.amazonaws.com:443 ?
[2022/12/13 15:58:59] [error] [http_client] broken connection to s3.us-west-2.amazonaws.com:443 ?
[2022/12/13 15:58:59] [error] [output:s3:s3.1] PutObject request failed
[2022/12/13 15:58:59] [error] [output:s3:s3.1] Could not send chunk with tag kube.var.log.containers.y.log
[2022/12/13 15:58:59] [error] [tls] error: error:00000006:lib(0):func(0):EVP lib
[2022/12/13 15:58:59] [error] [src/flb_http_client.c:1189 errno=25] Inappropriate ioctl for device
Fluent Bit Version Info
AWS for Fluent Bit Container Image Version 2.28.4
Fluent Bit v1.9.9
Cluster Details
Application Details
We are shipping logs from around 1,800 pods. The log activity varies dramatically from pod to pod. Each day we are creating around 60 million documents in OpenSearch and the size of the indexes is around 50Gb.
Input and output rates are in the attached Grafana screenshots.
Steps to reproduce issue
From the fluent-bit logs I see that we successfully upload around 250,000 objects to S3 in a 24 hour period, so the failure rate is quite low, but still it is getting noticed.
Added the worker setting but it made no difference. Have also tried to increase the upload timeout from 3 to 5 minutes and that also did not make any difference.
There does not seem to be any correlation between the failure rate and the input bytes rate. The failure rate remains more or less constant across the day, between 60 and 100 failures and hour.
Related Issues
The text was updated successfully, but these errors were encountered: