Sporadic log error once switch to cloudwatch_logs #135

bowliang · 2021-01-15T05:59:35Z

Hi,

I recently switched to use cloudwatch_logs in my project, and then found out that I sporadically get the following error in FireLens container logs.

2021-01-15 05:32:55.016,[2021/01/15 05:32:55] [error] [tls] SSL error: SSL - The peer notified us that the connection is going to be closed
2021-01-15 05:32:55.016,[2021/01/15 05:32:55] [error] [http_client] broken connection to logs.us-west-2.amazonaws.com:443 ?
2021-01-15 05:32:55.016,"[2021/01/15 05:32:55] [debug] [aws_client] logs.us-west-2.amazonaws.com: http_do=-1, HTTP Status: 0"
2021-01-15 05:32:55.016,[2021/01/15 05:32:55] [debug] [upstream] KA connection #61 to logs.us-west-2.amazonaws.com:443 is now available
2021-01-15 05:32:55.016,[2021/01/15 05:32:55] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
2021-01-15 05:32:55.016,[2021/01/15 05:32:55] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
2021-01-15 05:32:55.016,[2021/01/15 05:32:55] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send events
2021/01/15 04:47:44,[2021/01/15 04:47:44] [error] [output:cloudwatch_logs:cloudwatch_logs.1] PutLogEvents API responded with error='ThrottlingException', message='Rate exceeded'

My hunch is that this doesn't matter as my containers keeps running anyway, and the output plugin will retry anyway. But I want to make sure about it and see if I need to tune some parameters. Also, the fact that the log level is [error] is concerning to me.

For context, I get the image from 906394416424.dkr.ecr.us-east-1.amazonaws.com/aws-for-fluent-bit:latest and my config looks like

[INPUT]
    Name            tcp
    Tag                ApplicationLogs
    Listen           0.0.0.0
    Port              5170
    Chunk_Size    256
    Buffer_Size     256
    Format          None
    storage.type    filesystem
    Mem_Buf_Limit  800MB

[OUTPUT]
    Name               cloudwatch_logs
    Match              ApplicationLogs
    region             ${LOG_REGION}
    log_group_name     ${SERVICE_NAME}/application_log
    log_stream_name    application_log_${LOG_STREAM_SUFFIX}
    log_key            log
    auto_create_group  true

Thanks!

The text was updated successfully, but these errors were encountered:

PettitWesley · 2021-01-15T19:09:35Z

I would ignore the SSL errors, I get those occasionally too, Fluent Bit unfortunately has a custom HTTP client, and I think for these connection dropped errors other clients would just be silent and retry- I haven't seen evidence that occasional error messages like that are a sign of anything actually abnormal going on.

The throughput except though is something you should fix.

Check the logs some more, there should be an info message every flush that tells you how many events were sent in that batch. How many logs are you sending per fluent bit instance- what is your rough rate? How many Fluent Bit instances do you have an do they send to unique log streams of the same log stream? Based on your config it looks like they might all send to the same log stream. CloudWatch has ingestion limits per log stream- 5 requests per second.

bowliang · 2021-01-18T18:14:28Z

Thanks for your quick response. Here are the info message I found around that error log.

2021-01-18 17:45:05.594,[2021/01/18 17:45:05] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] Sending log events to log stream service_log_1e0f59e8d59644d8a0ca1d86ea7faad0
2021-01-18 17:45:05.594,[2021/01/18 17:45:05] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] Sending 47 events
2021-01-18 17:45:05.599,[2021/01/18 17:45:05] [debug] [task] destroy task=0x7fdfa2e518c0 (task_id=0)
2021-01-18 17:45:05.599,[2021/01/18 17:45:05] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send events
2021-01-18 17:45:05.599,[2021/01/18 17:45:05] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send log events
2021-01-18 17:45:05.599,[2021/01/18 17:45:05] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send log events
2021-01-18 17:45:05.599,"[2021/01/18 17:45:05] [error] [output:cloudwatch_logs:cloudwatch_logs.2] PutLogEvents API responded with error='ThrottlingException', message='Rate exceeded for logStreamName service_log_1e0f59e8d59644d8a0ca1d86ea7faad0'"
2021-01-18 17:45:05.599,[2021/01/18 17:45:05] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] PutLogEvents http status=400
2021-01-18 17:45:05.599,"[2021/01/18 17:45:05] [debug] [aws_client] logs.us-east-1.amazonaws.com: http_do=0, HTTP Status: 400"
2021-01-18 17:45:05.599,[2021/01/18 17:45:05] [debug] [http_client] server logs.us-east-1.amazonaws.com:443 will close connection #129
2021-01-18 17:45:05.600,"[2021/01/18 17:45:05] [ warn] [engine] failed to flush chunk '1-1610991904.642253673.flb', retry in 8 seconds: task_id=1, input=tail.4 > output=cloudwatch_logs.2"
2021-01-18 17:45:05.600,[2021/01/18 17:45:05] [debug] [retry] new retry created for task_id=1 attempts=1

2021-01-18 17:45:13.548,[2021/01/18 17:45:13] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] Sending log events to log stream service_log_1e0f59e8d59644d8a0ca1d86ea7faad0
2021-01-18 17:45:13.548,[2021/01/18 17:45:13] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] Sending 47 events
2021-01-18 17:45:13.633,[2021/01/18 17:45:13] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] Sent events to service_log_1e0f59e8d59644d8a0ca1d86ea7faad0
2021-01-18 17:45:13.633,[2021/01/18 17:45:13] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] PutLogEvents http status=200

It looks like that after retry, it works. And I have this behavior once every 20 mins or so. I'm sending logs to various log streams based on application_log_${LOG_STREAM_SUFFIX} , where ${LOG_STREAM_SUFFIX} is extracted from the Fargate task ARN and means the task id. So basically, the log from same task will be sent to same log stream, and different tasks have different log streams.

Any suggestion on how to improve this? Or should the log level be adjusted to warning? Thanks!

PettitWesley · 2021-01-19T00:40:55Z

The main advice to prevent throttling issues is to send to a unique log stream per task. If you are already doing that, then there is one more recommendation with lesser impact that I can give:

Experiment with increasing the Flush setting in the Service section to be greater than 5 seconds (the default), this might increase the number of events sent per flush, which could decrease the total number of calls to CloudWatch:

[Service]
     Flush 10

https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/configuration-file

bowliang · 2021-01-19T02:27:48Z

Thanks. I'll play with the flush parameter. For now I used Flush 1 as suggested in your post. https://aws.amazon.com/blogs/containers/under-the-hood-firelens-for-amazon-ecs-tasks/

Does increase flush time period also increase the possibility of log loss?

PettitWesley · 2021-01-19T05:24:21Z

Does increase flush time period also increase the possibility of log loss?

@BowenRaymone It does slightly increase the risk, yes. If Fluent Bit suddenly quits, there will be more logs stored in the buffer- if it crashes then more will be lost, if it exits gracefully there's greater risk it won't be able to send everything in time. Realistically graceful shutdown is the most common case, which is 30 seconds in ECS- which is realistically enough time to send a lot of logs, as shown in that post.

And with the new cloudwatch_logs plugin, you can send logs a bit faster, so I bet that post's numbers don't fully apply. Especially when Fluent Bit 1.7 comes out- it will have a multi-worker/multithreaded mode that only core plugins will be able to take advantage of.

bowliang · 2021-01-19T18:33:42Z

I see. That make sense. I'll try to increase the flush time and see if it resolves the throttling issue. Thanks for your advice!

lgg42 · 2021-04-26T12:55:13Z

I see. That make sense. I'll try to increase the flush time and see if it resolves the throttling issue. Thanks for your advice!

Hello! I was experimenting a very similar issue to yours and I'd like to know if the flush time increase did the trick?

Thanks.

k-vit · 2021-11-30T07:24:50Z

Hello! I was experimenting a very similar issue to yours and I'd like to know if the flush time increase did the trick?

Worked for me

Sangeethavisa · 2022-11-14T19:02:33Z

Hi All,

I'm facing similar issue with forwarding my application logs to cloudwatch. It does send logs but after some time the connection gets lost between container and cloudwatch. Flush is already set to 10 in the fluentbit.conf file. Is there anything I should check??

[output:cloudwatch_logs:cloudwatch_logs.3] Sending log events to log stream from-us-east-1b-xray-indexer-service.log
[2022/11/14 18:58:46] [debug] [http_client] not using http_proxy for header
[2022/11/14 18:58:46] [debug] [aws_credentials] Retrieving credentials from the HTTP provider..
[2022/11/14 18:58:46] [debug] [http_client] server logs.us-east-1.amazonaws.com:443 will close connection #55
[2022/11/14 18:58:46] [debug] [aws_client] logs.us-east-1.amazonaws.com: http_do=0, HTTP Status: 400
[2022/11/14 18:58:46] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] PutLogEvents http status=400
[2022/11/14 18:58:46] [ info] [output:cloudwatch_logs:cloudwatch_logs.3] Got DataAlreadyAcceptedException, a previous retry must have succeeded asychronously
[2022/11/14 18:58:46] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sent 2 events to CloudWatch
[2022/11/14 18:58:46] [debug] [out flush] cb_destroy coro_id=1
[2022/11/14 18:58:46] [debug] [socket] could not validate socket status for #55 (don't worry)
[2022/11/14 18:58:46] [ info] [engine] flush chunk '4910-1668452314.841010650.flb' succeeded at retry 1: task_id=4, input=tail.1 > output=cloudwatch_logs.3 (out_id=3)
[2022/11/14 18:58:46] [debug] [task] destroy task=0x7fecf5ebad90 (task_id=4)
[2022/11/14 18:58:53] [debug] [upstream] KA connection #75 to logs.us-east-1.amazonaws.com:443 has been disconnected by the remote service
[2022/11/14 18:58:53] [debug] [socket] could not validate socket status for #75 (don't worry)

Appreciate any help on this!!

Thanks!!

PettitWesley · 2022-11-14T19:15:05Z

@Sangeethavisa Please open a new issue: https://github.com/aws/aws-for-fluent-bit/blob/mainline/.github/ISSUE_TEMPLATE/issue.md

Also please check out our debugging guide first: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md

bowliang closed this as completed Jan 19, 2021

Sangeethavisa mentioned this issue Nov 15, 2022

Fluentbit stops sending logs to cloudwatch after some time. #475

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic log error once switch to cloudwatch_logs #135

Sporadic log error once switch to cloudwatch_logs #135

bowliang commented Jan 15, 2021

PettitWesley commented Jan 15, 2021

bowliang commented Jan 18, 2021

PettitWesley commented Jan 19, 2021

bowliang commented Jan 19, 2021

PettitWesley commented Jan 19, 2021

bowliang commented Jan 19, 2021

lgg42 commented Apr 26, 2021

k-vit commented Nov 30, 2021 •

edited

Loading

Sangeethavisa commented Nov 14, 2022

PettitWesley commented Nov 14, 2022

Sporadic log error once switch to cloudwatch_logs #135

Sporadic log error once switch to cloudwatch_logs #135

Comments

bowliang commented Jan 15, 2021

PettitWesley commented Jan 15, 2021

bowliang commented Jan 18, 2021

PettitWesley commented Jan 19, 2021

bowliang commented Jan 19, 2021

PettitWesley commented Jan 19, 2021

bowliang commented Jan 19, 2021

lgg42 commented Apr 26, 2021

k-vit commented Nov 30, 2021 • edited Loading

Sangeethavisa commented Nov 14, 2022

PettitWesley commented Nov 14, 2022

k-vit commented Nov 30, 2021 •

edited

Loading