Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic log error once switch to cloudwatch_logs #135

Closed
bowliang opened this issue Jan 15, 2021 · 10 comments
Closed

Sporadic log error once switch to cloudwatch_logs #135

bowliang opened this issue Jan 15, 2021 · 10 comments

Comments

@bowliang
Copy link

Hi,

I recently switched to use cloudwatch_logs in my project, and then found out that I sporadically get the following error in FireLens container logs.

2021-01-15 05:32:55.016,[2021/01/15 05:32:55] [error] [tls] SSL error: SSL - The peer notified us that the connection is going to be closed
2021-01-15 05:32:55.016,[2021/01/15 05:32:55] [error] [http_client] broken connection to logs.us-west-2.amazonaws.com:443 ?
2021-01-15 05:32:55.016,"[2021/01/15 05:32:55] [debug] [aws_client] logs.us-west-2.amazonaws.com: http_do=-1, HTTP Status: 0"
2021-01-15 05:32:55.016,[2021/01/15 05:32:55] [debug] [upstream] KA connection #61 to logs.us-west-2.amazonaws.com:443 is now available
2021-01-15 05:32:55.016,[2021/01/15 05:32:55] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
2021-01-15 05:32:55.016,[2021/01/15 05:32:55] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
2021-01-15 05:32:55.016,[2021/01/15 05:32:55] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send events
2021/01/15 04:47:44,[2021/01/15 04:47:44] [error] [output:cloudwatch_logs:cloudwatch_logs.1] PutLogEvents API responded with error='ThrottlingException', message='Rate exceeded'

My hunch is that this doesn't matter as my containers keeps running anyway, and the output plugin will retry anyway. But I want to make sure about it and see if I need to tune some parameters. Also, the fact that the log level is [error] is concerning to me.

For context, I get the image from 906394416424.dkr.ecr.us-east-1.amazonaws.com/aws-for-fluent-bit:latest and my config looks like

[INPUT]
    Name            tcp
    Tag                ApplicationLogs
    Listen           0.0.0.0
    Port              5170
    Chunk_Size    256
    Buffer_Size     256
    Format          None
    storage.type    filesystem
    Mem_Buf_Limit  800MB

[OUTPUT]
    Name               cloudwatch_logs
    Match              ApplicationLogs
    region             ${LOG_REGION}
    log_group_name     ${SERVICE_NAME}/application_log
    log_stream_name    application_log_${LOG_STREAM_SUFFIX}
    log_key            log
    auto_create_group  true

Thanks!

@PettitWesley
Copy link
Contributor

I would ignore the SSL errors, I get those occasionally too, Fluent Bit unfortunately has a custom HTTP client, and I think for these connection dropped errors other clients would just be silent and retry- I haven't seen evidence that occasional error messages like that are a sign of anything actually abnormal going on.

The throughput except though is something you should fix.

Check the logs some more, there should be an info message every flush that tells you how many events were sent in that batch. How many logs are you sending per fluent bit instance- what is your rough rate? How many Fluent Bit instances do you have an do they send to unique log streams of the same log stream? Based on your config it looks like they might all send to the same log stream. CloudWatch has ingestion limits per log stream- 5 requests per second.

@bowliang
Copy link
Author

Thanks for your quick response. Here are the info message I found around that error log.

2021-01-18 17:45:05.594,[2021/01/18 17:45:05] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] Sending log events to log stream service_log_1e0f59e8d59644d8a0ca1d86ea7faad0
2021-01-18 17:45:05.594,[2021/01/18 17:45:05] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] Sending 47 events
2021-01-18 17:45:05.599,[2021/01/18 17:45:05] [debug] [task] destroy task=0x7fdfa2e518c0 (task_id=0)
2021-01-18 17:45:05.599,[2021/01/18 17:45:05] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send events
2021-01-18 17:45:05.599,[2021/01/18 17:45:05] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send log events
2021-01-18 17:45:05.599,[2021/01/18 17:45:05] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send log events
2021-01-18 17:45:05.599,"[2021/01/18 17:45:05] [error] [output:cloudwatch_logs:cloudwatch_logs.2] PutLogEvents API responded with error='ThrottlingException', message='Rate exceeded for logStreamName service_log_1e0f59e8d59644d8a0ca1d86ea7faad0'"
2021-01-18 17:45:05.599,[2021/01/18 17:45:05] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] PutLogEvents http status=400
2021-01-18 17:45:05.599,"[2021/01/18 17:45:05] [debug] [aws_client] logs.us-east-1.amazonaws.com: http_do=0, HTTP Status: 400"
2021-01-18 17:45:05.599,[2021/01/18 17:45:05] [debug] [http_client] server logs.us-east-1.amazonaws.com:443 will close connection #129
2021-01-18 17:45:05.600,"[2021/01/18 17:45:05] [ warn] [engine] failed to flush chunk '1-1610991904.642253673.flb', retry in 8 seconds: task_id=1, input=tail.4 > output=cloudwatch_logs.2"
2021-01-18 17:45:05.600,[2021/01/18 17:45:05] [debug] [retry] new retry created for task_id=1 attempts=1

2021-01-18 17:45:13.548,[2021/01/18 17:45:13] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] Sending log events to log stream service_log_1e0f59e8d59644d8a0ca1d86ea7faad0
2021-01-18 17:45:13.548,[2021/01/18 17:45:13] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] Sending 47 events
2021-01-18 17:45:13.633,[2021/01/18 17:45:13] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] Sent events to service_log_1e0f59e8d59644d8a0ca1d86ea7faad0
2021-01-18 17:45:13.633,[2021/01/18 17:45:13] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] PutLogEvents http status=200

It looks like that after retry, it works. And I have this behavior once every 20 mins or so. I'm sending logs to various log streams based on application_log_${LOG_STREAM_SUFFIX} , where ${LOG_STREAM_SUFFIX} is extracted from the Fargate task ARN and means the task id. So basically, the log from same task will be sent to same log stream, and different tasks have different log streams.

Any suggestion on how to improve this? Or should the log level be adjusted to warning? Thanks!

@PettitWesley
Copy link
Contributor

The main advice to prevent throttling issues is to send to a unique log stream per task. If you are already doing that, then there is one more recommendation with lesser impact that I can give:

Experiment with increasing the Flush setting in the Service section to be greater than 5 seconds (the default), this might increase the number of events sent per flush, which could decrease the total number of calls to CloudWatch:

[Service]
     Flush 10

https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/configuration-file

@bowliang
Copy link
Author

Thanks. I'll play with the flush parameter. For now I used Flush 1 as suggested in your post. https://aws.amazon.com/blogs/containers/under-the-hood-firelens-for-amazon-ecs-tasks/

Does increase flush time period also increase the possibility of log loss?

@PettitWesley
Copy link
Contributor

Does increase flush time period also increase the possibility of log loss?

@BowenRaymone It does slightly increase the risk, yes. If Fluent Bit suddenly quits, there will be more logs stored in the buffer- if it crashes then more will be lost, if it exits gracefully there's greater risk it won't be able to send everything in time. Realistically graceful shutdown is the most common case, which is 30 seconds in ECS- which is realistically enough time to send a lot of logs, as shown in that post.

And with the new cloudwatch_logs plugin, you can send logs a bit faster, so I bet that post's numbers don't fully apply. Especially when Fluent Bit 1.7 comes out- it will have a multi-worker/multithreaded mode that only core plugins will be able to take advantage of.

@bowliang
Copy link
Author

I see. That make sense. I'll try to increase the flush time and see if it resolves the throttling issue. Thanks for your advice!

@lgg42
Copy link

lgg42 commented Apr 26, 2021

I see. That make sense. I'll try to increase the flush time and see if it resolves the throttling issue. Thanks for your advice!

Hello! I was experimenting a very similar issue to yours and I'd like to know if the flush time increase did the trick?

Thanks.

@k-vit
Copy link

k-vit commented Nov 30, 2021

Hello! I was experimenting a very similar issue to yours and I'd like to know if the flush time increase did the trick?

Worked for me

@Sangeethavisa
Copy link

Hi All,

I'm facing similar issue with forwarding my application logs to cloudwatch. It does send logs but after some time the connection gets lost between container and cloudwatch. Flush is already set to 10 in the fluentbit.conf file. Is there anything I should check??

[output:cloudwatch_logs:cloudwatch_logs.3] Sending log events to log stream from-us-east-1b-xray-indexer-service.log
[2022/11/14 18:58:46] [debug] [http_client] not using http_proxy for header
[2022/11/14 18:58:46] [debug] [aws_credentials] Retrieving credentials from the HTTP provider..
[2022/11/14 18:58:46] [debug] [http_client] server logs.us-east-1.amazonaws.com:443 will close connection #55
[2022/11/14 18:58:46] [debug] [aws_client] logs.us-east-1.amazonaws.com: http_do=0, HTTP Status: 400
[2022/11/14 18:58:46] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] PutLogEvents http status=400
[2022/11/14 18:58:46] [ info] [output:cloudwatch_logs:cloudwatch_logs.3] Got DataAlreadyAcceptedException, a previous retry must have succeeded asychronously
[2022/11/14 18:58:46] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sent 2 events to CloudWatch
[2022/11/14 18:58:46] [debug] [out flush] cb_destroy coro_id=1
[2022/11/14 18:58:46] [debug] [socket] could not validate socket status for #55 (don't worry)
[2022/11/14 18:58:46] [ info] [engine] flush chunk '4910-1668452314.841010650.flb' succeeded at retry 1: task_id=4, input=tail.1 > output=cloudwatch_logs.3 (out_id=3)
[2022/11/14 18:58:46] [debug] [task] destroy task=0x7fecf5ebad90 (task_id=4)
[2022/11/14 18:58:53] [debug] [upstream] KA connection #75 to logs.us-east-1.amazonaws.com:443 has been disconnected by the remote service
[2022/11/14 18:58:53] [debug] [socket] could not validate socket status for #75 (don't worry)

Appreciate any help on this!!

Thanks!!

@PettitWesley
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants