-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datadog output] Version 2.29.0 Causing Task to stop #491
Comments
Same for us. We retrieved this in our logs:
|
Oh this made my day. Restoring to 2.28.4 helps. |
Same error since 2.29.0: platform: arm(graviton) Quickfix -> switch to stable tag |
Same here, also on ECS/Fargate with datadog-agent 7.40.1. |
Same problem, also on Fargate with datadog-agent. |
We are experiencing the same problem, we are not using datadog-agent but the fluent bit task seems to stop randomly after 15-60 minutes. Switched back to 2.28.4 did resolve the issue. |
We are running into this issue as well it seems to be related to this fluent/fluent-bit#6512 |
Got burned hard by this today swapping to |
Same for us. We had an hour long outage trying to diagnose this. Using |
+1 also seeing this. Running it sidecar style in the same TaskDefinition (fargate) with datadog-agent container. |
+1. Any update? |
Are you all facing the error We are actively investigating this issue. In the meantime, please check out our relevant guides on high memory usage: |
Are you all using the Fluent Bit datadog output? Can you please share your Fluent Bit configuration files? |
Has someone from Datadog look into this? Anyone has engaged DataDog team? |
I have sent a note to DataDog public Slack channel - https://datadoghq.slack.com/archives/C8PV5LVDX/p1670625992901619 for someone to take a look. |
Can you share your ECS task definition pertaining to FireLens / DataDog log routing? This will help us isolate the problem. |
We consume Fluent Bit via the CDK. This was the offending code, in our case. It worked fine until we bumped to 2.29.0: this.taskDefinition.addFirelensLogRouter('LogRouter', {
image: ecs.obtainDefaultFluentBitECRImage(this.taskDefinition),
essential: true,
memoryLimitMiB: 256,
firelensConfig: {
type: ecs.FirelensLogRouterType.FLUENTBIT,
options: {
enableECSLogMetadata: true,
configFileType: ecs.FirelensConfigFileType.FILE,
configFileValue: '/fluent-bit/configs/parse-json.conf',
},
},
logging: new ecs.FireLensLogDriver({
options: {
Name: 'datadog',
Host: 'http-intake.logs.datadoghq.com',
dd_service: 'log-router',
dd_source: 'fluentbit',
dd_tags: `env:production`,
dd_message_key: 'log',
TLS: 'on',
provider: 'ecs',
},
secretOptions: {
apikey: this.datadogToken,
},
},
}); |
This ticket title has been changed to reference Datadog specifically, but @besbes above mentioned seeing it without Datadog in the mix. Has there been more confirmation that it's only Datadog-related? |
Could someone share their fluent bit configuration file which is used with the crash? |
In our case it's the one that ships in the image: |
Apologies it's the weekend here in Australia, I'll talk to our DevOps about sharing the task definition on Monday. |
Here's the ECS configuration that gave us problems: {
"name": "LogRouter",
"image": "906394416424.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.29.0",
"cpu": 0,
"memory": 256,
"links": [],
"portMappings": [],
"essential": true,
"entryPoint": [],
"command": [],
"environment": [],
"environmentFiles": [],
"mountPoints": [],
"volumesFrom": [],
"secrets": [],
"user": "0",
"dnsServers": [],
"dnsSearchDomains": [],
"extraHosts": [],
"dockerSecurityOptions": [],
"dockerLabels": {},
"ulimits": [],
"logConfiguration": {
"logDriver": "awsfirelens",
"options": {
"Host": "http-intake.logs.datadoghq.com",
"Name": "datadog",
"TLS": "on",
"dd_message_key": "log",
"dd_service": "log-router",
"dd_source": "fluentbit",
"dd_tags": "env:production",
"provider": "ecs"
},
"secretOptions": [
{
"name": "apikey",
"valueFrom": "[REDACTED]"
}
]
},
"systemControls": [],
"firelensConfiguration": {
"type": "fluentbit",
"options": {
"config-file-type": "file",
"config-file-value": "/fluent-bit/configs/parse-json.conf",
"enable-ecs-log-metadata": "true"
}
}
} |
Changing the version back to 2.28.4 resolves the issue with no other modifications. |
Thank you for the information, @paulrosania. |
Unfortunately since Fluent Bit routes our logs straight from ECS, I don't think we have a capture of the log contents from the time of the issue. 😞 |
AWS is actively working on reproducing this bug report. You can help us out by providing with more information. We will try to fix it as quickly as possible, but please see: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#i-reported-an-issue-how-long-will-it-take-to-get-fixed |
If possible, please, can anyone share example log messages that are emitted by their apps when experiencing the issue? |
I've put it on my list for next week :)
Ah sorry, I wasn't aware of that!
We are. I didn't realise that, thanks for correcting me! We've only just recently started dual-writing to Datadog, and none of the changes relating to that are in production yet, which is still using 2.29.0 directly. So we're only using the workaround I shared above in pre-production load tests at the moment. I'll make sure we don't inadverently switch production back to the go plugin now I know - thanks :) |
@matthewfala Is there a |
Oh that’s a good point - our service runs on Graviton 2 task instances so I would also need arm64 images (I hadn’t checked availability but it sounds like there might not be). |
Thanks @PettitWesley - I watched https://www.youtube.com/watch?v=F73MgV_c2MM when I first set our service up with Fluent Bit and was researching the options. The performance section was very informative :) I just completely missed out the significance of the names when I put that Dockerfile patch together. We are using the core C plugin, and for now we're using 2.28.4 which I realise means we don't benefit from fluent/fluent-bit#6339. |
@dhpiggott For this reason we just put out a new release which should be running the same code for datadog as 2.28.4 but also includes the 2.29.0 cloudwatch_logs synchronous task scheduler fix that you mentioned: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.29.1 For everyone in this issue, please try out the new release and let us know your findings. |
Please note that 2.29.1 and 2.30.0 have been officially released, and should be at least as stable as the 2.28.4 DataDog, but adds the cloudwatch hang fix. Thank you @dhpiggott! I'll hopefully build an arm version of the prerelease 2.31.1 image today. So in summary of all the versions to help us understand the potential problems: |
That would be great! At the moment we're a bit stuck with our service - we want to ship our changes that dual write to Datadog and CloudWatch Logs. We're currently running 2.29.0 in prod and it writes to CloudWatch Logs great. Datadog output fails if I try to enable it with 2.29.0 or 2.29.1, so we'd have to switch back to 2.28.4, but given the message about that having a CloudWatch hang issue, that doesn't seem safe either. So it sounds like the 2.31.1 prerelease is what we need. If the arm64 build passes all our tests once it's available my plan will be to ship our dual write change to prod using that. |
@dhpiggott 2.30.0 should be at least as stable as 2.28.4 in terms of datadog, but more stable in terms of cloudwatch_logs. 2.30.0 has been released officially! Feel free to use that one in production while 2.31.1 is in testing. Thank you for waiting for the arm prerelease! My arm machine wasn't working, so I had to load up a new one. Here are the arm prerelease images
The prerelease should be more stable than 2.28.4 in terms of datadog and cloudwatch_logs. Hope this helps. |
@matthewfala I've run the following images for about 15 min each and have not seen the issue arise.
|
@aidant For now, if an official release image is needed, please use 2.30.0, as it is as stable as 2.28.4 for Datadog and resolves the cloudwatch hang issues. 2.31.1 will be coming out soon! I'll keep everyone posted here. If anyone else want's to help validate the 2.31.1 image, to check it for segfaults in your workflow it would be greatly appreciated |
@dhpiggott, @besbes , @tw-sarah , @atlantis , @paulrosania , @MikeWise01, Would any of you be willing to help test the 2.31.1 prerelease image? We would like to be sure that new segfaults are not introduced in unexpected workflows. AWS is not actively maintaining the DataDog plugin except in rare circumstances like the recent segfaults found in 2.28.4 and prior which heavily impact customers, so we don't have as many test cases for this plugin, and want to make sure that it gets community validation to test as many workflows as possible before release. |
Hey @matthewfala - thanks for the arm64 build. I've been OoO today so I haven't had opportunity to test it yet, but do I plan to try it next week when I'm back in. |
I just tried 2.31.1 but saw segfaults within a minute of container startup. Unfortunately that's the only detail I saw in the Fluent Bit container logs:
|
@dhpiggott, |
Just wanted to add my own results... We had the issue of Fargate Tasks recycling over and over because the 'aws-for-fluent-bit:2.29.0' container was marked as essential and would crash with the 'Cannot allocate memory' line. We are configured to send logs to DataDog. Just updated to 2.30.0 and all is well... container starts, forwards the logs, and Tasks have been healthy for over an hour.
|
@matthewfala I just tried 2.30.0 on ECS Fargate (no Datadog involved but we are running on Graviton) and got a segfault a few minutes after starting the container:
No other logs unfortunately. |
I tried 2.29.1 a couple of weeks ago but did get segfaults (#491 (comment)).
I think when I used 2.28.4 it ran ok, but I saw in #491 (comment) and #491 (comment) that you mentioned 2.28.4 has a hang issue with the CloudWatch Logs output. At the moment our service uses CloudWatch Logs as the source of truth, and we're in the process of transitioning to Datadog logs. We've never ran 2.28.4 in production (prior to that we were using the awslogs Docker logs driver directly) so I'm very hesitant to downgrade production from 2.29 (with Datadog disabled, since we don't depend on it yet) to 2.28.4 (which we've never ran in prod). In other words, we need our CloudWatch Logs delivery to be solid until we switch to Datadog (which will happen gradually). |
@besbes (and others) this issue is for segfaults associated with using datadog output. If you are not using datadog and you see a crash, please open an new issue for that and give us your config, task def/pod yaml, etc. |
@besbes We'll be releasing aws-for-fluent-bit v2.31.1 shortly which should contain the most stable version of Datadog as it has fixes to both of the recent identified (non-arm) segfaults related to the datadog plugin. |
To update from my end, I've been trying each new release as I've seen them, and saw 2.31.2 was released yesterday. In all the tests I've done so far it works great! Hopefully we can go to prod with it next week. |
And that fits my observations - our config has a Cloudwatch and a Datadog output that both match all tags. |
This is great to hear! |
Closing this old issue |
When updating to version 2.29.0 (previously 2.28.4) of aws-observability/aws-for-fluent-bit we are observing one of our task definitions entering a cycle of provisioning and de-provisioning.
We are running ECS with Fargate and aws-observability/aws-for-fluent-bit plus datadog/agent version 7.40.1 as sidecars.
We have not had an opportunity to look into the cause of this. Hopefully, you can provide some insights into how we can debug this further. Our next steps will likely be to try the
FLB_LOG_LEVEL=debug
environment variable and report back.The text was updated successfully, but these errors were encountered: