-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: ECS Fluent Bit Daemon Collector #499
WIP: ECS Fluent Bit Daemon Collector #499
Conversation
region ${REGION_TASK} | ||
upload_timeout 2m | ||
total_file_size 10M | ||
s3_key_format /aws/ecs/${CLUSTER_NAME}/$TAG[0]/$TAG[1]/$TAG[2]/$TAG[3]/$TAG[4]/%Y/%m/%d/%H/%M/%S |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matt: do we really want pseudo directories with /
in the timestamp. May be only do it do day, not below that.
region ${REGION_TASK} | ||
upload_timeout 2m | ||
total_file_size 10M | ||
s3_key_format /aws/ecs/${CLUSTER_NAME}/task/failed-to-retrieve-metadata-for/$TAG/%Y/%m/%d/%H/%M/%S |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have two rewrite_tag filters, one for the fallback logs, I could still add instance ID probably to them, to make it still somewhat possible to figure where the failed logs came from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update timestamp here as well
log_group_name /aws/ecs/${CLUSTER_NAME}/application/fallback-group-for-failed-metadata-templating | ||
# fallback stream to log_stream_prefix + tag (log file name) if templating fails | ||
log_stream_prefix json-log-driver-file- | ||
region ${REGION_TASK} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bingfeng: So this enables cross region log, but what if customers want cross account log?
wesley: That would require a different output file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cross account requires you to add a new parameter in teh config:
log_stream_prefix json-log-driver-file-
region ${REGION_TASK}
role_arn ${ROLE_ARN_TASK}
which then can't be empty or it will fail validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use the EKS helm chart as an inspiration:
log_stream_prefix json-log-driver-file-
region ${REGION_TASK}
${EXTRA_CLOUDWATCH_TASK}
its safe for EXTRA_CLOUDWATCH_TASK
Need to check if this really works.
set EXTRA_CLOUDWATCH_TASK=
role_arn my-role`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to think about the experience
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log_stream_prefix json-log-driver-file-
region ${REGION_TASK}
${OPTIONAL_ROLE_ARN}
${OPTIONAL_LOG_KEY}
${OPTIONAL_LOG_RETENTION}
all of these I THINK but need to check, can be empty and it won't break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically init templating of config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Templating Documentation
One option discussed in today's meeting which I can put here for documentation is as follows:
Have a single environment variable called "init-config" which can be set to a JSON string. JSON is great because any JSON can be reduced to a one line string, thus one environment variable.
May take some more thinking to settle on a good JSON schema but the following would work
Configuration JSON
{
/* These are global configuration key value pairs */
"config": {
"param1": "val"
},
"options": [
/* Here are a list of options that the customer can configure separately. Name describes the option they are opting into, config is the option's setting */
{
"name": "host",
/* Each config has a default setting, so config is completely optional */
"config": {
"destinations": ["s3", "cloudwatch"]
}
},
{
"name": "host",
"config": {
"destination": "cloudwatch",
"region": "auto" /* Each config gets translated into a template model which gets translated to the template. Non trivial config options should be considered */
}
}
]
}
Fluent Bit Config Generators
Instead of having init process stitch together raw config files, we introduce a concept of data models and config file templates.
A process on startup parses the JSON config and generates a set of configuration files which are used to startup fluent bit (similar to the init process but with an added step of generating the config files from template)
Templates
We can use a widely used templating engine to generate our fluent bit config files adaptively from our config file.
I recommend Mustache render https://mustache.github.io/
It's great, super flexible and allows for conditional and looped blocks of template content which will no doubt come in handy for some future features. See Demo: https://mustache.github.io/#demo
Data Models
A mustache template needs a data model to render into it's result. Datamodels are just object. We can generate a Data Model for each template.
Generator Structure
This could be how we store the template rendering logic and organize our generator scripts
{
"options": [
/* template providers for the options we can enable */
{
"name": "<OptionName>" /* this option name corresponds to the options that can be enabled */
"synthesizeDataModel": function(globalConfig, optionConfig) => someObject, /* a function that makes our template data model */
"templates": [
/* A list of templates that get processed with our data model and sent to the output destination conf file */
{
"path": "/anyPathToOurTemplate",
"destination": "someDestinationConfigFile" /* append to destination file, maybe /parser.conf or /fluent-bit.conf */
},
...
]
}
]
}
(I highly recommend Typescript, because I love TS and could code this efficiently, but adding node to our image could introduce security scan vulnerabilities that we may have to patch which is unfortunate)
Note: I actually don't like the proposed schemas and structures. It's just a general idea of some thoughts (initial thoughts).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just having the log_key option, either via this or just another config file, for the task logs, is something I kind of want to have even just for the first launch 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think doing the log_key log
thing is what a lot of users want for their container logs. They don't want them as JSONs.
@@ -0,0 +1,12 @@ | |||
[INPUT] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep forgetting my own guidance here: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#tail-input-duplicates-logs-because-of-rotation
I need to double check and consider all of these. For each, there's a trade-off:
- If we match rotated log files, then we will get duplicates
- If we do not match all of the existing rotated log files at the time when Fluent Bit starts, then we won't get all of those old logs. Probably this is okay, but it'd be cool if there was a solution for it to get all the old logs. The ECS log collector script will grab all old logs IIRC: https://github.com/aws/amazon-ecs-logs-collector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is something to just note in the debugging/DIY part of the guide
ecs/daemon/inputs/host/all-host.conf
Outdated
Name tail | ||
Tag host.* | ||
Path /var/log/dmesg* | ||
Parser syslog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment here is applicable: aws-samples/amazon-cloudwatch-container-insights#108 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for this daemon service template, putting the dmesg log inside of a key called "log" is all we want... so just removing the parser here.
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
af78f68
to
4901138
Compare
Signed-off-by: Wesley Pettit <[email protected]>
ecs/daemon/inputs/task.conf
Outdated
Docker_Mode On | ||
Docker_Mode_Flush 5 | ||
Docker_Mode_Parser container_firstline | ||
Parser docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I note here, this is the old parser config, we can/should use the new one which supports docker: aws-samples/amazon-cloudwatch-container-insights#115 (comment)
ecs/daemon/inputs/ecs/agent.conf
Outdated
[INPUT] | ||
Name tail | ||
Tag ecs.* | ||
Path /var/log/ecs/ecs-agent.log* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted elsewhere, need to not match rotated agent log files
…uplicates Signed-off-by: Wesley Pettit <[email protected]>
Name tail | ||
Tag host.* | ||
Path /var/log/messages | ||
Parser syslog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Check if this parser is correct and actually most ideal for all of these logs
aws-samples/amazon-cloudwatch-container-insights#108 (comment)
Signed-off-by: Wesley Pettit <[email protected]>
@@ -0,0 +1,23 @@ | |||
[PARSER] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: I think we can just use the built-in docker parser. Only syslog may be needed here. Testing this.
"logConfiguration": { | ||
"logDriver": "awslogs", | ||
"options": { | ||
"awslogs-group": "/aws/ecs/{INSERT CLUSTER_NAME}/collector", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should add cache-disabled true here: https://docs.docker.com/config/containers/logging/dual-logging/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and in all other templates
(Assuming ECS accepts this option and doesn't validate it against a set of options that is old and doesn't include it...)
the new `multiline.parsers docker` does everything we need I checked that it handles logs split by runtime at the 16KB limit and also that it parses timestamps correctly from the docker log file Signed-off-by: Wesley Pettit <[email protected]>
{ | ||
"sourceVolume": "fluent-bit-buffer-directory", | ||
"containerPath": "/var/fluent-bit", | ||
"readOnly": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to fix this everywhere, the buffer and state dirs should not be readonly.
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
"awslogs-group": "/aws/ecs/{INSERT CLUSTER_NAME}/collector", | ||
"awslogs-region": "{INSERT REGION}", | ||
"awslogs-create-group": "true", | ||
"awslogs-stream-prefix": "fluent-bit-stdout-stderr-" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html
prefix-name/container-name/ecs-task-id
So I need to remove the dash here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and may be just make he prefix be stdout-stderr
since container name is fluent-bit-daemon-collector
"options": { | ||
"awslogs-group": "/aws/ecs/{INSERT CLUSTER_NAME}/collector", | ||
"awslogs-region": "{INSERT REGION}", | ||
"awslogs-create-group": "true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should remove this/set it to false. Log Groups are normally managed by infra as code outside of tasks so that they can be easily cleaned up.
Also, the AmazonECSTaskExecutionRolePolicy does not have create group permissions so if I include this then I require customers to have more perms on their execution or instance role.
Instead, log group can be created with AWS CLI or in the CFN template.
"awslogs-stream-prefix": "fluent-bit-stdout-stderr-" | ||
} | ||
}, | ||
"environment": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add an env var here for IMDS version used by AWS Filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, since we use host mode, its always safe to use V2 with default hop limit of one, so will just use that. No need to worry about this. V2 is always enabled by default now.
[FILTER] | ||
Name aws | ||
Match task.var.lib.docker.containers.* | ||
imds_version v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all config files, IMDS version used by AWS filters should be configured by the same single env var setting which defaults to v1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, since we use host mode, its always safe to use V2 with default hop limit of one, so will just use that. No need to worry about this. V2 is always enabled by default now.
log_group_name /aws/ecs/${CLUSTER_NAME}/application/fallback-group-for-failed-metadata-templating | ||
# fallback stream to log_stream_prefix + tag (log file name) if templating fails | ||
log_stream_prefix json-log-driver-file- | ||
region ${REGION_TASK} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should be LOG_REGION_X or all just LOG_REGION
[INPUT] | ||
Name tail | ||
Tag app.* | ||
Path /var/lib/docker/containers/*/*.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested that if I add:
Exclude_Path *container-cached*
This will ignore the container cache log files, so we only pick up containers that explicitly configure json-file log driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a simplified test where I scanned one container sub-directory on my test instance ^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the agent thing on another instance, it seems that the agent typically has json file log driver files. This is strange and unexpected.
I should look into this/ask ECS Agent team if this is on purpose. I don't think it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[FILTER] | ||
Name aws | ||
Match ecs.* | ||
imds_version v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: IMDS version needs to be an env var everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, since we use host mode, its always safe to use V2 with default hop limit of one, so will just use that. No need to worry about this. V2 is always enabled by default now.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.