Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ECS Fluent Bit Daemon Collector #499

Closed

Conversation

PettitWesley
Copy link
Contributor

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@PettitWesley PettitWesley requested a review from a team as a code owner December 15, 2022 00:02
@PettitWesley PettitWesley marked this pull request as draft December 15, 2022 00:02
region ${REGION_TASK}
upload_timeout 2m
total_file_size 10M
s3_key_format /aws/ecs/${CLUSTER_NAME}/$TAG[0]/$TAG[1]/$TAG[2]/$TAG[3]/$TAG[4]/%Y/%m/%d/%H/%M/%S
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matt: do we really want pseudo directories with / in the timestamp. May be only do it do day, not below that.

region ${REGION_TASK}
upload_timeout 2m
total_file_size 10M
s3_key_format /aws/ecs/${CLUSTER_NAME}/task/failed-to-retrieve-metadata-for/$TAG/%Y/%m/%d/%H/%M/%S
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have two rewrite_tag filters, one for the fallback logs, I could still add instance ID probably to them, to make it still somewhat possible to figure where the failed logs came from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update timestamp here as well

log_group_name /aws/ecs/${CLUSTER_NAME}/application/fallback-group-for-failed-metadata-templating
# fallback stream to log_stream_prefix + tag (log file name) if templating fails
log_stream_prefix json-log-driver-file-
region ${REGION_TASK}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bingfeng: So this enables cross region log, but what if customers want cross account log?

wesley: That would require a different output file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cross account requires you to add a new parameter in teh config:

log_stream_prefix      json-log-driver-file-
region                 ${REGION_TASK}
role_arn              ${ROLE_ARN_TASK}

which then can't be empty or it will fail validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use the EKS helm chart as an inspiration:

log_stream_prefix      json-log-driver-file-
region                 ${REGION_TASK}
${EXTRA_CLOUDWATCH_TASK}

its safe for EXTRA_CLOUDWATCH_TASK

Need to check if this really works.

set EXTRA_CLOUDWATCH_TASK=role_arn my-role`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to think about the experience

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_stream_prefix      json-log-driver-file-
region                 ${REGION_TASK}
${OPTIONAL_ROLE_ARN}
${OPTIONAL_LOG_KEY}
${OPTIONAL_LOG_RETENTION}

all of these I THINK but need to check, can be empty and it won't break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically init templating of config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Templating Documentation

One option discussed in today's meeting which I can put here for documentation is as follows:
Have a single environment variable called "init-config" which can be set to a JSON string. JSON is great because any JSON can be reduced to a one line string, thus one environment variable.

May take some more thinking to settle on a good JSON schema but the following would work

Configuration JSON

{
    /* These are global configuration key value pairs */
    "config": {
        "param1": "val"
    },
    "options": [
         /* Here are a list of options that the customer can configure separately. Name describes the option they are opting into, config is the option's setting */
         {
            "name": "host",
            /* Each config has a default setting, so config is completely optional */
            "config": {
                "destinations": ["s3", "cloudwatch"]
            }
         },
         {
            "name": "host",
            "config": {
                "destination": "cloudwatch",
                "region": "auto" /* Each config gets translated into a template model which gets translated to the template. Non trivial config options should be considered */
            }
         }
    ]
}

Fluent Bit Config Generators

Instead of having init process stitch together raw config files, we introduce a concept of data models and config file templates.

A process on startup parses the JSON config and generates a set of configuration files which are used to startup fluent bit (similar to the init process but with an added step of generating the config files from template)

Templates
We can use a widely used templating engine to generate our fluent bit config files adaptively from our config file.
I recommend Mustache render https://mustache.github.io/
It's great, super flexible and allows for conditional and looped blocks of template content which will no doubt come in handy for some future features. See Demo: https://mustache.github.io/#demo

Data Models
A mustache template needs a data model to render into it's result. Datamodels are just object. We can generate a Data Model for each template.

Generator Structure
This could be how we store the template rendering logic and organize our generator scripts

{
    "options": [
        /* template providers for the options we can enable */
        {
            "name": "<OptionName>" /* this option name corresponds to the options that can be enabled */
            "synthesizeDataModel": function(globalConfig, optionConfig) => someObject, /* a function that makes our template data model */
            "templates": [
                /* A list of templates that get processed with our data model and sent to the output destination conf file */
                {
                    "path": "/anyPathToOurTemplate",
                    "destination": "someDestinationConfigFile" /* append to destination file, maybe /parser.conf or /fluent-bit.conf */
                },
               ...
            ]
        }
    ]
}

(I highly recommend Typescript, because I love TS and could code this efficiently, but adding node to our image could introduce security scan vulnerabilities that we may have to patch which is unfortunate)

Note: I actually don't like the proposed schemas and structures. It's just a general idea of some thoughts (initial thoughts).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just having the log_key option, either via this or just another config file, for the task logs, is something I kind of want to have even just for the first launch 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing the log_key log thing is what a lot of users want for their container logs. They don't want them as JSONs.

@@ -0,0 +1,12 @@
[INPUT]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep forgetting my own guidance here: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#tail-input-duplicates-logs-because-of-rotation

I need to double check and consider all of these. For each, there's a trade-off:

  1. If we match rotated log files, then we will get duplicates
  2. If we do not match all of the existing rotated log files at the time when Fluent Bit starts, then we won't get all of those old logs. Probably this is okay, but it'd be cool if there was a solution for it to get all the old logs. The ECS log collector script will grab all old logs IIRC: https://github.com/aws/amazon-ecs-logs-collector

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is something to just note in the debugging/DIY part of the guide

Name tail
Tag host.*
Path /var/log/dmesg*
Parser syslog
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this daemon service template, putting the dmesg log inside of a key called "log" is all we want... so just removing the parser here.

@PettitWesley PettitWesley force-pushed the ecs-fluent-bit-daemon-collector branch from af78f68 to 4901138 Compare January 4, 2023 01:49
Comment on lines 5 to 8
Docker_Mode On
Docker_Mode_Flush 5
Docker_Mode_Parser container_firstline
Parser docker
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I note here, this is the old parser config, we can/should use the new one which supports docker: aws-samples/amazon-cloudwatch-container-insights#115 (comment)

[INPUT]
Name tail
Tag ecs.*
Path /var/log/ecs/ecs-agent.log*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted elsewhere, need to not match rotated agent log files

Name tail
Tag host.*
Path /var/log/messages
Parser syslog
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Check if this parser is correct and actually most ideal for all of these logs

aws-samples/amazon-cloudwatch-container-insights#108 (comment)

@@ -0,0 +1,23 @@
[PARSER]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: I think we can just use the built-in docker parser. Only syslog may be needed here. Testing this.

"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/aws/ecs/{INSERT CLUSTER_NAME}/collector",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and in all other templates

(Assuming ECS accepts this option and doesn't validate it against a set of options that is old and doesn't include it...)

the new `multiline.parsers docker` does everything we need
I checked that it handles logs split by runtime at the 16KB limit
and also that it parses timestamps correctly from the docker log file

Signed-off-by: Wesley Pettit <[email protected]>
{
"sourceVolume": "fluent-bit-buffer-directory",
"containerPath": "/var/fluent-bit",
"readOnly": true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix this everywhere, the buffer and state dirs should not be readonly.

"awslogs-group": "/aws/ecs/{INSERT CLUSTER_NAME}/collector",
"awslogs-region": "{INSERT REGION}",
"awslogs-create-group": "true",
"awslogs-stream-prefix": "fluent-bit-stdout-stderr-"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html

prefix-name/container-name/ecs-task-id

So I need to remove the dash here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and may be just make he prefix be stdout-stderr since container name is fluent-bit-daemon-collector

"options": {
"awslogs-group": "/aws/ecs/{INSERT CLUSTER_NAME}/collector",
"awslogs-region": "{INSERT REGION}",
"awslogs-create-group": "true",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should remove this/set it to false. Log Groups are normally managed by infra as code outside of tasks so that they can be easily cleaned up.

Also, the AmazonECSTaskExecutionRolePolicy does not have create group permissions so if I include this then I require customers to have more perms on their execution or instance role.

Instead, log group can be created with AWS CLI or in the CFN template.

"awslogs-stream-prefix": "fluent-bit-stdout-stderr-"
}
},
"environment": [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add an env var here for IMDS version used by AWS Filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since we use host mode, its always safe to use V2 with default hop limit of one, so will just use that. No need to worry about this. V2 is always enabled by default now.

[FILTER]
Name aws
Match task.var.lib.docker.containers.*
imds_version v1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all config files, IMDS version used by AWS filters should be configured by the same single env var setting which defaults to v1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since we use host mode, its always safe to use V2 with default hop limit of one, so will just use that. No need to worry about this. V2 is always enabled by default now.

log_group_name /aws/ecs/${CLUSTER_NAME}/application/fallback-group-for-failed-metadata-templating
# fallback stream to log_stream_prefix + tag (log file name) if templating fails
log_stream_prefix json-log-driver-file-
region ${REGION_TASK}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be LOG_REGION_X or all just LOG_REGION

[INPUT]
Name tail
Tag app.*
Path /var/lib/docker/containers/*/*.log
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested that if I add:

    Exclude_Path        *container-cached*

This will ignore the container cache log files, so we only pick up containers that explicitly configure json-file log driver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except, I saw on my instance that the ECS Agent seems to somehow have non-cache log files, I am not sure how this happened:

Screen Shot 2023-01-21 at 9 28 18 PM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows that it does indeed work for real cache files: see the excluded debug message:
Screen Shot 2023-01-21 at 9 33 27 PM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a simplified test where I scanned one container sub-directory on my test instance ^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the agent thing on another instance, it seems that the agent typically has json file log driver files. This is strange and unexpected.

I should look into this/ask ECS Agent team if this is on purpose. I don't think it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[FILTER]
Name aws
Match ecs.*
imds_version v1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: IMDS version needs to be an env var everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since we use host mode, its always safe to use V2 with default hop limit of one, so will just use that. No need to worry about this. V2 is always enabled by default now.

@sparrc sparrc closed this Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants