Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter_ecs: fix multiple bugs and make retries configurable #6589

Closed

Conversation

PettitWesley
Copy link
Contributor


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@PettitWesley
Copy link
Contributor Author

@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 07:49 — with GitHub Actions Inactive
{
FLB_CONFIG_MAP_INT, "agent_endpoint_retries", FLB_ECS_FILTER_METADATA_RETRIES,
0, FLB_TRUE, offsetof(struct flb_filter_ecs, agent_endpoint_retries),
"Number of retries for failed metadata requsts to ECS Agent Introspection "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling

@PettitWesley PettitWesley force-pushed the ecs-filter-failed-tag-bug branch from 6f6b06f to e207bba Compare December 20, 2022 19:48
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 19:49 — with GitHub Actions Inactive
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 19:49 — with GitHub Actions Inactive
@matthewfala
Copy link
Contributor

matthewfala commented Dec 20, 2022

It appears that this code is segfaulting during unit tests:
https://github.com/fluent/fluent-bit/actions/runs/3738425166/jobs/6344498020

@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 20:04 — with GitHub Actions Inactive
Copy link
Contributor

@matthewfala matthewfala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config map default value must be of type string.

This PR's code currently segfaults here:

fluent-bit/src/flb_env.c

Lines 156 to 179 in eceeb5f

/*
* Given a 'value', lookup for variables, if found, return a new composed
* sds string.
*/
flb_sds_t flb_env_var_translate(struct flb_env *env, const char *value)
{
int i;
int len;
int v_len;
int e_len;
int pre_var;
int have_var = FLB_FALSE;
const char *env_var = NULL;
char *v_start = NULL;
char *v_end = NULL;
char tmp[64];
flb_sds_t buf;
flb_sds_t s;
if (!value) {
return NULL;
}
len = strlen(value);

@@ -1694,6 +1694,15 @@ static struct flb_config_map config_map[] = {
"Defaults to 51678"
},

{
FLB_CONFIG_MAP_INT, "agent_endpoint_retries", FLB_ECS_FILTER_METADATA_RETRIES,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config_map entry def_value must be a string

flb_sds_t def_value; /* default value */

See NightFall for an example

FLB_CONFIG_MAP_INT, "tls.debug", "0",

The string value may be autoconverted to one of the FLB_CONFIG_MAP_INT type here:

* As an additional step, it populate the 'value' field using the given default
* value if any. Note that default values are strings so they are processed
* to fit into the proper data type of 'value'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested edit: Revise FLB_ECS_FILTER_METADATA_RETRIES from 2 to "2"

#define FLB_ECS_FILTER_METADATA_RETRIES 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@PettitWesley PettitWesley force-pushed the ecs-filter-failed-tag-bug branch from e207bba to cc79103 Compare December 20, 2022 20:56
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 20:57 — with GitHub Actions Inactive
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 20:57 — with GitHub Actions Inactive
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 21:17 — with GitHub Actions Inactive
@PettitWesley PettitWesley force-pushed the ecs-filter-failed-tag-bug branch from cc79103 to f0401f6 Compare December 20, 2022 23:05
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 23:06 — with GitHub Actions Inactive
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 23:06 — with GitHub Actions Inactive
@PettitWesley PettitWesley force-pushed the ecs-filter-failed-tag-bug branch from f0401f6 to c5e1a7c Compare December 20, 2022 23:09
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 23:09 — with GitHub Actions Inactive
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 23:09 — with GitHub Actions Inactive
@@ -1369,14 +1369,14 @@ static int is_tag_marked_failed(struct flb_filter_ecs *ctx,
const char *tag, int tag_len)
{
int ret;
int val = 0;
int *val = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initializing to 0 instead of NULL...

@PettitWesley PettitWesley force-pushed the ecs-filter-failed-tag-bug branch from c5e1a7c to aa173e0 Compare December 20, 2022 23:11
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 23:12 — with GitHub Actions Inactive
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 23:12 — with GitHub Actions Inactive
@@ -1393,7 +1393,7 @@ static void mark_tag_failed(struct flb_filter_ecs *ctx,

ret = flb_hash_get(ctx->failed_metadata_request_tags,
tag, tag_len,
(void *) val, &val_size);
(void **) &val, &val_size);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code below is still buggy, need to alloc memory for hash table to copy over in all cases

@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 23:27 — with GitHub Actions Inactive
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 23:27 — with GitHub Actions Inactive
@PettitWesley PettitWesley force-pushed the ecs-filter-failed-tag-bug branch from c8a3490 to 0f05bce Compare December 20, 2022 23:29
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 23:31 — with GitHub Actions Inactive
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 23:31 — with GitHub Actions Inactive
@PettitWesley
Copy link
Contributor Author

Ran tests with valgrind and that's how I came up with the other fix... flb_hash doesn't work the way I thought it would/I didn't read that code closely enough:

[100%] Built target flb-it-stream_processor
[ec2-user@ip-10-192-11-106 build]$ valgrind ./bin/flb-rt-filter_ecs
==11335== Memcheck, a memory error detector
==11335== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==11335== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==11335== Command: ./bin/flb-rt-filter_ecs
==11335==
Test flb_test_ecs_filter_mark_tag_failed...     [2022/12/20 23:30:59] [ info] [fluent bit] version=1.9.10, commit=cc7910323e, pid=11341
[2022/12/20 23:30:59] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/12/20 23:30:59] [ info] [cmetrics] version=0.3.7
[2022/12/20 23:30:59] [ info] [sp] stream processor started
[2022/12/20 23:30:59] [ warn] [filter:ecs:ecs.0] Failed to get metadata from /v1/tasks?dockerid=79c796ed2a7f, will retry
[2022/12/20 23:30:59] [ info] [filter:ecs:ecs.0] Requesting metadata from ECS Agent introspection endpoint failed for tag testprefix-79c796ed2a7f
[2022/12/20 23:30:59] [ info] [filter:ecs:ecs.0] Failed to get ECS Task metadata for testprefix-79c796ed2a7f, falling back to process cluster metadata only. If this is intentional, set `Cluster_Metadata_Only On`
==11341== Warning: client switching stacks?  SP change: 0x910a768 --> 0x842e2f0
==11341==          to suppress, use: --max-stackframe=13485176 or greater
==11341== Warning: client switching stacks?  SP change: 0x842e268 --> 0x910a768
==11341==          to suppress, use: --max-stackframe=13485312 or greater
==11341== Warning: client switching stacks?  SP change: 0x910a998 --> 0x842e268
==11341==          to suppress, use: --max-stackframe=13485872 or greater
==11341==          further instances of this message will not be shown.
[2022/12/20 23:31:00] [ warn] [filter:ecs:ecs.0] Failed to get metadata from /v1/tasks?dockerid=79c796ed2a7f, will retry
[2022/12/20 23:31:00] [ info] [filter:ecs:ecs.0] Requesting metadata from ECS Agent introspection endpoint failed for tag testprefix-79c796ed2a7f
[2022/12/20 23:31:00] [ info] [filter:ecs:ecs.0] Failed to get ECS Task metadata for testprefix-79c796ed2a7f, falling back to process cluster metadata only. If this is intentional, set `Cluster_Metadata_Only On`
[2022/12/20 23:31:00] [ info] [filter:ecs:ecs.0] Failed to get ECS Metadata for tag testprefix-79c796ed2a7f 2 times. This might be because the logs for this tag do not come from an ECS Task Container. This plugin will retry metadata requests at most 2 times total for this tag.
[2022/12/20 23:31:04] [ warn] [engine] service will shutdown in max 1 seconds
[2022/12/20 23:31:05] [ info] [engine] service has stopped (0 pending tasks)
[ OK ]
==11341==
==11341== HEAP SUMMARY:
==11341==     in use at exit: 96 bytes in 1 blocks
==11341==   total heap usage: 5,928 allocs, 5,927 frees, 1,946,368 bytes allocated
==11341==
==11341== LEAK SUMMARY:
==11341==    definitely lost: 0 bytes in 0 blocks
==11341==    indirectly lost: 0 bytes in 0 blocks
==11341==      possibly lost: 0 bytes in 0 blocks
==11341==    still reachable: 96 bytes in 1 blocks
==11341==         suppressed: 0 bytes in 0 blocks
==11341== Rerun with --leak-check=full to see details of leaked memory
==11341==
==11341== For lists of detected and suppressed errors, rerun with: -s
==11341== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Test flb_test_ecs_filter...                     [2022/12/20 23:31:05] [ info] [fluent bit] version=1.9.10, commit=cc7910323e, pid=11407
[2022/12/20 23:31:05] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/12/20 23:31:05] [ info] [cmetrics] version=0.3.7
[2022/12/20 23:31:06] [ info] [sp] stream processor started
==11407== Warning: client switching stacks?  SP change: 0x910a768 --> 0x8438e00
==11407==          to suppress, use: --max-stackframe=13441384 or greater
==11407== Warning: client switching stacks?  SP change: 0x8438d78 --> 0x910a768
==11407==          to suppress, use: --max-stackframe=13441520 or greater
==11407== Warning: client switching stacks?  SP change: 0x910a998 --> 0x8438d78
==11407==          to suppress, use: --max-stackframe=13442080 or greater
==11407==          further instances of this message will not be shown.
[2022/12/20 23:31:08] [ warn] [engine] service will shutdown in max 1 seconds
[2022/12/20 23:31:08] [ info] [engine] service has stopped (0 pending tasks)
[ OK ]
==11407==
==11407== HEAP SUMMARY:
==11407==     in use at exit: 96 bytes in 1 blocks
==11407==   total heap usage: 5,829 allocs, 5,828 frees, 895,869 bytes allocated
==11407==
==11407== LEAK SUMMARY:
==11407==    definitely lost: 0 bytes in 0 blocks
==11407==    indirectly lost: 0 bytes in 0 blocks
==11407==      possibly lost: 0 bytes in 0 blocks
==11407==    still reachable: 96 bytes in 1 blocks
==11407==         suppressed: 0 bytes in 0 blocks
==11407== Rerun with --leak-check=full to see details of leaked memory
==11407==
==11407== For lists of detected and suppressed errors, rerun with: -s
==11407== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Test flb_test_ecs_filter_no_prefix...           [2022/12/20 23:31:08] [ info] [fluent bit] version=1.9.10, commit=cc7910323e, pid=11428
[2022/12/20 23:31:08] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/12/20 23:31:08] [ info] [cmetrics] version=0.3.7
[2022/12/20 23:31:09] [ info] [sp] stream processor started
==11428== Warning: client switching stacks?  SP change: 0x910a768 --> 0x8438da0
==11428==          to suppress, use: --max-stackframe=13441480 or greater
==11428== Warning: client switching stacks?  SP change: 0x8438d18 --> 0x910a768
==11428==          to suppress, use: --max-stackframe=13441616 or greater
==11428== Warning: client switching stacks?  SP change: 0x910a998 --> 0x8438d18
==11428==          to suppress, use: --max-stackframe=13442176 or greater
==11428==          further instances of this message will not be shown.
[2022/12/20 23:31:11] [ warn] [engine] service will shutdown in max 1 seconds
[2022/12/20 23:31:11] [ info] [engine] service has stopped (0 pending tasks)
[ OK ]
==11428==
==11428== HEAP SUMMARY:
==11428==     in use at exit: 96 bytes in 1 blocks
==11428==   total heap usage: 5,829 allocs, 5,828 frees, 895,792 bytes allocated
==11428==
==11428== LEAK SUMMARY:
==11428==    definitely lost: 0 bytes in 0 blocks
==11428==    indirectly lost: 0 bytes in 0 blocks
==11428==      possibly lost: 0 bytes in 0 blocks
==11428==    still reachable: 96 bytes in 1 blocks
==11428==         suppressed: 0 bytes in 0 blocks
==11428== Rerun with --leak-check=full to see details of leaked memory
==11428==
==11428== For lists of detected and suppressed errors, rerun with: -s
==11428== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Test flb_test_ecs_filter_cluster_metadata_only... [2022/12/20 23:31:11] [ info] [fluent bit] version=1.9.10, commit=cc7910323e, pid=11448
[2022/12/20 23:31:11] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/12/20 23:31:11] [ info] [cmetrics] version=0.3.7
[2022/12/20 23:31:12] [ info] [sp] stream processor started
==11448== Warning: client switching stacks?  SP change: 0x910a768 --> 0x842de30
==11448==          to suppress, use: --max-stackframe=13486392 or greater
==11448== Warning: client switching stacks?  SP change: 0x842dda8 --> 0x910a768
==11448==          to suppress, use: --max-stackframe=13486528 or greater
==11448== Warning: client switching stacks?  SP change: 0x910a998 --> 0x842dda8
==11448==          to suppress, use: --max-stackframe=13487088 or greater
==11448==          further instances of this message will not be shown.
[2022/12/20 23:31:14] [ warn] [engine] service will shutdown in max 1 seconds
[2022/12/20 23:31:14] [ info] [engine] service has stopped (0 pending tasks)
[ OK ]
==11448==
==11448== HEAP SUMMARY:
==11448==     in use at exit: 96 bytes in 1 blocks
==11448==   total heap usage: 5,812 allocs, 5,811 frees, 851,950 bytes allocated
==11448==
==11448== LEAK SUMMARY:
==11448==    definitely lost: 0 bytes in 0 blocks
==11448==    indirectly lost: 0 bytes in 0 blocks
==11448==      possibly lost: 0 bytes in 0 blocks
==11448==    still reachable: 96 bytes in 1 blocks
==11448==         suppressed: 0 bytes in 0 blocks
==11448== Rerun with --leak-check=full to see details of leaked memory
==11448==
==11448== For lists of detected and suppressed errors, rerun with: -s
==11448== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Test flb_test_ecs_filter_cluster_error...       [2022/12/20 23:31:14] [ info] [fluent bit] version=1.9.10, commit=cc7910323e, pid=11466
[2022/12/20 23:31:14] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/12/20 23:31:14] [ info] [cmetrics] version=0.3.7
[2022/12/20 23:31:14] [ warn] [filter:ecs:ecs.0] Failed to get metadata from /v1/metadata, will retry
[2022/12/20 23:31:14] [ info] [sp] stream processor started
[2022/12/20 23:31:15] [ warn] [filter:ecs:ecs.0] Failed to get metadata from /v1/metadata, will retry
[2022/12/20 23:31:15] [ warn] [filter:ecs:ecs.0] Could not retrieve cluster metadata from ECS Agent
==11466== Warning: client switching stacks?  SP change: 0x910a768 --> 0x841e9d0
==11466==          to suppress, use: --max-stackframe=13548952 or greater
==11466== Warning: client switching stacks?  SP change: 0x841e948 --> 0x910a768
==11466==          to suppress, use: --max-stackframe=13549088 or greater
==11466== Warning: client switching stacks?  SP change: 0x910a998 --> 0x841e948
==11466==          to suppress, use: --max-stackframe=13549648 or greater
==11466==          further instances of this message will not be shown.
[2022/12/20 23:31:17] [ warn] [engine] service will shutdown in max 1 seconds
[2022/12/20 23:31:17] [ info] [engine] service has stopped (0 pending tasks)
[ OK ]
==11466==
==11466== HEAP SUMMARY:
==11466==     in use at exit: 96 bytes in 1 blocks
==11466==   total heap usage: 5,782 allocs, 5,781 frees, 791,430 bytes allocated
==11466==
==11466== LEAK SUMMARY:
==11466==    definitely lost: 0 bytes in 0 blocks
==11466==    indirectly lost: 0 bytes in 0 blocks
==11466==      possibly lost: 0 bytes in 0 blocks
==11466==    still reachable: 96 bytes in 1 blocks
==11466==         suppressed: 0 bytes in 0 blocks
==11466== Rerun with --leak-check=full to see details of leaked memory
==11466==
==11466== For lists of detected and suppressed errors, rerun with: -s
==11466== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Test flb_test_ecs_filter_task_error...          [2022/12/20 23:31:17] [ info] [fluent bit] version=1.9.10, commit=cc7910323e, pid=11488
[2022/12/20 23:31:17] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/12/20 23:31:17] [ info] [cmetrics] version=0.3.7
[2022/12/20 23:31:18] [ info] [sp] stream processor started
[2022/12/20 23:31:18] [ warn] [filter:ecs:ecs.0] Failed to get metadata from /v1/tasks?dockerid=79c796ed2a7f, will retry
[2022/12/20 23:31:18] [ info] [filter:ecs:ecs.0] Requesting metadata from ECS Agent introspection endpoint failed for tag 79c796ed2a7f
[2022/12/20 23:31:18] [ info] [filter:ecs:ecs.0] Failed to get ECS Task metadata for 79c796ed2a7f, falling back to process cluster metadata only. If this is intentional, set `Cluster_Metadata_Only On`
==11488== Warning: client switching stacks?  SP change: 0x910a768 --> 0x842e280
==11488==          to suppress, use: --max-stackframe=13485288 or greater
==11488== Warning: client switching stacks?  SP change: 0x842e1f8 --> 0x910a768
==11488==          to suppress, use: --max-stackframe=13485424 or greater
==11488== Warning: client switching stacks?  SP change: 0x910a998 --> 0x842e1f8
==11488==          to suppress, use: --max-stackframe=13485984 or greater
==11488==          further instances of this message will not be shown.
[2022/12/20 23:31:20] [ warn] [engine] service will shutdown in max 1 seconds
[2022/12/20 23:31:20] [ info] [engine] service has stopped (0 pending tasks)
[ OK ]
==11488==
==11488== HEAP SUMMARY:
==11488==     in use at exit: 96 bytes in 1 blocks
==11488==   total heap usage: 5,820 allocs, 5,819 frees, 852,509 bytes allocated
==11488==
==11488== LEAK SUMMARY:
==11488==    definitely lost: 0 bytes in 0 blocks
==11488==    indirectly lost: 0 bytes in 0 blocks
==11488==      possibly lost: 0 bytes in 0 blocks
==11488==    still reachable: 96 bytes in 1 blocks
==11488==         suppressed: 0 bytes in 0 blocks
==11488== Rerun with --leak-check=full to see details of leaked memory
==11488==
==11488== For lists of detected and suppressed errors, rerun with: -s
==11488== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
SUCCESS: All unit tests have passed.
==11335==
==11335== HEAP SUMMARY:
==11335==     in use at exit: 0 bytes in 0 blocks
==11335==   total heap usage: 6 allocs, 6 frees, 2,853 bytes allocated
==11335==
==11335== All heap blocks were freed -- no leaks are possible
==11335==
==11335== For lists of detected and suppressed errors, rerun with: -s
==11335== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

@PettitWesley PettitWesley marked this pull request as ready for review December 20, 2022 23:33
@PettitWesley PettitWesley temporarily deployed to pr December 20, 2022 23:52 — with GitHub Actions Inactive
Copy link
Contributor

@matthewfala matthewfala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@PettitWesley PettitWesley changed the title filter_ecs: fix metadata retry bug and make retries configurable filter_ecs: fix multiple bugs and make retries configurable Dec 23, 2022
@PettitWesley
Copy link
Contributor Author

These commits were merged with: https://github.com/fluent/fluent-bit/pull/6614/commits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants