data_dog: partially revert recent datadog PR to avoid provider ecs segfault #6785

matthewfala · 2023-02-04T01:40:10Z

We noticed that the recent datadog pr triggers a segfault when provider option is set to ecs. After a lot of investigation, we were unable to find the root cause of the segfault, but discovers that it exists during a some random network call, which has nothing to do with the error handling code added in the PR, that when removed, resolves the segfault.

As a solution, we partially revert the recent Datadog pr that mysteriously triggers this segfault. It is just some simple error handling code that was recently added. We also add the data buffer resize fix from here: #6570

Partial revert provider ecs code of Datadog recent pr that triggers segfaults:
#5930
#5929

See the segfault reports in aws-for-fluent-bit repo here:
aws/aws-for-fluent-bit#491

Signed-off-by: Matthew Fala [email protected]

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

Example configuration file for the change
Debug log output from testing the change

Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

Run local packaging test showing all targets (including any new ones) build.
Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

Documentation required for this feature

Backporting

Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Signed-off-by: Matthew Fala <[email protected]>

nokute78 · 2023-02-05T00:25:52Z

aws/aws-for-fluent-bit#491 (comment)

[2022/12/07 10:11:00] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
[2022/12/07 10:11:00] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device

I think this issue is caused by following condition case.
https://github.com/fluent/fluent-bit/blob/v1.9.10/plugins/out_datadog/datadog.c#L181-L182
flb_sds_len(remapped_tags) - byte_cnt is negative since byte_cnt is greater than flb_sds_len(remapped_tags).
Then flb_sds_increase will fail.

On current master, it was fixed by #6750 and it has not been released yet.

PettitWesley · 2023-02-06T18:46:06Z

plugins/out_datadog/datadog.c

@@ -179,7 +179,7 @@ static int datadog_format(struct flb_config *config,
                    return -1;
                }
            } else if (flb_sds_len(remapped_tags) < byte_cnt) {
-                tmp = flb_sds_increase(remapped_tags, flb_sds_len(remapped_tags) - byte_cnt);
+                tmp = flb_sds_increase(remapped_tags, byte_cnt - flb_sds_len(remapped_tags));


Wait, the 2nd arg to flb_sds_increase is the new size, so shouldn't that be byte_cnt? And we shouldn't subtract anything from it?

PettitWesley · 2023-02-06T18:49:12Z

plugins/out_datadog/datadog_remap.c

-static int dd_remap_container_name(const char *tag_name,
-                                    msgpack_object attr_value, flb_sds_t *dd_tags_buf)
+static void dd_remap_container_name(const char *tag_name,
+                                    msgpack_object attr_value, flb_sds_t dd_tags)


help me understand again, why are we reverting these changes which probably still are a real fix? Also if we are going to revert them, why not put it in a revert commit? Why same commit as this new fix.

I tested the individual parts of the datadog pr and it turns out that completely unrelated error handling code was triggering a segfault somewhere random in the code (like on a network call).

We decided that this PR would keep only the essential portions of the recent PR and ditch the lower priority ones to avoid adding the segfault

matthewfala · 2023-02-04T01:40:26Z

plugins/out_datadog/datadog.c

@@ -179,7 +179,7 @@ static int datadog_format(struct flb_config *config,
                    return -1;
                }
            } else if (flb_sds_len(remapped_tags) < byte_cnt) {
-                tmp = flb_sds_increase(remapped_tags, flb_sds_len(remapped_tags) - byte_cnt);
+                tmp = flb_sds_increase(remapped_tags, byte_cnt - flb_sds_len(remapped_tags));


This is the data buffer resize fix from here: #6570

matthewfala · 2023-02-06T19:12:58Z

plugins/out_datadog/datadog_remap.c

-static int dd_remap_container_name(const char *tag_name,
-                                    msgpack_object attr_value, flb_sds_t *dd_tags_buf)
+static void dd_remap_container_name(const char *tag_name,
+                                    msgpack_object attr_value, flb_sds_t dd_tags)


I tested the individual parts of the datadog pr and it turns out that completely unrelated error handling code was triggering a segfault somewhere random in the code (like on a network call).

We decided that this PR would keep only the essential portions of the recent PR and ditch the lower priority ones to avoid adding the segfault

github-actions · 2023-06-16T02:02:31Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

data_dog: partially revert recent datadog PR to avoid segfault

828a83f

Signed-off-by: Matthew Fala <[email protected]>

matthewfala requested review from nokute78 and edsiper as code owners February 4, 2023 01:40

github-actions bot added the docs-required label Feb 4, 2023

matthewfala temporarily deployed to pr February 4, 2023 01:40 — with GitHub Actions Inactive

matthewfala temporarily deployed to pr February 4, 2023 01:41 — with GitHub Actions Inactive

matthewfala mentioned this pull request Feb 4, 2023

data_dog: partially revert recent datadog PR to avoid segfault #6786

Open

7 tasks

matthewfala temporarily deployed to pr February 4, 2023 01:55 — with GitHub Actions Inactive

PettitWesley reviewed Feb 6, 2023

View reviewed changes

matthewfala commented Mar 17, 2023

View reviewed changes

github-actions bot added the Stale label Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_dog: partially revert recent datadog PR to avoid provider ecs segfault #6785

data_dog: partially revert recent datadog PR to avoid provider ecs segfault #6785

matthewfala commented Feb 4, 2023

nokute78 commented Feb 5, 2023 •

edited

Loading

PettitWesley Feb 6, 2023

PettitWesley Feb 6, 2023

matthewfala Feb 6, 2023

matthewfala Feb 4, 2023

matthewfala Feb 6, 2023

github-actions bot commented Jun 16, 2023

data_dog: partially revert recent datadog PR to avoid provider ecs segfault #6785

Are you sure you want to change the base?

data_dog: partially revert recent datadog PR to avoid provider ecs segfault #6785

Conversation

matthewfala commented Feb 4, 2023

nokute78 commented Feb 5, 2023 • edited Loading

PettitWesley Feb 6, 2023

Choose a reason for hiding this comment

PettitWesley Feb 6, 2023

Choose a reason for hiding this comment

matthewfala Feb 6, 2023

Choose a reason for hiding this comment

matthewfala Feb 4, 2023

Choose a reason for hiding this comment

matthewfala Feb 6, 2023

Choose a reason for hiding this comment

github-actions bot commented Jun 16, 2023

nokute78 commented Feb 5, 2023 •

edited

Loading