Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_datadog: fix/add error handling for all flb_sds calls #5929

Merged
merged 1 commit into from
Nov 16, 2022

Conversation

PettitWesley
Copy link
Contributor

Signed-off-by: Wesley Pettit [email protected]


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@PettitWesley
Copy link
Contributor Author

Valgrind passes:

==26291==
==26291== HEAP SUMMARY:
==26291==     in use at exit: 110,402 bytes in 3,763 blocks
==26291==   total heap usage: 33,404 allocs, 29,641 frees, 6,283,769 bytes allocated
==26291==
==26291== LEAK SUMMARY:
==26291==    definitely lost: 0 bytes in 0 blocks
==26291==    indirectly lost: 0 bytes in 0 blocks
==26291==      possibly lost: 0 bytes in 0 blocks
==26291==    still reachable: 110,402 bytes in 3,763 blocks
==26291==         suppressed: 0 bytes in 0 blocks
==26291== Rerun with --leak-check=full to see details of leaked memory
==26291==
==26291== For lists of detected and suppressed errors, rerun with: -s
==26291== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

With config:


[SERVICE]
    Log_Level debug

[INPUT]
    Name dummy
    Tag dummy


[OUTPUT]
    Name        datadog
    Match	*
    Host        http-intake.logs.datadoghq.com
    TLS         on
    compress    gzip
    apikey	SECRET
    dd_service  wesley
    dd_source   test
    dd_tags     team:logs,foo:bar

@@ -321,7 +358,7 @@ static void cb_datadog_flush(struct flb_event_chunk *event_chunk,
ret = datadog_format(config, i_ins,
ctx, NULL,
event_chunk->tag, flb_sds_len(event_chunk->tag),
event_chunk->data, event_chunk->size,
event_chunk->data, event_chunk->size, event_chunk->total_events,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is causing the build to give a warning... will fix it... this is used in the test_formatter.callback

@PettitWesley PettitWesley force-pushed the flb_sds_cat_datadog branch 2 times, most recently from 5a0655d to 0685187 Compare August 22, 2022 23:43
Comment on lines +248 to +261
ret = remapping[ind].remap_to_tag(remapping[ind].remap_tag_name, v,
&remapped_tags);
if (ret < 0) {
flb_plg_error(ctx->ins, "Failed to remap tag: %s, skipping", remapping[ind].remap_tag_name);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone from datadog could comment on this bit specifically, that'd be awesome.

So this code adds the ECS task metadata (cluster name, task arn, etc) to the datadog tags. And its very unlikely for this code to fail... that only happens if there is an alloc failure. But, if there is a failure, what should we do?

I was thinking that just skipping and applying the next tag is best. Technically I think the tag string has to be in a nice format like key:val,key2:val2. When it fails, we don't know if it failed in the middle of adding a tag, so at the continue here you could have an incomplete string like key:val,key2:. I do not know how bad this is.

My guess was that continuing and risking that the tags are mis-formatted is better than just discarding the tags or discarding the record.

Copy link
Contributor

@matthewfala matthewfala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The code looks much safer now. Made some small comments.

It seems like the convention you are following for flb_errno() is to call that only on failed allocations and reallocations, not on failed frees.


/* Count number of records */
array_size = flb_mp_count(data, bytes);
array_size = (int) total_events;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why downcast from a size_t to an int? Could you make array_size a size_t, or total_events an int?

@@ -110,13 +111,14 @@ static int datadog_format(struct flb_config *config,
msgpack_object k;
msgpack_object v;
struct flb_out_datadog *ctx = plugin_context;
struct flb_event_chunk *event_chunk = flush_ctx;

array_size = (int) event_chunk->total_events;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to make this a size_t rather than an int to avoid downcasting issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea this is because I thought the msgpack API wants an int but actually it does want a size_t so this cast is very silly...

Comment on lines 164 to +190
if (!remapped_tags) {
remapped_tags = flb_sds_create_size(byte_cnt);
if (!remapped_tags) {
flb_errno();
msgpack_sbuffer_destroy(&mp_sbuf);
msgpack_unpacked_destroy(&result);
return -1;
}
} else if (flb_sds_len(remapped_tags) < byte_cnt) {
tmp = flb_sds_increase(remapped_tags, flb_sds_len(remapped_tags) - byte_cnt);
if (!tmp) {
flb_errno();
flb_sds_destroy(remapped_tags);
msgpack_sbuffer_destroy(&mp_sbuf);
msgpack_unpacked_destroy(&result);
return -1;
}
remapped_tags = tmp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, to confirm:
This new section of code addresses the following problem

  1. a remapped_tags buffer is created for the first log event, based on it's size estimations
  2. the remapped_tags buffer is cleared and reused for the second log event, disregarding the second logevent's tag size estimation
  3. the remapped_tags buffer is too small which triggers a segfault.

The solution is as follows:
Reuse the remapped_tags unless the new size estimate is larger, in which case reinitialize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seem, though, that this section of code is an optimization and not necessary.

If there is too little space for the contents of the string, then the flb_sds_cat code will reinitialize the buffer with more space for all future use. Reallocating the memory upfront is more efficient, which is why this solution seems like a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seem, though, that this section of code is an optimization and not necessary.

If there is too little space for the contents of the string, then the flb_sds_cat code will reinitialize the buffer with more space for all future use. Reallocating the memory upfront is more efficient, which is why this solution seems like a good idea.

Yea this is what I'm going for... trying to make sure this buffer is reallocated as few times as needed.

plugins/out_datadog/datadog_remap.h Outdated Show resolved Hide resolved
@PettitWesley PettitWesley temporarily deployed to pr September 8, 2022 23:45 Inactive
@PettitWesley PettitWesley temporarily deployed to pr September 8, 2022 23:45 Inactive
@PettitWesley PettitWesley temporarily deployed to pr September 9, 2022 00:05 Inactive
@PettitWesley PettitWesley added this to the Fluent Bit v1.9.9 milestone Sep 16, 2022
@PettitWesley PettitWesley temporarily deployed to pr October 5, 2022 16:47 Inactive
@PettitWesley PettitWesley temporarily deployed to pr October 5, 2022 16:50 Inactive
@PettitWesley PettitWesley temporarily deployed to pr October 5, 2022 17:07 Inactive
@rajeev-netomi
Copy link

Can we please prioritise this PR ? Due to this issue we have been facing regular fluentbit crashes while sending logs to datadog.

@PettitWesley
Copy link
Contributor Author

@edsiper CI passes except for CIFuzz. This is a bug fix which was causing crashes for a csutomer. Can we please merge?

@edsiper edsiper merged commit 300206a into fluent:1.9 Nov 16, 2022
Claych added a commit to Claych/fluent-bit that referenced this pull request Dec 9, 2022
matthewfala added a commit to matthewfala/fluent-bit that referenced this pull request Jan 7, 2023
matthewfala added a commit to matthewfala/fluent-bit that referenced this pull request Feb 6, 2023
matthewfala added a commit to matthewfala/fluent-bit that referenced this pull request Feb 23, 2023
PettitWesley pushed a commit to PettitWesley/fluent-bit that referenced this pull request Mar 13, 2023
PettitWesley pushed a commit to PettitWesley/fluent-bit that referenced this pull request Apr 25, 2023
PettitWesley pushed a commit to PettitWesley/fluent-bit that referenced this pull request May 2, 2023
PettitWesley pushed a commit to PettitWesley/fluent-bit that referenced this pull request Jun 2, 2023
PettitWesley pushed a commit to PettitWesley/fluent-bit that referenced this pull request Jun 8, 2023
matthewfala added a commit to matthewfala/fluent-bit that referenced this pull request Sep 23, 2023
PettitWesley pushed a commit to PettitWesley/fluent-bit that referenced this pull request May 22, 2024
zhihonl pushed a commit to zhihonl/fluent-bit that referenced this pull request Aug 20, 2024
swapneils pushed a commit to amazon-contributing/upstream-to-fluent-bit that referenced this pull request Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants