Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus Exporter metrics with different tags should have only one HELP and TYPE comment line #5465

Closed
yingchen0706v opened this issue May 17, 2022 · 19 comments

Comments

@yingchen0706v
Copy link

yingchen0706v commented May 17, 2022

Bug Report

The exported Prometheus metrics with same name but different tags have duplicate HELP and TYPE comment lines.
According to https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details, it only allows one HELP/TYPE for any given metric.

To Reproduce

  • Rubular link if applicable: NA
  • Example log message if applicable: NA
  • Steps to reproduce the problem: setup a fluent-bit configuration with different output, as in configuration part in following.

Expected behavior
Only one line of TYPE and HELP should be generated for fluent-bit output metrics. But there duplicate ones as in screenshot.

Screenshots
Screen Shot 2022-04-08 at 5 59 19 PM
As in highlight above, there are duplicate TYPE/HELP comment lines.

Your Environment

  • Version used: 1.9.3

  • Configuration:

    [OUTPUT]
    Name http
    Alias confiant
    Match bids
    ...
    [OUTPUT]
    Name s3
    Alias s3
    Match bids
    region {{ .Values.s3RegionForBids }}
    bucket {{ .Values.s3BucketForBids }}
    ...
    [OUTPUT]
    Name prometheus_exporter
    Alias exporter
    match internal_metrics
    ...

  • Environment name and version (e.g. Kubernetes? What version?): K8S

  • Server type and version: EKS

  • Operating System and version: x86_64 Linux 5.4

  • Filters and plugins: no filters, output plugin as in configuration.

Additional context
It cause issues when we try to feed those metrics to our monitoring system, as according to https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details, it only allows one HELP/TYPE for any given metric.

@patrick-stephens
Copy link
Contributor

I cannot seem to reproduce this on 1.9.3 with this config as a test case:

[SERVICE]
  Http_server On

[INPUT]
  name dummy
  tag dummy1

[INPUT]
  name dummy
  tag dummy2
  
[OUTPUT]
  name stdout
  match dummy1

[OUTPUT]
  name stdout
  match dummy2

[OUTPUT]
  Name http
  match nothing

Run up the container and curl the output:

$ docker run --rm -d -p 2020:2020 -v $PWD/fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf fluent/fluent-bit:1.9.3
$ curl -s http://127.0.0.1:2020/api/v1/metrics/prometheus
# HELP fluentbit_input_bytes_total Number of input bytes.
# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="dummy.0"} 468 1652771931407
fluentbit_input_bytes_total{name="dummy.1"} 468 1652771931407
# HELP fluentbit_input_records_total Number of input records.
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="dummy.0"} 18 1652771931407
fluentbit_input_records_total{name="dummy.1"} 18 1652771931407
# HELP fluentbit_output_dropped_records_total Number of dropped records.
# TYPE fluentbit_output_dropped_records_total counter
fluentbit_output_dropped_records_total{name="http.2"} 0 1652771931407
fluentbit_output_dropped_records_total{name="stdout.0"} 0 1652771931407
fluentbit_output_dropped_records_total{name="stdout.1"} 0 1652771931407
# HELP fluentbit_output_errors_total Number of output errors.
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="http.2"} 0 1652771931407
fluentbit_output_errors_total{name="stdout.0"} 0 1652771931407
fluentbit_output_errors_total{name="stdout.1"} 0 1652771931407
# HELP fluentbit_output_proc_bytes_total Number of processed output bytes.
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="http.2"} 0 1652771931407
fluentbit_output_proc_bytes_total{name="stdout.0"} 416 1652771931407
fluentbit_output_proc_bytes_total{name="stdout.1"} 416 1652771931407
# HELP fluentbit_output_proc_records_total Number of processed output records.
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="http.2"} 0 1652771931407
fluentbit_output_proc_records_total{name="stdout.0"} 16 1652771931407
fluentbit_output_proc_records_total{name="stdout.1"} 16 1652771931407
# HELP fluentbit_output_retried_records_total Number of retried records.
# TYPE fluentbit_output_retried_records_total counter
fluentbit_output_retried_records_total{name="http.2"} 0 1652771931407
fluentbit_output_retried_records_total{name="stdout.0"} 0 1652771931407
fluentbit_output_retried_records_total{name="stdout.1"} 0 1652771931407
# HELP fluentbit_output_retries_failed_total Number of abandoned batches because the maximum number of re-tries was reached.
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="http.2"} 0 1652771931407
fluentbit_output_retries_failed_total{name="stdout.0"} 0 1652771931407
fluentbit_output_retries_failed_total{name="stdout.1"} 0 1652771931407
# HELP fluentbit_output_retries_total Number of output retries.
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="http.2"} 0 1652771931407
fluentbit_output_retries_total{name="stdout.0"} 0 1652771931407
fluentbit_output_retries_total{name="stdout.1"} 0 1652771931407
# HELP fluentbit_uptime Number of seconds that Fluent Bit has been running.
# TYPE fluentbit_uptime counter
fluentbit_uptime 18
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1652771913
# HELP fluentbit_build_info Build version information.
# TYPE fluentbit_build_info gauge
fluentbit_build_info{version="1.9.3",edition="Community"} 1

@patrick-stephens patrick-stephens added waiting-for-user Waiting for more information, tests or requested changes and removed status: waiting-for-triage labels May 17, 2022
@patrick-stephens
Copy link
Contributor

Ah, this seems to be an issue with the Prometheus Exporter itself: it we use that with the recent Fluent Bit metrics input plugin then it generates invalid output:

[SERVICE]
  Http_server On

[INPUT]
  name dummy
  tag dummy1

[INPUT]
  name dummy
  tag dummy2
  
[OUTPUT]
  name stdout
  match dummy1

[OUTPUT]
  name stdout
  match dummy2

[OUTPUT]
  Name http
  match nothing

[INPUT]
  name            fluentbit_metrics
  tag             internal_metrics

[OUTPUT]
  name            prometheus_exporter
  match           internal_metrics
  port            2021

Run and check then to see the incorrect output - make sure to expose the 2021 port now:

$ docker run --rm -d -p 2020:2020 -p 2021:2021 -v $PWD/fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf fluent/fluent-bit:1.9.3
$ curl -s http://127.0.0.1:2021/metrics
# HELP fluentbit_uptime Number of seconds that Fluent Bit has been running.
# TYPE fluentbit_uptime counter
fluentbit_uptime{hostname="653a853d661c"} 121
# HELP fluentbit_input_bytes_total Number of input bytes.
# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="dummy.0"} 3146
# HELP fluentbit_input_records_total Number of input records.
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="dummy.0"} 121
# HELP fluentbit_input_bytes_total Number of input bytes.
# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="dummy.1"} 3146
# HELP fluentbit_input_records_total Number of input records.
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="dummy.1"} 121
# HELP fluentbit_input_bytes_total Number of input bytes.
# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="fluentbit_metrics.2"} 520260
# HELP fluentbit_input_records_total Number of input records.
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="fluentbit_metrics.2"} 60
# HELP fluentbit_input_metrics_scrapes_total Number of total metrics scrapes
# TYPE fluentbit_input_metrics_scrapes_total counter
fluentbit_input_metrics_scrapes_total{name="fluentbit_metrics.2"} 61
# HELP fluentbit_output_proc_records_total Number of processed output records.
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="stdout.0"} 120
# HELP fluentbit_output_proc_bytes_total Number of processed output bytes.
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="stdout.0"} 3120
# HELP fluentbit_output_errors_total Number of output errors.
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="stdout.0"} 0
# HELP fluentbit_output_retries_total Number of output retries.
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="stdout.0"} 0
# HELP fluentbit_output_retries_failed_total Number of abandoned batches because the maximum number of re-tries was reached.
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="stdout.0"} 0
# HELP fluentbit_output_dropped_records_total Number of dropped records.
# TYPE fluentbit_output_dropped_records_total counter
fluentbit_output_dropped_records_total{name="stdout.0"} 0
# HELP fluentbit_output_retried_records_total Number of retried records.
# TYPE fluentbit_output_retried_records_total counter
fluentbit_output_retried_records_total{name="stdout.0"} 0
# HELP fluentbit_output_proc_records_total Number of processed output records.
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="stdout.1"} 120
# HELP fluentbit_output_proc_bytes_total Number of processed output bytes.
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="stdout.1"} 3120
# HELP fluentbit_output_errors_total Number of output errors.
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="stdout.1"} 0
# HELP fluentbit_output_retries_total Number of output retries.
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="stdout.1"} 0
# HELP fluentbit_output_retries_failed_total Number of abandoned batches because the maximum number of re-tries was reached.
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="stdout.1"} 0
# HELP fluentbit_output_dropped_records_total Number of dropped records.
# TYPE fluentbit_output_dropped_records_total counter
fluentbit_output_dropped_records_total{name="stdout.1"} 0
# HELP fluentbit_output_retried_records_total Number of retried records.
# TYPE fluentbit_output_retried_records_total counter
fluentbit_output_retried_records_total{name="stdout.1"} 0
# HELP fluentbit_output_proc_records_total Number of processed output records.
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="http.2"} 0
# HELP fluentbit_output_proc_bytes_total Number of processed output bytes.
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="http.2"} 0
# HELP fluentbit_output_errors_total Number of output errors.
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="http.2"} 0
# HELP fluentbit_output_retries_total Number of output retries.
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="http.2"} 0
# HELP fluentbit_output_retries_failed_total Number of abandoned batches because the maximum number of re-tries was reached.
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="http.2"} 0
# HELP fluentbit_output_dropped_records_total Number of dropped records.
# TYPE fluentbit_output_dropped_records_total counter
fluentbit_output_dropped_records_total{name="http.2"} 0
# HELP fluentbit_output_retried_records_total Number of retried records.
# TYPE fluentbit_output_retried_records_total counter
fluentbit_output_retried_records_total{name="http.2"} 0
# HELP fluentbit_output_proc_records_total Number of processed output records.
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="prometheus_exporter.3"} 60
# HELP fluentbit_output_proc_bytes_total Number of processed output bytes.
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="prometheus_exporter.3"} 520260
# HELP fluentbit_output_errors_total Number of output errors.
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="prometheus_exporter.3"} 0
# HELP fluentbit_output_retries_total Number of output retries.
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="prometheus_exporter.3"} 0
# HELP fluentbit_output_retries_failed_total Number of abandoned batches because the maximum number of re-tries was reached.
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="prometheus_exporter.3"} 0
# HELP fluentbit_output_dropped_records_total Number of dropped records.
# TYPE fluentbit_output_dropped_records_total counter
fluentbit_output_dropped_records_total{name="prometheus_exporter.3"} 0
# HELP fluentbit_output_retried_records_total Number of retried records.
# TYPE fluentbit_output_retried_records_total counter
fluentbit_output_retried_records_total{name="prometheus_exporter.3"} 0
# HELP fluentbit_process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE fluentbit_process_start_time_seconds gauge
fluentbit_process_start_time_seconds{hostname="653a853d661c"} 1652772686
# HELP fluentbit_build_info Build version information.
# TYPE fluentbit_build_info gauge
fluentbit_build_info{hostname="653a853d661c",version="1.9.3",os="linux"} 1652772686

The webserver output is fine:

$ curl -s http://127.0.0.1:2020/api/v1/metrics/prometheus
# HELP fluentbit_input_bytes_total Number of input bytes.
# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="dummy.0"} 3926 1652772837034
fluentbit_input_bytes_total{name="dummy.1"} 3926 1652772837034
fluentbit_input_bytes_total{name="fluentbit_metrics.2"} 650325 1652772837034
# HELP fluentbit_input_records_total Number of input records.
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="dummy.0"} 151 1652772837034
fluentbit_input_records_total{name="dummy.1"} 151 1652772837034
fluentbit_input_records_total{name="fluentbit_metrics.2"} 75 1652772837034
# HELP fluentbit_output_dropped_records_total Number of dropped records.
# TYPE fluentbit_output_dropped_records_total counter
fluentbit_output_dropped_records_total{name="http.2"} 0 1652772837034
fluentbit_output_dropped_records_total{name="prometheus_exporter.3"} 0 1652772837034
fluentbit_output_dropped_records_total{name="stdout.0"} 0 1652772837034
fluentbit_output_dropped_records_total{name="stdout.1"} 0 1652772837034
# HELP fluentbit_output_errors_total Number of output errors.
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="http.2"} 0 1652772837034
fluentbit_output_errors_total{name="prometheus_exporter.3"} 0 1652772837034
fluentbit_output_errors_total{name="stdout.0"} 0 1652772837034
fluentbit_output_errors_total{name="stdout.1"} 0 1652772837034
# HELP fluentbit_output_proc_bytes_total Number of processed output bytes.
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="http.2"} 0 1652772837034
fluentbit_output_proc_bytes_total{name="prometheus_exporter.3"} 641654 1652772837034
fluentbit_output_proc_bytes_total{name="stdout.0"} 3874 1652772837034
fluentbit_output_proc_bytes_total{name="stdout.1"} 3874 1652772837034
# HELP fluentbit_output_proc_records_total Number of processed output records.
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="http.2"} 0 1652772837034
fluentbit_output_proc_records_total{name="prometheus_exporter.3"} 74 1652772837034
fluentbit_output_proc_records_total{name="stdout.0"} 149 1652772837034
fluentbit_output_proc_records_total{name="stdout.1"} 149 1652772837034
# HELP fluentbit_output_retried_records_total Number of retried records.
# TYPE fluentbit_output_retried_records_total counter
fluentbit_output_retried_records_total{name="http.2"} 0 1652772837034
fluentbit_output_retried_records_total{name="prometheus_exporter.3"} 0 1652772837034
fluentbit_output_retried_records_total{name="stdout.0"} 0 1652772837034
fluentbit_output_retried_records_total{name="stdout.1"} 0 1652772837034
# HELP fluentbit_output_retries_failed_total Number of abandoned batches because the maximum number of re-tries was reached.
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="http.2"} 0 1652772837034
fluentbit_output_retries_failed_total{name="prometheus_exporter.3"} 0 1652772837034
fluentbit_output_retries_failed_total{name="stdout.0"} 0 1652772837034
fluentbit_output_retries_failed_total{name="stdout.1"} 0 1652772837034
# HELP fluentbit_output_retries_total Number of output retries.
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="http.2"} 0 1652772837034
fluentbit_output_retries_total{name="prometheus_exporter.3"} 0 1652772837034
fluentbit_output_retries_total{name="stdout.0"} 0 1652772837034
fluentbit_output_retries_total{name="stdout.1"} 0 1652772837034
# HELP fluentbit_uptime Number of seconds that Fluent Bit has been running.
# TYPE fluentbit_uptime counter
fluentbit_uptime 151
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1652772686
# HELP fluentbit_build_info Build version information.
# TYPE fluentbit_build_info gauge
fluentbit_build_info{version="1.9.3",edition="Community"} 1

@patrick-stephens patrick-stephens changed the title Prometheus metrics with different tags should have only one HELP and TYPE comment line Prometheus Exporter metrics with different tags should have only one HELP and TYPE comment line May 17, 2022
@patrick-stephens patrick-stephens added bug and removed waiting-for-user Waiting for more information, tests or requested changes labels May 17, 2022
@yingchen0706v
Copy link
Author

yingchen0706v commented May 17, 2022

@patrick-stephens it works with default configuration, but the metrics are exported with endpoint /api/v1/metrics/prometheus instead of /metrcis. Is there a way to make it use /metrcis?

@patrick-stephens
Copy link
Contributor

I don't think so as those routes are part of the web server. Scrape config should handle it fine though, you just need to configure the path so it doesn't use the default on the Prometheus side

@yingchen0706v
Copy link
Author

thanks @patrick-stephens. I'll workaround it with other solution. Close the ticket for now. Thank you for help.

@patrick-stephens
Copy link
Contributor

I've re-opened this as it is a legitimate bug that will prevent use of the exporter. @leonardo-albertovich can you take a look?

The issue seems to be the metrics are not grouped together for related things with the exporter output but they are for the web server.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added Stale and removed Stale labels Aug 22, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@a-thaler
Copy link

It seems that Prometheus and VictoriaMetrics can handle that situation well, however there are providers like Dynatrace which scrape only the first entry of every metric and drops the rest.

As the new mechanism using the fluentbit_metrics as input seems to be the future-safe solution (storage metrics are available here in prometheus format using the prometheus-exporter and it is much more flexible), it will be great if the problem could get solved so that the new mechanism can be adopted more widely.

@ccampo133
Copy link

ccampo133 commented Aug 25, 2023

Ran into this bug today and confirmed that it is due to the fluentbit_metrics input plugin. Strangely, as @a-thaler mentioned, Prometheus itself has no issue parsing this malformed metrics text (despite the format violating its own spec). The Prometheus Go parser however fails with an error, as expected (see: https://github.com/prometheus/common/blob/main/expfmt/text_parse.go#L500)

It would be great if the fluentbit_metrics plugin be used with the Prometheus exporter output plugin and formatted properly. This would at the very least allow me to add additional labels to the metrics, which the monitoring API does not. That being said, the monitoring API metrics endpoint (mentioned here #5465 (comment)) is a sufficient workaround for now, at least for my use case.

@randvoorhies
Copy link

I'm trying to pull in the fluentbit v2 metrics from fluentbit 3.1.3 into telegraf which uses the Go parser and am stuck because of this issue. It seems to only be with v2, which I need so that I can plot my storage buffer usages.

@evgfitil
Copy link

I’ve encountered this issue as well, specifically with the inability to use storage metrics available exclusively in the API/v2. Resolving this would significantly improve our monitoring capabilities. I hope this issue can be prioritized in future

@bwplotka
Copy link

👋🏽

Looks like more ppl want to use the new Prometheus endpoint, but can't, due to a broken exposition format implementation. Any updates on this, or at least pointers what are the challenges?

(Not that it helps here, but I'm Prometheus maintainer here, open for feedback on our side how to make it easier for C codebases)

@braydonk
Copy link
Contributor

braydonk commented Aug 14, 2024

I looked into this today, here's what I found.

The problem

When all the metrics are collected from each plugin, the cmt_cat function is used to append an entire cmt context into the single one that will eventually get sent down the line. This is done because each plugin gets its own separate cmt context, because each plugin has the opportunity to register its own metrics. However, each input, filter, and output plugin also sets up a set of default metrics separately in their own contexts.
Let’s use fluentbit_input_records_total as an example. This metric is registered for every input plugin using the tag in the name label. The registration happens independently in each cmt context for every new input plugin. This counter’s map contains one metric, the counter for this name label for this input plugin. When this context is collected, each counter gets appended to the context.
Imagine there are 3 input plugins, and each one has its own metric context with a registered input_records_total. The problem is that Fluent Bit does not actually recognize that in the full cmt context that these metrics will be added to, there is already an input_records_total, and thus each will be registered as 3 different metrics. Once this gets to the process for encoding individual metrics, there will be a HELP and TYPE banner produced for each one separately, because they aren’t considered by cmetrics to be the same metric. In reality, what we would like is in the overall cmetrics payload with all metrics, there would be one metric representing input_records_total with 3 different metrics in its map for each of the 3 input plugins.

Solution

I began looking at this in an assistive capacity for another team; it isn't something that directly affects my work at this time. As such, it is unlikely I will be able to dedicate the time to develop and shepherd a fix myself. However, I've outlined what I think would be the two best possible ways to resolve this which someone else could take on.

Proposal 1: Shared metrics context for each plugin type

One path forward that I see is for all input plugins to share one metric context. This would be the same for filter and output plugins. In this case, the shared metrics context would be wrapped in a struct that also includes the addresses for each of the shared metrics, and when this shared context is passed into the initialization procedure of a new plugin instance, it simply records new values in the existing metrics.

I wrote a proof of concept for this just for input plugins: #9231
Much of the code is a mess, but if you pull it down and build it, then use the following config:

[SERVICE]
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    2020

[INPUT]
    Name cpu

[INPUT]
    Name cpu

[INPUT]
    Name cpu

[OUTPUT]
    Name  stdout
    Match * 

You will see the resulting metrics being correctly grouped as the Prometheus Exposition Format specifies.

braydonk@bk:~/Documents/test_flb$ curl localhost:2020/api/v2/metrics/prometheus                                                                                                                                                                                                             
# HELP fluentbit_uptime Number of seconds that Fluent Bit has been running.                                                                                                                                                                                                                 
# TYPE fluentbit_uptime counter                                                                                                                                                                                                                                                             
fluentbit_uptime{hostname="bk.c.googlers.com"} 3                                                                                                                                                                                                                                            
# HELP fluentbit_input_bytes_total Number of input bytes.                                                                                                                                                                                                                                   
# TYPE fluentbit_input_bytes_total counter                                                                                                                                                                                                                                                  
fluentbit_input_bytes_total{name="cpu.0"} 6628                                                                                                                                                                                                                                              
fluentbit_input_bytes_total{name="cpu.1"} 4971                                                                                                                                                                                                                                              
fluentbit_input_bytes_total{name="cpu.2"} 4971                                                                                                                                                                                                                                              
# HELP fluentbit_input_records_total Number of input records.                                                                                                                                                                                                                               
# TYPE fluentbit_input_records_total counter                                                                                                                                                                                                                                                
fluentbit_input_records_total{name="cpu.0"} 4                                                                                                                                                                                                                                               
fluentbit_input_records_total{name="cpu.1"} 3                                                                                                                                                                                                                                               
fluentbit_input_records_total{name="cpu.2"} 3

If this approach would make sense, I would hand it off to @shuaich, my coworker from the team trying to tackle this problem. I think it is a straightforward enough implementation that I could provide the guidance to do relatively simply.

The only open question is thread safety; if using threaded input plugins, I'm not sure if the cmt context is designed for thread safety. Seems to work fine for threaded output plugins so I'm guessing it's okay, but never tried with threaded input plugins.

Proposal 2: Adjust cmt_cat to account for metrics that already exist

I'm not sure this is a great path forward, but I'll include it here. The other way I see to accomplish this is for cmt_cat to account for metrics that already exist in the destination context. i.e. if I'm appending a context that contains fluentbit_input_records_total, cmt_cat would need to recognize that fluentbit_input_records_total already exists, and instead of copying the entire metric add it as a value to the existing metric's cmt_map.

This would be a much harder implementation. I think this should only be considered if the thread safety of cmt isn't solid enough for Proposal 1.

If this were the direction chosen, I'd recommend a Fluent Bit maintainer take it on as it is not straightforward and has nuances deep in the library code that aren't straightforward to come up with as a standard community contributor.

@braydonk
Copy link
Contributor

CC @edsiper @leonardo-albertovich to look over my proposals

@bbkfhq
Copy link

bbkfhq commented Aug 27, 2024

I'm also affected by this issue. The presence of duplicate "TYPE" lines breaks Telegraf's parsing.

decoding response failed: text format parsing error in line 10: second HELP line for metric name "fluentbit_input_bytes_total"

image

@edsiper
Copy link
Member

edsiper commented Sep 5, 2024

I've pushed a draft PR to CMetrics to fix this: fluent/cmetrics#222

For testing purposes, I created a test branch of Fluent Bit here:

folks, would you mind give it a try to the test branch ? any help is appreciated

@lecaros
Copy link
Contributor

lecaros commented Sep 6, 2024

Hi @edsiper
I was able to reproduce the issue with Telegraf and Fluent Bit 3.1.7.

podman run --rm -v $(PWD)/telegraf.config:/etc/telegraf/telegraf.conf:ro --entrypoint=telegraf telegraf
2024-09-06T21:21:48Z I! Loading config: /etc/telegraf/telegraf.conf
2024-09-06T21:21:48Z I! Starting Telegraf 1.31.3 brought to you by InfluxData the makers of InfluxDB
2024-09-06T21:21:48Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-09-06T21:21:48Z I! Loaded inputs: prometheus
2024-09-06T21:21:48Z I! Loaded aggregators: 
2024-09-06T21:21:48Z I! Loaded processors: 
2024-09-06T21:21:48Z I! Loaded secretstores: 
2024-09-06T21:21:48Z I! Loaded outputs: exec
2024-09-06T21:21:48Z I! Tags enabled: host=8e2205bd85d8
2024-09-06T21:21:48Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"8e2205bd85d8", Flush Interval:10s
2024-09-06T21:21:48Z I! [inputs.prometheus] Using the label selector:  and field selector: 
2024-09-06T21:21:50Z E! [inputs.prometheus] Error in plugin: error reading metrics for "http://192.168.100.61:2020/api/v2/metrics/prometheus": decoding response failed: text format parsing error in line 10: second HELP line for metric name "fluentbit_input_bytes_total"
2024-09-06T21:22:00Z E! [inputs.prometheus] Error in plugin: error reading metrics for "http://192.168.100.61:2020/api/v2/metrics/prometheus": decoding response failed: text format parsing error in line 10: second HELP line for metric name "fluentbit_input_bytes_total"
^C2024-09-06T21:22:03Z I! [agent] Hang on, flushing any cached metrics before shutdown
2024-09-06T21:22:03Z I! [agent] Stopping running outputs

I've also used the branch from #9360 to validate the fix.

 podman run --rm -v $(PWD)/telegraf.config:/etc/telegraf/telegraf.conf:ro --entrypoint=telegraf telegraf
2024-09-06T21:28:04Z I! Loading config: /etc/telegraf/telegraf.conf
2024-09-06T21:28:04Z I! Starting Telegraf 1.31.3 brought to you by InfluxData the makers of InfluxDB
2024-09-06T21:28:04Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-09-06T21:28:04Z I! Loaded inputs: prometheus
2024-09-06T21:28:04Z I! Loaded aggregators: 
2024-09-06T21:28:04Z I! Loaded processors: 
2024-09-06T21:28:04Z I! Loaded secretstores: 
2024-09-06T21:28:04Z I! Loaded outputs: file
2024-09-06T21:28:04Z I! Tags enabled: host=519cbe732e08
2024-09-06T21:28:04Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"519cbe732e08", Flush Interval:10s
2024-09-06T21:28:04Z I! [inputs.prometheus] Using the label selector:  and field selector: 
fluentbit_uptime,host=519cbe732e08,hostname=chronolap.local,url=http://192.168.100.61:2020/api/v2/metrics/prometheus counter=152 1725658090000000000
fluentbit_output_proc_records_total,host=519cbe732e08,name=stdout.0,url=http://192.168.100.61:2020/api/v2/metrics/prometheus counter=1207 1725658090000000000
fluentbit_storage_fs_chunks_down,host=519cbe732e08,url=http://192.168.100.61:2020/api/v2/metrics/prometheus gauge=0 1725658090000000000
fluentbit_input_bytes_total,host=519cbe732e08,name=dummy.0,url=http://192.168.100.61:2020/api/v2/metrics/prometheus counter=27324 1725658090000000000
fluentbit_input_bytes_total,host=519cbe732e08,name=dummy.1,url=http://192.168.100.61:2020/api/v2/metrics/prometheus counter=16416 1725658090000000000

Given that @shuaich already tested the Prometheus Golang scraper, I'd say the fix works.

@edsiper edsiper added this to the Fluent Bit v3.1.8 milestone Sep 16, 2024
@edsiper
Copy link
Member

edsiper commented Sep 16, 2024

fixed with #9392 (master) and #9393 (3.1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests