Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve prometheus metric mapping to internal metrics / line protocol #4415

Closed
danielnelson opened this issue Jul 12, 2018 · 14 comments · Fixed by #6703
Closed

Improve prometheus metric mapping to internal metrics / line protocol #4415

danielnelson opened this issue Jul 12, 2018 · 14 comments · Fixed by #6703
Labels
area/agent area/prometheus feature request Requests for new plugin and for new features to existing plugins
Milestone

Comments

@danielnelson
Copy link
Contributor

danielnelson commented Jul 12, 2018

Updated April 5, 2019. Check the edited ^^ history for old revisions.

Proposal

Example Input:

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"}    3 1395066363000

# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3102
rpc_duration_seconds{quantile="0.05"} 3272
rpc_duration_seconds{quantile="0.5"} 4773
rpc_duration_seconds{quantile="0.9"} 9001
rpc_duration_seconds{quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

Current behavior:

rpc_duration_seconds 0.01=3102,0.05=3272,0.5=4773,0.9=9001,0.99=76656,count=2693,sum=17560473
http_request_duration_seconds sum=53423,0.05=24054,0.1=33444,0.2=100392,0.5=129389,1=133988,+Inf=144320,count=144320
http_requests_total,code=400,method=post counter=3
http_requests_total,code=200,method=post counter=1027

Desired behavior:

prometheus rpc_duration_seconds_count=2693,rpc_duration_seconds_sum=17560473
prometheus,quantile=0.01 rpc_duration_seconds=3102
prometheus,quantile=0.05 rpc_duration_seconds=3272
prometheus,quantile=0.5 rpc_duration_seconds=4773
prometheus,quantile=0.9 rpc_duration_seconds=9001
prometheus,quantile=0.99 rpc_duration_seconds=76656

prometheus http_request_duration_seconds_sum=53423,http_request_duration_seconds_count=144320
prometheus,le=0.05 http_request_duration_seconds_bucket=24054
prometheus,le=0.1 http_request_duration_seconds_bucket=33444
prometheus,le=0.2 http_request_duration_seconds_bucket=100392
prometheus,le=0.5 http_request_duration_seconds_bucket=129389
prometheus,le=1.0 http_request_duration_seconds_bucket=133988
prometheus,le=+Inf http_request_duration_seconds_bucket=144320

prometheus,code=400,method=post http_requests_total=3
prometheus,code=200,method=post http_requests_total=1027

Use case:

This will make histograms from prometheus use the same format as the histogram aggregator and also help with roundtripping back to the prometheus output.

See #4354

@danielnelson danielnelson added feature request Requests for new plugin and for new features to existing plugins area/prometheus labels Jul 12, 2018
@russorat russorat added this to the 1.11.0 milestone Feb 15, 2019
@danielnelson danielnelson changed the title Improve prometheus mapping for histogram/summary Improve prometheus metric mapping to internal metrics / line protocol Feb 21, 2019
@danielnelson
Copy link
Contributor Author

Update proposal with a method for splitting/joining metric names.

@trevorwhitney I believe if we use the transformations above in the prometheus input and output, it will allow us to add the histogram TYPE in the histogram aggregator and have the output be usable by Prometheus.

@danielnelson
Copy link
Contributor Author

What about points with a simple metric name (without underscores)?

something{foo="bar} 42

@pianohacker
Copy link
Contributor

pianohacker commented Mar 13, 2019

What about points with a simple metric name (without underscores)?

something{foo="bar} 42

This seems like it could be solved by the current approach in the prometheus output plugin: https://github.com/influxdata/telegraf/blob/master/plugins/outputs/prometheus_client/prometheus_client.go#L552-L556

We'd lose roundtripping of metrics ending in value, but I think that's a reasonable price to pay.

@vishiy
Copy link
Contributor

vishiy commented Apr 5, 2019

Can't we just keep the prometheus timeseries as is and have prometheus tags/buckets as tags? Renaming them as they move thru telegraf would change the originality of the metric and would require users to understand this transformation to map to their original prom metric.

# TYPE cpu_usage_user gauge
cpu_usage_user{cpu="cpu0"} 1.4112903225816156

Expected behavior:
prometheus,cpu=cpu0,url=blah cpu_usage_user=1.4112903225816156 1505776751000000000

@danielnelson
Copy link
Contributor Author

danielnelson commented Apr 6, 2019

I think @vishiy's method is the better approach:

  • It elegantly handles the simple metric case.
  • Doesn't have the potential to creating conflicting metrics with common Telegraf plugins
  • No handling of the metric name means we can't guess wrong on how to split it.
  • Same names will reduce friction when working with both Prometheus and InfluxDB.

For users who want metric name splitting, it would be possible to handle this in a separate processor plugin.

I have updated the original post to reflect this style.

@vishiy
Copy link
Contributor

vishiy commented Apr 25, 2019

@danielnelson - Sent a PR out for this (prometheus input changes)
#5767

@Menthalion
Copy link

Menthalion commented Apr 29, 2019

@vishiy @danielnelson The problem with bucket upper bounds as tags is that functions as histogramQuantile that would be needed to calculate quantiles over bucket histograms take a (numeric) field as upper bound expression, not a set of tags. How would you work around that / is there a way to convert tag ('le') values to typed (numeric + Inf) values in a (synthetic) field ?

Without functionality like that it's still impossible to use efficient Prometheus histogram queries for dashboards.

@danielnelson
Copy link
Contributor Author

Thanks for pointing this out @Menthalion. This isn't actually a new method for encoding histograms, it is already in use in some places of Telegraf such as the histogram aggregator. @nathanielc Is there something we can do to make these compatible with Flux functions?

@juliusv
Copy link

juliusv commented May 15, 2019

Hi everyone, I'm currently building the PromQL->Flux transpiler for InfluxDB and already touched on this topic in influxdata/influxdb#13893. For any PromQL-ish use case, we don't want to do any transformations that are special to the metric type (which is a very loose concept on the Prometheus side to begin with) or metric name. The mapping rules (in Flux / InfluxDBv2 terminology) just have to be dumb like this:

  • metric name -> _measurement column value
  • label name -> column name (but escape _value -> ~_value etc.)
  • label value -> column value (always string type)
  • sample value -> _value column value

The reason is that metric types are also not tracked or special-cased in Prometheus's own storage, so everything is just stored as flat labels and sample values. E.g. the only place where an le label is interpreted specially is when you feed a set of series to the histogram_quantile() PromQL function. Sending data out of Prometheus (most commonly via its remote write feature) also does not preserve metric type metadata, and the le label does not necessarily have to belong to a histogram or contain numeric values in all cases. So making it float-typed and treated separately would be wrong in some cases, and in the other cases it'd just introduce a lot of extra complexity for the PromQL use case. Same for the quantile label. Arbitrarily splitting metric names on underscores and doing guesswork on them is also always going to be unreliable, as the names are based on convention on the Prometheus side, but could really be anything.

In my transpiler work I also introduced a single field f64 to signal that the row stores the Prometheus sample value (and as there needs to some field), but it doesn't really matter what that field is named. Now there's the suggestion here to store the Prometheus metric name as a field name instead of in the _measurement column (and then that f64 field wouldn't be needed). I think this would be fine for the PromQL use case as well if people prefer it. It should only require minor changes to the transpiler.

@danielnelson
Copy link
Contributor Author

Hi Julius. Don't worry, this proposal has no special handling based on prom type, and any ideas of splitting the name have also been dropped (one can always do this in a processor instead).

I feel like our handling of le here is also the consistent with Prometheus and will provide good interop, as well as working well on with InfluxDB. I strongly think we should continue to use tags for histogram buckets in general, as it doesn't require a field glob and parsing bucket bounds out of field name like the alternative I have seen does. I would like to see the Flux histogramQuantile be adapted to work with this style.

The _measurement and _value concepts are query side only, on write we need to use line protocol which has more structure than just columns. I am strongly in favor of using metric_name -> fieldkey because it provides:

  • Better support for InfluxQL
  • Better support for Kapacitor
  • Better support for Telegraf processors and aggregators.
  • More performant behavior within Telegraf and when sending to InfluxDB.
  • Most importantly, consistent naming with other Telegraf metrics with respect to what the measurement and field names represent.

@juliusv
Copy link

juliusv commented May 16, 2019

@danielnelson Thanks! If I understand you correctly you are agreeing with the simple mapping I proposed, plus the variant where you store the metric name in such a way that it would appear as the value of the _field column in Flux. That sounds good.

And the le and quantile labels will become normal string-typed columns? The current histogramQuantile() in Flux expects float-typed le values, but I made a copy of that function and changed it to expect string-typed le values and also behave differently in many other small ways so it behaves exactly like PromQL's histogram_quantile() function. See juliusv/flux@ac8e300. Whether to drop the existing universe.histogramQuantile() or keep it with different behavior is yet to be decided, but for my work I needed something to work exactly like PromQL.

@danielnelson
Copy link
Contributor Author

If I understand you correctly you are agreeing with the simple mapping I proposed, plus the variant where you store the metric name in such a way that it would appear as the value of the _field column in Flux.

I don't actually know how the columns are named in flux, but the field name in the database would be the prometheus metric_name, for example: http_request_duration_seconds_bucket. Check the "desired behavior" in the top comment for examples of how it will look.

And the le and quantile labels will become normal string-typed columns?

Yes, I think they must be written as tags in order for this style of encoding to work and all tags are string typed.

@juliusv
Copy link

juliusv commented May 16, 2019

That sounds good then, thanks.

all tags are string typed.

Ah ok, wasn't 100% sure of this in the line protocol, since tags translate into columns in Flux, but columns can also have non-string types (float, int, string, bool). Good!

@danielnelson
Copy link
Contributor Author

Closed in #5767

lanzafame added a commit to filecoin-project/sentinel that referenced this issue Sep 3, 2020
This updates the metric_version of the prometheus input to version 2,
which brings the telegraf mapping more inline with how prometheus
natively outputs metrics, especially histograms and summaries.

See influxdata/telegraf#4415 for more
information.
placer14 pushed a commit to filecoin-project/sentinel that referenced this issue Sep 18, 2020
This updates the metric_version of the prometheus input to version 2,
which brings the telegraf mapping more inline with how prometheus
natively outputs metrics, especially histograms and summaries.

See influxdata/telegraf#4415 for more
information.
placer14 added a commit to filecoin-project/sentinel that referenced this issue Oct 14, 2020
* feat(telefraf): change the prometheus->telegraf metrics mapping (#113)

This updates the metric_version of the prometheus input to version 2,
which brings the telegraf mapping more inline with how prometheus
natively outputs metrics, especially histograms and summaries.

See influxdata/telegraf#4415 for more
information.

* feat(dashboard): Update dashboards for Grafana 7.1 (#135)

* feat(dashboard): Update dashboards for Grafana 7.1

* fix(docker): Set specific version of grafana for docker-compose

* feat: Rename Telegraf to Sentinel Drone (#131)

* feat: Add visor to Makefile; CI build (#149)

* feat: add visor to makefile

* Update Makefile

Co-authored-by: Oli Evans <[email protected]>

* chore: Remove sentinel- prefix in CLI usage

* chore: Add rust deps to CI build env

* chore: Add latest visor

Co-authored-by: Oli Evans <[email protected]>
Co-authored-by: Mike Greenberg <[email protected]>

* dep: Update to [email protected] (#155)

* feat(grafana): Update dashboards for latest Visor schema (#156)

* chore: Fixup Makefile and docker-compose; Update submodules (#157)

* chore: Remove redis docker-compose svcs; Create jaeger config

* chore: Update docker-compose timescale svc to pg12

* chore: visor-indexer and -processor make directives use new subcommand

* deps: Update to [email protected] and [email protected]

* deps: Update to [email protected]

* chore: Remove chainwatch services; Fix visor services

* fix(docs): Correct documentation for visor and naming

* Apply suggestions: Service desc edits; log output adj in service

Co-authored-by: Adrian Lanzafame <[email protected]>

* Update dashboards chain_power -> chain_powers

Co-authored-by: Adrian Lanzafame <[email protected]>

Co-authored-by: Adrian Lanzafame <[email protected]>
Co-authored-by: Ian Davis <[email protected]>
Co-authored-by: Oli Evans <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent area/prometheus feature request Requests for new plugin and for new features to existing plugins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants