Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync with Prometheus best practices #1

Closed
brian-brazil opened this issue Aug 6, 2016 · 17 comments
Closed

Sync with Prometheus best practices #1

brian-brazil opened this issue Aug 6, 2016 · 17 comments

Comments

@brian-brazil
Copy link

brian-brazil commented Aug 6, 2016

Hi,

It's great to see you're looking to use Prometheus monitoring. I noticed a few things that are out of line with Prometheus best practice, which will cause friction in future and limit the benefits you'll get.

In https://github.com/docker/go-metrics/blob/master/suffix.go: The _count suffix is used by Summary and Histogram metrics, it should not be be used elsewhere. Seconds are the standard, no other units should be offered. We also generally go for seconds rather than s as it tends to be more readable.

In https://github.com/docker/go-metrics/blob/master/namespace.go: The initial comment says "allows const labels to be added to all metrics created in this namespace and are commonly used for data like application version and git commit". This is an anti-pattern as it makes it much harder to work with the metrics, you should have a single metric with the value 1 and all the labels you need. See http://www.robustperception.io/how-to-have-labels-for-machine-roles/ for the general approach.

In
https://github.com/docker/go-metrics/blob/master/namespace.go#L53 you're putting a _count suffix on Counters. The convention is _total for Counters.

There's also a variety of other issues that limit what the user should be able to do with a Prometheus client library, per https://prometheus.io/docs/instrumenting/writing_clientlibs/ Rather than losing this functionality and semantics, I'd recommend using the Prometheus client library directly and filing bugs for anything you find lacking (the lack of timing utilities in Summary/Histogram being the obvious one).

@juliusv
Copy link

juliusv commented Aug 6, 2016

Yeah, super excited to see this as well - does this mean we can expect Docker to export native Prometheus metrics? :)

Anyways, I agree it would be beneficial to be more in line with upstream Prometheus conventions if possible. Upstream docs for that: https://prometheus.io/docs/practices/naming/

@crosbymichael
Copy link
Contributor

@brian-brazil thanks. I'll try to update with some of the conventions that you mentioned. We are just getting started so some things will be a little off.

A couple of questions/comments:

For const labels, we don't always have a static cluster of machines, things come and go, especially on our CI so we want the applications to report their version and not have some human to input these. I don't understand why this makes it harder to work with a metric, can you explain why? We also want to see performance over different versions to identify regressions in start times ,etc, so this label information is valuable for our use cases. Its not a machine role for us but an attribute of the metric.

We don't want to expose the full client lib because we want to add metrics to all of our projects and the hardest part about metrics is not the implementation, it is enforcing the naming conventions. This package is meant to allow everyone working on docker projects to have our naming, suffixes, types, etc be consistent.

@juliusv yes, that is the plan.

@brian-brazil
Copy link
Author

For const labels, we don't always have a static cluster of machines, things come and go, especially on our CI so we want the applications to report their version and not have some human to input these. I don't understand why this makes it harder to work with a metric, can you explain why?

It's fine to report your version, but not okay to put it on all metrics. What you're doing here means that every person using your metrics now needs to be aware that that label is present, and that it will get in the way when doing aggregation.

Its not a machine role for us but an attribute of the metric.

What do you mean when you say it's an attribute of the metric? It's an attribute of the binary/process, not of the few lines of code that specify the metric.

The technique for handling these is the same as for machine roles. Look at prometheus_build_info on Prometheus itself for example.

We don't want to expose the full client lib because we want to add metrics to all of our projects and the hardest part about metrics is not the implementation, it is enforcing the naming conventions.

I can understand where you're coming from, however you're losing a lot by removing functionality and changing the API. For example you're encouraging all metrics to have labels, whereas in fact most metrics tend not to have labels, and losing the ability to perform related important performance optimisations.

@crosbymichael
Copy link
Contributor

How do you correlate something like prometheus_build_info for the other metrics collected? One of the use cases is that we want to make sure when we are working on a new version or a PR/feature was merged, did it change the container start from from the previous version.

@brian-brazil
Copy link
Author

You can use group_left to copy over the label when you need it, without getting in the way of more general use cases.

@aluzzardi
Copy link

@brian-brazil @juliusv Awesome, thanks a lot for the feedback!

(@juliusv: long time no see!)

Yeah, super excited to see this as well - does this mean we can expect Docker to export native Prometheus metrics? :)

Indeed :)

I'm really looking forward to get naming feedback and such. I'm trying to remember the conventions from Borgmon but it's been a while. I do remember however what a pain it was to not have strong conventions and the tragic errors that occurred when people misunderstood a metric's intent.

@brian-brazil
Copy link
Author

I'm really looking forward to get naming feedback and such.

I actually just wrote http://www.robustperception.io/on-the-naming-of-things/ which covers the basic syntactic issues, in addition to what's on the project website such as https://prometheus.io/docs/instrumenting/writing_exporters/.

I'm trying to remember the conventions from Borgmon but it's been a while. I do remember however what a pain it was to not have strong conventions and the tragic errors that occurred when people misunderstood a metric's intent.

Misunderstandings will always happen, but we can try to reduce them. My guideline is that a metric name should at least let someone make a good guess as to what it means.

Two of the biggest pains I found were more around not understanding instrumentation's intent by not having metrics inline, and using wrapper libraries around the core instrumentation libraries to make things "simpler". Both make it frustrating to use metrics to debug real world problems, as they can easily take what should have been 2 simple lines of code that's right where you're looking at and turn it into 200 spread across 10 files that produce less-valuable instrumentation.

@crosbymichael
Copy link
Contributor

cc @stevvooe

@stevvooe
Copy link
Contributor

stevvooe commented Aug 9, 2016

Thanks @brian-brazil @juliusv for the interest! We are currently exploring better Docker metrics across the board. For the most part, we are looking at maintaining solid types and consistent naming across Docker projects.

For the most part, the prometheus export model is sufficiently flexible to support a large number of metrics use cases. We have had issues in other areas.

In particular, the client has a lot of room for error. I'm not sure how we would go about flowing these upstream without asking for a rewrite.

@brian-brazil

You can use group_left to copy over the label when you need it, without getting in the way of more general use cases.

Do you have example?

We want to be able to compare the performance of different versions of docker running in a cluster of machines. This is an extremely common general question and I am not quite seeing how group_left can work with histogram summaries.

@brian-brazil
Copy link
Author

brian-brazil commented Aug 9, 2016

In particular, the client has a lot of room for error. I'm not sure how we would go about flowing these upstream without asking for a rewrite.

I'm not sure it's really possible to prevent errors as to be useful you have to be flexible. Encouraging good practices is about the most that's practical in my opinion. It's my experience that wrappers like this harm more than they help.

We want to be able to compare the performance of different versions of docker running in a cluster of machines.

Offhand its some_metric group_left(version) on(instance) my_version_metric

@stevvooe
Copy link
Contributor

Offhand it's some_metric group_left(version) on(instance) my_version_metric

Doesn't this assume that a single instance uses the same version? Won't this fall over if an instance is running multiple versions of the same software?

I'm not sure it's really possible to prevent errors as to be useful you have to be flexible. Encouraging good practices is about the most that's practical in my opinion. It's my experience that wrappers like this harm more than they help.

Unfortunately, the prometheus/client_golang has a number of design issues that will lead to future errors. Leaving this to "encouraging good practices" for a project the size of Docker will lead to a mess.

Here is non-exhaustive list of problems encountered already:

  1. The type-aliasing of Opts is a very weird way of trying to represent inheritance. This is especially awkward when only two other metrics have a few metrics-specific options. It would be a lot easier if these were just one type and types with specific options required an extra argument or something. Having separate types here makes automating namespace/subsystem impossible since each metric has a different option type. This is probably the biggest reason for wrapping the API, as it makes creating a controlled namespace/subsystem hierarchy impossible.
  2. The interface and method bloat makes approaching the package challenging. After some reading around the prometheus docs, it makes some sense, but concrete types would make it easier to work with.
  3. The design is not goal-focused. Users of the API should be able to say, "I need a timer", "I need to report this value" and know exactly what to do.
  4. The lack of units declaration will lead to mistakes, especially when there are such strong and subtle opinions about where to use _seconds, _count or _total. We believe we can make the package ensure the right thing if there with a bit of sugar: NewGauge("active", Units("foos")).
  5. float64, everywhere. How can this possibly make sense when most metric values are fixed-point (bytes, timers, counts, etc.)? Will this cause errors when aggregating large metrics?
  6. Foo and FooVec all over the place. What is a Vec? O, that is how you label things. And for some reason, FooVec has four methods that do the same thing.

I'm afraid the changes we'd suggest to get the client package under control would be akin to a re-write. Is the prometheus project willing to acknowledge and fix these issues? Are you willing to make the Go API compatibility breaks necessary to address these problems?

While some of these seem like nits, these kinds of issues are not easily addressable across hundreds of pull requests. We can solve these now and avoid thousands of hours of work. We can also address this issues without affecting the prometheus project or your API users.

@juliusv
Copy link

juliusv commented Aug 10, 2016

Thanks for those details!

/cc @beorn7 for the client library questions. He's the master of the Go client library and currently reworking parts of it, so this might be of interest to him, and he's the best person to give answers about it. However, he's also on vacation this week, so he probably won't see this before next week.

As for the float64 question, we have an FAQ about why Prometheus stores everything as float64. We haven't had any problems with that in practice, and thus the client libraries also expose this type everywhere. Note that in our storage, float values compress really well if they are just integers.

@brian-brazil
Copy link
Author

Doesn't this assume that a single instance uses the same version?

Yes, this is a very reasonable assumption.

Won't this fall over if an instance is running multiple versions of the same software?

That shouldn't be possible as an instance is a process, but you should still be able to model that.

The type-aliasing of Opts is a very weird way of trying to represent inheritance

That's more an objection to Go itself.

It would be a lot easier if these were just one type and types with specific options required an extra argument or something.

I'm not sure that'd be of overall benefit. You'd be making it harder to create half the metrics and limiting the ability to add options in the future to the other half.

Having separate types here makes automating namespace/subsystem impossible since each metric has a different option type.

That sounds like procedurally generating metric names, which frustrates debugging in my experience. It's important to be able to go from a metric name to a piece of code, preferably with a standard find (I'm not the biggest fan of namespace/subsystem for this reason, they make grepping harder).

I'd tend towards keeping things simple, a linter may be easier to do.

The interface and method bloat makes approaching the package challenging. After some reading around the prometheus docs, it makes some sense, but concrete types would make it easier to work with.

To what exactly are you referring?

The design is not goal-focused. Users of the API should be able to say, "I need a timer", "I need to report this value" and know exactly what to do.

This is something that can be improved, the Go client is currently lacking various utility methods it should have for common use cases like timing.

We believe we can make the package ensure the right thing if there with a bit of sugar: NewGauge("active", Units("foos")).

I'm unsure how much that'd help. How would you handle things where the unit isn't immediately obvious?

The approach we're taking is to standardise on seconds, bytes etc. for units.

float64, everywhere. How can this possibly make sense when most metric values are fixed-point (bytes, timers, counts, etc.)? Will this cause errors when aggregating large metrics?

The math works out that it generally takes over a century to be a problem as a float64 has a 53 bit mantissa.

Foo and FooVec all over the place. What is a Vec? O, that is how you label things. And for some reason, FooVec has four methods that do the same thing.

Other clients chose to try and hide this distinction. The 4 methods are different, and can't be reduced due to Go not permitting function overloading and how it handles errors.

I'm afraid the changes we'd suggest to get the client package under control would be akin to a re-write. Is the prometheus project willing to acknowledge and fix these issues? Are you willing to make the Go API compatibility breaks necessary to address these problems?

There's currently a partial re-working in progress (see prometheus/client_golang#214), but several of these are fundamental to Go and the problem domain. If you'd like to try write a better client, the guidelines are at https://prometheus.io/docs/instrumenting/writing_clientlibs/ but there's a lot of subtleties here and attractive nuisances to be avoided. API design is hard.

@beorn7
Copy link

beorn7 commented Aug 15, 2016

Thanks @stevvooe for the feedback. While the client library was written with a lot of feedback from Go programmers in mind, additional feedback is very much appreciated, especially to get an idea what matters most out there. That's important because the requirements quite often contradict each other, so that I had to choose the side I'll satisfy and the side that will feel left out. Sometimes, I tried to satisfy both sides of an argument, which sometimes works, sometimes stirs up even more controversy.

The type-aliasing of Opts is a very weird way of trying to represent inheritance.

It has nothing to do with inheritance. It's merely convenience because four out of six option structs happen to look the same. From all I have heard, the community pretty much settled on option structs as the preferred way to implement named parameters with default values in Go and the option to add more parameters later without breaking the API, after some other approaches didn't work out.

This is especially awkward when only two other metrics have a few metrics-specific options. It would be a lot easier if these were just one type and types with specific options required an extra argument or something.

Pulling out the arguments specific for Summaries and Histograms would defeat the reason for having option types in the first place (namely to not have a large number of arguments on each function, to provide meaningful default values and argument naming that is visible in the call, and to allow adding more arguments later without breaking the API).

We could have one shared options struct for Counter and Gauge, but that would be first of all confusing (and some would find that quite awkward), and it would break once the options struct would behave differently (which is likely to happen during my current rework).

In conclusion, the current approach seems to be the least bad.

Having separate types here makes automating namespace/subsystem impossible since each metric has a different option type. This is probably the biggest reason for wrapping the API, as it makes creating a controlled namespace/subsystem hierarchy impossible.

The namespace/subsystem is a well-intended attempt of the previous author of the client library to guide users into structuring their metric naming. All those components are simply joined with _ to create the final metric name. It's nothing like "real" namespacing (which is a much more involved effort to be tackled throughout the stack in the mid or long term). When I took over the client library development, I wanted to keep the spirit of my predecessor, but I carefully designed things to keep namespace/subsystem optional. Thus, if you wish to automatically generate metric names, feel free to only use the Name parameter and generate the string it is set to by any means of your choice. (But please take @brian-brazil 's concerns into account for an informed decision.)

The interface and method bloat makes approaching the package challenging. After some reading around the prometheus docs, it makes some sense, but concrete types would make it easier to work with.

Could you elaborate on this? Counter, Gauge, Summary, and Histogram are mostly interfaces to be able to provide different implementations to swap in (e.g. for testing purposes, but also for the planned "no-op" implementations, see prometheus/client_golang#196 ). Also, in the current implementation, all the "single value metrics" are backed by the same implementation (which is more implementation convenience than a design goal). If there is one thing that itches me, then it is the fact that the various Vec types are concrete and not interfaces.

Could you explain why interfaces are more difficult to work with? Go is designed to make working with interface types seamless.

The design is not goal-focused.

Unsurprisingly, I have to disagree.

Users of the API should be able to say, "I need a timer", "I need to report this value" and know exactly what to do.

Even without the (planned) utility methods @brian-brazil mentioned above, all those things are straight forward and easy to accomplish. A metric in the Prometheus universe has a name, a doc string, and potentially labels. You have to specify those somewhere. And timing is essentially a defered call to Observer or Add. There is very little room left to streamline this.

In this context, I see three things that need improvement:

  1. Documentation (currently only godoc, we either have to make it way more comprehensive, or provide an instrumentation guide separately).
  2. The utility functions @brian-brazil mentioned (I'm working on those).
  3. The creation of "instant" metrics (with NewConstMetric and friends) is historically a late addition to the library and needs quite a bit of streamlining (which is also something I work on as we are speaking).

The lack of units declaration will lead to mistakes, especially when there are such strong and subtle opinions about where to use _seconds, _count or _total. We believe we can make the package ensure the right thing if there with a bit of sugar: NewGauge("active", Units("foos")).

Not much to add to what @brian-brazil said. In my perfect world, the unit would be part of the metric meta-data (same as the type or the doc string), and the Prometheus server would actually make use of the meta-data. That would be closer to your idea, but neither is true right now. So we recommend to have an applicable unit in the metric name. We want to keep the metric name transparent throughout the stack, i.e. the name you pick when instrumenting the code should be recognizable when you run queries. (I acknowledge that this is not entirely true at the moment as there are "magic" suffixes like _sum, _count, _bucket, triggered by the inability of the current Prometheus server to handle complex metric types as first class citizens. But we don't want to obfuscate the metric name generation even more. From that perspective, the namespace/subsystem is problematic, too. But as said, feel free to ignore and consider it legacy.)

Naming conventions are acceptable, even if they are not programatically enforced. (We could have a Prometheus linter one day. ;)

float64, everywhere. How can this possibly make sense when most metric values are fixed-point (bytes, timers, counts, etc.)? Will this cause errors when aggregating large metrics?

This has nothing to do with the client library. It's a deliberate design decision of Prometheus. If you wrapped this away, you would simply obfuscate the Prometheus data model.

Foo and FooVec all over the place. What is a Vec? O, that is how you label things.

Yes, vector is a pretty common term in the Prometheus ecosystem. When the naming in the client library was discussed, Vec turned out to be both short and meaningful. Without a certain level of understanding of the Prometheus data model, you won't be able to come up with meaningful instrumentation anyway.

And for some reason, FooVec has four methods that do the same thing.

That's a result of trying to satisfy different groups of users. There is the "pretty and safe" method of picking labels:
someVec.With(prometheus.Labels{"foo": "bar", "dings": "bums"})

And then there is the quick and dirty:
someVec.WithLabelValues("bar", "bums")

The latter has the problem that you have to know the order of labels and you can easily get it wrong. The former has quite a performance cost as it involves creating a map. Instrumentation is often in your hot code path, so performance might matter a lot. Therefore, me have to provide the latter way. We could drop the former, but it would remove the option to play safe outside of the hot code path.

That leaves us with two ways of picking labels. The other two are the result of a common Go idiom: To have one version of a method that returns an error, and another one that panics instead. The latter usually begins with a Must.... After careful consideration, I decided to not use the Must... naming in this case because the Must... version is the one used in 99% of cases and should be the short and easy-to-read one. However, since there are a few cases where you need to handle the error instead of panicking, I also had to provide the versions that return an error. Ideally, it would work like a native map access or a type assertion in Go, where the same syntax gives you both, depending on the number of return parameters. However, that's only possible as a built-in in Go.

The conclusion here is: Goal-focused design lead to the (unfortunate) necessity to have four methods for the same thing.

Hope that help to provide context and to get an idea where I'm planning improvements and where the current state is a deliberate design decision.

@crosbymichael
Copy link
Contributor

We are interested in prom for metrics in our projects because of the output format. It is very simple and robust and I can create a naive but working client in a few lines of code.

package main

import (
    "fmt"
    "sync/atomic"
    "time"
        "net/http"
)

var counter int64

func main() {
    http.HandleFunc("/metrics", func(w http.ResponseWriter, r *http.Request) {
        fmt.Fprintf(w, "my_metric %d\n", counter)
    })
    go func() {
        for i := 0; i < 500; i++ {
            atomic.AddInt64(&counter, 1)
            time.Sleep(10 * time.Second)
        }
    }()
    http.ListenAndServe(":8080", nil)
}

We are still in the very beginning steps but the approach to client libs/abstractions with prom and the prom way or the highway makes us want to reconsider using it and look for something else.

We have some goals that we want to hit and we are still experimenting but arguing over how to populate that fprintf in an http handler is not helping us as we have barely started experimenting with things.

Who knows, maybe we will find out that we are totally wrong and use the underlying prom client directly but it too soon to tell, we just know that its not doing the job of enforcing namespaces/subsystems for us and that is why are wrapping some of it.

@brian-brazil
Copy link
Author

I don't believe that is a correct client, and likely has undefined behaviour due to accessing counter non-atomically. This stuff is tricky to get right, and we've already solved these problems in an efficient manner.

Choosing to use a different client is up to you, but may result in your code and metrics being harder to use for Prometheus users generally due to being non-standard.

@crosbymichael
Copy link
Contributor

crosbymichael commented Aug 15, 2016

That wasn't the point in my 20 line example but lets all go back to doing something productive. Thanks for the feedback all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants