-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node-artedi histograms should be compatible with Prometheus #17
Conversation
README.md
Outdated
@@ -40,7 +40,10 @@ collector.collect(artedi.FMT_PROM, function (err, metrics) { | |||
|
|||
var histogram = collector.histogram({ | |||
name: 'http_request_latency_ms', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to change these examples to use '_seconds' where they're set to '_ms.' Since we're trying to do things better now we may as well stop misleading users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me!
Thanks! Great analysis. I just posted a couple small notes in the code. Before we merge this it would be nice to make sure this works with an existing service - a Triton or Manta service. It looks good to me, but if we're going to do a major rev. we may as well make sure we don't need a bug fix patch right out the gate. |
On the release management front here, I think we should get at least one bucket generator function into the library before doing the new release. If we do a release without adding any, services using this are going to reinvent their own code for generating buckets and we'll be stuck with almost-copies of that code around the place forever. |
Ok, sure. |
…tions, including one compatible with previous artedi
Just so it doesn't get lost, for reference here are the possible buckets that it is possible to generate with node-artedi 1.x:
There's no generator function in this PR that generates buckets using this pattern (there was briefly, but that was subsequently removed so that we don't encourage its use) but I'm including them here because I think they might be helpful as a reference when thinking about how to update software. |
README.md
Outdated
// http_request_latency_seconds{method="putobjectdir",le="10"} 1 | ||
// http_request_latency_seconds{le="+Inf",method="putobjectdir"} 1 | ||
// http_request_latency_seconds_count{method="putobjectdir"} 1 | ||
// http_request_latency_seconds_sum{method="putobjectdir"} 998 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this _sum value would be .998, right?
lib/collector.js
Outdated
@@ -344,9 +345,13 @@ function processTriggers(opts, cb) { | |||
} | |||
|
|||
module.exports = { | |||
artedi1Buckets: lib_buckets.artedi1Buckets, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function gone now?
lib/buckets.js
Outdated
assert.ok(min > 0, 'min must be > 0'); | ||
|
||
// TODO | ||
// - limit on factor? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you're still working on this, but I think it would be good to get some of these TODOs in. They would be breaking changes in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm still working on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KodyKantor do you have suggestions for what sanity limits we should set on these generator functions? From the prometheus source, it seems the observed value is a float64 which from this: prometheus/prometheus#3031 seems like we should probably avoid numbers > 2^53? Other than that limit, I'm not sure what to check here. We probably should put some bound just to prevent people from creating more buckets than is ever reasonable. 100 buckets sounds like a lot to me, but 1000 seems like it would always be a mistake. But I don't know what others think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshwilsdon, yes I think limiting to (2^53)-1 to keep the buckets within the node.js limit is a good idea.
I think stopping at 100 buckets across the board sounds good. 100 is more than enough buckets to do the sort of things we do today. The only reason I would say we should have 100 and not 75 (or another arbitrary number) is that someone might want to observe percentage as integers and keep one percent value per bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http://linuxczar.net/blog/2016/12/31/prometheus-histograms/#closing-thoughts suggests that case of over 100 buckets might be reasonable. He discusses an example with ~600 buckets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal preference would be to not place arbitrary limits like 100 or 1000 on bucket count or value. Instead, I think it is reasonable to only limit it to buckets that will be valid. With that in mind, the only invalid bucket count and value would be anything bigger than 2^31 - 1
, which is the max safe signed int as specified in the golang histogram that Prometheus uses. That said, it might be worth logging warn for anything over say 1000 buckets with a note about how it may create potential performance problems for the collecting endpoint.
docs/API.md
Outdated
// returns [ 1, 2, 4, 8, 16 ] | ||
``` | ||
|
||
#### logLinearBuckets(base, lowPower, highPower, bucketsPerMagnitude) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're missing the artedi.
namespace preface here
Thanks for the detailed write-up! Am I understanding correctly that the root issue here is that the buckets are dynamically generated and that we leave out larger buckets when we haven't seen data points that large? And that means that the range covered by the last bucket may vary depending on what measurements have already been observed? i.e., if we continued to dynamically generate the buckets, but ensured that we always generated all of them for each set of labels, that would work? (Not saying we should do that. The proposed solution makes sense to me.) I'm wondering about the upgrade and rollout part. (Sorry if this is just my lack of really understanding the problem or solution here.) Being a major bump, I assume we'd want to update CHANGES.md with instructions for a consumer about what they need to know and what changes they would make when they update their software to the new version. It seems like they could be in a situation where they weren't affected by this before (e.g., if they weren't using many labels, or if they tended to have the same distribution of values that filled out all the same buckets quickly after startup). Now, if they update to v2, their programs would be generating a different set of buckets. In that case, during a phased rollout of the new version of their program, they'd actually be introducing the same problem we're trying to solve here (namely, different instances returning different buckets). Here's an idea:
Alternatively, use a separate type, like "histogram_legacy"? I'm hoping this is relatively easy to do (if you were already going to support the generator function). Again, sorry if this is just my lack of understanding about the problem. If the old data is truly non-sensical, maybe this isn't worthwhile. But in practice it has matched the behavior we'd expect in a lot of cases, and it would be pretty painful if the intermediate state (some muskies updated before others) made the data worse. It's likely to be weeks or even months between updating the first and last muskies covered by some of our production dashboards. |
@davepacheco thanks for taking the time to read it and provide feedback!
The dynamic buckets are indeed the biggest problem here. However it's not just the larger buckets. Smaller buckets are also a problem. The real problem is that everything you compare really needs to be using the same set of buckets. If you have a service with multiple instances and instanceA starts up and its first observed value is 0.850, with artedi 1.x you'll get the buckets:
then if instanceB starts up and gets a different value (say 0.800) you get a totally different set of buckets:
now if you tried to do a query like:
you would be getting an invalid result. (In this case I believe you'd get a P50, P95 and P99 of 8.1 seconds since 8.1 is the largest bucket) The reason you get bad data is because you're comparing two different sets of buckets and instanceB is missing data in almost all the buckets used by instanceA (all but 1). The only way you can get valid results is when the query you're using for histogram_quantile only involves metrics with exactly the same buckets. In this case if for this service, we had buckets:
for both instances from the start, we would be able to include both in a histogram_quantile without getting incorrect data. To continue to dynamically generate the buckets, we'd have to come up with a way to ensure that each instance of a service always generates the same set of buckets. Which is basically the same thing as using a function to generate a fixed set of buckets up-front and passing that in when creating the histogram. Also note that if you were dynamically creating in the sense that we're adding the bucket when the program first observed a value that needed that bucket, we'd need to both:
failure to do either of those would be very likely to result in invalid data when trying to do a histogram_quantile. Also note that any buckets any instance of any service has used should be included in all future results. Permanently adding additional buckets to all instances seems (I haven't dug into this too deeply but from my experiments so far) to be fine, but removing buckets will break things. That's why on startup its important for your application that has been dynamically generating buckets to load from disk the set of all buckets it (or any other instances) has used and start with that set of buckets. From my perspective, it's easier just to change such that the set of buckets is fixed in the code so that there's no need for coordination between instances or storage of which buckets have been generated so far.
I can try to add some more detailed instructions/explanation to CHANGES.md. That's a good point.
I suppose it's possible. If your application only ever produced a very narrow range of values such that you always have exactly the same buckets for every instance without ever having any outliers and if you had never restarted your service such that it started out with fewer buckets, or you get requests across your whole range of values immediately on startup... That seems like a pretty unlikely case to me though. But it is possible.
Yeah, my recommendation in the upgrade notes will be that you really should either:
I did have an artedi.artedi1Buckets that generated buckets compatible with artedi 1.x but that was removed because it was suggested that this would just be extending people's problems rather than encouraging them to solve them. You can see that implementation here: which basically just gives you the option to select a portion of the 1.x bucket space:
If you mean having an If you mean having it work the way the current histogram function works (dynamically generating buckets) I think that's probably not a good idea since that would be generating bad data that people will likely eventually need to reset their prometheus to remove. But it would be possible to do if we make it clear to people what the consequences are for using that.
I guess it sounds here like you meant for histogram_legacy to continue to do exactly the same thing that the 1.x version did?
The data itself is not nonsensical, it's just that it's incompatible with an actual Prometheus histogram if you've ever had outliers or if you're ever using different buckets between instances of a service, or really any combination of labels. The way it is right now, if you have a label for status_code (which we do in Triton) you can have a different set of buckets for status_code=500 from status_code=200 even for the same service and instance. This fact also makes histogram_quantile incorrect for that service. Basically when using |
@davepacheco A potential way one could upgrade a lot of, say, muskies to an artedi v2 that no longer supports the v1 bucket generation is perhaps this:
|
If we can't just drop the artedi v1 buckets then trentm's idea sounds good. It would be nice to revisit our metric names and units. Do we actually need to provide backwards compat with v1 buckets? It seems like having bucket mismatches is what this issue is all about. If a subset of artedi consumers moved right from v1 to v2 buckets (creating a guaranteed mismatch) without changing metric names/labels/units, maybe the problem wouldn't be any worse than it already is. I think that's something we could test. |
fwiw: while searching for something else I just found that another project has had a similar problem because they did dynamic bucket creation (they removed support due to this issue) and one of the prometheus developers chimed in confirming that this is not supported. |
This is now a CR in Gerrit: https://cr.joyent.us/#/c/5024/ |
TRITON-928 node-artedi histograms should be compatible with Prometheus Reviewed by: Kody A Kantor <[email protected]> Approved by: Kody A Kantor <[email protected]>
Implemented with 54b21b7 |
node-artedi histograms should be compatible with Prometheus
TL;DR
Problem
If you use node-artedi histograms, any time you try to do a
histogram_quantile
across:you are only ever going to get correct results by accident.
Cause
node-artedi has a different mechanism for handling histogram buckets from other Prometheus clients and a mechanism that's in fact incompatible with Prometheus.
Proposed solution
New major version of node-artedi with histograms compatible with Prometheus.
What's this about?
While doing some investigations into some dashboards, I found some very strange behavior for PromQL queries like:
where that data comes from node-triton-metrics which uses a node-artedi histogram.
This is an attempt to explain what happened, and what I think we should do about it.
To explain the problems properly, we have to gain some understanding of how Prometheus histograms work. So we'll go through that first.
Digression: What's a Prometheus histogram?
Let's start with a digression.
In Prometheus, a histogram is a single metric type that takes observed values and places those values in "configurable buckets". It also exposes a "sum" and a "count" for all values that have been observed. The buckets themselves are just monotonic counters. In a simple case, imagine we were trying to keep track of request times (a common use case) and we created a histogram with buckets for second values (these are the default buckets that all libraries I'm aware of other than node-artedi use, more on that later):
There's also a special bucket which always exists called
+Inf
which gets incremented for every observation.Also note that with the exception of a few values that were removed, these default values are actually log-linear with a base of 10, low exponent of -3, high exponent of 1 and 4 linear steps per magnitude.
To update a histogram, you call a function like:
and for that histogram what we would do is increment the counter for every bucket that's less than or equal to the observed value. In this case, since the value observed is 0.016, we'd increment the counter in the buckets:
but we would not update the buckets:
since these do not satisfy the less than or equal condition when the value is 0.016. The way Prometheus exposes this is through a special "label". In Prometheus, labels are displayed as key-value pairs in curly brackets like:
and there's a special label that gets added for histograms called
le
which is used to indicate the bucket. Assuming the observation above was our first ever call tomyHistogram.observe
and the actual metric name wasmyHistogram
, we'd end up with something like the following state if we were to query the /metrics endpoint of our collector:since we will have incremented the counter for every bucket greater than or equal to our value. If we were then to observe a much slower request that took 0.08 seconds... The resulting data would look like:
It is a fundamental property of Prometheus histograms that if there is a counter value x in one bucket, all larger buckets will have a value greater than or equal to x. It's not a legal situation for example if we took the latest data above but instead of 2 placed a value 1 for
myHistogram{le="5"}
, because ifmyHistogram{le="2.5"}
has a value of 2,le=5
must be at least 2.This will be important later.
Disecting an actual query
To make this more concrete, lets look again at the query that led me down this path in the first place:
we'll work from inside out. If you have a recent Triton setup and have also setup triton-prometheus you should be able to follow along in your own Prometheus instance and see similar results.
What's http_request_duration_seconds?
The innermost part of this is:
this histogram is created in node-triton-metrics with:
and then labels are generated which we pass in when we observe the value. The labels look like:
so in Prometheus we will have observations like:
you can execute just the metric name as a query in Prometheus and you should see something like:
which shows a whole bunch of results (in my case 1201 total, a lot of scrolling).
So that's not actually that helpful, what we want to do is filter just for CNAPI, so we add a label and our query becomes:
which will look something like:
which is better, but still has a lot of results. The reason we have so many different results here is that we've added so many labels and this means that each combination of:
we'll have a separate counter in our metrics output on every scrape. This makes things pretty messy. But why don't we make them messier before we clean them up?
Since the values here are just counters they're not so helpful to observe how things are changing which is what we're usually interested in when we're looking at metrics. As such we're going to use the
rate
function to turn the values into something indicating how they're changing.To use rate you need a
range vector
which represents a range of samples instead of a single sample like we had in the last query. To turn the query above into a range vector, we can what's called a range vector selector (side note: Vector Selector could be a good band name) which looks like:the
[5m]
part there is the selector and tells Prometheus to grab all the samples it has for each individual metric/labels set for the last 5 minutes. Since our Prometheus instances typically scrape every 15 seconds, if we didn't miss any scrapes this should be around 20 data points for each metric. So, if you run that query in Prometheus, you'll see something like:which is pretty hard to look at, but shows what's going on here. If you scrolled down, you'd see 266 of these entries each as an array of ~20 values.
Now that we've made it more complex, lets finally start simplifying again. What we'll do is use the rate function. This function takes all those numbers from our range vector and turns them into a single value representing the per-second rate over those times. It does some fancy things to adjust for missed scrapes, bad alignment, etc. But that's unimportant for the sake of our current discussion. The important thing, is that this takes all the values from the last 5 minutes for each metrics/labels pair and gets us back to a single number. This time instead of representing a count, this represents how much the value is changing per-second.
The query now is:
and you should see results something like:
which shows some interesting things that you couldn't tell from the counter itself. Like the fact that we saw no requests with the combination of
status_code=200
anduser_agent="restify/1.6.0"
for theroute="serverget"
route in the last 5 minutes (this is the first 0 on the list). We know this because if there had been any, the+Inf
counter would have been incremented at least once and the rate would not be 0.As yet another slight digression, lets look at a bit more specific query here which just breaks this up a little more. We'll look at just the
route="servereventheartbeat
andmethod="POST"
(the only method supported by this endpoint) entries that also havestatus_code="204"
anduser_agent="cn-agent/2.3.2
with the query:this gets us a much more manageable list of only 36 results. Which almost even fits on one screen:
Note: if you're running this in the future, the exact query might not work (user_agent for example might need to be adjusted for the new version of cn-agent, but you should be able to figure out what to query based on the last query).
What you'll notice with this data, is that the
+Inf
rate is the largest value and as thele
value decreases, the rate also decreases. This becomes even more obvious if we apply 2 additional functions:what we've added here first is
sum(...) by (le)
. What that does is add up each unique set of labels + le, and combine into a single result that's the sum of the values. For example, if we had:and these were the only values for
le="0.01"
the results would include:which is the sum of the value for all
le="0.01"
regardless of color. In our query here, we already had only 1 value for eachle
because everything else is unique, so this just serves to make the Element column a bit more readable in this example.The sort function just sorts the elements by value in ascending order.
So the result should look something like:
which still is in the correct form of a Prometheus histogram, in that for each we see the rate always goes up or stays the same as the
le
value increases. It never decreases.Now we're ready for a query closer to what we're here to talk about:
histo ... what what?
The histogram_quantile function interpolates data and does some magic to give you a single value that represents a "φ-quantile" value from the buckets of a histogram. Basically this allows you to get the N-th percentile value from the histogram based on the buckets. What the above should output is something like:
This output is telling us that 95% of requests (P95) for this endpoint completed within 0.0009982834008097166 seconds based on the per-second rate calculated by the rate function over the last 5 minute's worth of data.
If you scroll up and look at the previous output, you'll see this makes sense. In that data you can see that the rate for
le="0.0009"
was ~1.53 but the rate forle="0.001"
was ~5.94. And the highest rate (+Inf) was only 6.19. So the vast majority of requests completed between 0.0009 and 0.001 seconds. And 95% of all requests completed in less that 0.001 seconds.We can also graph this data now (same query, just hit the graph button):
and we can see that it's tending to stay around that value (0.0009) though there are occasional spikes where the P95 has gone up to about 0.00134 seconds.
Hopefully by this point you understand basically how these functions work, so lets go back to the console and do something crazy.
Imagine you want to know what the P95 request time is across multiple endpoints, or even all the endpoints of your service at once? That seems like a reasonable thing to be curious about. Lets just take 2 endpoints to start:
This introduces one more thing which is that labels for your queries can have
=~
instead of just=
and then we can use a regular expression to include just two of the routes.The output from this query should look something like:
which looks just fine right? It's got entries for both serverget and server list, and we have numbers... So lets go back to doing a sum of the rate again like we did before:
Note: You always want to use rate before sum.
This new query should make data that looks like:
which is starting to look a bit suspicious, if you look carefully. If we sort it the problem becomes more obvious:
the issue here is that we no longer have what seems to be a valid set of data. Whereas before we had a properly behaved histogram where higher
le
values were always greater or equal to lowerle
values. Now we've got:le="6.48"
having a value of 0, butle="0.18"
having a higher rate, of 0.039.le="0.08"
having a value of 0.0035 butle="7.29"
having a rate of 0.So what's going on here?
If we look at a histogram_quantile of this data, it doesn't make a lot of sense:
The heart of the problem here is that we're combining two sets of buckets. The buckets for serverlist look like:
whereas the buckets for serverget look like:
And you can see that even within those there are some anomalies. For example, why is the counter for
le="0.007"
higher than forle="7.29"
for serverget?The problem here is twofold:
The combination of these two factors means that the only way you can actually use artedi histograms and compare them accurately, is to compare using all the labels such that you're always dealing with same set of buckets.
For the example we've been using thus far, that means that:
which we used above is safe, because we've fully specified all the labels. If we fail to specify all the labels and a given metric happens to have more than one value for that label, you will be using different sets of buckets, so your results will only be accurate if you get lucky.
Note that the data here up to the point of histogram_quantile is correct. It just no longer looks like valid Prometheus histogram data. So trying to do histogram_quantile on data that's not valid, is not going to give you valid results.
You can get lucky though. For example, if most of your requests happen to be in the same general range, and you end up with exactly the same buckets. If the buckets are the same between two different combination of metric+labels, you could combine the two and things would be fine. But if any of the buckets are different (which seems to be common at least in Triton) comparing across them is not going to give you what you think you're getting. And doing histogram_quantile will just be confusing.
I'm sure at this point you must be asking yourself: why can't Prometheus handle figuring out some way to do a histogram_quantile histograms with
le
values all over the place? And I'm glad you asked. At least part of the reason is that since PR 2610, when Prometheus sees a histogram that looks like:it assumes that the values are "nonsense" and as it walks up the le numbers it "fixes" the data such that each higher le has the max value. So in this case, the bucketQuantile function runs code that ensures the values are monotonic.
So the data above becomes:
which is a big problem when you're combining multiple endpoints with different characteristics. If you go back to our serverlist and serverget example, you can see that this happened there when we ran histogram_quantile on it. The values there, from the query:
looked like (trimmed to just the relevant bits):
so you can see here, that we're clearly mixing at least 2 different series. One had no value at
le=1.62
and presumably did at some point later (maybe between 8.1 and 684?), and at least one other series seems to have started having values at 0.08 and peaked at 0.18.With what we've described though, the data from 1.62-684 will all be replaced with 0.0386 since those are considered "nonsense" values. In fact nothing matters after 0.18. Whatever the values were after that, if it's less than 0.0386 it would be replaced with 0.0386.
This is why when we ran a histogram_quantile across this data above, we got a value 0.1701... Because as far as the histogram_quantile is concerned, 95% (in fact 100%) of values were less than or equal to 0.18 (since that was the first
bucket with the highest value).
Problem Number 2
Now that we've gone through that problem, there's another issue that we've observed with histograms while looking at the Triton data. Though this one has a similar cause (mixing unrelated buckets) the outcome is much different.
I created a small program (see "Appendix 1: histogram-breaker.js") that uses node-artedi which you can point your Prometheus at by adding a section like:
(replacing the IP with the IP wherever you're running it, it listens on all IPs)
When you run this program, what it will do is:
every 5 minutes, this will produce 900 (300 seconds * 3 values each second) values that are less than 1, and 3 values that are outliers and greater than 10^6. So the outliers will be 0.333...% of the values.
Example data would look something like:
which shows that we're getting the proper distribution of 100 of each of the stream="10^-3", stream="10^-2", stream="10^-1", for every one stream="outlier" value.
If we look at the values as a rate, we see:
which is unsurprising for the 10^* streams, but somewhat surprising for the outlier stream. Remember the rate value here is the per-second value. Since we're reporting 1 of each of the 10^* streams the correct value for those is 1 and Prometheus does a reasonable job of figuring that out. But for the outlier stream there are only at most 3 data points in a given 5m window (300 seconds, and these get reported once every 100s) so it seems to be more than 100% off here (I've seen it be more than 200% off with a value around 0.01). Which is kindof annoying, but shouldn't actually impact even the P99 since we're still talking about less than 1% of the values even in that case.
Since we expect the values of the outlier stream to all be between 1,000,000 and 9,999,999 the following distribution seems reasonable:
since it tells us that all values were less than 13,463,172 and there were no values less than 498,636 and only 1 between 498,636 and 1,495,908. This all seems sane.
The other streams also look sane, for example:
here we'd expect all values to be 0.01 - 0.099 which matches the data here.
What happens then when we combine them? This simulates what we'd do for example if we were talking about http requests across different endpoints where one endpoint might be fast and called frequently and other endpoints which are called infrequently but take a long time. Which we've actually seen in Triton.
What we see when we sum across all of the streams, we see:
The first thing you might notice here is that the values and buckets are all over the place due to the combination of the buckets from the outliers and the other streams. This data looks pretty weird from a Prometheus histogram perspective, since if we try to histogram_quantile over it, we already know (as discussed previously) that the higher le values will get run over with higher values so that the value for each le is >= all those below it.
When we try to take the P95 quantile however, we end up with an even more strange result:
And even if we go down to P60 we have the same result:
which seems surprising since the value there is what we'd expect for an outlier, and yet those represent far less than 1% of the actual data.
What's going on here?
The first thing to note is that the 40389516 value is actually an exact bucket value. And more interestingly it's the highest bucket (other than +Inf).
Looking again at:
so that we can see what the values/buckets are that we're trying to put into quantiles, we see:
The value for the bucket is 0.010526352724044648 which is tied for the lowest non-zero value. The value for +Inf is 3.0000105263527246. So there's a big gap between these two.
In order to illustrate the problem here further, I built a modified version of Prometheus which output some state as it was executing the bucketQuantile function and I ran the query:
and then reformatted the results as follows:
the pairs of numbers here are the
le
and thevalue
and the set of numbers in the left column is what came in as a result of the:query, which hopefully looks reasonable to you by now based on the data we're talking about here. Note that I ran this soon after restarting Prometheus and that's why the large buckets have no samples in the last 5 minutes. This actually is important but we'll discuss that shortly.
In the right column we're showing what the value for each bucket was when we used it. Specifically immediately following the ensureMonotonic() call. We've discussed this above, but basically this is just designed to ensure that each
le
as we go down this list is >= the previous value. If it's not, it's set to the previous value. Lines with a-
in the right column mean that we did not modify the original data. So you can see here that the majority of the values were modified because of the lack of monotonicity. Including those buckets which actually had a value 0.The stated intention of the ensureMonotonic function is to deal with the fact that there aren't atomic scrapes. So it's possible data might still be coming in while we're loading it. And if the values decreased they'd be invalid anyway, so for a valid histogram, this does make sense. Unfortunately since we've mixed multiple sets of buckets from different labels, we no longer have something that's a valid Prometheus histogram so this mangles it pretty badly.
So that explains the data in the table, except for the last value which is actually the key to understanding where the 40389516 came from when we tried to take the P60 and P95 histogram_quantile before.
One last digression and we'll get there.
Once it has ensured the values are montonic the next thing Prometheus does is determine which bucket has the specified quantile value. To figure that out, it takes the value of the +Inf bucket (which is expected to contain the largest value seen) and multiplies that by the
q
quantile value (0.95 here). In the code this is called therank
. The rank from the above data was 0.28893247566991065 (I modified Prometheus to output this too) which matches:then it looks for which of the buckets has a value >= that rank which in our case is only the +Inf bucket.
From that point it normally does some calculations to determine what value to return based on where the rank value sits in relation between the found bucket and the bucket previous to that.
But there is a special case...
And now we're getting to the actual heart of this second problem with the histograms. In the special case, if the rank value is found in the last bucket (i.e. +Inf) instead of determining where the value should be between the penultimate bucket and infinity, it will just return the le value from the previous bucket! So this explains the mystery of 40389516. The rank value calculated for 0.95 * the +Inf count put us above the 40389516 bucket and in the +Inf bucket. So Prometheus returned the value 40389516. But if you recall, in this case the original value for the 40389516 le was 0 (because there are no samples with this value in the time specified).
So basically here we've ended up in a really gross state. If you have one label (here {stream="outlier"}) that has ever had a huge outlier value and has thus generated a bad bucket... you've ruined your histogram forever. Even if that bad value is never seen again, and even if that bucket is never used again. At least, if you're mixing that set of buckets with any other set of buckets.
There's one final piece here. Why was +Inf so much larger than the largest bucket if we had no observations that were larger than 8.1 in this window? That's what forced us to be past the last bucket and into +Inf when we calculated the rank. And the reason is that we've combined a bunch of separate sets of histogram buckets into a single thing that we're trying to pass off as a single histogram. To illustrate, a simplified example... Say we had 3 different histograms each with only 2 buckets that when called with:
had the following values:
and
and
when we're combining these together with say:
we'd get:
so while each of the individual metrics makes sense, doing a sum across them breaks everything because the buckets are not the same and so the +Inf no longer actually makes any sense. And doing a histogram_quantile on this bad data will give us something that makes no sense. Everything from P34 up will give us a result of 0.03 (the le for the largest bucket before +Inf).
If we took exactly the same metrics and values, and instead put them with the same set of buckets for all 3 histograms: [0.001, 0.002, 0.003, 0.01, 0.02, 0.03]... (we're simplifying here!) with the metrics doing exactly the same thing, we'd end up with:
and
and
now when we're combining these together with:
we'd get:
which is still (even after the sum) a valid Prometheus histogram. And we can do histogram_quantile on this and P99 would be less than 0.03 (and greater than 0.02) which makes much more sense.
Two problems one real cause
So both of the above problems really have the same cause. Specifically, the fact that unlike every other Prometheus library, we dynamically generate buckets and tie them to the histogram+label combination.
If
broken_histogram{stream="10^-3"}
andbroken_histogram{stream="outlier"}
were both using the same set of buckets, we could not have this situation. The occasional outlier would show up as a value that matches the highest bucket, but the histogram_quantile would go back to normal once the outlier value had gone back out of the[5m]
window. Which is probably what you'd want. You'd get a spike there, then things would settle back down.And for the first problem we discussed (values getting run over because of the ensureMonotonic) that also only happens because we're combining a different set of buckets. If we had the same buckets, that would not be an issue either.
Solutions
Change node-artedi to be compatible with Prometheus
The first and perhaps most obvious solution (also my preferred solution) is to change node-artedi to be fully compatible with Prometheus by making buckets work like they do in all the supported libraries.
Specifically, change node-artedi so that:
If we did this, we'd be passing the same set of buckets in node-triton-metrics so for the
http_request_duration_seconds
histogram, we'd always have the same buckets and be able to combine and histogram_quantile all day long.If we still wanted log-linear buckets, we could provide an:
which could generate an array of log-linear buckets that could be passed in when creating a histogram.
This is the only of these proposed solutions that actually solves all the problems discussed above.
The PR this is attached to implements this solution with a new major version for node-artedi but without adding an artedi.logLinearBuckets() function.
Tying buckets to histogram rather than histogram+labels
Another related option would be that instead of passing in the buckets, we still compute the buckets dynamically... but we compute them for the whole histogram within a service, rather than per histogram+label combination. If the
http_request_duration_seconds
histogram for the{service="cnapi"}
used the same buckets, regardless ofuser_agent
,method
, etc. we'd still be able to at least get valid histogram_quantile results for a single service and all its routes. But we'd still need to store the resulting buckets somewhere so I don't think this solution is really great.Stop using labels
Another possibly more drastic option here would be top stop using any labels other than
{service="cnapi"}
with the histograms in node-triton-metrics. This would solve most of the problem since the buckets would be the same as long as the CNAPI instance is running. We do still have the problem that different buckets could be used when CNAPI restarts however (see next section: Problems with solutions).Problems with solutions
Having the same set of buckets every time at least within a service does require us to think a bit more about how the buckets should be setup for a service, but this seems preferable to having useless histograms.
The other thing that has been mentioned is that changing the buckets (e.g. for a new version of the software) would invalidate all previous data. That is true insofar as you can't do a histogram_quantile across the data with different buckets... But that's already true with the current node-artedi implementation. When you start your program and create a histogram and then expose those with a /metrics endpoint... Nothing stores the set of buckets that were used. So if you stop your program and start it up again, it's already possible you're going to have a different set of buckets for the first call to /metrics than you had when your application was running previously.
It's easy to verify this by just running:
with different arguments:
vs
I think if we're going to keep generating the buckets in node-artedi, we need to implement some mechanism to write out the bucket configuration somehow so that on restart of the service we end up with the same buckets. Either that or the set of buckets artedi generates needs to be the same for every application in every instance. And I think that just using fixed buckets is a much better solution.
Conclusion
It seems that dynamically generating buckets rather than using fixed buckets for Prometheus histograms is untenable. As such, we should figure out a way to have fixed buckets for Triton services. One option would be implementing the attached PR and then making the appropriate modifications to node-triton-metrics to use this , probably by modifying node-artedi to only support fixed buckets.
(Somewhat) Related Reading
Appendix 1: histogram-breaker.js