-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prometheus output can return partial results if a collect request is made in-between writing multiple batches #1775
Comments
hi @banks, good sleuthing, but the previous behavior of continuing to report metrics that are not reported anymore was actually a bug that was fixed in 1.0: #1334 my understanding is that this is the intended prometheus behavior. why? because prometheus endpoints are supposed to work by gathering a set of metrics at the time that the http request is made. If the metrics don't exist at the time that the request is made, then they are not exposed. I'll re-open this if you can find some documentation that explains otherwise, but my conversations with prometheus devs led me to believe that the 1.0 behavior is the intended behavior. |
I'm pretty sure this is a bug and not a valid fix for #1334. I followed that issue too as we were also bitten by that bug on other hosts. In fact 0.13.2 fixed that bug for us - I assume the fix was back-ported from 1.0 branch. This report is different though:
In other words, as of 1.0.0 the prometheus output can't report on more than 1000 distinct metrics. Not only that but the failure mode when you try is random sampling of metrics that drops an unbounded amount of them on the floor. |
Btw. if there was no intentional backport of the fix for #1334 in 0.13.2 then it's very interesting because we saw it go away. Sadly it was too long ago to still have data in prometheus to show but I may be able to dig out a graph that showed how we dropped about 25% of our ingested-metrics-per-second over then 30 mins it took to roll out due to the deleted containers' metrics going away as expected. If that happened without an intentional change in telegraf perhaps it was a prometheus client bug that was fixed in that version or something and in fact the "fix" in 1.0.0 is creating this whole new problem? Please re-open this, prometheus output is unusable in production on 1.0.0 for anyone who might have more than 1000 metrics on telegraf instance as far as I can see. |
For the record here: That was the number of distinct containers we had metrics for over the time we rolled out 0.13.1 -> 0.13.2. I checked and none of the live containers I could find were missing them and all recently-deleted containers had their metrics stop as expected. I didn't look further and just assumed that #1334 fix had been backported and released. If that's not the case then I think #1334 was not really understanding the correct issue and potentially caused this new one with the "fix". |
I see what you mean, I'll reopen |
I can see exactly what's wrong here, and this should actually be a fairly simple fix, so I'm going to bump it up to 1.0.1 this map: https://github.com/influxdata/telegraf/blob/master/plugins/outputs/prometheus_client/prometheus_client.go#L95 should be getting remade on calls to Collect, not on Write in the meantime, you should be able to bump up metric_batch_size in the [agent] config section to workaround this: https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf#L35 |
Sounds exactly like the issue :) Thanks for the quick response. |
Incidentally my assumption that it was caused by having more metrics than the buffer seems to be wrong. In process of trying to patch hosts I found one host reporting only 300 metrics and it still showed them inconsistently sometimes being blank and sometimes having all of them even just in the few seconds I observed:
After downgrading to 0.13.2 this stopped too. Not sure whether or not that is explained by your find yet... (maybe a timing issue even if writes are not large thatn buffer size?) |
Hi, @sparrc I've finally gotten around to testing 1.0.1 ready to roll out to our servers and I've discovered that this issue seems NOT to be resolved in the latest 1.0.1 binary.. On my hosts before (0.13.2) I could run my little for loop from the original issue description:
On my canary host (wouldn't expect identical number of metrics but equally consistent) I see:
The log output on the new host is just as before and shows no errors:
So we expect 4122 metrics per scrape in this case but never get it... So it seems the fix was not the full story. I think this is the same issue as before not been fixed as we expected however it is possible it's the same symptom from a different cause. @sparrc would you prefer I reopen this issue or create a new one with a link back? |
FWIW my wild guess is that although the output is no longer clearing the map on each Write, it is clearing it at the end of each collection so there is a race where prometheus might (and often does) scrape before it's done with all batches so it can see the result f any number of Writes complete before it gets data... i.e. still not seeing everything collected since last scrape. This is almost as bad as the original issue here if I'm correct, but less simple to fix. |
doesn't this add up to 4122 metrics, which is what you were expecting?:
I think that the consistency can be fixed by making your output batch size larger than the number of metrics you expect to receive, ie, set The problem is that Telegraf can only "Lock" the output on each write call, which gets called once for every batch. So if your batch size is smaller than the number of metrics then the metrics will be split across multiple writes (and multiple locks). If you then make an HTTP request in-between multiple writes you will get split data. I'm not sure there's any other way to get around this other than having some very special handling of the prometheus output plugin coded into the core of telegraf. |
Yes sorry, that's confusion on my part. I expected 4122 because I added those up - my point is those log messages look correct and are the same for 1.0.1 as 0.13.2, despite the output as indicated in the for loop above being different. I can't explain why we saw 3988 in the output though - that was consistently less than 100% but its also not a simple combination of the batches reported in the logs. No filtering is configured on the output, it might be possible that I have duplicate metric names that are being de-duped by the prometheus output but that doesn't seem to happen obviously with 0.13.2. I need to be more scientific with testing that though to draw any conclusions.
I imagine that would work but it feels like a hack. My issues with it are:
If my understanding of the issue is correct, then the clean solution would be closer to the old prometheus output where metrics are registered and the current value (last one written by any batch) reported on every scrape. Then to fix the issue of old metrics not being unregistered, you'd need something more elaborate that kept a map of last write time for each metric and sweep through and unregistered ones that were not written for X collection cycles or seconds. I may well be able to contribute this fix or an alternative if you have a preference. |
You need to understand that the prometheus output is a bit different than any other telegraf output. "Writing" a batch to the prometheus output is more like just registering metrics on the http endpoint. Batching is really intended more for the sake of the more traditional write endpoints (InfluxDB, graphite, openTSDB, MQTT, AMQP, Kafka, etc).
I think that you should be able to come up with a max batch size that you reasonably can expect to not exceed.
I'm afraid I can't really speak to that, but as I said before if you're only using the prometheus output then I wouldn't worry too much about the batch size. That being said, I would caution against your planned architecture....exposing 450,000 metrics over an http interface, then scraping and parsing that endpoint every 10s would have an enormous overhead and I'm not entirely sure how well the prometheus client library performs at that scale.
It might be a documentation issue, I wouldn't be opposed if you submitted a PR explaining that users of the prometheus output may want to increase their batch size to an arbitrarily large number, if that is what works best.
I don't think that keeping a map of all values is a good solution, especially not a good large-scale solution (that map will be in-memory and can grow quite large). I haven't looked into it too much but I think my preference would be to have a special RunningOutput just for the prometheus output, which can be locked while it is writing a "set" of batches. |
Thanks for that @sparrc Batching not being relevant for prometheus is useful to know.
Yeah if that is a viable solution then this would be necessary. I'll see what happens in our tests.
Interesting. The way I see it is if we do end up with ~100k metrics per host which is not impossible, then exposing them all through a single scrape in Telegraf is less overhead than running hundreds of prometheus clients per host and having to scrape same amount of data overall using hundreds of separate http requests... Obviously we have more research to do to verify how well the components scale and at what cost.
That was a sketch suggestion for a "correct" (matching Prometheus' expectations) option, in practice with a bit of design, I think it could be done with very little overhead compared to the maps the output already keeps in memory. Prometheus' expectations are that hitting the endpoint should just return the most recent value of all metrics we care about. It used to do this. In trying to work around the never unregistering metrics issue (which is real and bites us too) we've kinda broken that basic assumption: now hitting the collect endpoint actually causes state change in the metrics/lastMetrics maps - potentially wiping older metrics that have not yet had a chance to be written for the most recent collection period. (Effectively) Infinite batch size masks that problem only because it ensures To put it another way:
Do you agree with that analysis or have I missed something again?
Interesting. That's certainly cleaner than relying on Prometheus users noticing the warning about batch size in the docs and never combining with another output...
|
you might be right that collecting the prometheus metrics in one place is better, especially if the machine exposing the http endpoint with that many metrics isn't expected to perform other heavy duties. I will reopen the issue because it's probably worth revisiting how we handle the prometheus output. Basically the problem is that prometheus expects that the metrics get collected at the time that the http request is made. Telegraf obviously has a different model where it collects metrics on a set interval and then just gives prometheus it's most recent metrics when asked for. Another solution that I think might be a lot more simple: every time a "Write" request comes in we increment a counter, and decrement it when the request finishes. The "HTTP" requests are then only allowed to grab the lock when the Write counter == 0....Go has a nice sync.WaitGroup object designed for this. ie, every "Write" goes straight to the front of the line, and "HTTP" requests need to wait for all "Write" requests to finish before they are processed. |
Cool thanks. I didn't quite follow your suggestion. It sounds like something I was considering but I think would need a change to how Telegraf calls Write to indicate whether or not there are more metrics to be collected in the current collection cycle. How do you propose informing the output that there are "outstanding "Write" requests on the lock"? Given a mechanism like I described (and like I think you are implying would be needed), the solution is pretty easy. |
My hope was that the prometheus plugin could service this lock internally, delaying HTTP requests until all metrics are ready. |
giving this a second thought, I'm actually not sure there is anything we can do to guarantee that this won't happen. Telegraf deals with metrics individually as they stream to the output plugins, so there isn't really any concept of "groups" of metrics that should or shouldn't get exposed together (in the prometheus sense). as to the behavior in 0.13.2, I think that was even worse in some ways. In that case, you were missing the same data from the current interval, but the plugin was just mixing data from the current & previous intervals as if they were collected at the same time. One thing we could do is set some sort of arbitrary waiting period to service the HTTP requests. Basically we would set a certain waiting period, and if no new metrics arrive within that period then we can respond to the HTTP request, otherwise we would wait for any incoming metrics to finish arriving. |
@sparrc thanks for the thought. It was my assumption based on logs and config etc, that Telegraf has distinct collection "passes" - i.e. it does have a concept of a group of metrics collected together that is just hidden form the outputs currently. If that is really not the case then You're right, it's pretty hard to reconcile the model with Prometheus's assumptions. I'm going to think some more about this and dig into the code. If I come up with any better ideas I'll report back. The timing option might be the best we can do but it feels like a hack to me. Almost like we'd be better of just saying outright in docs that the prometheus output doesn't really work very well and explain the issue rather than fudge it to make it sorta work mostly kinda. In my mind the old behaviour of reporting the last known value of a metric even if it was collected in a previous pass is "correct" at least in the sense of how prometheus expects things to work. You have to accept that counter values might end up incrementing effectively one sample later than otherwise etc but if you use Telegraf or any other agent then there is some delay/time smearing going on anyway in general and Prometheus' case there isn't really a better option. Anyway will return if I have further insight to share. |
I have encountered this same issue. Reading through your notes the behaviour in V0.13.2 would be much more useful in my use than the V1.0.0 behaviour. In my use old data (V0.13.2) is less harmful than lost data (V1.0.0). However it does cause a problem where the lack of new data from a host is masked indefinitely. Reading through the suggestion from @sparrc of delaying http responses from the prometheus client until no new metrics have arrived for a configurable amount of time would work well with my use. As I see many metrics coming in batches from the poll and then a gap until the next interval. @banks what was your work around with V1.0.1? |
@trastle increasing batchsize is a theoretical workaround with a few unpleasant caveats as noted. Honestly I've not even tried it yet although it should work. The more I look at this, the more I see fundamental mismatches between what telegraf does and what Prometheus expects. I've not finished looking yet, but I'm sadly leaning towards replacing telegraf in our setup with a combination of Prometheus' node_exporter and pushgateway and/or a custom extremely simple collector daemon that runs our custom collection scripts synchronously on scrapes. One of the reasons I'm not fully decided is that the other solutions all require subtle design decisions to avoid all the same issues telegraf has/had (old metrics not being dropped implicitly etc) and will end up being very similar to telegraf + batchsize workaround in practice. I still feel like that has too many gotchas to be a long term solution though. I'll update again when we've made a choice here. |
@trastle increase it's true that using telegraf for prometheus output is somewhat of a shoehorn. The typical workflow of a prometheus exporter is to collect metrics on an http request, which means that there is always a "batch" of metrics that are collected & exposed together. we could presumably modify the core of telegraf to have a "prometheus mode" where it collected metrics on request, rather than on an interval. There actually was a PR to do that in the early days of telegraf that we were resistant to, but it could possibly be revisited especially given the popularity of prometheus. |
@banks I'd appreciate if you could test with a very large batch size setting no matter which option you decide on. I'm thinking that the simplest solution to this would actually be to just set the batch size for the prometheus output to a very high number (10,000,000?) when the output is created (ignoring the From what I can tell, this can be set to an almost arbitrarily high number because the memory is not allocated until it is actually filled with a metric (though it may allocate some "virtual" memory before that point) |
I've had a few other duties to attend to this week which have limited my time to continuing figuring out the best option here beyond discovering the issue in the first place and out-of-hours reading up on things. It's on my TODO list though to test with the hack fix and to review telegraf code base more thoroughly to see if there is a cleaner solution. To be honest though your suggestion there if hard coding it for the Prometheus output alone might not be a bad idea if possible without affecting other outputs. I'll look further. My mind flip flops - it turns out the official Prometheus node_exporter (or some inputs for it) essentially works the same way as telegraf anyway - gathering metrics on an interval/caching and the reporting them on next scrape. This is also more-or-less the behaviour you'd get from running push gateway and PUTing the results of a custom script to it on an interval. If telegraf can behave that way without hacks/caveats and out of the box then it's actually fine but with much more flexibility over inputs and pruning metrics you don't care about on the host rather than centrally on the server (which matters if you care about bandwidth as well as just Prometheus storage etc). Thanks for your patience, I'll try to produce something constructive for Telegraf here either way.
|
FWIW, in my mind I wouldn't call increasing the batch size a "hack", I think it's simply the correct behavior for a prometheus exporter. The fact that it needs to be increased manually at the moment is inconvenient, but it's because the setting makes more sense from a more standard "push" output perspective. |
@sparrc I agree: the "hack" is doing it manually in a global config. This is a hack because:
If we can find a clean way to bump that batch size for Prometheus output without manual config, and without affecting other outputs then I agree it's (theoretically) a viable solution and not a hack. I'll look into this. From a first glance it looks like we'd need to change the Off the top of my head, we could add a new interface like:
Those could return That would seem like a clean, general purpose solution that would allow Prometheus to clamp min to a very high number and so work in all cases out of the box without affecting other outputs. Added benefit, no other Output code nor core code would need to change to accommodate. Any thoughts on that feel free to share - was based on a few mins of looking at the code on GitHub so may not be ideal! If we did this, we could effectively revert the last PR on this issue and go back to assuming each I also need to verify that bumping batch size is correct/works before I get too ahead of myself with code... |
TBH I was thinking something quite a bit simpler, after calling NewRunningOutput we can just check if we are creating a prometheus output: if strings.HasPrefix(name, "prometheus") {
if batchSize < 1000000 {
// log that we are auto-increasing the prometheus batch size
log.Println("I! blahblah")
batchSize = 1000000
}
} this would just need to get inserted at https://github.com/influxdata/telegraf/blob/master/internal/models/running_output.go#L44 |
Hmm thinking more: is this even a correct solution? Given that flushing right now is not in strict collection passes, even if we make batch size arbitrarily large, it will only work as expected if all inputs are on same interval and collect at exactly the same time so all the metrics arrive in one batch right? Nothing guarantees this even if they are all same interval now as far as I can see... Especially if you have If you were to have 2 different inputs with different intervals, you'd never have all the metrics in one batch regardless of batchsize. I think that is the fundamental issue rather than the batching. Am I wrong about that? This also makes me nervous about testing big batches out at all because it may seem to work for us now, but may not be a good solution or have harder to spot but still resent cases where it drops stuff... |
what this does guarantee is that when the metrics are getting written (on flush_interval), that whatever metrics have been reported at that time will get written together (and thus can never be split across multiple prometheus requests) any metrics that arrive later would get added to the next flush, which to me seems like the logical thing to happen.
You would have the metrics together whenever their metrics arrive within the same window. Let's say pluginA reports every 10s and pluginB every 13s. Then let's say that you make an HTTP request every 10s. What'd you'd get is this:
(NOTE that in this example I am also assuming that flush_interval = "10s") so, from my perspective this is correct, because the only alternative would be to potentially keep metrics which no longer exist. (ie, prometheus metrics are only "registered" when they've actually been collected in the previous interval). |
I think there are three intervals at play: input plugin collection interval, telegraf flush/write interval and prometheus scrape interval. To me the core difficulty is in the interplay between telegraf flush and prometheus scrape. If telegraf flushes too quickly, then it's possible for prometheus to miss metrics. For example:
I don't think we ever actually get a plugin B reading in that case in prometheus, unless I'm missing something. Now it's possible with some good documentation this could be ameliorated, but it's a lot of moving pieces to keep track of at once. I think Paul's suggestion earlier in the thread of keeping the most recent observation of each metric in the map for a longer period of time is a more user-friendly solution to the issue personally. |
Heh I was just writing up an elaborate example that said the same as Mike. Even if you consider that flush interval is set to min of any of the input intervals, the collections and flushes take time too and there are clearly races that can happen that would mean some plugins output might be missed entirely on some scrapes due to being overwritten by another flush before the scrape can happen. As a minimal example, assume the input timings in your example @sparrc, but have Prometheus only scrape every 30 seconds, if the scrapes occurred at 00:20 and 00:50, the second scrape would have missed any of the outputs from B since the previous scrape even though there were some... I think going back to keeping all the metrics in memory in the output and keeping enough additional data to be able to remove things that have not reported for X seconds is the only option that really works. There is overhead for sure but I think it could be minimised to acceptable levels with careful design (especially since you could just clean up while iterating the metrics on a scrape where we already visit every one anyway so no additional work needed to iterate them all to implement the timeouts..). Any other correct solution I can think of would change too much about Telegraf to be worth considering IMO - even if it was optional it would probably remove a lot of the benefits Telegraf has by making features like separate intervals per input unusble etc. |
@ragalie a Alright! I think the solution will be to have an "expiration" setting for the prometheus output. Every metric that gets written will set it's expiration time to Would this be an OK solution for everyone? |
Sounds good to me! Thanks for your effort on this @sparrc we really appreciate the time in this conversation and putting up with a lot of verbosity and mind-changing from me :). Either Mike or I would probably be able to contribute a PR for that proposal if that would help. |
I have found the same thing as @ragalie the flush_interval is key. I need to make sure metrics collection is completed before flushing to the prometheus output or metrics are lost. In my environment I collecting some 50k metrics per interval. This takes about 350 seconds. I am allowing 600 seconds so have plenty of head room. I was testing this today using V1.0.0 with the following config:
This is working quite well for me.
I am running 3 Telegraf servers each publish roughly the same number of metrics. Each interval. Prometheus is scraping each Telegraf server every 60s. As best I can tell this has resolved the loss I was seeing with a smaller flush_interval. An expiration for each metric would be much cleaner. |
@sparrc I have a PR on this, just need to get it cleared and sign the CLA and I'll open it up. Thanks again! |
closed by #2016 |
Bug report
We upgraded some servers to latest deb release in repos.influxdata.com only because that repo does not keep old versions (filing a separate ticket about that).
The latest version seems to fail in a way I can't diagnose, log output is same as before yet the response form the prometheus output metrics endpoint intermittently has a limited set of the actual data.
It seems to be only reporting the last batch of metrics written not all since last scrape.
In the config below the
exec
input pulls cgroups memory stats for each container running on the host. There are ~3000-4000 distinct metrics returned by it every time and this number is constant with number of containers running.Relevant telegraf.conf:
/etc/telegraf/telegraf.conf
/etc/telegraf/telegraf.d/default_inputs.conf
/etc/telegraf/telegraf.d/default_outputs.conf
System info:
telegraf 1.0.0 (old version 0.13.2)
Steps to reproduce:
(This is assumption, based on the cases that do exhibit it)
exec
input that returns > 1000 metricsExpected behavior:
Shoudl constistenyl see the correct number of metrics. For our case This is what happend when we scrape every second and output number of metrics scraped using 0.13.2:
Actual behavior:
After upgrade with 1.0.0 we see:
This numbers match the log messages about number of metrics in each batch so the bug appears to be that prometheus output is only reporting the last batch of metrics recorded not all since last scrape.
The text was updated successfully, but these errors were encountered: