-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding running parameter to stats #2687
Conversation
This change is intended to give information in metrics if our process is not running or we are missing just a metrics from it.
Can you explain briefly why you need to a positive assertion that the process is not running? I think most are simply alerting if points stop being created. |
Most briefly I can get is to put you link https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes |
Thanks for the link, I really like how prometheus has this guidance. One issue I see with doing this in procstat is we may not be able to produce the same tagset unless the process is found. The example I'm thinking of involves searching by process name (via the exe option) and with pid_tag true, normally you would get something like:
But if the process is not running you can't fill out the pid. This would mean we would create a new series. I'm also not sure if how well this works if the query normally selects multiple processes. @desa What do you think about this? |
I understand the desire to positively assert that data is no longer reporting, but the reasoning from the prometheus
doesn't quite apply (unless I'm missing something). What stats could be provided that previously would not have been? Additionally, its possible now with subqueries write a query that will return the list of all series that are not currently reporting data. Something the the effect of SELECT * FROM (
SELECT last(*) FROM (
SELECT count(<field>) FROM <measurement>
WHERE time > now() - <duration>
GROUP BY time(<collection interval>), * FILL(0)
)
GROUP BY *
)
WHERE last_count = 0 By playing with |
We are talking about "pull-based" metrics instead of your influx model of "push-based". |
Ah that makes sense. Forgot about the possibility of using pull based metrics. In that context it makes sense. The issue that @danielnelson brings up about PID is still valid though. Being able to set the PID and this feature should be mutually exclusive so that you don't create an additional series. |
Not exactly, It informs about pretty different thing. "*_up" metrics are to say if everything was ok or not from that host. We don't care about process id or whatever. As we failed to gather metrics from that process we want information about it failed. PID is something extra in that case. |
The issue is that it creates another series, kind of unnecessarily. I'm not saying that it shouldn't be implemented, but rather that it just rubs me the wrong way that data pertaining to one thing is getting written to two different series. e.g. It just doesn't feel right that
and
can be generated from the same plugin. The more I think of it the less it bothers me. |
If you are using the prometheus output, what about lowering the The general point about old points is true. Does that mean we should always send something every interval for all measurements? This is something we recently did for http_response after a little persuasion #2784 (comment) The multiple series issue is probably not a hard rule, just an indicator that something might not be right. In some respects the fact that pid is added as a tag is the real mistake, but something was needed to prevent collisions in the case of multiple process matches. Also, it is going to cause problems for the prometheus output if the tagset changes until this issue is properly fixed #2822. |
We encounter #2822 so we are still using 1.2 ;) expiration_interval resolves problem only partly. Actually you get lag which is a at most expiration_interval*2 + interval. You need expiration_interval to be bigger than longest interval you have and something will pull with metrics according to some interval which is the lowest setting for expiration_interval. So yes and no -> It doesn't resolve problem it just gives you workaround which may in some cases work in other no, but still is far from perfect.
Actually yes, but rather for each series/plugin. Multiple series are in every big plugin... You cannot avoid them. It just a matter of how big plugin will be. Think about OpenStack plugin. |
@danielnelson I think a use case we have is related: using the latest Telegraf with Prometheus, we can't tell if a host is down without using (I know this PR is not about those input plugins, but it definitely covers the absence of metrics.) To explain a little more, if So now in Prometheus if we check I want to suggest (although I don't know if these are technically the best solutions) that it would be very helpful to either have an aggregator to map fields to numbers (maybe the histogram aggregator can autobucket strings based on hash code), or to have a way to map the |
@tzz I think these pull requests might be of interest. |
@russorat @danielnelson Adding another usecase for this. We see that it is hard (or impossible) to differentiate between that a process has stopped running or that the Telegraf agent has stopped sending metrics (for whatever reason). As an example, with the usage of Kapacitor deadman alerting this causes one alarm for each process that was monitored on a host to be sent when the Telegraf agent stops reporting. Also, we cannot really confirm that the lack of data is due to the process not running or simply due to lack of data in graphs. |
The best way to monitor if Telegraf is operational is by using the
You can also use this plugin to check if procstat collected any processes:
|
I'm still uneasy with how this change would work when matching multiple processes. Here is a proposal for a slight modification which I would feel more comfortable with: #4237 If you are watching this issue, please take a look and let me know if it would work for your needs. |
Hi @danielnelson, thanks for the feedback. Yes, checking whether Telegraf is operational or not may be done with the "internal" plugin. We currently do it by checking "uptime" in "system", as Chronograf does it. I cannot see that using the "internal_gather" "metrics_gathered" metric may help in our situation at least. In our case we would like to use Kapacitor to send alarms when a process on a host has stopped running, meaning we require one alarm for each process that has stopped for each server. Using "deadman" for this works in theory yes but is quite iffy in practice. The fact is that "deadman" alerting sends alarms due to lack of metrics within a certain time, which might be due to a multitude of other reasons than that the process has stopped running. We do currently run with "deadman" for this, however we experience issues with storms of alarms due to sudden latency fluctuations in the platform and other reasons for why the metrics might not be coming through within the interval. We are likely to stop using "deadman" for this as the sheer amount of false alarms overwhelm the ones that are real. There are several other inputs that do expose a "status" field such as "result_type / result_code" for http_response/net_response or "exists" for filestat, it seems fitting that procstat also follows the same pattern in my opinion at least. |
I would use this pattern if procstat only matched a single process, but I think the idea in #4237 is fairly similar and mostly just drops the notion that this information is part of the procstat series. |
Superseded by #4307 |
This change is intended to give information in metrics
if our process is not running or we are missing just a
metrics from it.
Required for all PRs: