Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding running parameter to stats #2687

Closed
wants to merge 1 commit into from
Closed

Conversation

szuki
Copy link

@szuki szuki commented Apr 19, 2017

This change is intended to give information in metrics
if our process is not running or we are missing just a
metrics from it.

Required for all PRs:

  • CHANGELOG.md updated (we recommend not updating this until the PR has been approved by a maintainer)
  • Sign CLA (if not already signed)
  • README.md updated (if adding a new plugin)

This change is intended to give information in metrics
if our process is not running or we are missing just a
metrics from it.
@danielnelson danielnelson modified the milestones: 1.3.0, 1.4.0 Apr 19, 2017
@danielnelson
Copy link
Contributor

Can you explain briefly why you need to a positive assertion that the process is not running? I think most are simply alerting if points stop being created.

@szuki
Copy link
Author

szuki commented Jun 7, 2017

Most briefly I can get is to put you link https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes

@danielnelson
Copy link
Contributor

Thanks for the link, I really like how prometheus has this guidance.

One issue I see with doing this in procstat is we may not be able to produce the same tagset unless the process is found. The example I'm thinking of involves searching by process name (via the exe option) and with pid_tag true, normally you would get something like:

procstat,exe=dnsmasq,pid=44979 cpu_user=0.14,cpu_system=0.07

But if the process is not running you can't fill out the pid. This would mean we would create a new series.

I'm also not sure if how well this works if the query normally selects multiple processes.

@desa What do you think about this?

@desa
Copy link
Contributor

desa commented Jun 7, 2017

I understand the desire to positively assert that data is no longer reporting, but the reasoning from the prometheus

...
The seconds is to have an myexporter_up (e.g. haproxy_up) variable that’s 0/1 depending on whether the scrape worked.

The latter is better where there’s still some useful metrics you can get even with a failed scrape, such as the haproxy exporter providing process stats.
...

doesn't quite apply (unless I'm missing something). What stats could be provided that previously would not have been?

Additionally, its possible now with subqueries write a query that will return the list of all series that are not currently reporting data.

Something the the effect of

SELECT * FROM (
    SELECT last(*) FROM (
        SELECT count(<field>) FROM <measurement>
        WHERE time > now() - <duration>
        GROUP BY time(<collection interval>), * FILL(0)
     )
     GROUP BY *
)
WHERE last_count = 0

By playing with WHERE last_count ... and the various time intervals, the type of deadman assertions you could do are fairly sophisticated. I struggle to see why this isn't sufficient.

@szuki
Copy link
Author

szuki commented Jun 8, 2017

We are talking about "pull-based" metrics instead of your influx model of "push-based".
That means that you don't get missing metrics if something failed after a while. You get them after expiration occures. Meanwhile you are scraping "old" metrics from

@desa
Copy link
Contributor

desa commented Jun 8, 2017

Ah that makes sense. Forgot about the possibility of using pull based metrics. In that context it makes sense.

The issue that @danielnelson brings up about PID is still valid though. Being able to set the PID and this feature should be mutually exclusive so that you don't create an additional series.

@szuki
Copy link
Author

szuki commented Jun 8, 2017

Not exactly, It informs about pretty different thing. "*_up" metrics are to say if everything was ok or not from that host. We don't care about process id or whatever. As we failed to gather metrics from that process we want information about it failed. PID is something extra in that case.

@desa
Copy link
Contributor

desa commented Jun 8, 2017

The issue is that it creates another series, kind of unnecessarily. I'm not saying that it shouldn't be implemented, but rather that it just rubs me the wrong way that data pertaining to one thing is getting written to two different series. e.g. It just doesn't feel right that

procstat,host=myhost,exe=dnsmasq,pid=44979 cpu_user=0.14,cpu_system=0.07,running=0

and

procstat,host=myhost,exe=dnsmasq running=1

can be generated from the same plugin. The more I think of it the less it bothers me.

@danielnelson
Copy link
Contributor

If you are using the prometheus output, what about lowering the expiration_interval to a level that you are comfortable with?

The general point about old points is true. Does that mean we should always send something every interval for all measurements? This is something we recently did for http_response after a little persuasion #2784 (comment)

The multiple series issue is probably not a hard rule, just an indicator that something might not be right. In some respects the fact that pid is added as a tag is the real mistake, but something was needed to prevent collisions in the case of multiple process matches.

Also, it is going to cause problems for the prometheus output if the tagset changes until this issue is properly fixed #2822.

@szuki
Copy link
Author

szuki commented Jun 9, 2017

We encounter #2822 so we are still using 1.2 ;)

expiration_interval resolves problem only partly. Actually you get lag which is a at most expiration_interval*2 + interval. You need expiration_interval to be bigger than longest interval you have and something will pull with metrics according to some interval which is the lowest setting for expiration_interval. So yes and no -> It doesn't resolve problem it just gives you workaround which may in some cases work in other no, but still is far from perfect.

Does that mean we should always send something every interval for all measurements?

Actually yes, but rather for each series/plugin.

Multiple series are in every big plugin... You cannot avoid them. It just a matter of how big plugin will be. Think about OpenStack plugin.

@danielnelson danielnelson modified the milestones: 1.4.0, 1.5.0 Aug 14, 2017
@danielnelson danielnelson added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Aug 24, 2017
@danielnelson danielnelson modified the milestones: 1.5.0, 1.6.0 Nov 30, 2017
@russorat russorat modified the milestones: 1.6.0, 1.7.0 Jan 26, 2018
@tzz
Copy link
Contributor

tzz commented Mar 7, 2018

@danielnelson I think a use case we have is related: using the latest Telegraf with Prometheus, we can't tell if a host is down without using absent(), which is an awkward workaround individually and really hard to do in a big group of hosts. This affects http_response and net_response inputs.

(I know this PR is not about those input plugins, but it definitely covers the absence of metrics.)

To explain a little more, if http_response fails due to a string mismatch, we get all the stats. But if the host is not reachable, the only stats are the result_code which is a string. But Prometheus doesn't collect string metrics.

So now in Prometheus if we check absent(...one host...) that works. But we can't check all our hosts at once because even 1 of them will make the check pass. I don't know of a workaround.

I want to suggest (although I don't know if these are technically the best solutions) that it would be very helpful to either have an aggregator to map fields to numbers (maybe the histogram aggregator can autobucket strings based on hash code), or to have a way to map the result_code from http_response and net_response to a number in the plugin itself. Or a field rewriter plugin. Let me know the best way to proceed so I can open a issue on the right components.

@danielnelson
Copy link
Contributor

@tzz I think these pull requests might be of interest.

@Anderen2
Copy link

@russorat @danielnelson Adding another usecase for this. We see that it is hard (or impossible) to differentiate between that a process has stopped running or that the Telegraf agent has stopped sending metrics (for whatever reason).

As an example, with the usage of Kapacitor deadman alerting this causes one alarm for each process that was monitored on a host to be sent when the Telegraf agent stops reporting. Also, we cannot really confirm that the lack of data is due to the process not running or simply due to lack of data in graphs.

@danielnelson danielnelson modified the milestones: 1.7.0, 1.8.0 Jun 3, 2018
@danielnelson
Copy link
Contributor

We see that it is hard (or impossible) to differentiate between that a process has stopped running or that the Telegraf agent has stopped sending metrics (for whatever reason).

The best way to monitor if Telegraf is operational is by using the internal plugin. The internal_agent measurement is probably the best place to start:

internal_agent gather_errors=0i,metrics_dropped=0i,metrics_gathered=1i,metrics_written=0i 

You can also use this plugin to check if procstat collected any processes:

internal_gather,input=procstat metrics_gathered=0i

@danielnelson
Copy link
Contributor

I'm still uneasy with how this change would work when matching multiple processes. Here is a proposal for a slight modification which I would feel more comfortable with: #4237

If you are watching this issue, please take a look and let me know if it would work for your needs.

@Anderen2
Copy link

Anderen2 commented Jun 5, 2018

Hi @danielnelson, thanks for the feedback.

Yes, checking whether Telegraf is operational or not may be done with the "internal" plugin. We currently do it by checking "uptime" in "system", as Chronograf does it.

I cannot see that using the "internal_gather" "metrics_gathered" metric may help in our situation at least. In our case we would like to use Kapacitor to send alarms when a process on a host has stopped running, meaning we require one alarm for each process that has stopped for each server.

Using "deadman" for this works in theory yes but is quite iffy in practice. The fact is that "deadman" alerting sends alarms due to lack of metrics within a certain time, which might be due to a multitude of other reasons than that the process has stopped running.

We do currently run with "deadman" for this, however we experience issues with storms of alarms due to sudden latency fluctuations in the platform and other reasons for why the metrics might not be coming through within the interval. We are likely to stop using "deadman" for this as the sheer amount of false alarms overwhelm the ones that are real.

There are several other inputs that do expose a "status" field such as "result_type / result_code" for http_response/net_response or "exists" for filestat, it seems fitting that procstat also follows the same pattern in my opinion at least.

@danielnelson
Copy link
Contributor

There are several other inputs that do expose a "status" field such as "result_type / result_code" for http_response/net_response or "exists" for filestat, it seems fitting that procstat also follows the same pattern in my opinion at least.

I would use this pattern if procstat only matched a single process, but I think the idea in #4237 is fairly similar and mostly just drops the notion that this information is part of the procstat series.

@danielnelson
Copy link
Contributor

Superseded by #4307

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/procstat feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants