-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add process summary metrics #4231
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence if this should be it's own metricset or not. I feel like it should be possible for a user to only enable the summary which speaks for its own metricset. At the same time I see it valuable inside the process metricset as it can be "calculated" during the fetch.
If we put it in it's own metricset we could put it under the namespace processes
. My worry is if we put it under process
is that it mixes the summary with a single event.
Thinking more about it I tend to go in the direction of a separate metricset which is enabled by default. This also allows to set different periods for the summary.
@@ -487,3 +487,33 @@ | |||
description: > | |||
Total number of I/O operations performed on all devices | |||
by processes in the cgroup as seen by the throttling policy. | |||
|
|||
- name: summaries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this singular as it is only 1 summary and not multiple.
@@ -49,6 +49,7 @@ func New(base mb.BaseMetricSet) (mb.MetricSet, error) { | |||
CPUTicks bool `config:"cpu_ticks"` | |||
CacheCmdLine bool `config:"process.cmdline.cache.enabled"` | |||
IncludeTop includeTopConfig `config:"process.include_top_n"` | |||
Summaries bool `config:"process.summaries"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
process.summary.enabled
? In case we add more options to it in the future
@@ -373,6 +383,11 @@ func (procStats *ProcStats) GetProcStats() ([]common.MapStr, error) { | |||
} | |||
procStats.ProcsMap = newProcs | |||
|
|||
var summaries *processSummaries | |||
if procStats.Summaries { | |||
summaries = procStats.getSummaries(processes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned that users might expect the summary to unfiltered. As written it looks like the summary only includes processes that matched the given regular expressions.
If we keep it this way, I feel like we should include info about what we matched against in the summary. This would be needed to distinguish between two separate summary events if the system/process
metricset is used more than once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you are right, it should be unfiltered, I didn't realize that the regexps are already applied at that point.
@ruflin I agree that a separate metricset would be somehow cleaner, although I don't like it when two metricsets have very similar names (process and processes) because it's unclear for the user which one does what. We already have that situation with Another concern is that with a separate metricset we'd need to poll the process data twice, from both metricsets. Or is there a way for two metricsets to work on the same data? |
Ok, I have tried the approach of having its own metricset and I have to admit it feels a lot simpler, so I'm now also leaning towards that option. I named the metricset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a separate metricset is simpler w.r.t. the overall implementation.
state := sigar.ProcState{} | ||
err = state.Get(pid) | ||
if err != nil { | ||
return nil, fmt.Errorf("Error getting process state for pid=%d: %v", pid, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One bad apple will spoil the bunch. This will cause problems on Windows, because pid=0 and csrss.exe
processes cannot be accessed. But this could also affect other OSes if a process exits in between the Pids()
and the state.Get(pid)
calls.
One solution would be to treat these processes as a new state, unknown
, in the state summary.
|
||
pids, err := process.Pids() | ||
if err != nil { | ||
logp.Warn("Getting the list of pids: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the error should only be handled once. I'm advocating against logging the error also returning it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that was copy pasting from the process
metricset :) I'll fix it in both places.
case 'Z': | ||
summary.zombie += 1 | ||
default: | ||
logp.Err("Unknown state <%v> for process with pid %d", state.State, pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be another reason for having an unknown
state in the summary. Without it, the sum of the counts of each state will not equal the total
in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I thought that we don't need unknown
because it can be computed on the UI side, but I think we can be explicit about it.
|
||
config := struct{}{} | ||
|
||
if err := base.Module().UnpackConfig(&config); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be removed since there is no config.
pids, err := process.Pids() | ||
if err != nil { | ||
logp.Warn("Getting the list of pids: %v", err) | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend wrapping the returned error to provide some addition context, like errors.Wrap(err, "failed to fetch list of PIDs")
.
Agree on the naming part, it's somehow ugly. I like the suggestion of |
Talking about naming: In case we agree on |
I'd say no, for that people should use the |
jenkins, retest it |
dfd5229
to
48c8fb7
Compare
"sleeping": 0, | ||
"stopped": 0, | ||
"total": 355, | ||
"unknown": 130, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a lot of unknowns. Is that because of permissions on /proc/pid/*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, it was generated on my mac without sudo.
@@ -0,0 +1,3 @@ | |||
=== system process_summary MetricSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
system
should be capitalized and I think we should change process_summary
to "Process Summary". I think it would fit in better on the metricset listing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// init registers the MetricSet with the central registry. | ||
// The New method will be called after the setup of the module and before starting to fetch data | ||
func init() { | ||
if err := mb.Registry.AddMetricSet("system", "process_summary", New, parse.EmptyHostParser); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the metricset name were process.summary
instead of process_summary
, would that alleviate some of the concerns about the name differing from the data model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewkroh I like the idea. Would this cause any issues somewhere else?
And interesting part of this could be that one could also enable process.*
metricsets in the future. This becomes especially interesting in the kubernetes case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea, but that currently doesn't work:
2017/05/11 10:25:00.264381 metricbeat.go:31: INFO Register [ModuleFactory:[docker, mongodb, mysql, postgresql, system], MetricSetFactory:[apache/status, audit/kernel, ceph/cluster_disk, ceph/cluster_health, ceph/monitor_health, ceph/pool_disk, couchbase/bucket, couchbase/cluster, couchbase/node, docker/container, docker/cpu, docker/diskio, docker/healthcheck, docker/image, docker/info, docker/memory, docker/network, dropwizard/collector, elasticsearch/node, elasticsearch/node_stats, golang/expvar, golang/heap, haproxy/info, haproxy/stat, http/json, jolokia/jmx, kafka/consumergroup, kafka/partition, kibana/status, kubernetes/container, kubernetes/node, kubernetes/pod, kubernetes/system, kubernetes/volume, memcached/stats, mongodb/dbstats, mongodb/status, mysql/status, nginx/stubstatus, php_fpm/pool, postgresql/activity, postgresql/bgwriter, postgresql/database, prometheus/collector, prometheus/stats, redis/info, redis/keyspace, system/core, system/cpu, system/diskio, system/filesystem, system/fsstat, system/load, system/memory, system/network, system/process, system/process.summary, vsphere/datastore, vsphere/host, vsphere/virtualmachine, zookeeper/mntr]]
panic: name process already used
I think this brings up the question (again), if we want to support metricset hierarchies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would separate the two things. I think we should support dots in metricset names which does not necessarly mean there is a hierachy in the code structure. That it happens on the data side is nice of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the panic is caused by a naming collision the happens when registering the metrics key for metricbeat.system.process.summary
with libbeat monitoring. Since the process
metricset is already registered we end up with a collision. We could work around the issue with a change to this line to remove dots from the key name. This would "flatten" the metrics reported about the metricsets and simplify consuming those metrics in any sort of UI.
key := fmt.Sprintf("metricbeat.%s.%s", module, strings.Replace(name, ".", "_", -1))
But @tsg's concern that the usage of dots is overloaded was spot on.
93f2798
to
f6a79fe
Compare
Updated to use |
@tsg I think you have to use ["..."] in the tests there:
|
As we're enabling `include_top_n` by default, a few visualizations from the Metricbeat-processes list were no longer correct (as they aggregate only a sample of the data). This adds a new `process_summary` metricset that adds these metrics. The fields are namespaced under `process.summary`. This PR adds summary metrics for the total number of processes and their state, as an extra document created by the `system.process` metricset.
f6a79fe
to
fbac667
Compare
jenkins, retest it |
state := sigar.ProcState{} | ||
err = state.Get(pid) | ||
if err != nil { | ||
summary.unknown += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[golint] reported by reviewdog 🐶
should replace summary.unknown += 1 with summary.unknown++
|
||
switch byte(state.State) { | ||
case 'S': | ||
summary.sleeping += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[golint] reported by reviewdog 🐶
should replace summary.sleeping += 1 with summary.sleeping++
case 'S': | ||
summary.sleeping += 1 | ||
case 'R': | ||
summary.running += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[golint] reported by reviewdog 🐶
should replace summary.running += 1 with summary.running++
case 'R': | ||
summary.running += 1 | ||
case 'D': | ||
summary.idle += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[golint] reported by reviewdog 🐶
should replace summary.idle += 1 with summary.idle++
case 'D': | ||
summary.idle += 1 | ||
case 'T': | ||
summary.stopped += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[golint] reported by reviewdog 🐶
should replace summary.stopped += 1 with summary.stopped++
case 'T': | ||
summary.stopped += 1 | ||
case 'Z': | ||
summary.zombie += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[golint] reported by reviewdog 🐶
should replace summary.zombie += 1 with summary.zombie++
summary.zombie += 1 | ||
default: | ||
logp.Err("Unknown state <%v> for process with pid %d", state.State, pid) | ||
summary.unknown += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[golint] reported by reviewdog 🐶
should replace summary.unknown += 1 with summary.unknown++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
assert.Contains(t, event, "idle") | ||
assert.Contains(t, event, "stopped") | ||
assert.Contains(t, event, "zombie") | ||
assert.Contains(t, event, "unknown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing total
.
And how about a test to ensure the sum of the various states equals total
?
assert isinstance(summary["idle"], int) | ||
assert isinstance(summary["stopped"], int) | ||
assert isinstance(summary["zombie"], int) | ||
assert isinstance(summary["unknown"], int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing total
.
This addresses comments in elastic#4231 that came just before the PR got merged.
As we're enabling
include_top_n
by default, a few visualizations from theMetricbeat-processes list were no longer correct (as they aggregate only a sample of the
data).
This PR adds a new
process_summary
metricset that adds these metrics. The fields are namespaced underprocess.summary
.Remaining TODOs:
This is required for #4112.