Add process summary metrics #4231

tsg · 2017-05-05T14:08:34Z

As we're enabling include_top_n by default, a few visualizations from the
Metricbeat-processes list were no longer correct (as they aggregate only a sample of the
data).

This PR adds a new process_summary metricset that adds these metrics. The fields are namespaced under process.summary.

Remaining TODOs:

system test for the new metricset.
changelog
update Kibana dashboards

This is required for #4112.

ruflin

I'm on the fence if this should be it's own metricset or not. I feel like it should be possible for a user to only enable the summary which speaks for its own metricset. At the same time I see it valuable inside the process metricset as it can be "calculated" during the fetch.

If we put it in it's own metricset we could put it under the namespace processes. My worry is if we put it under process is that it mixes the summary with a single event.

Thinking more about it I tend to go in the direction of a separate metricset which is enabled by default. This also allows to set different periods for the summary.

ruflin · 2017-05-09T07:42:31Z

metricbeat/module/system/process/_meta/fields.yml

@@ -487,3 +487,33 @@
              description: >
                Total number of I/O operations performed on all devices
                by processes in the cgroup as seen by the throttling policy.
+
+    - name: summaries


I would make this singular as it is only 1 summary and not multiple.

ruflin · 2017-05-09T07:43:22Z

metricbeat/module/system/process/process.go

@@ -49,6 +49,7 @@ func New(base mb.BaseMetricSet) (mb.MetricSet, error) {
 		CPUTicks     bool             `config:"cpu_ticks"`
 		CacheCmdLine bool             `config:"process.cmdline.cache.enabled"`
 		IncludeTop   includeTopConfig `config:"process.include_top_n"`
+		Summaries    bool             `config:"process.summaries"`


process.summary.enabled? In case we add more options to it in the future

andrewkroh · 2017-05-09T17:57:06Z

metricbeat/module/system/process/helper.go

@@ -373,6 +383,11 @@ func (procStats *ProcStats) GetProcStats() ([]common.MapStr, error) {
 	}
 	procStats.ProcsMap = newProcs

+	var summaries *processSummaries
+	if procStats.Summaries {
+		summaries = procStats.getSummaries(processes)


I'm a little concerned that users might expect the summary to unfiltered. As written it looks like the summary only includes processes that matched the given regular expressions.

If we keep it this way, I feel like we should include info about what we matched against in the summary. This would be needed to distinguish between two separate summary events if the system/process metricset is used more than once.

Ah, you are right, it should be unfiltered, I didn't realize that the regexps are already applied at that point.

tsg · 2017-05-09T21:11:21Z

@ruflin I agree that a separate metricset would be somehow cleaner, although I don't like it when two metricsets have very similar names (process and processes) because it's unclear for the user which one does what. We already have that situation with fststat and filesystem, and I find it confusing.

Another concern is that with a separate metricset we'd need to poll the process data twice, from both metricsets. Or is there a way for two metricsets to work on the same data?

tsg · 2017-05-09T22:18:15Z

Ok, I have tried the approach of having its own metricset and I have to admit it feels a lot simpler, so I'm now also leaning towards that option. I named the metricset process_summary, but I'm open to suggestions :).

andrewkroh

I think having a separate metricset is simpler w.r.t. the overall implementation.

andrewkroh · 2017-05-09T22:38:01Z

metricbeat/module/system/process_summary/process_summary.go

+		state := sigar.ProcState{}
+		err = state.Get(pid)
+		if err != nil {
+			return nil, fmt.Errorf("Error getting process state for pid=%d: %v", pid, err)


One bad apple will spoil the bunch. This will cause problems on Windows, because pid=0 and csrss.exe processes cannot be accessed. But this could also affect other OSes if a process exits in between the Pids() and the state.Get(pid) calls.

One solution would be to treat these processes as a new state, unknown, in the state summary.

andrewkroh · 2017-05-09T22:40:27Z

metricbeat/module/system/process_summary/process_summary.go

+
+	pids, err := process.Pids()
+	if err != nil {
+		logp.Warn("Getting the list of pids: %v", err)


I think the error should only be handled once. I'm advocating against logging the error also returning it.

Oh, that was copy pasting from the process metricset :) I'll fix it in both places.

andrewkroh · 2017-05-09T22:43:15Z

metricbeat/module/system/process_summary/process_summary.go

+		case 'Z':
+			summary.zombie += 1
+		default:
+			logp.Err("Unknown state <%v> for process with pid %d", state.State, pid)


This could be another reason for having an unknown state in the summary. Without it, the sum of the counts of each state will not equal the total in this case.

Initially I thought that we don't need unknown because it can be computed on the UI side, but I think we can be explicit about it.

andrewkroh · 2017-05-09T22:44:14Z

metricbeat/module/system/process_summary/process_summary.go

+
+	config := struct{}{}
+
+	if err := base.Module().UnpackConfig(&config); err != nil {


These can be removed since there is no config.

andrewkroh · 2017-05-09T22:47:20Z

metricbeat/module/system/process_summary/process_summary.go

+	pids, err := process.Pids()
+	if err != nil {
+		logp.Warn("Getting the list of pids: %v", err)
+		return nil, err


I recommend wrapping the returned error to provide some addition context, like errors.Wrap(err, "failed to fetch list of PIDs").

ruflin · 2017-05-10T06:16:26Z

Agree on the naming part, it's somehow ugly. I like the suggestion of process_summary and we could still put it under process.summary as the docs namespace. Will we allow the process_summary to also be filtered or exclude processes?

ruflin · 2017-05-10T06:17:12Z

Talking about naming: In case we agree on process_summary, we should probably rename fsstats to filesystem_summary for consitency. Agree that it is suboptimal / confusing at the moment.

tsg · 2017-05-10T10:51:36Z

Will we allow the process_summary to also be filtered or exclude processes?

I'd say no, for that people should use the process metricset and do the analytics in Kibana.

tsg · 2017-05-10T12:50:05Z

jenkins, retest it

andrewkroh · 2017-05-10T15:18:31Z

metricbeat/module/system/process_summary/_meta/data.json

+            "sleeping": 0,
+            "stopped": 0,
+            "total": 355,
+            "unknown": 130,


That's a lot of unknowns. Is that because of permissions on /proc/pid/*?

I think so, it was generated on my mac without sudo.

andrewkroh · 2017-05-10T15:21:22Z

metricbeat/module/system/process_summary/_meta/docs.asciidoc

@@ -0,0 +1,3 @@
+=== system process_summary MetricSet


system should be capitalized and I think we should change process_summary to "Process Summary". I think it would fit in better on the metricset listing.

andrewkroh

LGTM

andrewkroh · 2017-05-10T22:19:37Z

metricbeat/module/system/process_summary/process_summary.go

+// init registers the MetricSet with the central registry.
+// The New method will be called after the setup of the module and before starting to fetch data
+func init() {
+	if err := mb.Registry.AddMetricSet("system", "process_summary", New, parse.EmptyHostParser); err != nil {


What if the metricset name were process.summary instead of process_summary, would that alleviate some of the concerns about the name differing from the data model?

@andrewkroh I like the idea. Would this cause any issues somewhere else?

And interesting part of this could be that one could also enable process.* metricsets in the future. This becomes especially interesting in the kubernetes case.

Interesting idea, but that currently doesn't work:

2017/05/11 10:25:00.264381 metricbeat.go:31: INFO Register [ModuleFactory:[docker, mongodb, mysql, postgresql, system], MetricSetFactory:[apache/status, audit/kernel, ceph/cluster_disk, ceph/cluster_health, ceph/monitor_health, ceph/pool_disk, couchbase/bucket, couchbase/cluster, couchbase/node, docker/container, docker/cpu, docker/diskio, docker/healthcheck, docker/image, docker/info, docker/memory, docker/network, dropwizard/collector, elasticsearch/node, elasticsearch/node_stats, golang/expvar, golang/heap, haproxy/info, haproxy/stat, http/json, jolokia/jmx, kafka/consumergroup, kafka/partition, kibana/status, kubernetes/container, kubernetes/node, kubernetes/pod, kubernetes/system, kubernetes/volume, memcached/stats, mongodb/dbstats, mongodb/status, mysql/status, nginx/stubstatus, php_fpm/pool, postgresql/activity, postgresql/bgwriter, postgresql/database, prometheus/collector, prometheus/stats, redis/info, redis/keyspace, system/core, system/cpu, system/diskio, system/filesystem, system/fsstat, system/load, system/memory, system/network, system/process, system/process.summary, vsphere/datastore, vsphere/host, vsphere/virtualmachine, zookeeper/mntr]] panic: name process already used

I think this brings up the question (again), if we want to support metricset hierarchies.

I would separate the two things. I think we should support dots in metricset names which does not necessarly mean there is a hierachy in the code structure. That it happens on the data side is nice of course.

So the panic is caused by a naming collision the happens when registering the metrics key for metricbeat.system.process.summary with libbeat monitoring. Since the process metricset is already registered we end up with a collision. We could work around the issue with a change to this line to remove dots from the key name. This would "flatten" the metrics reported about the metricsets and simplify consuming those metrics in any sort of UI.

key := fmt.Sprintf("metricbeat.%s.%s", module, strings.Replace(name, ".", "_", -1))

But @tsg's concern that the usage of dots is overloaded was spot on.

tsg · 2017-05-12T12:21:06Z

Updated to use process.summary in fields, but keep process_summary as the metricset name. Rebased and ready for new reviews/merging.

ruflin · 2017-05-12T13:27:15Z

@tsg I think you have to use ["..."] in the tests there:

Traceback (most recent call last):
  File "/go/src/github.com/elastic/beats/metricbeat/tests/system/test_system.py", line 318, in test_process_summary
    summary = evt["system"]["process.summary"]
KeyError: 'process.summary'

As we're enabling `include_top_n` by default, a few visualizations from the Metricbeat-processes list were no longer correct (as they aggregate only a sample of the data). This adds a new `process_summary` metricset that adds these metrics. The fields are namespaced under `process.summary`. This PR adds summary metrics for the total number of processes and their state, as an extra document created by the `system.process` metricset.

ruflin · 2017-05-12T14:49:30Z

jenkins, retest it

elasticmachine · 2017-05-12T16:05:50Z

metricbeat/module/system/process_summary/process_summary.go

+		state := sigar.ProcState{}
+		err = state.Get(pid)
+		if err != nil {
+			summary.unknown += 1


[golint] _{reported by reviewdog 🐶}
should replace summary.unknown += 1 with summary.unknown++

elasticmachine · 2017-05-12T16:05:50Z

metricbeat/module/system/process_summary/process_summary.go

+
+		switch byte(state.State) {
+		case 'S':
+			summary.sleeping += 1


[golint] _{reported by reviewdog 🐶}
should replace summary.sleeping += 1 with summary.sleeping++

elasticmachine · 2017-05-12T16:05:50Z

metricbeat/module/system/process_summary/process_summary.go

+		case 'S':
+			summary.sleeping += 1
+		case 'R':
+			summary.running += 1


[golint] _{reported by reviewdog 🐶}
should replace summary.running += 1 with summary.running++

elasticmachine · 2017-05-12T16:05:50Z

metricbeat/module/system/process_summary/process_summary.go

+		case 'R':
+			summary.running += 1
+		case 'D':
+			summary.idle += 1


[golint] _{reported by reviewdog 🐶}
should replace summary.idle += 1 with summary.idle++

elasticmachine · 2017-05-12T16:05:50Z

metricbeat/module/system/process_summary/process_summary.go

+		case 'D':
+			summary.idle += 1
+		case 'T':
+			summary.stopped += 1


[golint] _{reported by reviewdog 🐶}
should replace summary.stopped += 1 with summary.stopped++

elasticmachine · 2017-05-12T16:05:50Z

metricbeat/module/system/process_summary/process_summary.go

+		case 'T':
+			summary.stopped += 1
+		case 'Z':
+			summary.zombie += 1


[golint] _{reported by reviewdog 🐶}
should replace summary.zombie += 1 with summary.zombie++

elasticmachine · 2017-05-12T16:05:50Z

metricbeat/module/system/process_summary/process_summary.go

+			summary.zombie += 1
+		default:
+			logp.Err("Unknown state <%v> for process with pid %d", state.State, pid)
+			summary.unknown += 1


[golint] _{reported by reviewdog 🐶}
should replace summary.unknown += 1 with summary.unknown++

andrewkroh

LGTM

andrewkroh · 2017-05-12T16:13:57Z

metricbeat/module/system/process_summary/process_summary_test.go

+	assert.Contains(t, event, "idle")
+	assert.Contains(t, event, "stopped")
+	assert.Contains(t, event, "zombie")
+	assert.Contains(t, event, "unknown")


Missing total.

And how about a test to ensure the sum of the various states equals total?

andrewkroh · 2017-05-12T16:14:21Z

metricbeat/tests/system/test_system.py

+            assert isinstance(summary["idle"], int)
+            assert isinstance(summary["stopped"], int)
+            assert isinstance(summary["zombie"], int)
+            assert isinstance(summary["unknown"], int)


Missing total.

This addresses comments in elastic#4231 that came just before the PR got merged.

This addresses comments in #4231 that came just before the PR got merged.

tsg added Metricbeat Metricbeat review in progress Pull request is currently in progress. labels May 5, 2017

ruflin reviewed May 9, 2017

View reviewed changes

andrewkroh reviewed May 9, 2017

View reviewed changes

tsg force-pushed the add_process_summaries branch from dfd5229 to 48c8fb7 Compare May 10, 2017 13:07

andrewkroh reviewed May 10, 2017

View reviewed changes

tsg mentioned this pull request May 10, 2017

Metricbeat: reduce disk usage in default configuration #4112

Closed

9 tasks

andrewkroh approved these changes May 10, 2017

View reviewed changes

andrewkroh reviewed May 10, 2017

View reviewed changes

tsg force-pushed the add_process_summaries branch 2 times, most recently from 93f2798 to f6a79fe Compare May 12, 2017 12:19

tsg removed the in progress Pull request is currently in progress. label May 12, 2017

ruflin approved these changes May 12, 2017

View reviewed changes

tsg force-pushed the add_process_summaries branch from f6a79fe to fbac667 Compare May 12, 2017 13:38

elasticmachine reviewed May 12, 2017

View reviewed changes

andrewkroh approved these changes May 12, 2017

View reviewed changes

ruflin merged commit 86932ca into elastic:master May 12, 2017

tsg pushed a commit to tsg/beats that referenced this pull request May 12, 2017

Follow up on comments from elastic#4231

4f55207

This addresses comments in elastic#4231 that came just before the PR got merged.

tsg mentioned this pull request May 12, 2017

Follow up on comments from #4231 #4305

Merged

andrewkroh pushed a commit that referenced this pull request May 13, 2017

Follow up on comments from #4231 (#4305)

0593f67

This addresses comments in #4231 that came just before the PR got merged.


		config := struct{}{}

		if err := base.Module().UnpackConfig(&config); err != nil {

Add process summary metrics #4231

Add process summary metrics #4231

Conversation

tsg commented May 5, 2017 • edited Loading

ruflin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsg commented May 9, 2017

tsg commented May 9, 2017

andrewkroh left a comment

Choose a reason for hiding this comment

andrewkroh May 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsg May 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruflin commented May 10, 2017

ruflin commented May 10, 2017

tsg commented May 10, 2017

tsg commented May 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewkroh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsg commented May 12, 2017

ruflin commented May 12, 2017

ruflin commented May 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewkroh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsg commented May 5, 2017 •

edited

Loading

andrewkroh May 9, 2017 •

edited

Loading

tsg May 10, 2017 •

edited

Loading