Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add process summary metrics #4231

Merged
merged 1 commit into from
May 12, 2017
Merged

Conversation

tsg
Copy link
Contributor

@tsg tsg commented May 5, 2017

As we're enabling include_top_n by default, a few visualizations from the
Metricbeat-processes list were no longer correct (as they aggregate only a sample of the
data).

This PR adds a new process_summary metricset that adds these metrics. The fields are namespaced under process.summary.

Remaining TODOs:

  • system test for the new metricset.
  • changelog
  • update Kibana dashboards

This is required for #4112.

@tsg tsg added Metricbeat Metricbeat review in progress Pull request is currently in progress. labels May 5, 2017
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence if this should be it's own metricset or not. I feel like it should be possible for a user to only enable the summary which speaks for its own metricset. At the same time I see it valuable inside the process metricset as it can be "calculated" during the fetch.

If we put it in it's own metricset we could put it under the namespace processes. My worry is if we put it under process is that it mixes the summary with a single event.

Thinking more about it I tend to go in the direction of a separate metricset which is enabled by default. This also allows to set different periods for the summary.

@@ -487,3 +487,33 @@
description: >
Total number of I/O operations performed on all devices
by processes in the cgroup as seen by the throttling policy.

- name: summaries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this singular as it is only 1 summary and not multiple.

@@ -49,6 +49,7 @@ func New(base mb.BaseMetricSet) (mb.MetricSet, error) {
CPUTicks bool `config:"cpu_ticks"`
CacheCmdLine bool `config:"process.cmdline.cache.enabled"`
IncludeTop includeTopConfig `config:"process.include_top_n"`
Summaries bool `config:"process.summaries"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process.summary.enabled? In case we add more options to it in the future

@@ -373,6 +383,11 @@ func (procStats *ProcStats) GetProcStats() ([]common.MapStr, error) {
}
procStats.ProcsMap = newProcs

var summaries *processSummaries
if procStats.Summaries {
summaries = procStats.getSummaries(processes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned that users might expect the summary to unfiltered. As written it looks like the summary only includes processes that matched the given regular expressions.

If we keep it this way, I feel like we should include info about what we matched against in the summary. This would be needed to distinguish between two separate summary events if the system/process metricset is used more than once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you are right, it should be unfiltered, I didn't realize that the regexps are already applied at that point.

@tsg
Copy link
Contributor Author

tsg commented May 9, 2017

@ruflin I agree that a separate metricset would be somehow cleaner, although I don't like it when two metricsets have very similar names (process and processes) because it's unclear for the user which one does what. We already have that situation with fststat and filesystem, and I find it confusing.

Another concern is that with a separate metricset we'd need to poll the process data twice, from both metricsets. Or is there a way for two metricsets to work on the same data?

@tsg
Copy link
Contributor Author

tsg commented May 9, 2017

Ok, I have tried the approach of having its own metricset and I have to admit it feels a lot simpler, so I'm now also leaning towards that option. I named the metricset process_summary, but I'm open to suggestions :).

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a separate metricset is simpler w.r.t. the overall implementation.

state := sigar.ProcState{}
err = state.Get(pid)
if err != nil {
return nil, fmt.Errorf("Error getting process state for pid=%d: %v", pid, err)
Copy link
Member

@andrewkroh andrewkroh May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One bad apple will spoil the bunch. This will cause problems on Windows, because pid=0 and csrss.exe processes cannot be accessed. But this could also affect other OSes if a process exits in between the Pids() and the state.Get(pid) calls.

One solution would be to treat these processes as a new state, unknown, in the state summary.


pids, err := process.Pids()
if err != nil {
logp.Warn("Getting the list of pids: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error should only be handled once. I'm advocating against logging the error also returning it.

Copy link
Contributor Author

@tsg tsg May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that was copy pasting from the process metricset :) I'll fix it in both places.

case 'Z':
summary.zombie += 1
default:
logp.Err("Unknown state <%v> for process with pid %d", state.State, pid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be another reason for having an unknown state in the summary. Without it, the sum of the counts of each state will not equal the total in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I thought that we don't need unknown because it can be computed on the UI side, but I think we can be explicit about it.


config := struct{}{}

if err := base.Module().UnpackConfig(&config); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be removed since there is no config.

pids, err := process.Pids()
if err != nil {
logp.Warn("Getting the list of pids: %v", err)
return nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend wrapping the returned error to provide some addition context, like errors.Wrap(err, "failed to fetch list of PIDs").

@ruflin
Copy link
Contributor

ruflin commented May 10, 2017

Agree on the naming part, it's somehow ugly. I like the suggestion of process_summary and we could still put it under process.summary as the docs namespace. Will we allow the process_summary to also be filtered or exclude processes?

@ruflin
Copy link
Contributor

ruflin commented May 10, 2017

Talking about naming: In case we agree on process_summary, we should probably rename fsstats to filesystem_summary for consitency. Agree that it is suboptimal / confusing at the moment.

@tsg
Copy link
Contributor Author

tsg commented May 10, 2017

Will we allow the process_summary to also be filtered or exclude processes?

I'd say no, for that people should use the process metricset and do the analytics in Kibana.

@tsg
Copy link
Contributor Author

tsg commented May 10, 2017

jenkins, retest it

@tsg tsg force-pushed the add_process_summaries branch from dfd5229 to 48c8fb7 Compare May 10, 2017 13:07
"sleeping": 0,
"stopped": 0,
"total": 355,
"unknown": 130,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of unknowns. Is that because of permissions on /proc/pid/*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, it was generated on my mac without sudo.

@@ -0,0 +1,3 @@
=== system process_summary MetricSet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

system should be capitalized and I think we should change process_summary to "Process Summary". I think it would fit in better on the metricset listing.

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// init registers the MetricSet with the central registry.
// The New method will be called after the setup of the module and before starting to fetch data
func init() {
if err := mb.Registry.AddMetricSet("system", "process_summary", New, parse.EmptyHostParser); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the metricset name were process.summary instead of process_summary, would that alleviate some of the concerns about the name differing from the data model?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewkroh I like the idea. Would this cause any issues somewhere else?

And interesting part of this could be that one could also enable process.* metricsets in the future. This becomes especially interesting in the kubernetes case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea, but that currently doesn't work:

2017/05/11 10:25:00.264381 metricbeat.go:31: INFO Register [ModuleFactory:[docker, mongodb, mysql, postgresql, system], MetricSetFactory:[apache/status, audit/kernel, ceph/cluster_disk, ceph/cluster_health, ceph/monitor_health, ceph/pool_disk, couchbase/bucket, couchbase/cluster, couchbase/node, docker/container, docker/cpu, docker/diskio, docker/healthcheck, docker/image, docker/info, docker/memory, docker/network, dropwizard/collector, elasticsearch/node, elasticsearch/node_stats, golang/expvar, golang/heap, haproxy/info, haproxy/stat, http/json, jolokia/jmx, kafka/consumergroup, kafka/partition, kibana/status, kubernetes/container, kubernetes/node, kubernetes/pod, kubernetes/system, kubernetes/volume, memcached/stats, mongodb/dbstats, mongodb/status, mysql/status, nginx/stubstatus, php_fpm/pool, postgresql/activity, postgresql/bgwriter, postgresql/database, prometheus/collector, prometheus/stats, redis/info, redis/keyspace, system/core, system/cpu, system/diskio, system/filesystem, system/fsstat, system/load, system/memory, system/network, system/process, system/process.summary, vsphere/datastore, vsphere/host, vsphere/virtualmachine, zookeeper/mntr]]
panic: name process already used

I think this brings up the question (again), if we want to support metricset hierarchies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would separate the two things. I think we should support dots in metricset names which does not necessarly mean there is a hierachy in the code structure. That it happens on the data side is nice of course.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the panic is caused by a naming collision the happens when registering the metrics key for metricbeat.system.process.summary with libbeat monitoring. Since the process metricset is already registered we end up with a collision. We could work around the issue with a change to this line to remove dots from the key name. This would "flatten" the metrics reported about the metricsets and simplify consuming those metrics in any sort of UI.

key := fmt.Sprintf("metricbeat.%s.%s", module, strings.Replace(name, ".", "_", -1))

But @tsg's concern that the usage of dots is overloaded was spot on.

@tsg tsg force-pushed the add_process_summaries branch 2 times, most recently from 93f2798 to f6a79fe Compare May 12, 2017 12:19
@tsg tsg removed the in progress Pull request is currently in progress. label May 12, 2017
@tsg
Copy link
Contributor Author

tsg commented May 12, 2017

Updated to use process.summary in fields, but keep process_summary as the metricset name. Rebased and ready for new reviews/merging.

@ruflin
Copy link
Contributor

ruflin commented May 12, 2017

@tsg I think you have to use ["..."] in the tests there:

Traceback (most recent call last):
  File "/go/src/github.com/elastic/beats/metricbeat/tests/system/test_system.py", line 318, in test_process_summary
    summary = evt["system"]["process.summary"]
KeyError: 'process.summary'

As we're enabling `include_top_n` by default, a few visualizations from the
Metricbeat-processes list were no longer correct (as they aggregate only a
sample of the data).

This adds a new `process_summary` metricset that adds these metrics. The fields
are namespaced under `process.summary`.

This PR adds summary metrics for the total number of processes and their state, as an
extra document created by the `system.process` metricset.
@tsg tsg force-pushed the add_process_summaries branch from f6a79fe to fbac667 Compare May 12, 2017 13:38
@ruflin
Copy link
Contributor

ruflin commented May 12, 2017

jenkins, retest it

state := sigar.ProcState{}
err = state.Get(pid)
if err != nil {
summary.unknown += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[golint] reported by reviewdog 🐶
should replace summary.unknown += 1 with summary.unknown++


switch byte(state.State) {
case 'S':
summary.sleeping += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[golint] reported by reviewdog 🐶
should replace summary.sleeping += 1 with summary.sleeping++

case 'S':
summary.sleeping += 1
case 'R':
summary.running += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[golint] reported by reviewdog 🐶
should replace summary.running += 1 with summary.running++

case 'R':
summary.running += 1
case 'D':
summary.idle += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[golint] reported by reviewdog 🐶
should replace summary.idle += 1 with summary.idle++

case 'D':
summary.idle += 1
case 'T':
summary.stopped += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[golint] reported by reviewdog 🐶
should replace summary.stopped += 1 with summary.stopped++

case 'T':
summary.stopped += 1
case 'Z':
summary.zombie += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[golint] reported by reviewdog 🐶
should replace summary.zombie += 1 with summary.zombie++

summary.zombie += 1
default:
logp.Err("Unknown state <%v> for process with pid %d", state.State, pid)
summary.unknown += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[golint] reported by reviewdog 🐶
should replace summary.unknown += 1 with summary.unknown++

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

assert.Contains(t, event, "idle")
assert.Contains(t, event, "stopped")
assert.Contains(t, event, "zombie")
assert.Contains(t, event, "unknown")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing total.

And how about a test to ensure the sum of the various states equals total?

assert isinstance(summary["idle"], int)
assert isinstance(summary["stopped"], int)
assert isinstance(summary["zombie"], int)
assert isinstance(summary["unknown"], int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing total.

@ruflin ruflin merged commit 86932ca into elastic:master May 12, 2017
tsg pushed a commit to tsg/beats that referenced this pull request May 12, 2017
This addresses comments in elastic#4231 that came just before the PR got merged.
andrewkroh pushed a commit that referenced this pull request May 13, 2017
This addresses comments in #4231 that came just before the PR got merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants