Add per-task gearman metrics #2672

nyanshak · 2016-07-08T16:21:44Z

Add metrics to collect data on each individual task. This lets you see how many of each task is queued to catch problems with each individual queue's processing.
Each new metric is tagged by task:<task_name>
List of tags: gearman.queued_by_task, gearman.running_by_task, gearman.workers_by_task
Here is an example in metrics explorer with this new data:

nyanshak · 2016-07-08T16:57:43Z

Failing to install dependencies:

[2016-07-08T16:23:23Z] >>>>>>>>>>>>>> INSTALL STAGE
pip install --upgrade pip setuptools
Requirement already up-to-date: pip in /home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages
Requirement already up-to-date: setuptools in /home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages
pip install           -r requirements.txt           --cache-dir /home/travis/.cache/pip           2>&1 >> /tmp/ci.log
pip install pycurl==7.19.5.1 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install pycurl==7.19.5.1' 2>&1 >> /tmp/ci.log
pip install psutil==3.3.0 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install psutil==3.3.0' 2>&1 >> /tmp/ci.log
pip install pysnmp-mibs==0.1.4 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install pysnmp-mibs==0.1.4' 2>&1 >> /tmp/ci.log
pip install pysnmp==4.2.5 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install pysnmp==4.2.5' 2>&1 >> /tmp/ci.log
pip install pymongo==3.2 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install pymongo==3.2' 2>&1 >> /tmp/ci.log
pip install kazoo==1.3.1 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install kazoo==1.3.1' 2>&1 >> /tmp/ci.log
pip install winrandom-ctypes --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install winrandom-ctypes' 2>&1 >> /tmp/ci.log
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-YwxK3s/winrandom-ctypes/
pip install paramiko==1.15.2 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install paramiko==1.15.2' 2>&1 >> /tmp/ci.log
pip install psycopg2==2.6 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install psycopg2==2.6' 2>&1 >> /tmp/ci.log
pip install wmi==1.4.9 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install wmi==1.4.9' 2>&1 >> /tmp/ci.log
pip install scandir==1.2 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log             || echo 'Unable to install scandir==1.2' 2>&1 >> /tmp/ci.log
pip install           --upgrade           -r requirements-test.txt           --cache-dir /home/travis/.cache/pip            2>&1 >> /tmp/ci.log
[2016-07-08T16:23:29Z] >>>>>>>>>>>>>> BEFORE_SCRIPT STAGE

* Adds gearman.{queued_by_task, running_by_task, workers_by_task} metrics to collect data on each individual task. This lets you see how many of each tasks are queued up to catch problems with any individual queue not being processed. * Each new metric is tagged by task:<task_name>

olivielpeau · 2016-07-27T08:58:34Z

Hi @nyanshak, thanks for this addition!

Your PR looks good overall! One concern I may have is that depending on the number of different tasks in the gearman job server the check may create a lot of metrics (i.e. the cardinality of the unique tag combinations on the *_by_task metrics could be very high).

In your experience how many different tasks would generally live on the job server?

I think the check can reasonably collect metrics on ~100 different tasks, but if this number can be higher in some environments I'd rather the check had a way of limiting the number of tasks the *_by_task metrics are collected on, and allow users to explicitely list the task to collect metrics on in the yaml configuration when it reaches 100 tasks (see the rabbitmq check configuration for an example of that)

Let me know what you think, thanks!

nyanshak · 2016-07-27T14:02:03Z

I hadn't thought of that. In our environment we're generally looking at maybe 10 or so tasks. I can look into limiting the number of tasks.

nyanshak · 2016-07-27T15:21:02Z

I took a first shot at integrating your feedback. Let me know what you think.

masci · 2016-07-28T18:10:05Z

checks.d/gearmand.py

-        for stat in data:
+        if len(specified_tasks) > MAX_NUM_TASKS:
+            raise Exception(
+                "The maximum number of tasks you can specify is %d.".format(MAX_NUM_TASKS))


You should use {} instead of %d with format, keep an eye on the default flavor in the test matrix on Travis: https://travis-ci.org/DataDog/dd-agent/builds/147788400

nyanshak · 2016-07-28T19:03:37Z

@masci updated that line

masci · 2016-07-29T08:22:57Z

Thanks @nyanshak , all green! Waiting for a thumbs up from @olivielpeau

olivielpeau · 2016-07-29T11:26:22Z

checks.d/gearmand.py

+            task_tags.append("task:{}".format(stat['task']))
+            self.gauge("gearman.running_by_task", running, tags=task_tags)
+            self.gauge("gearman.queued_by_task", queued, tags=task_tags)
+            self.gauge("gearman.workers_by_task", workers, tags=task_tags)


I'm not very familiar with gearmand's admin status command output, but do all the tasks listed in the output have a different task? (i.e. is each task field in the tasks list unique?)
If that's the case this works fine, but if not we need to use a different type of metrics submission than gauge(probably increment). The reason for that is that using a gauge, multiple values submitted during the same run with the same metric name and the same tags overwrite one another, only the last value is sent. Using an increment, the values are summed instead.

olivielpeau · 2016-07-29T11:32:31Z

Thanks @nyanshak for your changes!

I've added in a few comments, let us know if you can work on addressing them.

nyanshak · 2016-07-29T15:10:11Z

Split into two functions, one to collect the aggregate metrics and one to collect per-task metrics. I'm glad that I did that because I found a bug or two in the process.

I'm not very familiar with gearmand's admin status command output, but do all the tasks listed in the output have a different task? (i.e. is each task field in the tasks list unique?)

Each task field is unique.

@olivielpeau @masci

olivielpeau · 2016-08-03T12:33:12Z

Looks good, thanks @nyanshak!

Merging, we'll include this in the 5.9.0 release

* Adds gearman.{queued_by_task, running_by_task, workers_by_task} metrics to collect data on each individual task. This lets you see how many of each tasks are queued up to catch problems with any individual queue not being processed. * Each new metric is tagged by task:<task_name> * Limits the maximum number of tasks on which per-task metrics are collected

nyanshak closed this Jul 25, 2016

nyanshak reopened this Jul 25, 2016

masci added checks community improvement labels Jul 26, 2016

olivielpeau added this to the Triage milestone Jul 27, 2016

masci reviewed Jul 28, 2016
View reviewed changes

olivielpeau reviewed Jul 29, 2016
View reviewed changes

Limit tasks collected in gearman integration

01ec6ec

olivielpeau modified the milestones: 5.9.0, Triage Aug 3, 2016

olivielpeau merged commit 382eba2 into DataDog:master Aug 3, 2016

nyanshak deleted the add-per-task-gearman-metrics branch August 3, 2016 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-task gearman metrics #2672

Add per-task gearman metrics #2672

nyanshak commented Jul 8, 2016

nyanshak commented Jul 8, 2016

olivielpeau commented Jul 27, 2016

nyanshak commented Jul 27, 2016

nyanshak commented Jul 27, 2016

masci Jul 28, 2016 •

edited

Loading

nyanshak commented Jul 28, 2016

masci commented Jul 29, 2016

olivielpeau Jul 29, 2016

olivielpeau commented Jul 29, 2016

nyanshak commented Jul 29, 2016 •

edited

Loading

olivielpeau commented Aug 3, 2016

Add per-task gearman metrics #2672

Add per-task gearman metrics #2672

Conversation

nyanshak commented Jul 8, 2016

nyanshak commented Jul 8, 2016

olivielpeau commented Jul 27, 2016

nyanshak commented Jul 27, 2016

nyanshak commented Jul 27, 2016

masci Jul 28, 2016 • edited Loading

Choose a reason for hiding this comment

nyanshak commented Jul 28, 2016

masci commented Jul 29, 2016

olivielpeau Jul 29, 2016

Choose a reason for hiding this comment

olivielpeau commented Jul 29, 2016

nyanshak commented Jul 29, 2016 • edited Loading

olivielpeau commented Aug 3, 2016

masci Jul 28, 2016 •

edited

Loading

nyanshak commented Jul 29, 2016 •

edited

Loading