-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add per-task gearman metrics #2672
Add per-task gearman metrics #2672
Conversation
nyanshak
commented
Jul 8, 2016
- Add metrics to collect data on each individual task. This lets you see how many of each task is queued to catch problems with each individual queue's processing.
- Each new metric is tagged by task:<task_name>
- List of tags: gearman.queued_by_task, gearman.running_by_task, gearman.workers_by_task
- Here is an example in metrics explorer with this new data:
Failing to install dependencies: [2016-07-08T16:23:23Z] >>>>>>>>>>>>>> INSTALL STAGE
pip install --upgrade pip setuptools
Requirement already up-to-date: pip in /home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages
Requirement already up-to-date: setuptools in /home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages
pip install -r requirements.txt --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log
pip install pycurl==7.19.5.1 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log || echo 'Unable to install pycurl==7.19.5.1' 2>&1 >> /tmp/ci.log
pip install psutil==3.3.0 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log || echo 'Unable to install psutil==3.3.0' 2>&1 >> /tmp/ci.log
pip install pysnmp-mibs==0.1.4 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log || echo 'Unable to install pysnmp-mibs==0.1.4' 2>&1 >> /tmp/ci.log
pip install pysnmp==4.2.5 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log || echo 'Unable to install pysnmp==4.2.5' 2>&1 >> /tmp/ci.log
pip install pymongo==3.2 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log || echo 'Unable to install pymongo==3.2' 2>&1 >> /tmp/ci.log
pip install kazoo==1.3.1 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log || echo 'Unable to install kazoo==1.3.1' 2>&1 >> /tmp/ci.log
pip install winrandom-ctypes --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log || echo 'Unable to install winrandom-ctypes' 2>&1 >> /tmp/ci.log
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-YwxK3s/winrandom-ctypes/
pip install paramiko==1.15.2 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log || echo 'Unable to install paramiko==1.15.2' 2>&1 >> /tmp/ci.log
pip install psycopg2==2.6 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log || echo 'Unable to install psycopg2==2.6' 2>&1 >> /tmp/ci.log
pip install wmi==1.4.9 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log || echo 'Unable to install wmi==1.4.9' 2>&1 >> /tmp/ci.log
pip install scandir==1.2 --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log || echo 'Unable to install scandir==1.2' 2>&1 >> /tmp/ci.log
pip install --upgrade -r requirements-test.txt --cache-dir /home/travis/.cache/pip 2>&1 >> /tmp/ci.log
[2016-07-08T16:23:29Z] >>>>>>>>>>>>>> BEFORE_SCRIPT STAGE |
* Adds gearman.{queued_by_task, running_by_task, workers_by_task} metrics to collect data on each individual task. This lets you see how many of each tasks are queued up to catch problems with any individual queue not being processed. * Each new metric is tagged by task:<task_name>
Hi @nyanshak, thanks for this addition! Your PR looks good overall! One concern I may have is that depending on the number of different tasks in the gearman job server the check may create a lot of metrics (i.e. the cardinality of the unique tag combinations on the In your experience how many different tasks would generally live on the job server? I think the check can reasonably collect metrics on ~100 different tasks, but if this number can be higher in some environments I'd rather the check had a way of limiting the number of tasks the Let me know what you think, thanks! |
I hadn't thought of that. In our environment we're generally looking at maybe 10 or so tasks. I can look into limiting the number of tasks. |
I took a first shot at integrating your feedback. Let me know what you think. |
for stat in data: | ||
if len(specified_tasks) > MAX_NUM_TASKS: | ||
raise Exception( | ||
"The maximum number of tasks you can specify is %d.".format(MAX_NUM_TASKS)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use {}
instead of %d
with format
, keep an eye on the default
flavor in the test matrix on Travis: https://travis-ci.org/DataDog/dd-agent/builds/147788400
@masci updated that line |
Thanks @nyanshak , all green! Waiting for a thumbs up from @olivielpeau |
task_tags.append("task:{}".format(stat['task'])) | ||
self.gauge("gearman.running_by_task", running, tags=task_tags) | ||
self.gauge("gearman.queued_by_task", queued, tags=task_tags) | ||
self.gauge("gearman.workers_by_task", workers, tags=task_tags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with gearmand's admin status command output, but do all the tasks listed in the output have a different task
? (i.e. is each task
field in the tasks
list unique?)
If that's the case this works fine, but if not we need to use a different type of metrics submission than gauge
(probably increment
). The reason for that is that using a gauge
, multiple values submitted during the same run with the same metric name and the same tags overwrite one another, only the last value is sent. Using an increment
, the values are summed instead.
Thanks @nyanshak for your changes! I've added in a few comments, let us know if you can work on addressing them. |
Split into two functions, one to collect the aggregate metrics and one to collect per-task metrics. I'm glad that I did that because I found a bug or two in the process.
Each task field is unique. |
Looks good, thanks @nyanshak! Merging, we'll include this in the |
* Adds gearman.{queued_by_task, running_by_task, workers_by_task} metrics to collect data on each individual task. This lets you see how many of each tasks are queued up to catch problems with any individual queue not being processed. * Each new metric is tagged by task:<task_name> * Limits the maximum number of tasks on which per-task metrics are collected
* Adds gearman.{queued_by_task, running_by_task, workers_by_task} metrics to collect data on each individual task. This lets you see how many of each tasks are queued up to catch problems with any individual queue not being processed. * Each new metric is tagged by task:<task_name> * Limits the maximum number of tasks on which per-task metrics are collected