-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing mesos_tasks
metrics.
#1686
Conversation
@harnash I think we should actually completely remove that tag. Why would we want to expose the option to create massive cardinality? And especially why would we default that to true? |
@sparrc I'll look into it first thing in the morning tomorrow (UTC+0200 time). I'll probably include those values as standard fields and leave only |
Tagging values by executor_id can create quite a lot data series in InfluxDB so we should stick to framework_id and server.
@sparrc Is this what you had on mind? |
mesos_tasks
metrics.mesos_tasks
metrics.
please don't, Telegraf is meant to send well-behaved data. Sending massive cardinality tags is a very bad practice and I wouldn't call eliminating those "intrusive" |
@sparrc Your call. I'll leave it as it is. We can still use continuous queries to group them by |
Not having task name in influx tag has some impact in using that kind of data, eg. in grafana we can group by tag feature to be able to create graphs per task. |
@r0bj you'll need to workaround that if it's a high-cardinality series. Crashing the database is not an acceptable trade-off for easier graphing. |
The main idea of the Mesos plugin's tasks part is to be able to collect metric per task, so we can monitor them and troubleshoot if needed. |
I think we're only talking about removing |
Well it is problematic since we want to collect and analyze statistics per |
forgive my ignorance, as I'm not very familiar with mesos, but some of you are talking about executor_id and others are talking about "tasks", is there a relationship between the two? Also @harnash what do you mean by "normalizing"? Removing the tag? |
@sparrc Sorry for creating confusion. So as far as I know mesos executor can spawn multiple tasks. For our discussion we can assume that By normalizing I mean strip UUIDs from |
OK, I see, I think I actually agree with what @r0bj suggested. You should strip the UUID off of the task_id and push the metric with a tag called "task_name". From what I can tell you should also be removing the framework_id tag? What is the usefulness of that UUID as a tag? |
Right now I'm not setting |
This seems related and can save us some trouble: d2iq-archive/marathon#1242 |
@sparrc @tuier After looking at documentation and other resources I'm inclined to leave current solution as it is - |
is there really no way to get the task name from mesos without the UUID? |
@harnash @tuier @r0bj if anyone could, it would be helpful if we could come up with a metric design schema that doesn't include any UUIDs but is still capable of usefully separating metrics. If the metrics are not possible to be usefully separated with the UUID then I think we should consider if those metrics should even be collected at all. |
Yeah. We are still brainstorming on a solution. Problem is that |
Alright. There is a way to do it in a framework agnostic way. We should call {
"frameworks": [
{
"id": "20150323-183301-505808906-5050-14434-0001",
"name": "marathon",
"pid": "[email protected]:53051",
"used_resources": {
"disk": 139264.0,
"mem": 170346.0,
"gpus": 0.0,
"cpus": 66.3,
"ports": "[31026-31027, 31035-31036, 31059-31060, 31078-31078, 31096-31096, 31118-31120, 31134-31135, 31139-31139, 31167-31168, 31179-31180, 31189-31190, 31208-31209, 31215-31215, 31230-31231, 31258-31259, 31291-31291, 31296-31297, 31309-31310, 31312-31313, 31320-31320, 31323-31324, 31344-31345, 31401-31401, 31420-31420, 31436-31437, 31439-31440, 31456-31457, 31477-31478, 31490-31491, 31547-31547, 31558-31558, 31579-31579, 31594-31595, 31607-31608, 31614-31617, 31638-31639, 31664-31664, 31705-31706, 31709-31709, 31714-31715, 31720-31721, 31732-31733, 31745-31746, 31748-31751, 31765-31766, 31797-31798, 31813-31813, 31821-31821, 31835-31836, 31864-31865, 31869-31870, 31886-31887, 31899-31900, 31936-31938, 31950-31951, 31972-31973, 31993-31994, 31996-31997]"
},
"offered_resources": {
"disk": 0.0,
"mem": 0.0,
"gpus": 0.0,
"cpus": 0.0
},
"capabilities": [],
"hostname": "mesos-s2",
"webui_url": "http://mesos-s2:8080",
"active": true,
"user": "marathon",
"failover_timeout": 604800.0,
"checkpoint": true,
"role": "*",
"registered_time": 1471960678.14146,
"unregistered_time": 0.0,
"resources": {
"disk": 139264.0,
"mem": 170346.0,
"gpus": 0.0,
"cpus": 66.3,
"ports": "[31026-31027, 31035-31036, 31059-31060, 31078-31078, 31096-31096, 31118-31120, 31134-31135, 31139-31139, 31167-31168, 31179-31180, 31189-31190, 31208-31209, 31215-31215, 31230-31231, 31258-31259, 31291-31291, 31296-31297, 31309-31310, 31312-31313, 31320-31320, 31323-31324, 31344-31345, 31401-31401, 31420-31420, 31436-31437, 31439-31440, 31456-31457, 31477-31478, 31490-31491, 31547-31547, 31558-31558, 31579-31579, 31594-31595, 31607-31608, 31614-31617, 31638-31639, 31664-31664, 31705-31706, 31709-31709, 31714-31715, 31720-31721, 31732-31733, 31745-31746, 31748-31751, 31765-31766, 31797-31798, 31813-31813, 31821-31821, 31835-31836, 31864-31865, 31869-31870, 31886-31887, 31899-31900, 31936-31938, 31950-31951, 31972-31973, 31993-31994, 31996-31997]"
},
"tasks": [
{
"id": "hello_world.a9897537-7437-11e6-8492-56847afe9799",
"name": "hello_world.production",
"framework_id": "20150323-183301-505808906-5050-14434-0001",
"executor_id": "",
"slave_id": "78791579-8172-4509-8160-a5c76e0540af-S7",
"state": "TASK_RUNNING",
"resources": {
"disk": 0.0,
"mem": 550.0,
"gpus": 0.0,
"cpus": 0.2
},
"statuses": [
{
"state": "TASK_RUNNING",
"timestamp": 1473169310.86516,
"container_status": {
"network_infos": [
{
"ip_addresses": [
{
"ip_address": "10.8.40.30"
}
]
}
]
}
}
],
"discovery": {
"visibility": "FRAMEWORK",
"name": "hello_world.production",
"ports": {}
}
},
]
}
]
} now from this we can map |
why not just use tasks[i][name]? |
Yup. We should find |
@sparrc I have one issue with this solution: when we have 4 instances then all metrics will be sent as one data series which will be problematic for representing this data or do sensible alerting based on such. One solution would be to enumerate tasks per task_name and assign them unique tags like this: |
@harnash please try to spell it out a bit more simply, I truly have no experience with mesos, drawing out some diagrams or condensing what you're saying into bullet-points would help I think. If we can't find a solution soon, I'm going to have to remove the mesos plugin from Telegraf. I don't like having plugins within Telegraf that send badly formed data, as I'm sure many users have already shot themselves in the foot with this plugin. A short-term solution might be to remove the per-task metrics for the time being. We can also revisit per-task metrics when InfluxDB has fixed influxdata/influxdb#7151 |
@sparrc Alright. So we have situation as follows: In mesos there is a task named
Telegraf will gather metrics for those three instances. To simplify I'll use
Now if we send this data to Influx with tag specifying only
and it will be hard to do statistic analysis based on tag value since we would have data from three instances, I think. I'm not sure how would InfluxDB handle this data (there might be slight time offset between those data points but that is a guess right now). Idea is to map those
Of course you would also get I hope that clears the picture a bit. As for removing functionality I would like to leave master/slave stats and just disable task metrics which are problematic. |
got it, crystal clear, thanks @harnash. My recommendation for now would be to remove task metrics, but leave the code as stubs as-is & in-place (still collecting task UUIDs). Once influxdata/influxdb#7151 has been implemented, we could provide a configuration option to gather task metrics, but with very clear warnings that collecting per-task metrics would lead to infinitely-increasing cardinality, and recommend that this data be kept in a very short-term retention-policy. |
…h them. Due to quite real problem of generating vast number of data series through mesos tasks metrics this feature is disabled until better solution is found.
mesos_tasks
metrics.mesos_tasks
metrics.
@sparrc Alright. As decided I have removed |
thanks @harnash, you can update the changelog now and I'll merge this change |
Required for all PRs:
Tagging metrics with
executor_id
can cause thousands series in InfluxDBwhich can easily reach over default limit of 1M series per DB. This commit should add missing
framework_id
tag and sendexecutor_id
as normal field.Tests were improved and documentation was also updated.
ping @tuier