Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing mesos_tasks metrics. #1686

Merged
merged 3 commits into from
Sep 28, 2016
Merged

Removing mesos_tasks metrics. #1686

merged 3 commits into from
Sep 28, 2016

Conversation

harnash
Copy link
Contributor

@harnash harnash commented Aug 30, 2016

Required for all PRs:

  • CHANGELOG.md updated (we recommend not updating this until the PR has been approved by a maintainer)
  • Sign CLA (if not already signed)
  • README.md updated (if adding a new plugin)

Tagging metrics with executor_id can cause thousands series in InfluxDB
which can easily reach over default limit of 1M series per DB. This commit should add missing framework_id tag and send executor_id as normal field.

Tests were improved and documentation was also updated.

ping @tuier

@sparrc
Copy link
Contributor

sparrc commented Aug 30, 2016

@harnash I think we should actually completely remove that tag. Why would we want to expose the option to create massive cardinality? And especially why would we default that to true?

@harnash
Copy link
Contributor Author

harnash commented Aug 30, 2016

@sparrc I'll look into it first thing in the morning tomorrow (UTC+0200 time). I'll probably include those values as standard fields and leave only framework_id and server.

Tagging values by executor_id can create quite a lot data series
in InfluxDB so we should stick to framework_id and server.
@harnash
Copy link
Contributor Author

harnash commented Aug 31, 2016

@sparrc Is this what you had on mind?

@r0bj
Copy link

r0bj commented Aug 31, 2016

@harnash @sparrc what about just normalize this tag by default? This could avoid intrusive behaviour and leave functionality at the same time.

@harnash harnash changed the title Added option to normalize tags on mesos_tasks metrics. Fix tags on mesos_tasks metrics. Aug 31, 2016
@sparrc
Copy link
Contributor

sparrc commented Aug 31, 2016

please don't, Telegraf is meant to send well-behaved data. Sending massive cardinality tags is a very bad practice and I wouldn't call eliminating those "intrusive"

@harnash
Copy link
Contributor Author

harnash commented Sep 1, 2016

@sparrc Your call. I'll leave it as it is. We can still use continuous queries to group them by executor_id if needed.

@r0bj
Copy link

r0bj commented Sep 1, 2016

Not having task name in influx tag has some impact in using that kind of data, eg. in grafana we can group by tag feature to be able to create graphs per task.

@sparrc
Copy link
Contributor

sparrc commented Sep 1, 2016

@r0bj you'll need to workaround that if it's a high-cardinality series. Crashing the database is not an acceptable trade-off for easier graphing.

@jwilder jwilder added the bug unexpected problem or unintended behavior label Sep 1, 2016
@tuier
Copy link
Contributor

tuier commented Sep 5, 2016

The main idea of the Mesos plugin's tasks part is to be able to collect metric per task, so we can monitor them and troubleshoot if needed.
I agree that in some case this could create a lots of metric and because of that the normalization should be enable by default. but removing that part completely will remove half of the value of this output.
@sparrc @harnash

@sparrc
Copy link
Contributor

sparrc commented Sep 6, 2016

I think we're only talking about removing executor_id as a tag, does that present a huge problem to people?

@harnash
Copy link
Contributor Author

harnash commented Sep 6, 2016

Well it is problematic since we want to collect and analyze statistics per executor_id and InfluxDB can group data by tag. We can group it using continuous queries but then we are facing the cardinality problem since we can't normalize this data in InfluxDB. It is doable but problematic and adds additional step in the workflow to process executor_id values (third party tool, kapacitor UDF, etc.)
Normalizing the executor_id tag would be a compromise and should be enough in most cases I guess. If there is some better workaround I'm not aware of them at the moment.

@sparrc @tuier @r0bj

@sparrc
Copy link
Contributor

sparrc commented Sep 6, 2016

forgive my ignorance, as I'm not very familiar with mesos, but some of you are talking about executor_id and others are talking about "tasks", is there a relationship between the two? Also @harnash what do you mean by "normalizing"? Removing the tag?

@harnash
Copy link
Contributor Author

harnash commented Sep 6, 2016

@sparrc Sorry for creating confusion. So as far as I know mesos executor can spawn multiple tasks. For our discussion we can assume that executor_id identifies each task running on mesos.

By normalizing I mean strip UUIDs from executor_ids effectively grouping series by task name.

@sparrc
Copy link
Contributor

sparrc commented Sep 6, 2016

OK, I see, I think I actually agree with what @r0bj suggested. You should strip the UUID off of the task_id and push the metric with a tag called "task_name".

From what I can tell you should also be removing the framework_id tag? What is the usefulness of that UUID as a tag?

@harnash
Copy link
Contributor Author

harnash commented Sep 6, 2016

Right now I'm not setting executor_id as a tag just as simple value. I'm digging through documentation of mesos to make sure that if we want to group series by executor_id it will work on all frameworks launched on mesos... From my understanding generating task_id/excutor_id is done on framework level so assuming we get it always in the format of some_task_name.UUID might be wrong.

@harnash
Copy link
Contributor Author

harnash commented Sep 6, 2016

This seems related and can save us some trouble: d2iq-archive/marathon#1242

@harnash
Copy link
Contributor Author

harnash commented Sep 6, 2016

@sparrc @tuier After looking at documentation and other resources I'm inclined to leave current solution as it is - executor_id as plain field.
In a complex situation we can have multiple frameworks running on a one mesos cluster each framework can assign task id's and executor id's to their own liking so we cannot assume that the format of this field and properly process it.
One solution would be to create regex substitution formula which would normalize this field but then again we would probably need separate formula for each framework type/version and somehow get the list of running frameworks and match their name (which are arbitrary) with formula in telegraf's config.
So including this login in telegraf seems like a bad decision and leave the processing to the users via telegraf's exec plugin, kapacitor UDF or other third party tool.

@sparrc
Copy link
Contributor

sparrc commented Sep 6, 2016

is there really no way to get the task name from mesos without the UUID?

@sparrc
Copy link
Contributor

sparrc commented Sep 6, 2016

@harnash @tuier @r0bj if anyone could, it would be helpful if we could come up with a metric design schema that doesn't include any UUIDs but is still capable of usefully separating metrics.

If the metrics are not possible to be usefully separated with the UUID then I think we should consider if those metrics should even be collected at all.

@harnash
Copy link
Contributor Author

harnash commented Sep 6, 2016

Yeah. We are still brainstorming on a solution. Problem is that executor_id is generated by a framework. Either way we would probably need to transform those executor_ids into something that can identify the application/task being executed by the framework.

@harnash
Copy link
Contributor Author

harnash commented Sep 6, 2016

Alright. There is a way to do it in a framework agnostic way. We should call /master/tasks API endpoint to check what tasks are running on a given cluster. We should get something like this:

{
    "frameworks": [
        {
            "id": "20150323-183301-505808906-5050-14434-0001",
            "name": "marathon",
            "pid": "[email protected]:53051",
            "used_resources": {
                "disk": 139264.0,
                "mem": 170346.0,
                "gpus": 0.0,
                "cpus": 66.3,
                "ports": "[31026-31027, 31035-31036, 31059-31060, 31078-31078, 31096-31096, 31118-31120, 31134-31135, 31139-31139, 31167-31168, 31179-31180, 31189-31190, 31208-31209, 31215-31215, 31230-31231, 31258-31259, 31291-31291, 31296-31297, 31309-31310, 31312-31313, 31320-31320, 31323-31324, 31344-31345, 31401-31401, 31420-31420, 31436-31437, 31439-31440, 31456-31457, 31477-31478, 31490-31491, 31547-31547, 31558-31558, 31579-31579, 31594-31595, 31607-31608, 31614-31617, 31638-31639, 31664-31664, 31705-31706, 31709-31709, 31714-31715, 31720-31721, 31732-31733, 31745-31746, 31748-31751, 31765-31766, 31797-31798, 31813-31813, 31821-31821, 31835-31836, 31864-31865, 31869-31870, 31886-31887, 31899-31900, 31936-31938, 31950-31951, 31972-31973, 31993-31994, 31996-31997]"
            },
            "offered_resources": {
                "disk": 0.0,
                "mem": 0.0,
                "gpus": 0.0,
                "cpus": 0.0
            },
            "capabilities": [],
            "hostname": "mesos-s2",
            "webui_url": "http://mesos-s2:8080",
            "active": true,
            "user": "marathon",
            "failover_timeout": 604800.0,
            "checkpoint": true,
            "role": "*",
            "registered_time": 1471960678.14146,
            "unregistered_time": 0.0,
            "resources": {
                "disk": 139264.0,
                "mem": 170346.0,
                "gpus": 0.0,
                "cpus": 66.3,
                "ports": "[31026-31027, 31035-31036, 31059-31060, 31078-31078, 31096-31096, 31118-31120, 31134-31135, 31139-31139, 31167-31168, 31179-31180, 31189-31190, 31208-31209, 31215-31215, 31230-31231, 31258-31259, 31291-31291, 31296-31297, 31309-31310, 31312-31313, 31320-31320, 31323-31324, 31344-31345, 31401-31401, 31420-31420, 31436-31437, 31439-31440, 31456-31457, 31477-31478, 31490-31491, 31547-31547, 31558-31558, 31579-31579, 31594-31595, 31607-31608, 31614-31617, 31638-31639, 31664-31664, 31705-31706, 31709-31709, 31714-31715, 31720-31721, 31732-31733, 31745-31746, 31748-31751, 31765-31766, 31797-31798, 31813-31813, 31821-31821, 31835-31836, 31864-31865, 31869-31870, 31886-31887, 31899-31900, 31936-31938, 31950-31951, 31972-31973, 31993-31994, 31996-31997]"
            },
            "tasks": [
                {
                    "id": "hello_world.a9897537-7437-11e6-8492-56847afe9799",
                    "name": "hello_world.production",
                    "framework_id": "20150323-183301-505808906-5050-14434-0001",
                    "executor_id": "",
                    "slave_id": "78791579-8172-4509-8160-a5c76e0540af-S7",
                    "state": "TASK_RUNNING",
                    "resources": {
                        "disk": 0.0,
                        "mem": 550.0,
                        "gpus": 0.0,
                        "cpus": 0.2
                    },
                    "statuses": [
                        {
                            "state": "TASK_RUNNING",
                            "timestamp": 1473169310.86516,
                            "container_status": {
                                "network_infos": [
                                    {
                                        "ip_addresses": [
                                            {
                                                "ip_address": "10.8.40.30"
                                            }
                                        ]
                                    }
                                ]
                            }
                        }
                    ],
                    "discovery": {
                        "visibility": "FRAMEWORK",
                        "name": "hello_world.production",
                        "ports": {}
                    }
                },
            ]
        }    
    ]
}

now from this we can map task_name to executor_id (which in my case is the same as tasks.id in above JSON). Tricky part is that each instance of telegraf which gathers metrics from slaves should know where is the current elected master.

@sparrc
Copy link
Contributor

sparrc commented Sep 6, 2016

"tasks": [
                {
                    "id": "hello_world.a9897537-7437-11e6-8492-56847afe9799",
                    "name": "hello_world.production",

why not just use tasks[i][name]?

@harnash
Copy link
Contributor Author

harnash commented Sep 6, 2016

Yup. We should find executor_id in current metrics and match it with id from above json and use name as an tag for a metric.
We can detect leader by using information which we already pull from master metrics but it will mean that in telegraf config we should always have full list of master nodes so we can look for a leader among them and pass that information to the slave collector.

@harnash
Copy link
Contributor Author

harnash commented Sep 7, 2016

@sparrc I have one issue with this solution: when we have 4 instances then all metrics will be sent as one data series which will be problematic for representing this data or do sensible alerting based on such. One solution would be to enumerate tasks per task_name and assign them unique tags like this: task_name_1, task_name_2 etc. There will be at most the twice number of tags per task as the number of instances running (depending on the upgrade policy) and tags won't be in always pointing on the same instance.

@sparrc
Copy link
Contributor

sparrc commented Sep 7, 2016

@harnash please try to spell it out a bit more simply, I truly have no experience with mesos, drawing out some diagrams or condensing what you're saying into bullet-points would help I think.

If we can't find a solution soon, I'm going to have to remove the mesos plugin from Telegraf. I don't like having plugins within Telegraf that send badly formed data, as I'm sure many users have already shot themselves in the foot with this plugin.

A short-term solution might be to remove the per-task metrics for the time being. We can also revisit per-task metrics when InfluxDB has fixed influxdata/influxdb#7151

@harnash
Copy link
Contributor Author

harnash commented Sep 7, 2016

@sparrc Alright. So we have situation as follows:

In mesos there is a task named hello_world which has three instances running:

  • hello_world.a927cc04-1630-400a-8b95-a39fe192ce8e
  • hello_world.06d66c19-4a89-482a-985e-cfe6b1d23b5c
  • hello_world.d0a6a0b2-254f-47fd-b30e-bb4a3923a8ae

Telegraf will gather metrics for those three instances. To simplify I'll use cpu_usage as an example.
These are the metrics snapshot at a given point of time:

  • executor_id=hello_world.a927cc04-1630-400a-8b95-a39fe192ce8e cpu_usage=0.3
  • executor_id=hello_world.06d66c19-4a89-482a-985e-cfe6b1d23b5c cpu_usage=0.1
  • executor_id=hello_world.d0a6a0b2-254f-47fd-b30e-bb4a3923a8ae cpu_usage=0.9

Now if we send this data to Influx with tag specifying only task_name we would get something like this:

  • task_name=hello_world cpu_usage=0.3
  • task_name=hello_world cpu_usage=0.1
  • task_name=hello_world cpu_usage=0.9

and it will be hard to do statistic analysis based on tag value since we would have data from three instances, I think. I'm not sure how would InfluxDB handle this data (there might be slight time offset between those data points but that is a guess right now).

Idea is to map those executor_ids to a limited tags in such a way that we would get metrics like this:

  • task_name=hello_world_1 cpu_usage=0.3
  • task_name=hello_world_2 cpu_usage=0.1
  • task_name=hello_world_3 cpu_usage=0.9

Of course you would also get executor_id as a normal field if you want to pin point the exact instance but on metrics and graphs it would show as n-th instance of the given task and order of those instances is determined by their executor_ids (they can move around when some instances in the middle gets killed).

I hope that clears the picture a bit.

As for removing functionality I would like to leave master/slave stats and just disable task metrics which are problematic.

@sparrc
Copy link
Contributor

sparrc commented Sep 7, 2016

got it, crystal clear, thanks @harnash.

My recommendation for now would be to remove task metrics, but leave the code as stubs as-is & in-place (still collecting task UUIDs).

Once influxdata/influxdb#7151 has been implemented, we could provide a configuration option to gather task metrics, but with very clear warnings that collecting per-task metrics would lead to infinitely-increasing cardinality, and recommend that this data be kept in a very short-term retention-policy.

…h them.

Due to quite real problem of generating vast number of data series through
mesos tasks metrics this feature is disabled until better solution is found.
@harnash harnash changed the title Fix tags on mesos_tasks metrics. Removing mesos_tasks metrics. Sep 23, 2016
@harnash
Copy link
Contributor Author

harnash commented Sep 23, 2016

@sparrc Alright. As decided I have removed mesos-tasks metrics from the plugin but I've left most of the code commented out.

@sparrc sparrc added this to the 1.1.0 milestone Sep 27, 2016
@sparrc
Copy link
Contributor

sparrc commented Sep 27, 2016

thanks @harnash, you can update the changelog now and I'll merge this change

@sparrc sparrc merged commit 32268fb into influxdata:master Sep 28, 2016
@harnash harnash deleted the mesos_tag_fixes branch October 4, 2016 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants