Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added input plugin for VMware vSphere. #4141

Merged
merged 118 commits into from
Sep 11, 2018
Merged

Added input plugin for VMware vSphere. #4141

merged 118 commits into from
Sep 11, 2018

Conversation

prydin
Copy link
Contributor

@prydin prydin commented May 12, 2018

closes #1420

Added full support for vSphere monitoring. Supports vms, hosts, clusters and datastores. Allows filtering of metrics and resources. Written by Pontus Rydin, VMware and Pierre Tessier, VMware.

Required for all PRs:

  • [ X ] Signed CLA.
  • [ X ] Associated README.md updated.
  • [ X ] Has appropriate unit tests.

prydin and others added 30 commits September 30, 2017 16:07
Fixed bug with missing newlines in Wavefront plugin.
changed Interval properties to durations
@danielnelson
Copy link
Contributor

Can you add some example output to the README? https://github.com/influxdata/telegraf/pull/4141/files#r213154334

@prydin
Copy link
Contributor Author

prydin commented Aug 30, 2018

Added sample output.

Copy link
Contributor

@danielnelson danielnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried out the simulator, I think it will work well for integration testing for now. I did run into a couple issues:

I put in a bogus address https://localhost:1234/sdk, but there was nothing printed. We should return an error either from gather and start if the connection cannot be established.

I fixed the address but still didn't see any error, simulator logged remote error: tls: bad certificate. This should also display an error if possible., after setting insecure_skip_verify everything is working great.

plugins/inputs/vsphere/client.go Show resolved Hide resolved
vsphere_host_disk,disk=/var/folders/rf/txwdm4pj409f70wnkdlp7sz80000gq/T/govcsim-DC0-LocalDS_0-367088371@folder-5,esxhostname=DC0_H0,host=host.example.com,moid=host-19,os=Mac,source=DC0_H0,vcenter=localhost:8989 write_average=2635i,read_average=30i 1535660339000000000
vsphere_host_mem,esxhostname=DC0_H0,host=host.example.com,moid=host-19,os=Mac,source=DC0_H0,vcenter=localhost:8989 usage_average=98.5 1535660339000000000
vsphere_host_net,esxhostname=DC0_H0,host=host.example.com,moid=host-19,os=Mac,source=DC0_H0,vcenter=localhost:8989 usage_average=1887i,bytesRx_average=662i,bytesTx_average=251i 1535660339000000000
vsphere_host_net,esxhostname=DC0_H0,host=host.example.com,interface=vmnic0,moid=host-19,os=Mac,source=DC0_H0,vcenter=localhost:8989 usage_average=1481i,bytesTx_average=899i,bytesRx_average=992i 1535660339000000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two lines above, L335:336, is one of them meant to be an aggregate metric of all interfaces on the host? If so it would be ideal if we could, at least by default, collect only the total usage or the per interface metrics.

Copy link
Contributor Author

@prydin prydin Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what the [host|vm|cluster|datastore]_instances flag does. If you set it to false, you'll get the behavior you're asking for. By default, they're set to "true", except for datastores. This reflects some kind of "best practice" of what people usually want to collect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should add support for collecting only the instance metrics? If you collect instance metrics then there is less reason to store the totals since it can be computed at query time. Perhaps the instance variables could be something like: `vm_metric_mode = "instance|aggregate|all".

One thing that is a moderate issue with metrics tagged like this is selecting only the instance without the totals. For example say we have these metrics:

vsphere_host_net,interface=vmnic0 bytesTx_average=1i
vsphere_host_net,interface=vmnic1 bytesTx_average=3i
vsphere_host_net bytesTx_average=2i

To average only the instance versions you need to know the secret method to remove series containing the total aggregation:

select mean(bytesTx_average) from vsphere_host_net where interface =~ /./

What we usually do is try to rename the field by appending the aggregation type, the new field name ensures the values do not get aggregated together:

vsphere_host_net,interface=vmnic0 bytesTx_average=1i
vsphere_host_net,interface=vmnic1 bytesTx_average=3i
vsphere_host_net bytesTx_average_average=2i

So that would be very ugly, maybe something like this would be better?:

vsphere_host_net,interface=vmnic0 bytesTx_average=1i
vsphere_host_net,interface=vmnic1 bytesTx_average=3i
vsphere_host_net total_bytesTx_average=2i

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point that it's hard to search to things that are missing a certain point tag. The problem is that it's tricky to reliably determine whether a metric in vCenter will ever show up with an instance metric, so you'd have to add the "total" or "average" to anything that's not a metric on an instance. You never know if that same metric is going to show up in the future with an instance. @puckpuck what do you think?

Copy link
Contributor Author

@prydin prydin Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. cpu.blablabla doesn't have an instance metric in vSphere version N, so all metrics are reported as just cpu.blablabla. In version N+1, the metric is given an instance and all of a sudden cpu.blablabla becomes cpu.blablabla.total. I really don't like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be a downside, you don't want to switch your dashboards because you change this minor setting. It seems to me that the best solution would be to collect only one type or the other, at least by default. Don't have to worry about excluding data from the query if you don't have it, and I think having both forms is redundant anyway.

Copy link
Contributor Author

@prydin prydin Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best solution is to always emit the instance tag and send it as instance-total for the aggregate metrics (i.e. metrics without an instance tag).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be similar to the way e.g. input.cpu does it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Commit coming shortly.

vsphere_host_cpu,cpu=1,esxhostname=DC0_H0,host=host.example.com,moid=host-19,os=Mac,source=DC0_H0,vcenter=localhost:8989 coreUtilization_average=25.92,utilization_average=18.72,used_summation=39790i,usage_average=40.42,idle_summation=69457i 1535660339000000000
vsphere_host_net,clustername=DC0_C0,esxhostname=DC0_C0_H0,host=host.example.com,interface=vmnic0,moid=host-30,os=Mac,source=DC0_C0_H0,vcenter=localhost:8989 usage_average=1246i,bytesTx_average=673i,bytesRx_average=781i 1535660339000000000
vsphere_host_cpu,clustername=DC0_C0,esxhostname=DC0_C0_H0,host=host.example.com,moid=host-30,os=Mac,source=DC0_C0_H0,vcenter=localhost:8989 coreUtilization_average=33.8,idle_summation=77121i,ready_summation=15857i,readiness_average=0.39,used_summation=29554i,costop_summation=2i,wait_summation=4338417i,utilization_average=17.87,latency_average=0.44,usage_average=28.78 1535660339000000000
vsphere_host_cpu,clustername=DC0_C0,cpu=0,esxhostname=DC0_C0_H0,host=host.example.com,moid=host-30,os=Mac,source=DC0_C0_H0,vcenter=localhost:8989 idle_summation=86610i,coreUtilization_average=34.36,utilization_average=19.03,used_summation=28766i,usage_average=23.72 1535660339000000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe the relationship between cluster, esxhostname, source, moid, and vcenter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A vCenter has clusters, clusters have esxhosts (the physical machines) identified by the esxhostname, hosts have vms. The source field is the name of whatever object is collected (vm, host, cluster or datastore). A moid is the unique internal id of any object in a vCenter. It is sometimes useful for uniquely identifying an object when e.g. it has been renamed.

Do you think we need this in the README? My assumption was that most people using thisa plugin would already be familiar with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(MOID stands for Managed Object ID and is a well-known property of vSphere resources)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need this in the README?

No probably not, I agree that anyone monitoring will be more up to speed than I am.

## Sample output

```
vsphere_vm_cpu,esxhostname=DC0_H0,guest=other,host=host.example.com,moid=vm-35,os=Mac,source=DC0_H0_VM0,vcenter=localhost:8989,vmname=DC0_H0_VM0 run_summation=2608i,ready_summation=129i,usage_average=5.01,used_summation=2134i,demand_average=326i 1535660299000000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a couple collections, lets remove it down to just one interval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Will check on that.

"virtualDisk.totalReadLatency.average",
"virtualDisk.totalWriteLatency.average",
"virtualDisk.write.average",
"virtualDisk.writeOIO.latest",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the metrics in the default config are not documented in METRICS.md

plugins/inputs/vsphere/README.MD Show resolved Hide resolved
# metrics_per_query = 256

## number of go routines to use for collection and discovery of objects and metrics
# collect_concurrency = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see collect_concurrency being used, do we still need it?


## number of go routines to use for collection and discovery of objects and metrics
# collect_concurrency = 1
# discover_concurrency = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resource that is scarce is connections, so I propose we rename this max_discover_connections.

BTW, did you see they added an option for this to the transport in Go 1.11 (MaxConnsPerHost)? Should be pretty helpful, but we still target Go 1.9 so we can't use it yet.


For a detailed list of commonly available metrics, please refer to [METRICS.MD](METRICS.MD)

## Tags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will probably merge the Tags section with Measurements & Fields when I merge the PR, unless you do it first :)

Check EXAMPLE_README.md for the latest style.

@danielnelson danielnelson mentioned this pull request Sep 7, 2018
3 tasks
@danielnelson danielnelson added this to the 1.8.0 milestone Sep 7, 2018
@danielnelson
Copy link
Contributor

@prydin We are hoping to release 1.8.0-rc1 on Wednesday afternoon, can you ping me once any final tweaks are in?

@randallt
Copy link

@prydin Please confirm that you aren't outputting any empty tag values. On the old version I'm using, I just found that a lot of my points were being blocked in wavefront due to an empty cluster tag, e.g. cluster="".

@prydin
Copy link
Contributor Author

prydin commented Sep 11, 2018

I will commit my final changes today. A question for @danielnelson : My fork has a ton of commits. Do you squash them as you pull them in or you want me to do anything?

@prydin
Copy link
Contributor Author

prydin commented Sep 11, 2018

@randallt Thanks for the note. I went through the code and fixed an issue related to what you mentioned. It was possible for the plugin to send empty cluster tags under certain circumstances. Not anymore.

@prydin
Copy link
Contributor Author

prydin commented Sep 11, 2018

@danielnelson please hold off on this for a few hours. Need to check something.

@prydin prydin closed this Sep 11, 2018
@prydin
Copy link
Contributor Author

prydin commented Sep 11, 2018

Sorry about that. We are back in business.

@prydin prydin reopened this Sep 11, 2018
@danielnelson danielnelson merged commit 5f3c331 into influxdata:master Sep 11, 2018
@danielnelson
Copy link
Contributor

Merged! 🍻

rgitzel pushed a commit to rgitzel/telegraf that referenced this pull request Oct 17, 2018
otherpirate pushed a commit to otherpirate/telegraf that referenced this pull request Mar 15, 2019
otherpirate pushed a commit to otherpirate/telegraf that referenced this pull request Mar 15, 2019
dupondje pushed a commit to dupondje/telegraf that referenced this pull request Apr 22, 2019
athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create Telegraf Collector for vSphere Object Metric Collection
7 participants