-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vSphere Input does not collect datastore metrics #4789
Comments
I have the same exact issue here. Environment : VCenter 6.7 and Vsphere 6.7 |
Did you let it run for a while? It misses the first collection because background object discovery hasn't finished yet. You can force the collector to wait for the first round of discovery by setting this flag in the config:
Let me know if this solved your problem. |
Hi, I've now applied your trick, and my telegraf.log is now a little more explicit about what's happening. Here it is : 018-10-04T06:01:47Z D! [input.vsphere]: Start of sample period deemed to be 2018-10-04 05:56:47.096487102 +0000 UTC m=-279.309774333 |
I have extended the interval between collections. and now I have another set of errors : 2018-10-04T06:48:16Z D! [input.vsphere]: Start of sample period deemed to be 2018-10-04 06:43:16.837579961 +0000 UTC m=-178.066128366 |
@rsasportes I don't see any errors in that log. Just debug messages. Am I missing something? |
Oh! Now I see the problem! The query isn't returning any data. Which version of Telegraf are you on? 1.8 has a bug in it that can cause queries to return 0 objects of the time on the node where Telegraf runs is ahead of vCenter. This is fixed in 1.8.1 that was just released. |
Hi, First, thanks for your help ;-) I've upgraded Telegraf to 1.8.1_1, and now some datastores start to appear in the dashboard. As it says in the log : Latest: 2018-10-05 07:45:17.898618 +0000 UTC, elapsed: 304.955344, resource: datastore I've extended data collection interval to 5 minutes, just to check if there is a timeout. No luck. Do you think it might be possible to manually launch data collection, alongside verbose logging, and trace what's wrong ? Again, thanks ! |
Do you see any metrics at all? Do you see a complete set of metrics for some datastores or do you only see sporadic metrics for random datastores? Also, for anything that's missing, can you go to the vCenter UI and make sure you can see those metrics under Monitoring->Performance? |
Hi, Thanks. |
@Muellerflo the includes and excludes act on metric names, not object names. The ability to filter objects will be added soon. See #4790 As for sporadic metrics, have you checked the log if you get any timeouts/collections that take longer than the interval? |
We have the same issue, it appears to happen with larger vcenters, one vcenter with only 5 datastores worked pefectly, but the other with dozens of datastores the data collection failed. Any ETA on a fix? This appears to be effecting many others as well. |
@ion-storm anything in the logs? |
@ion-storm There are several reasons why datastore metrics could be missing. What is your collection interval? Have you tried to declare the plugin separately for the datastores with a longer collection interval? |
Hello, We also seem to be experiencing issues with receiving certain data data from certain datastores. The data that Telegraf stores into the measurement "vsphere_datastore_datastore" does not seem to appear for certain datastores. The data in the measurement "vsphere_datastore_disk" however, does. So, regarding our setup, we have 25 datastores in total, of which 19 have a type of form "VMFS", the other 6 have a type of "NFS 3". The NFS ones are the ones that don't show up in the "vsphere_datastore_datastore" measurement (in InfluxDB). When running Telegraf I turned on debug logging and it does discover 25 datastores:
I messed around a bit more in both Telegraf and govmomi and while printing the data that is processed in govmomi's I haven't dug any deeper yet, but hopefully this will help someone along their way. :) If anyone wants me to try things out, I'm in the CEST timezone. |
@ybinnenwegin2ip What's the statistics level on your vCenter? I believe you have to be at least at level 3 for those metrics to be collected. You could also try to check the metrics using the govc tool. Something like this:
If that doesn't return any metrics, you're simply not collecting them on your vCenter and you'd have to increase the statistics level for the 5 minute buckets. |
Thanks for your quick response! I gave it a shot and this is all I get back:
I'll look into the statistics level, thanks for the pointer! EDIT: I noticed your edit, I think you added the Either way, I ran it again, didn't change:
I've also just increased the statistics level from "1" to "2", let's see what happens. :) |
I think you need at least 3. Just verified in my lab. If I drop it lower than 3, the metric disappears. |
Thanks! It's set to 2 now and while I do see some statistics appearing (read_average & write_average) the others are indeed still missing. Perhaps I misunderstood the VMware documentation :)
I guess a networked datastore doesn't count as a disk but rather a 'device' then? Either way, I'll set it to 3 soon and I'll report back in here. Thanks a lot for pointing me in the right direction! |
Happy to help. You're not the only one who's confused about the documentation around this. :) |
Hi, I've exactly the same issue on my vcenter NO datastore metrics are collected. I use 1.8.2 version of telegraf. Also activating the debug option gives no hint why no metrics are collected. I see this lines maybe it's a hint: 2018-10-25T09:51:34Z D! [input.vsphere]: Start of sample period deemed to be 2018-10-25 09:46:34.829154 +0000 UTC How could I debug this better? This is the output of the govc command: ./govc_linux_amd64 metric.sample -n 10 itasz7_01 datastore.numberReadAveraged.averageitasz7_01 - datastore.numberReadAveraged.average num |
@jvigna It looks like Govc isn't returning any metrics either. Have you tried decreasing the statistics level for 5 minute samples in vCenter? |
It's strange I now inserted 2 more vcenters (smaller ones) and they work without problems. I don't think it is a setting on vcenter site, could it be that there are too much datastores? |
BTW.: In first I'm interested in the disk infos as: ./govc_linux_amd64 metric.sample itasz7_01 disk.capacity.latest And they seem to work. |
What does your config file look like? Also, please check the stats levels on both vcenters to see if there's a difference. |
Hi the stats level are the same on the 3 vcenter servers and my config is this: [[inputs.vsphere]] |
@jvigna You're collecting all metrics for datastores. That can take a long time. What's your collection interval? I think what's happening is that the collection takes longer than the collection interval. You have two options:
|
I will try this, but just for the record shouldn't I get some sort of warning if "really" such a timeout is is hit? BTW: My collection interval is already 30s as with 10s i GOT that warning. And I already have because of this an own instance of telegraf for just collecting the vsphere metrics. |
Yes, there should be errors in the logfile if this is the issue. Can you try to set datastore_metric_include to just the metric you're interested in, e.g.
Could you please try that and tell me what the result is and paste a logfile if it doesn't work? |
Ok I think that could be a good idea, what do I need for capacity? disk.capacity.latest and disk.used.latest? How may I get a list of the metrics? |
Those two are good candidates. To list all metrics available, use the following govc command:
|
As soon as I'm able to modify the configuration I'll let you know if it works when sending only a few metrics. |
Keep in mind that it may take up to 30 minutes to see any data on storage capacity, since these are only generated at a 30 minute interval by vCenter. |
Hi @prydin I've been using the plugin with a few day's now and have also had this issue. I've tried to create 2 instance like this. [[inputs.vsphere]] I finally got some data showing but only if i choose 7day's or more. everyting below that does not show content. |
@Compboy100 Which version of the plugin? |
Hi thank you for the quick reply.
Eventhou i have the interval at 301. |
I think I've tracked this down to minor clock skew between vCenter and the ESXi hosts. Working on a fix. |
I've been working on this over the last few days and addressed multiple issues:
|
Anyone who wants to be a beta tester for the fix? It's available here: https://github.com/prydin/telegraf/releases/tag/prydin-4789 |
Thank you will try it, Can I just extract the vsphere plugins folder to my linux distro? or do i have to compile the whole thing. |
It's binaries. Nothing to compile. |
@prydin Feel free to make a PR too, this will build all the packages on CircleCI and I can add links to the artifacts. Just note in the PR that its still preliminary. |
@danielnelson Will do. On the go today, but I'll get a PR filed as some as I get back to home base. |
No luck getting data below 24h yet. will report more tomorrow when back in office. 2018-11-02T13:42:07Z D! [input.vsphere] Discovering resources for datastore |
Ah! That one is easy to fix. I'm assuming you're running an older version of vCenter? Go ahead and set Like this: If that doesn't work, try decreasing it to 20. |
Yes 6.0. Had already put them to 64. |
Can confirm that data is being recorded for below 24h now. |
@prydin Don't forget the PR. there is a RC1 live. does it include the changes? |
I've been wanting this run in my lab for a while first, but I just opened a PR. @glinton and @danielnelson is there still a chance to get this into 1.9? |
@Compboy100 no, RC1 doesn't have the changes. I wanted to make sure it ran OK in my lab first. |
It's a possibility, if not for 1.9.0 then it should be possible to get this in for 1.9.1. Let's focus on getting it added to master first, I'll review today. |
Closed in #4968, @Compboy100 I'm creating a new release candidate this afternoon. |
Relevant telegraf.conf:
[agent]
interval = "10s"
round_interval = true
metric_buffer_limit = 1000
flush_buffer_when_full = true
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
debug = true
quiet = false
logfile = "/Program Files/Telegraf/telegraf.log"
hostname = ""
[[outputs.influxdb]]
urls = ["udp://10.120.1.44:8089"]
[[inputs.vsphere]]
vcenters = [ "https://vsphere.address.here/sdk" ]
username = "vsphereUSer"
password = "SuperSecretVspherePasswordOfGreatness"
datastore_metric_include = [ "*" ]
vm_metric_exclude = [ "*" ]
host_metric_include = [
"cpu.coreUtilization.average",
"cpu.costop.summation",
"cpu.demand.average",
"cpu.idle.summation",
"cpu.latency.average",
"cpu.readiness.average",
"cpu.ready.summation",
"cpu.swapwait.summation",
"cpu.usage.average",
"cpu.usagemhz.average",
"cpu.used.summation",
"cpu.utilization.average",
"cpu.wait.summation",
"mem.active.average",
"mem.latency.average",
"mem.state.latest",
"mem.swapin.average",
"mem.swapinRate.average",
"mem.swapout.average",
"mem.swapoutRate.average",
"mem.totalCapacity.average",
"mem.usage.average",
"mem.vmmemctl.average",
"net.bytesRx.average",
"net.bytesTx.average",
"net.droppedRx.summation",
"net.droppedTx.summation",
"net.errorsRx.summation",
"net.errorsTx.summation",
"net.usage.average",
"power.power.average",
"sys.uptime.latest",
]
host_metric_exclude = [] ## Nothing excluded by default
host_instances = true ## true by default
cluster_metric_exclude = [""] ## Nothing excluded by default
cluster_instances = true ## true by default
datacenter_metric_exclude = [ "" ] ## Datacenters are not collected by default.
collect_concurrency = 4
discover_concurrency = 2
timeout = "20s"
insecure_skip_verify = true
System info:
Telegraf 1.8.0
Windows Server 2012 r2
vSphere Appliance 6.5 u1d
Steps to reproduce:
Expected behavior:
Datastore metrics are written to influxdb
Actual behavior:
No datastore metrics are written to influxdb
Additional info:
Logs:
2018-10-02T20:40:04Z D! Attempting connection to output: influxdb
2018-10-02T20:40:04Z D! Successfully connected to output: influxdb
2018-10-02T20:40:04Z I! Starting Telegraf 1.8.0
2018-10-02T20:40:04Z I! Loaded inputs: inputs.vsphere
2018-10-02T20:40:04Z I! Loaded aggregators:
2018-10-02T20:40:04Z I! Loaded processors:
2018-10-02T20:40:04Z I! Loaded outputs: influxdb
2018-10-02T20:40:04Z I! Tags enabled: host=telegraf
2018-10-02T20:40:04Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"telegraf", Flush Interval:10s
2018-10-02T20:40:10Z D! [input.vsphere]: Starting plugin
2018-10-02T20:40:10Z D! [input.vsphere]: Creating client: vsphere.address.here
2018-10-02T20:40:10Z D! [input.vsphere]: Start of sample period deemed to be 2018-10-02 13:35:10.1675805 -0700 PDT m=-292.934584699
2018-10-02T20:40:10Z D! [input.vsphere]: Collecting metrics for 0 objects of type datastore for vsphere.address.here
2018-10-02T20:40:10Z D! [input.vsphere]: Discover new objects for vsphere.address.here
2018-10-02T20:40:10Z D! [input.vsphere] Discovering resources for datacenter
2018-10-02T20:40:10Z D! [input.vsphere]: No parent found for Folder:group-d1 (ascending from Folder:group-d1)
2018-10-02T20:40:10Z D! [input.vsphere] Discovering resources for cluster
2018-10-02T20:40:10Z D! [input.vsphere] Discovering resources for host
2018-10-02T20:40:11Z D! [input.vsphere] Discovering resources for vm
2018-10-02T20:40:11Z D! [input.vsphere] Discovering resources for datastore
2018-10-02T20:40:20Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-10-02T20:40:20Z D! [input.vsphere]: Latest: 2018-10-02 13:40:10.1675805 -0700 PDT m=+7.065415301, elapsed: 14.846599, resource: datastore
2018-10-02T20:40:20Z D! [input.vsphere]: Sampling period for datastore of 300 has not elapsed for vsphere.address.here
The text was updated successfully, but these errors were encountered: