-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XML document element count exceeds configured maximum 500000 #65
Comments
Looks that this is a limitation from vcenter. I would think a solution is to limit the amount of managed object properties are collected from. |
Could you elaborate? I've tried minimizing the number of metrics collected to the bare minimum we want to measure, unfortunately it's not enough to make the XML reply sufficiently small. Since we have a steady growth of VM's, it would just be a matter of time until we hit the limit again though. Would it be possible to divide the queries in to batches? |
right @pypb, that is what i had in mind: limiting the amount of managed objects (vm/host) by batching them (i.e. per 4000). |
@pypb what does your vcenter topology look like? I've managed worked around this by appending only the desired vsphere datacenters and ignoring out the rest: vsphere-graphite/vsphere/vsphere.go Lines 198 to 200 in d880905
Similarly, I'm only appending points where the cluster strings match my specified desired clusters. vsphere-graphite/vsphere/vsphere.go Line 485 in d880905
This also helped improve sampling performance and reduce long term storage capacity. @cblomart, perhaps a 'filters' section could be written into the vsphere-graphite.json for datacenters and clusters? |
I wanted to think about how to overcome the limit. |
I worked on a branch to overcome this limit. I tested it with success by using a docker image @MnrGreg and certainly @pypb could you confirm it does work. One detail is that the amount of returned metric needs to be evaluated and changes in function of whether or not instances are requested. To cope for that i introduced an evaluation ration of 1,5 times the amount of requested metrics. We will need to validate if this is sufficient. |
@cblomart Unfortunately, it's still the same error:
|
Can you provide more extended logs... |
Certainly, here's the full session log:
|
2019/01/11 07:48:01 5550 queries to vcenter vcenter My interpretation of the issue is that vcenter returns more metrics because you may requests point for each instances (vcpu, nic, disks, ...) and i introduced a ratio to take that into account. In function of that it will split requests in the minimum amount of threads to respect the 500000 limit. So here, arrount 5550 objects in vcenter (vm & hosts) and 171435 different metric requested. Looks like the ratio is at least 500000/171435 ~ 3. |
A new docker image of the element-limit branch is available at "cblomart/vsphere-graphite:9c1e612" |
Now it's able to fetch metrics, but the output to Graphite contains only metrics for 2003 VM's and no hosts. The vCenter contains 5400 VM's and at this moment 4127 are powered on, plus 123 hosts.
|
The first thing i want to see is why it appears to split in 2775 threads as two should be suffisient. |
Ok the thread count is most probably a wrong variable used... i corrected it... it still doesn't explain why you don't have everything comming back |
No, just the one. |
if only one threads comes to an end... this might explains why you don't have everything... the waithgroup should ensure every thread has finished before continuing... |
Ok... i tought i had it right over sync groups but apparently something was missing... |
had to review a few things and make a new build: |
Sorry for the delay. I've tried build 142f905 and it seems to give about the same result, metrics for 2030 VM's returned and no hosts.
|
Thanks... at least it confirms that it splits requests amongs two threads... i will look at it further |
Just for our good understanding, the "VM vm-* has no host" simply states that the link between a VM an its host cannot be found. |
Looks like between the two threads metrics of 4060 objects were collected. does this match your expectation. If i understand you correctly while collection happens some vm are missing and no metrics for hosts are collected. |
I gues that despite all the changes made to the code in the element-limit branch you are still in this situation? |
Sorry for the delay. If I remember correctly, using build 142f905 the output to Graphite did not contain everything. But I should verify, I'll get back to you. |
OK, I have finally had time to double check my results. I've re-done tests with build 142f905. Our vCenter now have 6095 VM's, most of them are powered on. In the Graphite output I get metrics for 2579 VM's and 0 hosts. vsphere-graphite logs "Thread 1 returned 2005 metrics" and "Thread 2 returned 2005 metrics". |
Thanks again to take time on this @pypb there has been work done on other issues #78, #77, #79 related to graphite where all metrics weren't returned. I also added more logging to see the amount of object requested and returned in each threads (vms and hosts). This should be present in 063c6f2 |
OK, I've run a test with build 063c6f2. It looks like both threads have the same working set, they work on the exact same amount of objects which is also equal to the total number of VM's I see in the Graphite output (2710 VM's, 0 hosts). In fact, now that I look in the output, all metrics are doubled for each VM. Here's the log:
|
I have added soem logging and reviewed "pointer" usage in the definition of batches. |
Now we're talking! I'm getting metrics for 130 hosts and 5344 VM's, which matches the number of connected hosts and powered on VM's.
|
Thanks... good to be back on track... pointers and go func... |
merged #81. |
When collecting metrics from a fairly large vCenter (+5000 VM's) vsphere-graphite fails with the error:
Error: ServerFaultCode: XML document element count exceeds configured maximum 500000
Complete log:
The text was updated successfully, but these errors were encountered: