You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Im using snmp collector to get polling data from about 5k devices, the issue that Im getting it's when I reload config, I've get holes on metric dashboards of Grafana. That is because I got a lot of devices (I know it), now Im playing with timeout and retries but without encouraging result:
Host
Measurements
Timeout
Retries
Time
5027
5
5
2
0:05:54
5027
5
5
2
0:03:34
5027
5
5
2
0:03:51
5027
5
5
2
0:05:31
5027
5
5
2
0:06:09
5027
5
5
2
0:05:50
5027
5
3
1
0:07:19
5027
5
3
1
0:07:12
5027
5
3
1
0:06:00
5027
5
3
1
0:05:43
5027
5
3
1
0:07:29
Introduction:
So... I found play with timeout or retries isn't a option (at least for me). OK, saying that, I need to found a solution for me and this is my motivation to opening this issue, I'm not looking for a magical code from you(devs of this project), that not will happen.
I'm searching ideas and information for where I don't have to go. Also, I'm new on GoLang and it's possible that I guess things that can't be done, or at inverse, things that can be done and i don't know.
Objetive: Reduce as most as posible the time to reload SNMP Collector
Scenario:
On my test scenario, I got a snmpcollector instance and a snmp demon running both on k8s pods (snmp-collector and snmp-src)
Actually, I load all configurations to mysql db and snmpcollector takes from there the config.
A thing to take on mind is all snmp-devices are pointing to same url (same k8s service) and here can come a flaw, I suppose this is enhance-able but ins't a blocking issue.
Ideas:
I got on mind three ways to front this:
1- Modify the code, in order to put a channel (or anything like Observer pattern) where I ask if device can do the gather process before start gathering. I think the channel would be buffer-able and I can call len(buffChannel), if == 0 can start gather.
(Maybe I'll extend this Idea when I research a bit more)
2- (Ugly solution): Don't wait WaitGroups on agent.go at End() method:
func End() (time.Duration, error) {
start := time.Now()
log.Infof("END: begin device Gather processes stop... at %s", start.String())
// stop all device processes
DeviceProcessStop() // 5000 devices = [36, 86] seconds.
log.Info("END: begin selfmon Gather processes stop...")
// stop the selfmon process
selfmonProc.StopGather()
log.Info("END: waiting for all Gather goroutines stop...")
// wait until Done
gatherWg.Wait()
log.Info("END: releasing Device Resources")
ReleaseDevices()
log.Info("END: releasing Selfmonitoring Resources")
selfmonProc.End()
log.Info("END: begin sender processes stop...")
//log.Info("DEBUG Gather WAIT %+v", GatherWg)
//log.Info("DEBUG SENDER WAIT %+v", senderWg)
// stop all Output Emitter
StopInfluxOut(influxdb)
log.Info("END: waiting for all Sender goroutines stop..")
senderWg.Wait() // 3.45 minutes <------------ Comment
log.Info("END: releasing Sender Resources")
ReleaseInfluxOut(influxdb)
log.Infof("END: Finished from %s to %s [Duration : %s]", start.String(), time.Now().String(), time.Since(start).String())
return time.Since(start), nil
}
Time attached on code it's relative, sometimes delay more time, sometime less.
Solution here can be comment Wait() and clean senderWg var + call GC. I don't know what are the cons if I do that (It's why im here asking you boys 😄 )
3-(Another ugly solution, but not too ugly like prev): Modify Dockerfile adding Supervisord and when I need to reload kill the process and up again. This solution is inspired on wurmrobert comment:
I am using your collector for about 550 devices.
Everything works well. But is it normal that the reload config process sometimes is really slow? Most of the time it takes more than one minute. I run your collector in a docker container. When i restart the container i am much faster and the gather process takes also the latest config. [...]
4- Federate multiples instances of SNMP Collector?
To know:
What are the issues if I don't wait end the gather round?
I can imagine an inconsistency on influxDB like:
device
t1
t2
t3(reload config)
t4
device1
201
199
-
-
device2
156
157
159
-
device3
170
178
-
-
Where t(s) are points through time and when call to reload config some values can finish the gather (filling the Influx point) and others cannot finish gathering achieving empty points
I will expand more on this Issue at the same time as I get more information researching and learning
Thanks for read to here
And more thanks in advance if you can contrib anything
Greetings
Pd: English is not my first language, sorry if I miss anything
The text was updated successfully, but these errors were encountered:
@JuSacco sorry for the big delay in the response,you have done a real good review regard this question , we will review your Ideas and will answer you ASAP.
About language .. I've seen you are from Argentina , if you need spanish support email-me at my personal address :-)
@JuSacco sorry for the big delay in the response,you have done a real good review regard this question , we will review your Ideas and will answer you ASAP.
About language .. I've seen you are from Argentina , if you need spanish support email-me at my personal address :-)
Context:
I'm using 0.8.0 on production and i'm working on 0.9.0 (dev stage)
Docker
Modify devices, measurements, metrics and then reload config
Get a reload config as fast as possible.
Holes of few minutes on Grafana dashboards.
I came here from here
Introduction:
So... I found play with timeout or retries isn't a option (at least for me). OK, saying that, I need to found a solution for me and this is my motivation to opening this issue, I'm not looking for a magical code from you(devs of this project), that not will happen.
I'm searching ideas and information for where I don't have to go. Also, I'm new on GoLang and it's possible that I guess things that can't be done, or at inverse, things that can be done and i don't know.
Objetive: Reduce as most as posible the time to reload SNMP Collector
Scenario:
On my test scenario, I got a snmpcollector instance and a snmp demon running both on k8s pods (snmp-collector and snmp-src)
Actually, I load all configurations to mysql db and snmpcollector takes from there the config.
A thing to take on mind is all snmp-devices are pointing to same url (same k8s service) and here can come a flaw, I suppose this is enhance-able but ins't a blocking issue.
Ideas:
I got on mind three ways to front this:
1- Modify the code, in order to put a channel (or anything like Observer pattern) where I ask if device can do the gather process before start gathering. I think the channel would be buffer-able and I can call len(buffChannel), if == 0 can start gather.
(Maybe I'll extend this Idea when I research a bit more)
2- (Ugly solution): Don't wait WaitGroups on agent.go at End() method:
Time attached on code it's relative, sometimes delay more time, sometime less.
Solution here can be comment Wait() and clean senderWg var + call GC. I don't know what are the cons if I do that (It's why im here asking you boys 😄 )
3-(Another ugly solution, but not too ugly like prev): Modify Dockerfile adding Supervisord and when I need to reload kill the process and up again. This solution is inspired on wurmrobert comment:
4- Federate multiples instances of SNMP Collector?
To know:
I can imagine an inconsistency on influxDB like:
Where t(s) are points through time and when call to reload config some values can finish the gather (filling the Influx point) and others cannot finish gathering achieving empty points
I will expand more on this Issue at the same time as I get more information researching and learning
Thanks for read to here
And more thanks in advance if you can contrib anything
Greetings
Pd: English is not my first language, sorry if I miss anything
The text was updated successfully, but these errors were encountered: