[Enhancement] Improve SNMP Collector to work with heavy loads (+5000 devices) #467

JuSacco · 2021-01-26T17:49:08Z

Context:

What SnmpCollector version are you using?
I'm using 0.8.0 on production and i'm working on 0.9.0 (dev stage)
What OS are you running snmpcollector on?
Docker
What did you do?
Modify devices, measurements, metrics and then reload config
What was the expected result?
Get a reload config as fast as possible.
What happened instead?
Holes of few minutes on Grafana dashboards.

I came here from here

Im using snmp collector to get polling data from about 5k devices, the issue that Im getting it's when I reload config, I've get holes on metric dashboards of Grafana. That is because I got a lot of devices (I know it), now Im playing with timeout and retries but without encouraging result:

Host Measurements Timeout Retries Time

5027 5 5 2 0:05:54

5027 5 5 2 0:03:34

5027 5 5 2 0:03:51

5027 5 5 2 0:05:31

5027 5 5 2 0:06:09

5027 5 5 2 0:05:50

5027 5 3 1 0:07:19

5027 5 3 1 0:07:12

5027 5 3 1 0:06:00

5027 5 3 1 0:05:43

5027 5 3 1 0:07:29

Introduction:
So... I found play with timeout or retries isn't a option (at least for me). OK, saying that, I need to found a solution for me and this is my motivation to opening this issue, I'm not looking for a magical code from you(devs of this project), that not will happen.
I'm searching ideas and information for where I don't have to go. Also, I'm new on GoLang and it's possible that I guess things that can't be done, or at inverse, things that can be done and i don't know.

Objetive: Reduce as most as posible the time to reload SNMP Collector

Scenario:
On my test scenario, I got a snmpcollector instance and a snmp demon running both on k8s pods (snmp-collector and snmp-src)
Actually, I load all configurations to mysql db and snmpcollector takes from there the config.
A thing to take on mind is all snmp-devices are pointing to same url (same k8s service) and here can come a flaw, I suppose this is enhance-able but ins't a blocking issue.

Ideas:
I got on mind three ways to front this:
1- Modify the code, in order to put a channel (or anything like Observer pattern) where I ask if device can do the gather process before start gathering. I think the channel would be buffer-able and I can call len(buffChannel), if == 0 can start gather.
(Maybe I'll extend this Idea when I research a bit more)

2- (Ugly solution): Don't wait WaitGroups on agent.go at End() method:

func End() (time.Duration, error) {
	start := time.Now()
	log.Infof("END: begin device Gather processes stop... at %s", start.String())
	// stop all device processes
	DeviceProcessStop() // 5000 devices = [36, 86] seconds. 
	log.Info("END: begin selfmon Gather processes stop...")
	// stop the selfmon process
	selfmonProc.StopGather()
	log.Info("END: waiting for all Gather goroutines stop...")
	// wait until Done
	gatherWg.Wait()
	log.Info("END: releasing Device Resources")
	ReleaseDevices()
	log.Info("END: releasing Selfmonitoring Resources")
	selfmonProc.End()
	log.Info("END: begin sender processes stop...")
	//log.Info("DEBUG Gather WAIT %+v", GatherWg)
	//log.Info("DEBUG SENDER WAIT %+v", senderWg)
	// stop all Output Emitter
	StopInfluxOut(influxdb)
	log.Info("END: waiting for all Sender goroutines stop..")
	senderWg.Wait() // 3.45 minutes <------------ Comment
	log.Info("END: releasing Sender Resources")
	ReleaseInfluxOut(influxdb)
	log.Infof("END: Finished from %s to %s [Duration : %s]", start.String(), time.Now().String(), time.Since(start).String())
	return time.Since(start), nil
}

Time attached on code it's relative, sometimes delay more time, sometime less.
Solution here can be comment Wait() and clean senderWg var + call GC. I don't know what are the cons if I do that (It's why im here asking you boys 😄 )

3-(Another ugly solution, but not too ugly like prev): Modify Dockerfile adding Supervisord and when I need to reload kill the process and up again. This solution is inspired on wurmrobert comment:

I am using your collector for about 550 devices.
Everything works well. But is it normal that the reload config process sometimes is really slow? Most of the time it takes more than one minute. I run your collector in a docker container. When i restart the container i am much faster and the gather process takes also the latest config. [...]

4- Federate multiples instances of SNMP Collector?

To know:

What are the issues if I don't wait end the gather round?

I can imagine an inconsistency on influxDB like:

device	t1	t2	t3(reload config)	t4
device1	201	199	-	-
device2	156	157	159	-
device3	170	178	-	-

Where t(s) are points through time and when call to reload config some values can finish the gather (filling the Influx point) and others cannot finish gathering achieving empty points

I will expand more on this Issue at the same time as I get more information researching and learning

Thanks for read to here

And more thanks in advance if you can contrib anything

Greetings

Pd: English is not my first language, sorry if I miss anything

The text was updated successfully, but these errors were encountered:

toni-moreno · 2021-01-28T12:29:27Z

@JuSacco sorry for the big delay in the response,you have done a real good review regard this question , we will review your Ideas and will answer you ASAP.

About language .. I've seen you are from Argentina , if you need spanish support email-me at my personal address :-)

JuSacco · 2021-01-28T13:09:29Z

@JuSacco sorry for the big delay in the response,you have done a real good review regard this question , we will review your Ideas and will answer you ASAP.

About language .. I've seen you are from Argentina , if you need spanish support email-me at my personal address :-)

Thank you very much! I'll send you an email!

toni-moreno added this to the 1.0 milestone Mar 7, 2021

toni-moreno added the performance Related to performance bugs, info, improvements label Mar 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Improve SNMP Collector to work with heavy loads (+5000 devices) #467

[Enhancement] Improve SNMP Collector to work with heavy loads (+5000 devices) #467

JuSacco commented Jan 26, 2021 •

edited

Loading

toni-moreno commented Jan 28, 2021

JuSacco commented Jan 28, 2021

[Enhancement] Improve SNMP Collector to work with heavy loads (+5000 devices) #467

[Enhancement] Improve SNMP Collector to work with heavy loads (+5000 devices) #467

Comments

JuSacco commented Jan 26, 2021 • edited Loading

toni-moreno commented Jan 28, 2021

JuSacco commented Jan 28, 2021

JuSacco commented Jan 26, 2021 •

edited

Loading