Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Improve SNMP Collector to work with heavy loads (+5000 devices) #467

Open
JuSacco opened this issue Jan 26, 2021 · 2 comments
Labels
performance Related to performance bugs, info, improvements
Milestone

Comments

@JuSacco
Copy link

JuSacco commented Jan 26, 2021

Context:

  • What SnmpCollector version are you using?
    I'm using 0.8.0 on production and i'm working on 0.9.0 (dev stage)
  • What OS are you running snmpcollector on?
    Docker
  • What did you do?
    Modify devices, measurements, metrics and then reload config
  • What was the expected result?
    Get a reload config as fast as possible.
  • What happened instead?
    Holes of few minutes on Grafana dashboards.

I came here from here

Im using snmp collector to get polling data from about 5k devices, the issue that Im getting it's when I reload config, I've get holes on metric dashboards of Grafana. That is because I got a lot of devices (I know it), now Im playing with timeout and retries but without encouraging result:

Host Measurements Timeout Retries Time
5027 5 5 2 0:05:54
5027 5 5 2 0:03:34
5027 5 5 2 0:03:51
5027 5 5 2 0:05:31
5027 5 5 2 0:06:09
5027 5 5 2 0:05:50
5027 5 3 1 0:07:19
5027 5 3 1 0:07:12
5027 5 3 1 0:06:00
5027 5 3 1 0:05:43
5027 5 3 1 0:07:29

Introduction:
So... I found play with timeout or retries isn't a option (at least for me). OK, saying that, I need to found a solution for me and this is my motivation to opening this issue, I'm not looking for a magical code from you(devs of this project), that not will happen.
I'm searching ideas and information for where I don't have to go. Also, I'm new on GoLang and it's possible that I guess things that can't be done, or at inverse, things that can be done and i don't know.

Objetive: Reduce as most as posible the time to reload SNMP Collector

Scenario:
On my test scenario, I got a snmpcollector instance and a snmp demon running both on k8s pods (snmp-collector and snmp-src)
Actually, I load all configurations to mysql db and snmpcollector takes from there the config.
A thing to take on mind is all snmp-devices are pointing to same url (same k8s service) and here can come a flaw, I suppose this is enhance-able but ins't a blocking issue.

Ideas:
I got on mind three ways to front this:
1- Modify the code, in order to put a channel (or anything like Observer pattern) where I ask if device can do the gather process before start gathering. I think the channel would be buffer-able and I can call len(buffChannel), if == 0 can start gather.
(Maybe I'll extend this Idea when I research a bit more)

2- (Ugly solution): Don't wait WaitGroups on agent.go at End() method:

func End() (time.Duration, error) {
	start := time.Now()
	log.Infof("END: begin device Gather processes stop... at %s", start.String())
	// stop all device processes
	DeviceProcessStop() // 5000 devices = [36, 86] seconds. 
	log.Info("END: begin selfmon Gather processes stop...")
	// stop the selfmon process
	selfmonProc.StopGather()
	log.Info("END: waiting for all Gather goroutines stop...")
	// wait until Done
	gatherWg.Wait()
	log.Info("END: releasing Device Resources")
	ReleaseDevices()
	log.Info("END: releasing Selfmonitoring Resources")
	selfmonProc.End()
	log.Info("END: begin sender processes stop...")
	//log.Info("DEBUG Gather WAIT %+v", GatherWg)
	//log.Info("DEBUG SENDER WAIT %+v", senderWg)
	// stop all Output Emitter
	StopInfluxOut(influxdb)
	log.Info("END: waiting for all Sender goroutines stop..")
	senderWg.Wait() // 3.45 minutes <------------ Comment
	log.Info("END: releasing Sender Resources")
	ReleaseInfluxOut(influxdb)
	log.Infof("END: Finished from %s to %s [Duration : %s]", start.String(), time.Now().String(), time.Since(start).String())
	return time.Since(start), nil
}

Time attached on code it's relative, sometimes delay more time, sometime less.
Solution here can be comment Wait() and clean senderWg var + call GC. I don't know what are the cons if I do that (It's why im here asking you boys 😄 )

3-(Another ugly solution, but not too ugly like prev): Modify Dockerfile adding Supervisord and when I need to reload kill the process and up again. This solution is inspired on wurmrobert comment:

I am using your collector for about 550 devices.
Everything works well. But is it normal that the reload config process sometimes is really slow? Most of the time it takes more than one minute. I run your collector in a docker container. When i restart the container i am much faster and the gather process takes also the latest config. [...]

4- Federate multiples instances of SNMP Collector?

To know:

  • What are the issues if I don't wait end the gather round?

I can imagine an inconsistency on influxDB like:

device t1 t2 t3(reload config) t4
device1 201 199 - -
device2 156 157 159 -
device3 170 178 - -

Where t(s) are points through time and when call to reload config some values can finish the gather (filling the Influx point) and others cannot finish gathering achieving empty points

I will expand more on this Issue at the same time as I get more information researching and learning

Thanks for read to here

And more thanks in advance if you can contrib anything

Greetings

Pd: English is not my first language, sorry if I miss anything

@toni-moreno
Copy link
Owner

@JuSacco sorry for the big delay in the response,you have done a real good review regard this question , we will review your Ideas and will answer you ASAP.

About language .. I've seen you are from Argentina , if you need spanish support email-me at my personal address :-)

@JuSacco
Copy link
Author

JuSacco commented Jan 28, 2021

@JuSacco sorry for the big delay in the response,you have done a real good review regard this question , we will review your Ideas and will answer you ASAP.

About language .. I've seen you are from Argentina , if you need spanish support email-me at my personal address :-)

Thank you very much! I'll send you an email!

@toni-moreno toni-moreno added this to the 1.0 milestone Mar 7, 2021
@toni-moreno toni-moreno added the performance Related to performance bugs, info, improvements label Mar 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Related to performance bugs, info, improvements
Projects
None yet
Development

No branches or pull requests

2 participants