Use workers instead spawning goroutines for each incoming DNS request #565

UladzimirTrehubenka · 2017-11-14T15:44:18Z

There are two major issues:
unlimited using resources - performance test shows that on high load CoreDNS silently crashed;
performance drop due much time is spent on management goroutines instead serve DNS requests.

miekg · 2017-11-15T07:34:35Z

Why?

miekg · 2017-11-15T07:37:39Z

server.go

+	// Maximum number of incoming DNS messages in queue.
+	maxQueueSize = 1000000
+	// Maximum number of workers.
+	maxWorkers = 100


Observed that 100 is optimal value, <100 and >100 drops performance.

But probably we need more tests.

miekg · 2017-11-15T07:37:56Z

server.go

+	// Maximum number of TCP queries before we close the socket.
+	maxTCPQueries = 128
+	// Maximum number of incoming DNS messages in queue.
+	maxQueueSize = 1000000


Why 1000000?

CoreDNS cannot handle 1M requests - so this queue size is not affected anything (prevents bottleneck).

codecov-io · 2017-11-15T08:41:35Z

Codecov Report

Merging #565 into master will increase coverage by 0.08%.
The diff coverage is 79.48%.

@@            Coverage Diff             @@
##           master     #565      +/-   ##
==========================================
+ Coverage   57.85%   57.93%   +0.08%     
==========================================
  Files          37       37              
  Lines        9984    10008      +24     
==========================================
+ Hits         5776     5798      +22     
  Misses       3158     3158              
- Partials     1050     1052       +2

Impacted Files	Coverage Δ
server.go	`59.57% <79.48%> (+1.05%)`	⬆️
msg.go	`77.85% <0%> (+0.65%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3bbde60...9e70f9e. Read the comment docs.

miekg · 2017-11-15T09:25:15Z

Couple a things:

If you're proposing large changes, it's best to open a bug first
Nor the comment, nor the PR has a decent description of what and why you are doing
If this is performance it should add a benchmark test or show some improvements
A buffered channel with a random number of elements in it (the other place this happens: https://github.com/miekg/dns/blob/master/scan.go#L171 should be removed) does not work
A single benchmark to fix the numbers to N does not say anything. How about ARM, s390, i386?
What happens when you hit the 100 goroutines?
What happens when the buffer fills up?
How much faster is this anyway?

miekg · 2017-12-07T15:07:53Z

What do other think of this?

The speed is nice. Is the contant of '100' a problem?

johnbelamaric · 2017-12-07T21:28:11Z

I asked @UladzimirTrehubenka to put some more details (like those in the email) onto this PR or onto an issue, I think others will need to see that to weigh in.

I think to know if the constant is a problem probably requires more empirical tests on other platforms. But if this feature is disabled with workers == 0, then the cost is low (well, if you consider maintaining two different paths low cost).

Silent crashes are bad...I like that it fixes that (it does, right?).

miekg · 2017-12-07T22:49:16Z

Silent crashes are bad...I like that it fixes that (it does, right?).

Good point, somehow I tunnel visioned on the constant.

UladzimirTrehubenka · 2017-12-08T16:02:46Z

Actually this PR breaks nothing. By default Workers and QueueSize set to zero. These params are set during server object initialization (e.g. on CoreDNS side). Finally on AWS cluster with using handler that returns random A record for any request I got following numbers (dnsperf against test binary over network):

workers 0, size 0 - 92K;
workers 100, size 0 - 95K;
workers 100, size 1000000 - 100K.

BTW the project performance (with 100 workers) is 38K for size=0 vs 47K for size=1000000.
For erratic handler on my local machine I got (for local testing doesn't matter what is the queue size):

workers 0, size 0 - 208K;
workers 50, size 0 - 209K;
workers 100, size 0 - 250K;
workers 200, size 0 - 246K;
workers 1000, size 0 - 234K.

miekg · 2017-12-09T07:23:45Z

server.go

-	srv.lock.Lock()
-	defer srv.lock.Unlock()
-	if srv.started {
+	if srv.isRunning() {


Is this need in this PR? (Not opposing the change - but it seems to do the same thing as earlier code, or is there a bug fixed?)

[it does clear out of awful locking we had in server]

Can you make this a seperate PR?

miekg · 2017-12-09T07:25:13Z

server.go

+
+	if srv.Handler == nil {
+		srv.Handler = DefaultServeMux
+	}


Why is this on needed?

scrolls down

Ah you're pulling it out from the serveX functions; sensible change; can you make that also a new PR?

serveUDP() and serveTCP() have:

handler := srv.Handler if handler == nil { handler = DefaultServeMux } ... go srv.serve(s.RemoteAddr(), handler, ...)

Why do we need to set handler each time on serveUDP() or serveTCP() call and then pass handler into serve() if we can set srv.Handler only once and serve() can just use srv.Handler?

miekg · 2017-12-09T07:28:31Z

server.go

 func (srv *Server) serveTCP(l net.Listener) error {
+	srv.start()


why is start() need again here?

miekg · 2017-12-09T07:37:00Z

server.go

+	if in.s != nil {
+		a = in.s.RemoteAddr()
+	} else if in.t != nil {
+		a = in.t.RemoteAddr()


if in.s != nil { a = in.s.RemoteAddr() } if in.t != nil { a = in.t.RemoteAddr() }

All the reader stuff should apply for both cases and can be outdented

miekg · 2017-12-09T07:39:25Z

server.go

+	for q := 0; q < maxTCPQueries; q++ { // TODO(miek): make this number configurable?
+		req := new(Msg)
+		err := req.Unpack(in.m)
+		if err != nil { // Send a FormatError back


I'm not convinced the if-elseif-else is better than the goto we had.

miekg · 2017-12-09T07:43:04Z

server.go

+	// Number of workers, if set to zero - use spawn goroutines instead
+	Workers int
+	// Size of DNS requests queue
+	QueueSize int


the #workers is one thing, but the QueueSize is quite another, if this doesn't perform better with QueueSize == 0 it's not worth adding. A buffered channel leads to weird "sometimes it is slow - or blocking" errors in prod when the thing finally fills up.

miekg · 2018-02-05T14:21:41Z

[ Quoting <[email protected]> in "Re: [miekg/dns] Use workers instead..." ]

UladzimirTrehubenka commented on this pull request. > return &Error{err: "server already started"} } + + if srv.Handler == nil { + srv.Handler = DefaultServeMux + } serveUDP() and serveTCP() have: ``` handler := srv.Handler if handler == nil { handler = DefaultServeMux } ... go srv.serve(s.RemoteAddr(), handler, ...) ``` Why do we need to set handler each time on serveUDP() or serveTCP() call and then pass handler into serve() if we can set srv.Handler only once and serve() can just use srv.Handler?

Good question, it does look a bit odd doing this in the serving path.. And no comments (yeah!) on why this makes sense. Can't think of a good reason right now.

miekg · 2018-02-05T16:31:56Z

[ Quoting <[email protected]> in "Re: [miekg/dns] Use workers instead..." ]

Why do we need to set handler each time on serveUDP() or serveTCP() call and then pass handler into serve() if we can set srv.Handler only once and serve() can just use srv.Handler?

A quick hack to remove this crashes with 'go test'

UladzimirTrehubenka · 2018-02-05T17:30:05Z

This is not enough just remove this - need to change srv.serve() to use srv.Handler instead passed handler and set srv.Handler to DefaultServeMux (if handler is empty) in srv.ListenAndServe() and srv.ActivateAndServe() as in the PR.

UladzimirTrehubenka · 2018-02-05T17:31:48Z

BTW PR passed all UT and don't change default behavior.

UladzimirTrehubenka force-pushed the worker branch from 1f26fee to 75c7b1d Compare November 14, 2017 16:59

miekg reviewed Nov 15, 2017

View reviewed changes

UladzimirTrehubenka force-pushed the worker branch from 75c7b1d to 5ec8082 Compare November 15, 2017 08:41

UladzimirTrehubenka force-pushed the worker branch 3 times, most recently from d8c3d8b to 89b23b3 Compare November 22, 2017 14:40

UladzimirTrehubenka force-pushed the worker branch 4 times, most recently from c2b1d54 to f7e4c4c Compare November 27, 2017 11:35

UladzimirTrehubenka force-pushed the worker branch from f7e4c4c to 6c77da5 Compare December 4, 2017 14:36

UladzimirTrehubenka force-pushed the worker branch from 6c77da5 to 5d657cf Compare December 8, 2017 15:43

miekg requested changes Dec 9, 2017

View reviewed changes

UladzimirTrehubenka force-pushed the worker branch from 5d657cf to 0859349 Compare December 12, 2017 14:04

Use workers instead spawning goroutines for each incoming DNS request

9e70f9e

UladzimirTrehubenka force-pushed the worker branch from 0859349 to 9e70f9e Compare December 12, 2017 14:07

andrewtj mentioned this pull request Dec 21, 2017

DoS: Msg.Unpack uses untrusted lengths to allocate slices #609

Closed

miekg closed this Jan 5, 2018

UladzimirTrehubenka mentioned this pull request Feb 7, 2018

Use workers instead spawning goroutines for each incoming DNS request #639

Closed

miekg mentioned this pull request Mar 23, 2018

SO_REUSEPORT support #654

Closed

tmthrgd mentioned this pull request Mar 8, 2019

Request spike use CPU and memory #916

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use workers instead spawning goroutines for each incoming DNS request #565

Use workers instead spawning goroutines for each incoming DNS request #565

UladzimirTrehubenka commented Nov 14, 2017 •

edited

Loading

miekg commented Nov 15, 2017

miekg Nov 15, 2017

UladzimirTrehubenka Nov 15, 2017

UladzimirTrehubenka Nov 15, 2017

miekg Nov 15, 2017

UladzimirTrehubenka Nov 15, 2017

codecov-io commented Nov 15, 2017 •

edited

Loading

miekg commented Nov 15, 2017

miekg commented Dec 7, 2017

johnbelamaric commented Dec 7, 2017

miekg commented Dec 7, 2017

UladzimirTrehubenka commented Dec 8, 2017 •

edited

Loading

miekg Dec 9, 2017

miekg Dec 9, 2017

UladzimirTrehubenka Feb 5, 2018

miekg Dec 9, 2017

miekg Dec 9, 2017

miekg Dec 9, 2017

miekg Dec 9, 2017

miekg commented Feb 5, 2018 via email

miekg commented Feb 5, 2018 via email

UladzimirTrehubenka commented Feb 5, 2018 •

edited

Loading

UladzimirTrehubenka commented Feb 5, 2018

		func (srv *Server) serveTCP(l net.Listener) error {
		srv.start()

Use workers instead spawning goroutines for each incoming DNS request #565

Use workers instead spawning goroutines for each incoming DNS request #565

Conversation

UladzimirTrehubenka commented Nov 14, 2017 • edited Loading

miekg commented Nov 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Nov 15, 2017 • edited Loading

Codecov Report

miekg commented Nov 15, 2017

miekg commented Dec 7, 2017

johnbelamaric commented Dec 7, 2017

miekg commented Dec 7, 2017

UladzimirTrehubenka commented Dec 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miekg commented Feb 5, 2018 via email

miekg commented Feb 5, 2018 via email

UladzimirTrehubenka commented Feb 5, 2018 • edited Loading

UladzimirTrehubenka commented Feb 5, 2018

UladzimirTrehubenka commented Nov 14, 2017 •

edited

Loading

codecov-io commented Nov 15, 2017 •

edited

Loading

UladzimirTrehubenka commented Dec 8, 2017 •

edited

Loading

UladzimirTrehubenka commented Feb 5, 2018 •

edited

Loading