Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use workers instead spawning goroutines for each incoming DNS request #565

Closed
wants to merge 1 commit into from

Conversation

UladzimirTrehubenka
Copy link
Contributor

@UladzimirTrehubenka UladzimirTrehubenka commented Nov 14, 2017

There are two major issues:
unlimited using resources - performance test shows that on high load CoreDNS silently crashed;
performance drop due much time is spent on management goroutines instead serve DNS requests.

@miekg
Copy link
Owner

miekg commented Nov 15, 2017

Why?

server.go Outdated
// Maximum number of incoming DNS messages in queue.
maxQueueSize = 1000000
// Maximum number of workers.
maxWorkers = 100
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a 100?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observed that 100 is optimal value, <100 and >100 drops performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But probably we need more tests.

server.go Outdated
// Maximum number of TCP queries before we close the socket.
maxTCPQueries = 128
// Maximum number of incoming DNS messages in queue.
maxQueueSize = 1000000
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1000000?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CoreDNS cannot handle 1M requests - so this queue size is not affected anything (prevents bottleneck).

@codecov-io
Copy link

codecov-io commented Nov 15, 2017

Codecov Report

Merging #565 into master will increase coverage by 0.08%.
The diff coverage is 79.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #565      +/-   ##
==========================================
+ Coverage   57.85%   57.93%   +0.08%     
==========================================
  Files          37       37              
  Lines        9984    10008      +24     
==========================================
+ Hits         5776     5798      +22     
  Misses       3158     3158              
- Partials     1050     1052       +2
Impacted Files Coverage Δ
server.go 59.57% <79.48%> (+1.05%) ⬆️
msg.go 77.85% <0%> (+0.65%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3bbde60...9e70f9e. Read the comment docs.

@miekg
Copy link
Owner

miekg commented Nov 15, 2017

Couple a things:

  • If you're proposing large changes, it's best to open a bug first
  • Nor the comment, nor the PR has a decent description of what and why you are doing
  • If this is performance it should add a benchmark test or show some improvements
  • A buffered channel with a random number of elements in it (the other place this happens: https://github.com/miekg/dns/blob/master/scan.go#L171 should be removed) does not work
  • A single benchmark to fix the numbers to N does not say anything. How about ARM, s390, i386?
  • What happens when you hit the 100 goroutines?
  • What happens when the buffer fills up?
  • How much faster is this anyway?

@miekg
Copy link
Owner

miekg commented Dec 7, 2017

What do other think of this?

The speed is nice. Is the contant of '100' a problem?

@johnbelamaric
Copy link

I asked @UladzimirTrehubenka to put some more details (like those in the email) onto this PR or onto an issue, I think others will need to see that to weigh in.

I think to know if the constant is a problem probably requires more empirical tests on other platforms. But if this feature is disabled with workers == 0, then the cost is low (well, if you consider maintaining two different paths low cost).

Silent crashes are bad...I like that it fixes that (it does, right?).

@miekg
Copy link
Owner

miekg commented Dec 7, 2017

Silent crashes are bad...I like that it fixes that (it does, right?).

Good point, somehow I tunnel visioned on the constant.

@UladzimirTrehubenka
Copy link
Contributor Author

UladzimirTrehubenka commented Dec 8, 2017

Actually this PR breaks nothing. By default Workers and QueueSize set to zero. These params are set during server object initialization (e.g. on CoreDNS side). Finally on AWS cluster with using handler that returns random A record for any request I got following numbers (dnsperf against test binary over network):

workers 0, size 0 - 92K;
workers 100, size 0 - 95K;
workers 100, size 1000000 - 100K.

BTW the project performance (with 100 workers) is 38K for size=0 vs 47K for size=1000000.
For erratic handler on my local machine I got (for local testing doesn't matter what is the queue size):

workers 0, size 0 - 208K;
workers 50, size 0 - 209K;
workers 100, size 0 - 250K;
workers 200, size 0 - 246K;
workers 1000, size 0 - 234K.

server.go Outdated
srv.lock.Lock()
defer srv.lock.Unlock()
if srv.started {
if srv.isRunning() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this need in this PR? (Not opposing the change - but it seems to do the same thing as earlier code, or is there a bug fixed?)

[it does clear out of awful locking we had in server]

Can you make this a seperate PR?


if srv.Handler == nil {
srv.Handler = DefaultServeMux
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this on needed?

scrolls down

Ah you're pulling it out from the serveX functions; sensible change; can you make that also a new PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serveUDP() and serveTCP() have:

	handler := srv.Handler
	if handler == nil {
		handler = DefaultServeMux
	}
        ...
        go srv.serve(s.RemoteAddr(), handler, ...)

Why do we need to set handler each time on serveUDP() or serveTCP() call and then pass handler into serve() if we can set srv.Handler only once and serve() can just use srv.Handler?

server.go Outdated
func (srv *Server) serveTCP(l net.Listener) error {
srv.start()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is start() need again here?

server.go Outdated
if in.s != nil {
a = in.s.RemoteAddr()
} else if in.t != nil {
a = in.t.RemoteAddr()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if in.s != nil {
    a = in.s.RemoteAddr()
}
if in.t != nil {
    a = in.t.RemoteAddr()
}

All the reader stuff should apply for both cases and can be outdented

server.go Outdated
for q := 0; q < maxTCPQueries; q++ { // TODO(miek): make this number configurable?
req := new(Msg)
err := req.Unpack(in.m)
if err != nil { // Send a FormatError back
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced the if-elseif-else is better than the goto we had.

server.go Outdated
// Number of workers, if set to zero - use spawn goroutines instead
Workers int
// Size of DNS requests queue
QueueSize int
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the #workers is one thing, but the QueueSize is quite another, if this doesn't perform better with QueueSize == 0 it's not worth adding. A buffered channel leads to weird "sometimes it is slow - or blocking" errors in prod when the thing finally fills up.

@miekg
Copy link
Owner

miekg commented Feb 5, 2018 via email

@miekg
Copy link
Owner

miekg commented Feb 5, 2018 via email

@UladzimirTrehubenka
Copy link
Contributor Author

UladzimirTrehubenka commented Feb 5, 2018

This is not enough just remove this - need to change srv.serve() to use srv.Handler instead passed handler and set srv.Handler to DefaultServeMux (if handler is empty) in srv.ListenAndServe() and srv.ActivateAndServe() as in the PR.

@UladzimirTrehubenka
Copy link
Contributor Author

BTW PR passed all UT and don't change default behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants