Make it easy to set `TCP_NODELAY` on server sockets #629

staticfloat · 2020-11-27T01:35:03Z

In my quest to improve PkgServer reliability, I noticed that my current favorite load testing tool k6 shows a reliable 40ms delay whenever dealing with HTTP.jl.

The HTTP.jl server code:

# http_mwe.jl
using HTTP, Sockets

function serve(host, port)
    num_requests = 1
    @info("Listening on $(host):$(port)")
    HTTP.listen(host, port) do http
        # Uncommenting this improves performance in all cases
        # Sockets.nagle(http.stream.c.io, false)
        HTTP.setstatus(http, 200)
        HTTP.startwrite(http)
        write(http, "Hello, Julia")
    end
end

serve("0.0.0.0", 8000)

The k6 configuration:

// http_perf.js
// If you don't want to install `k6`, you can run this via:
// docker run -v $(pwd):/config:ro --net=host -ti loadimpact/k6 run /config/http_perf.js
import http from 'k6/http';
import { check } from 'k6';

// Simulate 10 users for 10 seconds
export let options = {
    vus: 1,
    duration: '10s',
    // Uncommenting this improves performance, but at a slight cost
    //noConnectionReuse: true,
};

// Get the root page, /registries, and follow the /registries redirect
export default function () {
    let base = 'http://localhost:8000';

    // Hit the index page (static file)
    let res = http.get('http://localhost:8000');
    check(res, {
        '200 OK': (r) => r.status === 200,
    });
    check(res, {
        'payload': (r) => r.body.startsWith("Hello"),
    });
}

The results of running k6 against the server show a consistent 40ms delay in receiving the first byte of data from the server:

$ docker run -v $(pwd):/config:ro --net=host -ti loadimpact/k6 run /config/http_perf.js

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: /config/http_perf.js
     output: -

  scenarios: (100.00%) 1 scenario, 1 max VUs, 40s max duration (incl. graceful stop):
           * default: 1 looping VUs for 10s (gracefulStop: 30s)


running (10.1s), 0/1 VUs, 228 complete and 0 interrupted iterations
default ✓ [======================================] 1 VUs  10s

    ✓ 200 OK
    ✓ payload

    checks.....................: 100.00% ✓ 456 ✗ 0  
    data_received..............: 16 kB   1.6 kB/s
    data_sent..................: 18 kB   1.8 kB/s
    http_req_blocked...........: avg=71.01µs  min=2.52µs  med=5.66µs  max=14.87ms  p(90)=5.97µs   p(95)=7.16µs  
    http_req_connecting........: avg=673ns    min=0s      med=0s      max=153.58µs p(90)=0s       p(95)=0s      
    http_req_duration..........: avg=43.65ms  min=696.5µs med=43.75ms max=47.83ms  p(90)=43.8ms   p(95)=43.85ms 
    http_req_receiving.........: avg=43.32ms  min=240.9µs med=43.42ms max=47.51ms  p(90)=43.48ms  p(95)=43.54ms 
    http_req_sending...........: avg=28.21µs  min=12.51µs med=26.69µs max=127.06µs p(90)=33.24µs  p(95)=34.49µs 
    http_req_tls_handshaking...: avg=0s       min=0s      med=0s      max=0s       p(90)=0s       p(95)=0s      
    http_req_waiting...........: avg=300.94µs min=227.5µs med=302µs   max=382.9µs  p(90)=357.07µs p(95)=360.64µs
    http_reqs..................: 228     22.64887/s
    iteration_duration.........: avg=43.96ms  min=16.05ms med=43.98ms max=48.07ms  p(90)=44.04ms  p(95)=44.09ms 
    iterations.................: 228     22.64887/s
    vus........................: 1       min=1 max=1
    vus_max....................: 1       min=1 max=1

The roundness of 40ms tickled my brain, and I remembered an article I read about TCP_NODELAY. Grepping through Julia's source code, I discovered that Sockets.nagle() was the way to toggle this. Finding the underlying socket object took a bit more grepping through HTTP.jl, but eventually I found it in http.stream.c.io. This seems a little "internalsy", which is why I opened this issue, but the results are dramatic. Uncommenting that Sockets.nagle() call in the Julia code above, we get:

$ docker run -v $(pwd):/config:ro --net=host -ti loadimpact/k6 run /config/http_perf.js

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: /config/http_perf.js
     output: -

  scenarios: (100.00%) 1 scenario, 1 max VUs, 40s max duration (incl. graceful stop):
           * default: 1 looping VUs for 10s (gracefulStop: 30s)


running (10.1s), 0/1 VUs, 21189 complete and 0 interrupted iterations
default ✓ [======================================] 1 VUs  10s

    ✓ 200 OK
    ✓ payload

    checks.....................: 100.00% ✓ 42378 ✗ 0  
    data_received..............: 1.5 MB  145 kB/s
    data_sent..................: 1.7 MB  168 kB/s
    http_req_blocked...........: avg=4.33µs   min=1.64µs   med=3.22µs   max=14.94ms  p(90)=4.77µs   p(95)=5.48µs  
    http_req_connecting........: avg=7ns      min=0s       med=0s       max=161.53µs p(90)=0s       p(95)=0s      
    http_req_duration..........: avg=284.03µs min=158.48µs med=262.16µs max=112.07ms p(90)=305.08µs p(95)=331.77µs
    http_req_receiving.........: avg=120.11µs min=59.72µs  med=117.9µs  max=1.3ms    p(90)=137.69µs p(95)=160.2µs 
    http_req_sending...........: avg=20.11µs  min=10.33µs  med=18.22µs  max=1.05ms   p(90)=24.99µs  p(95)=28.42µs 
    http_req_tls_handshaking...: avg=0s       min=0s       med=0s       max=0s       p(90)=0s       p(95)=0s      
    http_req_waiting...........: avg=143.8µs  min=77.66µs  med=125.84µs max=111.93ms p(90)=146.47µs p(95)=156.57µs
    http_reqs..................: 21189   2106.505073/s
    iteration_duration.........: avg=463.58µs min=313.01µs med=432.82µs max=112.33ms p(90)=505.48µs p(95)=535.4µs 
    iterations.................: 21189   2106.505073/s
    vus........................: 1       min=1   max=1
    vus_max....................: 1       min=1   max=1

So a nice, almost 100x speedup (2106.5 requests served per second, as opposed to 22.6). Now, of course, HTTP.jl is able to multiplex connections, so if we increase the number of VU's in k6 to get parallelized requests going, we can push the throughput up, but not the latency. Using 100 VU's and the default configuration, we can get an average 45.3ms latency with 2193.2 requests served per second, however adding TCP_NODELAY back on drops the latency to 8.6ms and gets us a whopping 11467.9 requests served per second.

There is a point (1000 VUs) at which we can completely overwhelm the CPU and the added work of setting Sockets.nagle() actually hurts us more than it helps; at that point the TCP_NODELAY version gets a lower throughput and higher latency than the default version, but at that point the CPUs on my test machine are screaming for relief, so I'm okay with it.

I tried out some other tools and some of them did not show this issue. I discovered it had to do with the behavior of k6 keeping the socket open, for connection reuse. Turning that off (forcing it to use a new socket for each connection attempt) also got rid of the latency, but it has a significant impact on total throughput.

The end result of this issue is to ask: is there a nice, HTTP.jl-approved way for me to alter the socket objects created by the accept() call in HTTP.listen()? I don't want to rely on the path of http.stream.c.io as it seems kind of internals-y and could change in a future release. Perhaps some kind of callback could be issued (similar to tcpisvalid, which I could abuse for this purpose to mutate the socket) to allow me to modify the socket options?

The text was updated successfully, but these errors were encountered:

staticfloat · 2020-11-27T01:36:11Z

As an aside, with these socket options set, we totally blow flask out of the water on this very specific micro-benchmark. :P

staticfloat · 2020-11-27T19:23:49Z

Thinking about this a little bit more, we don't just want an easy way to set nagle() on a socket once; we actually want to be able to do it dynamically, and also to set it back on again. An example would be that for some responses, where we know we're going to generate the response all in one go, we want to disable Nagle, but if we're generating the response piecemeal, we don't want to send a bunch of tiny packets in rapid succession. (Although my experiments so far have not found reliable cases where this happens).

An alternative way of doing this would be to set HTTP.jl into some mode where it collects all the write(http, ...) calls, then uses writev() against the socket with Nagle disabled; this should give much more precise control over the minimum packet size.

staticfloat · 2020-11-27T20:16:46Z

Okay, I did a few more experiments (and learned more about Transfer-Encoding: chunked) and I have come up with what looks like a pretty happy middle ground; as long as the lifetime of the request is governed by the HTTP.listen() do-block, we pretty much always want nagle enabled within the do-block, but disabled momentarily as soon as the the do block is finished.

In this case, I think what this basically boils down to is that in startwrite(), we generally want to call Sockets.nagle(http.stream.c.io, true), and in closewrite() we generally want to call Sockets.nagle(http.stream.c.io, false).

This gives us a natural segmentation of TCP packets, and serves to essentially "flush" the TCP buffer once the user callback is done constructing their HTTP response. You can simulate this right now, with the following:

using HTTP, Sockets

function serve(host, port)
    num_requests = 1
    @info("Listening on $(host):$(port)")
    HTTP.listen(host, port) do http
        Sockets.nagle(http.stream.c.io, true)
        HTTP.setstatus(http, 200)
        # I'm using this just to disable chunked encoding, which inflates payload size
        # dramatically when writing a single character at a time.  ;)
        HTTP.setheader(http, "Content-Length" => "12")
        HTTP.startwrite(http)
        for c in "Hello, Julia"
            write(http, string(c))
        end
        Sockets.nagle(http.stream.c.io, false)
    end
end

serve("0.0.0.0", 8000)

Testing with k6, we see that we get a nice 2050.3 requests per second served with a single VU, which is not bad considering we're doing a fair bit more work, write()'ing 12 times instead of one, etc... Latency remains sub-millisecond, which is fantastic, and checking packet sizes with tcpdump/tcpflow, we see a nice bundling of packets:

$ sudo tcpdump -i lo -l -w - | tcpflow -g -c -r -
127.000.000.001.36830-127.000.000.001.08000: GET / HTTP/1.1
Host: localhost:8000
User-Agent: k6/0.29.0 (https://k6.io/)


127.000.000.001.08000-127.000.000.001.36830: HTTP/1.1 200 OK
Content-Length: 12


127.000.000.001.08000-127.000.000.001.36830: Hello, Julia

If we were to simply disable Nagle for the entire callback, as I had originally proposed, we would get something like this:

127.000.000.001.36848-127.000.000.001.08000: GET / HTTP/1.1
Host: localhost:8000
User-Agent: k6/0.29.0 (https://k6.io/)


127.000.000.001.08000-127.000.000.001.36848: HTTP/1.1 200 OK
Content-Length: 12


127.000.000.001.08000-127.000.000.001.36848: H
127.000.000.001.08000-127.000.000.001.36848: e
127.000.000.001.08000-127.000.000.001.36848: l
127.000.000.001.08000-127.000.000.001.36848: l
127.000.000.001.08000-127.000.000.001.36848: o
127.000.000.001.08000-127.000.000.001.36848: ,
127.000.000.001.08000-127.000.000.001.36848:  
127.000.000.001.08000-127.000.000.001.36848: J
127.000.000.001.08000-127.000.000.001.36848: u
127.000.000.001.08000-127.000.000.001.36848: l
127.000.000.001.08000-127.000.000.001.36848: i
127.000.000.001.08000-127.000.000.001.36848: a

And while our latency remains sub-millisecond, our throughput drops to only 1706.5 requests per second.

quinnj · 2020-11-28T00:12:31Z

This is some fantastic investigative work. Happy to support any changes we need in HTTP.jl to improve things. Happy to jump on a zoom call or anything if we want to do a jam session and talk about the best way to thread changes through.

staticfloat · 2020-12-01T00:15:09Z

What do you think about just putting:

Sockets.nagle(http.stream.c.io, false)
Sockets.nagle(http.stream.c.io, true)

Within closewrite()? This should flush the TCP buffer and work well with keepalive, as it restores the socket back to its default configuration for the next request.

quinnj · 2020-12-02T06:14:37Z

Yeah, that sounds fine to me; make a PR?

quinnj · 2022-05-27T23:02:22Z

@staticfloat, sorry for the slow follow up. I'd really hate to lose all the great investigative work you originally did here; I've put up a PR with my understanding of what would be needed to get the benefits you proposed? It seems like there are a few different places we could put this code, but what do you think about my idea?

fonsp added enhancement server About our HTTP server labels Mar 16, 2022

quinnj mentioned this issue May 27, 2022

Set server socket nagle to improve throughput #837

Closed

quinnj closed this as completed May 28, 2022

chelyabinsk mentioned this issue Nov 11, 2023

Make it easy to set TCP_NODELAY and TCP_QUICKACK on client sockets #1124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it easy to set `TCP_NODELAY` on server sockets #629

Make it easy to set `TCP_NODELAY` on server sockets #629

staticfloat commented Nov 27, 2020 •

edited

Loading

staticfloat commented Nov 27, 2020

staticfloat commented Nov 27, 2020

staticfloat commented Nov 27, 2020

quinnj commented Nov 28, 2020

staticfloat commented Dec 1, 2020 •

edited

Loading

quinnj commented Dec 2, 2020

quinnj commented May 27, 2022

Make it easy to set TCP_NODELAY on server sockets #629

Make it easy to set TCP_NODELAY on server sockets #629

Comments

staticfloat commented Nov 27, 2020 • edited Loading

staticfloat commented Nov 27, 2020

staticfloat commented Nov 27, 2020

staticfloat commented Nov 27, 2020

quinnj commented Nov 28, 2020

staticfloat commented Dec 1, 2020 • edited Loading

quinnj commented Dec 2, 2020

quinnj commented May 27, 2022

Make it easy to set `TCP_NODELAY` on server sockets #629

Make it easy to set `TCP_NODELAY` on server sockets #629

staticfloat commented Nov 27, 2020 •

edited

Loading

staticfloat commented Dec 1, 2020 •

edited

Loading