io_uring buffer performance #692

davidzeng0 · 2022-10-19T07:59:50Z

davidzeng0
Oct 19, 2022

This isn't so much io_uring related, so forgive me if I'm talking in the wrong place. But it sort of is related, and I feel like the expertise/experience of people here may come in handy.

I have a sample single threaded echo server application that:

allocates a 512 length aligned buffer for each client
performs recvs and sends using the buffer, always the same buffer for a particular fd

I've noticed that the application really likes order. By which I mean the buffers being used for recvs and sends
are relatively close together, which isn't the case with many many clients. This makes sense because the CPU would prefer memory that's in cache and having 6144 clients * 16384 bytes = 100 MB easily exceeds any consumer CPU's cache size. Sometimes, the performance would rise as high as 255K/s and dip as low as 217K/s. When I use the same buffer for all clients, it's stable at about 235-237K/sec.

Though I don't believe I'm misusing io_uring, but both IORING_OP_POLL_ADD and epoll_wait are not affected by the position of the buffers in memory and consistently perform around the same number (237K for uring and 236K for epoll). As far as I know, recv/send memory copy are done in task work in the userspace so I can't see why recv/send in kernel space would be worse. My closest guess is that in user space, the send is right after the recv, but in uring the recv happens some time after the send.

I've considered using PROVIDE_BUFFERS to eliminate the randomness of the buffer addresses but I would prefer not to have extra overhead if possible.

I've also set thread affinity every time and it actually makes performance worse, but more consistent.

So far my hypothesis seems to be true. Performing all the sends in a separate loop from the recvs brings epoll performance down noticeably to 211K and increases the variability in performance.

I'm wondering if anyone has any insight related to this.

Answered by davidzeng0

Oct 27, 2022

Ok I tried using different buffers per connection for epoll and that seemed to bring down the performance to a point where it gets consistently outperformed by io_uring. I think my issue was at significantly higher number of connections (6144) the overhead of having that many sockets is the main bottleneck instead of the recv/send buffer addresses

View full answer

axboe · 2022-10-19T12:46:13Z

axboe
Oct 19, 2022
Maintainer

If you are able to use the newer ring provided buffers rather than the older way of providing buffers, you can use provided buffers without really incurring any extra overhead.

1 reply

davidzeng0 Oct 19, 2022
Author

Ok, thank you for the answer. I'm assuming this feature was originally intended for this purpose. Do correct me if I'm wrong, but theoretically if multiple recvs return at the same time, that would be similar to running the recvs and sends in different loops, right?

davidzeng0 · 2022-10-27T06:57:55Z

davidzeng0
Oct 27, 2022
Author

Ok I tried using different buffers per connection for epoll and that seemed to bring down the performance to a point where it gets consistently outperformed by io_uring. I think my issue was at significantly higher number of connections (6144) the overhead of having that many sockets is the main bottleneck instead of the recv/send buffer addresses

11 replies

davidzeng0 Oct 27, 2022
Author

Interesting, I completely forgot about testing the NOP. What might help is, specifically with my code, there is some code inbetween the submit call and the processing of the cqes. The submit function is inlined and gcc may be trying to not reorder the reads, leading to suboptimal code. And it only loads the tail once before the cqe loop runs, and never re-loads it to prevent running too many cqes. So it may affect my benchmarks even more.

If it helps, my code is open source at

https://github.com/ilikdoge/xe/blob/master/xe/loop.cc#L443

And the commit that increases performance is

davidyz0/xe@54359d2

davidzeng0 Oct 27, 2022
Author

As for my end, I found rather inconclusive evidence with liburing. That seems to imply its bad optimization of my code when I had the smp function. I was only testing recv+send, so maybe the difference is only noticeable with NOP.

isilence Oct 27, 2022
Collaborator

I'd guess that's because it most probably would add some sort of a compiler barrier, which would force to reload all the tails, heads, masks, cqes array it so on. Especially should be noticeable when you do nothing useful in the completion loop and when the rest doesn't do much, i.e. nops. That acquire always looked quite over excessive but just nobody cared enough as x64.

Can you try a macro from below? Will help to confirm if that's the problem. It takes an additional unsigned int argument "tail", can be uninitialised.

#define io_uring_for_each_cqe2(ring, head, tail, cqe)			\
	/*								\
	 * io_uring_smp_load_acquire() enforces the order of tail	\
	 * and CQE reads.						\
	 */								\
	for (head = *(ring)->cq.khead,					\
		tail = io_uring_smp_load_acquire((ring)->cq.ktail);	\
	     (cqe = (head != tail ? \
		&(ring)->cq.cqes[io_uring_cqe_index(ring, head, (ring)->cq.ring_mask)] : NULL)); \
	     head++)							\

davidzeng0 Oct 27, 2022
Author

Yeah, it may depend on the use case significantly, especially if some function pointer is called that might force a reload of some variables.

I have this function

The regular loop

And the "faster" one

The increment of sqe_tail is just a fast submit (the sqes are all preconfigured so they never have to be prepped again)
Sqring size 256, cqring size 8192, total number of reqs 4096

No measureable difference between the two loops, both capable of reaching 32.2 million/sec (AMD 5800X)
I notice that the compiler really likes to reload the ring_mask variable. Though in real world conditions, this minor difference is going to be minimized due to the overhead of other userspace processing so I don't think it's worth pursuing.

isilence Oct 27, 2022
Collaborator

Makes sense, could be some magic like instruction alignment and what's not.
I agree that It shouldn't really matter for x64, but removing excessive acquire/release may help non-x86/64 arches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

io_uring buffer performance #692

{{title}}

Replies: 2 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

io_uring buffer performance #692

davidzeng0 Oct 19, 2022

Replies: 2 comments · 12 replies

axboe Oct 19, 2022 Maintainer

davidzeng0 Oct 19, 2022 Author

davidzeng0 Oct 27, 2022 Author

davidzeng0 Oct 27, 2022 Author

davidzeng0 Oct 27, 2022 Author

isilence Oct 27, 2022 Collaborator

davidzeng0 Oct 27, 2022 Author

isilence Oct 27, 2022 Collaborator

davidzeng0
Oct 19, 2022

Replies: 2 comments 12 replies

axboe
Oct 19, 2022
Maintainer

davidzeng0 Oct 19, 2022
Author

davidzeng0
Oct 27, 2022
Author

davidzeng0 Oct 27, 2022
Author

davidzeng0 Oct 27, 2022
Author

isilence Oct 27, 2022
Collaborator

davidzeng0 Oct 27, 2022
Author

isilence Oct 27, 2022
Collaborator