native_queued_spin_lock_slowpath contention even after #215 fix #218

romange · 2020-10-02T11:10:41Z

A bit different scenario:

When using IOSQE_IO_LINK for chaining SQEs there is a contention bottleneck with
native_queued_spin_lock_slowpath (see below).

I run it on the patched kernel from tif-task_work branch.

To reproduce - I used my own echo server: https://github.com/romange/async/blob/master/examples/echo_server.cc
Binary (compiled on ubuntu 20.04): https://drive.google.com/file/d/1hgimzG2Y9Olf1cVVUGyssF2pgx1QiBJ3/view?usp=sharing

To run on server side: ./echo_server --logtostderr
To run on client side: ./echo_server --connect -n 100000 -c 5

67.01% [kernel] [k] native_queued_spin_lock_slowpath
3.35% [kernel] [k] _raw_spin_lock
1.88% [kernel] [k] io_prep_async_work
1.01% [kernel] [k] io_dismantle_req
0.79% [ena] [k] ena_io_poll
0.61% [kernel] [k] fput_many
0.47% [kernel] [k] __io_free_req_finish
0.43% [kernel] [k] _copy_from_user
0.42% [ena] [k] ena_start_xmit
0.37% [kernel] [k] _raw_spin_lock_irq
0.36% [kernel] [k] irq_entries_start
0.30% [kernel] [k] copy_user_enhanced_fast_string
0.30% [kernel] [k] skb_release_data

PerfTop: 123479 irqs/sec kernel:91.4% exact: 0.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 36 CPUs)

The text was updated successfully, but these errors were encountered:

isilence · 2020-10-02T11:50:47Z

Is that specific to #215? My 8 CPU laptop doesn't show anything
like that, though without tif-task_work.

Could you find an exact call-chain leading to that spinlock?
e.g. with perf top --call-graph dwarf

romange · 2020-10-02T12:14:52Z

romange · 2020-10-02T12:15:15Z

I noticed now it happens sporadically, sometimes on the client side and sometimes on the server side

isilence · 2020-10-02T13:02:55Z

That helps, thanks! The culprit is spin_lock(&current->fs->lock);

There was a bug fix that, I guess, made it worse. Could
you try the diff below? It reverts the fix, so it's buggy,
but you wouldn't notice.
edit, this is just to confirm that the problem is there.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index eed0d068904c..fa108a7b4fd4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5470,8 +5470,6 @@ static int io_req_defer_prep(struct io_kiocb *req,
 	if (unlikely(ret))
 		return ret;
 
-	io_prep_async_work(req);
-
 	switch (req->opcode) {
 	case IORING_OP_NOP:
 		break;

isilence · 2020-10-02T13:04:57Z

@axboe, btw any chance you remember why {SEND,RCV}MSG need fs?
Is that for net address resolution?

axboe · 2020-10-02T13:45:43Z

Only for SCM_RIGHTS iirc, which is currently disabled explicitly.

axboe · 2020-10-02T13:47:43Z

You could probably also just try unshare(CLONE_FS) in the thread and see if that makes it better.

romange · 2020-10-02T15:41:23Z

Yes, both fixes seem to mitigate the issue.

isilence · 2020-10-02T15:58:36Z

Only for SCM_RIGHTS iirc, which is currently disabled explicitly.

Not sure I see it somewhere disabled on the io_uring side, but it
souds like we can remove needs_fs from SENDMSG/RECVMSG.

isilence · 2020-10-02T16:03:23Z

@romange, thanks for testing! I'll see how to do it similarish
to the diff, I need it in any case.

axboe · 2020-10-02T17:12:24Z

Not sure I see it somewhere disabled on the io_uring side, but it souds like we can remove needs_fs from SENDMSG/RECVMSG.

It's done in __sys_sendmsg_sock():

/* disallow ancillary data requests from this path */
if (msg->msg_control || msg->msg_controllen)
        return -EINVAL;

So yes, it can be removed in io_uring, just needs a comment on why.

axboe · 2020-10-02T17:13:18Z

It's done in __sys_sendmsg_sock()

And ditto __sys_recvmsg_sock(), of course, for the receive side.

isilence · 2020-10-02T17:15:39Z

@axboe, got it, thanks! It seems this would be
the fastest way to fix the contention. I'll cook it.

isilence · 2020-10-04T16:10:58Z

Hmm, I can't find it used anywhere near SCM_RIGHTS or net.
Even tried with a bit modified kernel -- passes fd via io_uring fine
with ->fs explicitely set to NULL.

It seems, need_fs there is just for safety now.

romange · 2020-11-01T18:17:38Z

@axboe Thanks for pushing this forward. I've seen that you submitted part of the fixes into 5.10 branch and some into 5.11.

Could you please tell if 5.10 will reduce the contention or dependence on 5.11 is necessary?

isilence · 2020-11-01T18:31:58Z

@romange, 5.10 should be fine

https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.10&id=5bf5e464f1acb1c031b4a290d63760bcb074c027

axboe · 2020-11-01T18:57:17Z

There's really two things here, Pavel is covering the second one and that will be fine in 5.10. For the TWA_SIGNAL based contention, it'll land in 5.11, but I plan on pushing it for -stable as well once it lands in 5.11-rc.

romange · 2020-11-01T19:03:12Z

Thanks!

…

On Sun, Nov 1, 2020 at 8:57 PM Jens Axboe ***@***.***> wrote: There's really two things here, Pavel is covering the second one and that will be fine in 5.10. For the TWA_SIGNAL based contention, it'll land in 5.11, but I plan on pushing it for -stable as well once it lands in 5.11-rc. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#218 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA4BFCEAVTDWPMNZAIIPCTDSNWVRRANCNFSM4SBSC5HQ> .

-- Best regards, Roman

romange · 2020-11-02T07:06:08Z

It seems that unfortunately 5.10-rc2 still has the contention problem with native_queued_spin_lock_slowpath. Will wait for 5.11 then.

isilence · 2020-11-02T11:15:49Z

@romange, the previous path is gone, but that may happen if requests go through io-wq or there is another spinlock contended.

Could you again locate the full stack trace? e.g. via perf top --call-graph dwarf. That would be very helpful!

isilence · 2020-11-02T11:18:11Z

p.s. IIRC there is nothing in for-5.11 yet that would help with this particular contention.

romange · 2020-11-02T11:49:47Z

I do not succeed to find something meaningful. dwarf option throws lots of errors like
addr2line: DWARF error: could not find variable specification at offset 2ac2f
and anyway with stacktrace sampling native_queued_spin_lock_slowpath is not shown on top.

I will keep checking. I want to add that I did build kernel from of Jens branches, I think tif-task_work and it did help, so I assumed that this will be submitted into 5.11.

romange · 2020-11-02T12:46:42Z

@isilence Pavel, I recorded the profile data with record -F99 --kernel-callchains -C0 -g. Unfortunately, native_queued_spin_lock_slowpath is dispersed there across many stacktraces. If you know any tips on how to improve the sampling method please tell me. For now I am sharing what I got here:
https://drive.google.com/file/d/1Qlb19pgP5biJZy8PnyrfM7S4T4ueQgiw/view?usp=sharing

you can see all the callchains with perf report --stdio

isilence · 2020-11-02T12:59:53Z

tif-task_work have nothing to do with the spin that we located last time, so I guess it's just another one.
You may try the patch below to be sure. I haven't sent it yet but going to.

commit 2e98fb75d85c2c3b0ccb690ff138395d329a0698
Author: Pavel Begunkov <[email protected]>
Date:   Fri Oct 2 22:36:54 2020 +0300

    io_uring: no need in fs for {recv,send}msg
    
    SENDMSG and RECVMSG don't actually need ->fs, it's used only for
    SCM_RIGHTS, which is disallowed by __sys_{send,revb}msg_sock().
    Remove ->needs_fs for them because taking fs->lock may became pretty
    contended if a lot of requests are going through io-wq (or just being
    async prepared).
    
    Signed-off-by: Pavel Begunkov <[email protected]>

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 80be2184f4a5..57709a268eff 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -849,8 +849,7 @@ static const struct io_op_def io_op_defs[] = {
 		.pollout		= 1,
 		.needs_async_data	= 1,
 		.async_size		= sizeof(struct io_async_msghdr),
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG |
-						IO_WQ_WORK_FS,
+		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_RECVMSG] = {
 		.needs_file		= 1,
@@ -859,8 +858,7 @@ static const struct io_op_def io_op_defs[] = {
 		.buffer_select		= 1,
 		.needs_async_data	= 1,
 		.async_size		= sizeof(struct io_async_msghdr),
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG |
-						IO_WQ_WORK_FS,
+		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_TIMEOUT] = {
 		.needs_async_data	= 1,

isilence · 2020-11-02T13:04:28Z

I do not succeed to find something meaningful. dwarf option throws lots of errors like
addr2line: DWARF error: could not find variable specification at offset 2ac2f
and anyway with stacktrace sampling native_queued_spin_lock_slowpath is not shown on top.

I don't know at glance what's the issue, but I'd update libdwarf and build a new perf
(see tools/perf in a folder with linux kernel sources).

isilence · 2020-11-02T13:12:55Z

@isilence Pavel, I recorded the profile data with record -F99 --kernel-callchains -C0 -g. Unfortunately, native_queued_spin_lock_slowpath is dispersed there across many stacktraces. If you know any tips on how to improve the sampling method please tell me. For now I am sharing what I got here:
https://drive.google.com/file/d/1Qlb19pgP5biJZy8PnyrfM7S4T4ueQgiw/view?usp=sharing

you can see all the callchains with perf report --stdio

In generally perf.data are not very portable, even if my perf wouldn't say that the format is incompatible, there is also kernel map that's usually needed.

romange · 2020-11-02T13:38:27Z

Ok, would it help if I attach here perf report --stdio output or maybe you prefer some other report?

axboe · 2020-11-02T14:12:04Z

Ok, would it help if I attach here perf report --stdio output or maybe you prefer some other report?

That would certainy be better - remember to use -g --no-children for perf report, that makes it easier to see. Pavel seems convinced that this is something new, but did you run tif-task_work.arch and see any difference? The task_work with TWA_SIGNAL fix for threaded application would exactly show this kind of contention, that's basically where we started with #215. But the perf report would be crucial in seeing what's going on here, and if it's the signal lock or something else.

isilence · 2020-11-02T14:31:52Z

By "new" I mean that it's different from what we've seen before in this issue (i.e. spinlocking on fs). It's probably tif-task_work as you pointed that it was struggling in native_queued_spin_lock_slowpath as well, but sure better to double check.

romange · 2020-11-02T14:38:56Z

I am currently struggling to build a bootable linux from source, I am checking out for-next branch.
Meanwhile, here is the report with -g --no-children. Does not look helpful.

# Event count (approx.): 1508210043900
#
# Overhead  Command          Shared Object               Symbol                                                                                                                                                   >
# ........  ...............  ..........................  .........................................................................................................................................................>
#
     0.92%  Proactor6        [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.86%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

     0.90%  Proactor19       [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.85%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

     0.86%  Proactor10       [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.79%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

     0.84%  Proactor14       [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.78%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

     0.84%  Proactor13       [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.78%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

     0.84%  Proactor8        [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.76%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

     0.83%  Proactor15       [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.78%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

     0.82%  Proactor31       [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.77%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

:

romange · 2020-11-02T16:17:06Z

"for-next" branch https://git.kernel.dk/cgit/linux-block/log/?h=for-next does not have native_queued_spin_lock_slowpath contention but it has other spinlock functions stand out.

axboe · 2020-11-02T16:20:35Z

OK, I think that settles that the newly reported case is indeed the #215 case. The contention you're seeing now is off the socket, and that's a generic thing that would apply using epoll etc as well. There might be room for improvement there, but that's outside the scope of io_uring.

romange · 2020-11-02T16:31:30Z

Oh, I just noticed now - I've commented on the wrong issue. I meant to comment yesterday on #215 and not on this one. sorry for the confusion I caused.

To conclude, #218 is fixed in 5.10 and #215 is waiting for 5.11. And once 5.11 release window opens, Jens will try to cherry-pick the fix for #215 into 5.10.

axboe · 2020-11-02T16:34:04Z

To conclude, #218 is fixed in 5.10 and #215 is waiting for 5.11. And once 5.11 release window opens, Jens will try to cherry-pick the fix for #215 into 5.10.

Exactly!

axboe · 2020-11-02T16:34:23Z

Closing this one out.

axboe closed this as completed Nov 2, 2020

native_queued_spin_lock_slowpath contention even after #215 fix #218

native_queued_spin_lock_slowpath contention even after #215 fix #218

Comments

romange commented Oct 2, 2020

isilence commented Oct 2, 2020

romange commented Oct 2, 2020

romange commented Oct 2, 2020

isilence commented Oct 2, 2020 • edited Loading

isilence commented Oct 2, 2020 • edited Loading

axboe commented Oct 2, 2020

axboe commented Oct 2, 2020

romange commented Oct 2, 2020

isilence commented Oct 2, 2020

isilence commented Oct 2, 2020

axboe commented Oct 2, 2020

axboe commented Oct 2, 2020

isilence commented Oct 2, 2020

isilence commented Oct 4, 2020

romange commented Nov 1, 2020 • edited Loading

isilence commented Nov 1, 2020

axboe commented Nov 1, 2020

romange commented Nov 1, 2020 via email

romange commented Nov 2, 2020

isilence commented Nov 2, 2020

isilence commented Nov 2, 2020

romange commented Nov 2, 2020

romange commented Nov 2, 2020

isilence commented Nov 2, 2020

isilence commented Nov 2, 2020

isilence commented Nov 2, 2020

romange commented Nov 2, 2020

axboe commented Nov 2, 2020

isilence commented Nov 2, 2020

romange commented Nov 2, 2020

romange commented Nov 2, 2020

axboe commented Nov 2, 2020

romange commented Nov 2, 2020

axboe commented Nov 2, 2020

axboe commented Nov 2, 2020

isilence commented Oct 2, 2020 •

edited

Loading

isilence commented Oct 2, 2020 •

edited

Loading

romange commented Nov 1, 2020 •

edited

Loading