Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

native_queued_spin_lock_slowpath contention even after #215 fix #218

Closed
romange opened this issue Oct 2, 2020 · 35 comments
Closed

native_queued_spin_lock_slowpath contention even after #215 fix #218

romange opened this issue Oct 2, 2020 · 35 comments

Comments

@romange
Copy link
Contributor

romange commented Oct 2, 2020

A bit different scenario:

When using IOSQE_IO_LINK for chaining SQEs there is a contention bottleneck with
native_queued_spin_lock_slowpath (see below).

I run it on the patched kernel from tif-task_work branch.

To reproduce - I used my own echo server: https://github.com/romange/async/blob/master/examples/echo_server.cc
Binary (compiled on ubuntu 20.04): https://drive.google.com/file/d/1hgimzG2Y9Olf1cVVUGyssF2pgx1QiBJ3/view?usp=sharing

To run on server side: ./echo_server --logtostderr
To run on client side: ./echo_server --connect -n 100000 -c 5

67.01% [kernel] [k] native_queued_spin_lock_slowpath
3.35% [kernel] [k] _raw_spin_lock
1.88% [kernel] [k] io_prep_async_work
1.01% [kernel] [k] io_dismantle_req
0.79% [ena] [k] ena_io_poll
0.61% [kernel] [k] fput_many
0.47% [kernel] [k] __io_free_req_finish
0.43% [kernel] [k] _copy_from_user
0.42% [ena] [k] ena_start_xmit
0.37% [kernel] [k] _raw_spin_lock_irq
0.36% [kernel] [k] irq_entries_start
0.30% [kernel] [k] copy_user_enhanced_fast_string
0.30% [kernel] [k] skb_release_data

PerfTop: 123479 irqs/sec kernel:91.4% exact: 0.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 36 CPUs)

@isilence
Copy link
Collaborator

isilence commented Oct 2, 2020

Is that specific to #215? My 8 CPU laptop doesn't show anything
like that, though without tif-task_work.

Could you find an exact call-chain leading to that spinlock?
e.g. with perf top --call-graph dwarf

@romange
Copy link
Contributor Author

romange commented Oct 2, 2020

Screenshot from 2020-10-02 15-14-07

@romange
Copy link
Contributor Author

romange commented Oct 2, 2020

I noticed now it happens sporadically, sometimes on the client side and sometimes on the server side

@isilence
Copy link
Collaborator

isilence commented Oct 2, 2020

That helps, thanks! The culprit is spin_lock(&current->fs->lock);

There was a bug fix that, I guess, made it worse. Could
you try the diff below? It reverts the fix, so it's buggy,
but you wouldn't notice.
edit, this is just to confirm that the problem is there.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index eed0d068904c..fa108a7b4fd4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5470,8 +5470,6 @@ static int io_req_defer_prep(struct io_kiocb *req,
 	if (unlikely(ret))
 		return ret;
 
-	io_prep_async_work(req);
-
 	switch (req->opcode) {
 	case IORING_OP_NOP:
 		break;

@isilence
Copy link
Collaborator

isilence commented Oct 2, 2020

@axboe, btw any chance you remember why {SEND,RCV}MSG need fs?
Is that for net address resolution?

@axboe
Copy link
Owner

axboe commented Oct 2, 2020

Only for SCM_RIGHTS iirc, which is currently disabled explicitly.

@axboe
Copy link
Owner

axboe commented Oct 2, 2020

You could probably also just try unshare(CLONE_FS) in the thread and see if that makes it better.

@romange
Copy link
Contributor Author

romange commented Oct 2, 2020

Yes, both fixes seem to mitigate the issue.

@isilence
Copy link
Collaborator

isilence commented Oct 2, 2020

Only for SCM_RIGHTS iirc, which is currently disabled explicitly.

Not sure I see it somewhere disabled on the io_uring side, but it
souds like we can remove needs_fs from SENDMSG/RECVMSG.

@isilence
Copy link
Collaborator

isilence commented Oct 2, 2020

@romange, thanks for testing! I'll see how to do it similarish
to the diff, I need it in any case.

@axboe
Copy link
Owner

axboe commented Oct 2, 2020

Not sure I see it somewhere disabled on the io_uring side, but it souds like we can remove needs_fs from SENDMSG/RECVMSG.

It's done in __sys_sendmsg_sock():

/* disallow ancillary data requests from this path */
if (msg->msg_control || msg->msg_controllen)
        return -EINVAL;

So yes, it can be removed in io_uring, just needs a comment on why.

@axboe
Copy link
Owner

axboe commented Oct 2, 2020

It's done in __sys_sendmsg_sock()

And ditto __sys_recvmsg_sock(), of course, for the receive side.

@isilence
Copy link
Collaborator

isilence commented Oct 2, 2020

@axboe, got it, thanks! It seems this would be
the fastest way to fix the contention. I'll cook it.

@isilence
Copy link
Collaborator

isilence commented Oct 4, 2020

Hmm, I can't find it used anywhere near SCM_RIGHTS or net.
Even tried with a bit modified kernel -- passes fd via io_uring fine
with ->fs explicitely set to NULL.

It seems, need_fs there is just for safety now.

@romange
Copy link
Contributor Author

romange commented Nov 1, 2020

@axboe Thanks for pushing this forward. I've seen that you submitted part of the fixes into 5.10 branch and some into 5.11.

Could you please tell if 5.10 will reduce the contention or dependence on 5.11 is necessary?

@isilence
Copy link
Collaborator

isilence commented Nov 1, 2020

@axboe
Copy link
Owner

axboe commented Nov 1, 2020

There's really two things here, Pavel is covering the second one and that will be fine in 5.10. For the TWA_SIGNAL based contention, it'll land in 5.11, but I plan on pushing it for -stable as well once it lands in 5.11-rc.

@romange
Copy link
Contributor Author

romange commented Nov 1, 2020 via email

@romange
Copy link
Contributor Author

romange commented Nov 2, 2020

It seems that unfortunately 5.10-rc2 still has the contention problem with native_queued_spin_lock_slowpath. Will wait for 5.11 then.

Screenshot from 2020-11-02 09-02-44

@isilence
Copy link
Collaborator

isilence commented Nov 2, 2020

@romange, the previous path is gone, but that may happen if requests go through io-wq or there is another spinlock contended.

Could you again locate the full stack trace? e.g. via perf top --call-graph dwarf. That would be very helpful!

@isilence
Copy link
Collaborator

isilence commented Nov 2, 2020

p.s. IIRC there is nothing in for-5.11 yet that would help with this particular contention.

@romange
Copy link
Contributor Author

romange commented Nov 2, 2020

I do not succeed to find something meaningful. dwarf option throws lots of errors like
addr2line: DWARF error: could not find variable specification at offset 2ac2f
and anyway with stacktrace sampling native_queued_spin_lock_slowpath is not shown on top.

I will keep checking. I want to add that I did build kernel from of Jens branches, I think tif-task_work and it did help, so I assumed that this will be submitted into 5.11.

@romange
Copy link
Contributor Author

romange commented Nov 2, 2020

@isilence Pavel, I recorded the profile data with record -F99 --kernel-callchains -C0 -g. Unfortunately, native_queued_spin_lock_slowpath is dispersed there across many stacktraces. If you know any tips on how to improve the sampling method please tell me. For now I am sharing what I got here:
https://drive.google.com/file/d/1Qlb19pgP5biJZy8PnyrfM7S4T4ueQgiw/view?usp=sharing

you can see all the callchains with perf report --stdio

@isilence
Copy link
Collaborator

isilence commented Nov 2, 2020

tif-task_work have nothing to do with the spin that we located last time, so I guess it's just another one.
You may try the patch below to be sure. I haven't sent it yet but going to.

commit 2e98fb75d85c2c3b0ccb690ff138395d329a0698
Author: Pavel Begunkov <[email protected]>
Date:   Fri Oct 2 22:36:54 2020 +0300

    io_uring: no need in fs for {recv,send}msg
    
    SENDMSG and RECVMSG don't actually need ->fs, it's used only for
    SCM_RIGHTS, which is disallowed by __sys_{send,revb}msg_sock().
    Remove ->needs_fs for them because taking fs->lock may became pretty
    contended if a lot of requests are going through io-wq (or just being
    async prepared).
    
    Signed-off-by: Pavel Begunkov <[email protected]>

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 80be2184f4a5..57709a268eff 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -849,8 +849,7 @@ static const struct io_op_def io_op_defs[] = {
 		.pollout		= 1,
 		.needs_async_data	= 1,
 		.async_size		= sizeof(struct io_async_msghdr),
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG |
-						IO_WQ_WORK_FS,
+		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_RECVMSG] = {
 		.needs_file		= 1,
@@ -859,8 +858,7 @@ static const struct io_op_def io_op_defs[] = {
 		.buffer_select		= 1,
 		.needs_async_data	= 1,
 		.async_size		= sizeof(struct io_async_msghdr),
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG |
-						IO_WQ_WORK_FS,
+		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_TIMEOUT] = {
 		.needs_async_data	= 1,

@isilence
Copy link
Collaborator

isilence commented Nov 2, 2020

I do not succeed to find something meaningful. dwarf option throws lots of errors like
addr2line: DWARF error: could not find variable specification at offset 2ac2f
and anyway with stacktrace sampling native_queued_spin_lock_slowpath is not shown on top.

I don't know at glance what's the issue, but I'd update libdwarf and build a new perf
(see tools/perf in a folder with linux kernel sources).

@isilence
Copy link
Collaborator

isilence commented Nov 2, 2020

@isilence Pavel, I recorded the profile data with record -F99 --kernel-callchains -C0 -g. Unfortunately, native_queued_spin_lock_slowpath is dispersed there across many stacktraces. If you know any tips on how to improve the sampling method please tell me. For now I am sharing what I got here:
https://drive.google.com/file/d/1Qlb19pgP5biJZy8PnyrfM7S4T4ueQgiw/view?usp=sharing

you can see all the callchains with perf report --stdio

In generally perf.data are not very portable, even if my perf wouldn't say that the format is incompatible, there is also kernel map that's usually needed.

@romange
Copy link
Contributor Author

romange commented Nov 2, 2020

Ok, would it help if I attach here perf report --stdio output or maybe you prefer some other report?

@axboe
Copy link
Owner

axboe commented Nov 2, 2020

Ok, would it help if I attach here perf report --stdio output or maybe you prefer some other report?

That would certainy be better - remember to use -g --no-children for perf report, that makes it easier to see. Pavel seems convinced that this is something new, but did you run tif-task_work.arch and see any difference? The task_work with TWA_SIGNAL fix for threaded application would exactly show this kind of contention, that's basically where we started with #215. But the perf report would be crucial in seeing what's going on here, and if it's the signal lock or something else.

@isilence
Copy link
Collaborator

isilence commented Nov 2, 2020

By "new" I mean that it's different from what we've seen before in this issue (i.e. spinlocking on fs). It's probably tif-task_work as you pointed that it was struggling in native_queued_spin_lock_slowpath as well, but sure better to double check.

@romange
Copy link
Contributor Author

romange commented Nov 2, 2020

I am currently struggling to build a bootable linux from source, I am checking out for-next branch.
Meanwhile, here is the report with -g --no-children. Does not look helpful.

# Event count (approx.): 1508210043900
#
# Overhead  Command          Shared Object               Symbol                                                                                                                                                   >
# ........  ...............  ..........................  .........................................................................................................................................................>
#
     0.92%  Proactor6        [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.86%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

     0.90%  Proactor19       [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.85%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

     0.86%  Proactor10       [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.79%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

     0.84%  Proactor14       [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.78%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

     0.84%  Proactor13       [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.78%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

     0.84%  Proactor8        [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.76%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

     0.83%  Proactor15       [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.78%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

     0.82%  Proactor31       [kernel.kallsyms]           [k] native_queued_spin_lock_slowpath
            |          
             --0.77%--entry_SYSCALL_64_after_hwframe
                       do_syscall_64

:

@romange
Copy link
Contributor Author

romange commented Nov 2, 2020

"for-next" branch https://git.kernel.dk/cgit/linux-block/log/?h=for-next does not have native_queued_spin_lock_slowpath contention but it has other spinlock functions stand out.

image

@axboe
Copy link
Owner

axboe commented Nov 2, 2020

OK, I think that settles that the newly reported case is indeed the #215 case. The contention you're seeing now is off the socket, and that's a generic thing that would apply using epoll etc as well. There might be room for improvement there, but that's outside the scope of io_uring.

@romange
Copy link
Contributor Author

romange commented Nov 2, 2020

Oh, I just noticed now - I've commented on the wrong issue. I meant to comment yesterday on #215 and not on this one. sorry for the confusion I caused.

To conclude, #218 is fixed in 5.10 and #215 is waiting for 5.11. And once 5.11 release window opens, Jens will try to cherry-pick the fix for #215 into 5.10.

@axboe
Copy link
Owner

axboe commented Nov 2, 2020

To conclude, #218 is fixed in 5.10 and #215 is waiting for 5.11. And once 5.11 release window opens, Jens will try to cherry-pick the fix for #215 into 5.10.

Exactly!

@axboe
Copy link
Owner

axboe commented Nov 2, 2020

Closing this one out.

@axboe axboe closed this as completed Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants