Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write IO hangs when uiomove triggers page fault in zfs_write context. #7512

Closed
YanyunGao opened this issue May 7, 2018 · 9 comments
Closed
Labels
Status: Stale No recent activity for issue

Comments

@YanyunGao
Copy link
Contributor

YanyunGao commented May 7, 2018

System information

Type Version/Name
Distribution Name Centos
Distribution Version 7.1
Linux Kernel 3.10.0-514.26.2.el7.x86_64
Architecture
ZFS Version 0.6.5.6
SPL Version 0.6.5.6

Describe the problem you're observing

Write IO hangs on a pool with a lot of backtraces in sys log. One write request stuck in context of page_fault which causes the quiesce thread never complete. It turns out following txg cannot be synced.

Describe how to reproduce the problem

We hit the problem twice in one week but yet have no idea how to reproduce it.
There are mixed payloads with mmap and regular read/write operations. We noticed that the problem could be easy to reproduce when there is no too much memory left.

Include any warning/errors/backtraces from the system logs

May 7 14:39:28 xt4 kernel: ffff88074283b918 0000000000000086 ffff880814623ec0 ffff88074283bfd8
May 7 14:39:28 xt4 kernel: ffff88074283bfd8 ffff88074283bfd8 ffff880814623ec0 ffff880814623ec0
May 7 14:39:28 xt4 kernel: ffff88046d70de38 ffffffffffffffff ffff88046d70de40 ffff88046d70ddc0
May 7 14:39:28 xt4 kernel: Call Trace:
May 7 14:39:28 xt4 kernel: [] schedule+0x29/0x70
May 7 14:39:28 xt4 kernel: [] rwsem_down_read_failed+0xf5/0x170
May 7 14:39:28 xt4 kernel: [] call_rwsem_down_read_failed+0x18/0x30
May 7 14:39:28 xt4 kernel: [] ? spl_kmem_zalloc+0xc0/0x170 [spl]
May 7 14:39:28 xt4 kernel: [] down_read+0x20/0x30
May 7 14:39:28 xt4 kernel: [] __do_page_fault+0x380/0x450
May 7 14:39:28 xt4 kernel: [] do_page_fault+0x35/0x90
May 7 14:39:28 xt4 kernel: [] page_fault+0x28/0x30
May 7 14:39:28 xt4 kernel: [] ? spl_kmem_zalloc+0xc0/0x170 [spl]
May 7 14:39:28 xt4 kernel: [] ? copy_user_enhanced_fast_string+0x9/0x20
May 7 14:39:28 xt4 kernel: [] ? uiomove+0x155/0x2b0 [zcommon]
May 7 14:39:28 xt4 kernel: [] dmu_write_uio_dnode+0x90/0x150 [zfs]
May 7 14:39:28 xt4 kernel: [] dmu_write_uio_dbuf+0x4f/0x70 [zfs]
May 7 14:39:28 xt4 kernel: [] zfs_write+0xb31/0xc40 [zfs]
May 7 14:39:28 xt4 kernel: [] ? nvlist_lookup_common.part.71+0xa2/0xb0 [znvpair]
May 7 14:39:28 xt4 kernel: [] ? nvlist_lookup_byte_array+0x26/0x30 [znvpair]
May 7 14:39:28 xt4 kernel: [] ? strfree+0xe/0x10 [spl]
May 7 14:39:28 xt4 kernel: [] ? vfs_getxattr+0x88/0xb0
May 7 14:39:28 xt4 kernel: [] zpl_write_common_iovec.constprop.8+0x95/0x100 [zfs]
May 7 14:39:28 xt4 kernel: [] zpl_write+0x7e/0xb0 [zfs]
May 7 14:39:28 xt4 kernel: [] vfs_write+0xbd/0x1e0
May 7 14:39:28 xt4 kernel: [] SyS_pwrite64+0x92/0xc0
May 7 14:39:28 xt4 kernel: [] system_call_fastpath+0x16/0x1b

May 7 14:39:28 xt4 kernel: glusterfsd D 0000000000000000 0 28771 29196 0x00000080
May 7 14:39:28 xt4 kernel: ffff880561ffb9b8 0000000000000086 ffff8808652b5e20 ffff880561ffbfd8
May 7 14:39:28 xt4 kernel: ffff880561ffbfd8 ffff880561ffbfd8 ffff8808652b5e20 ffff88046f406b68
May 7 14:39:28 xt4 kernel: ffff88046f406a20 ffff88046f406b70 ffff88046f406a48 0000000000000000
May 7 14:39:28 xt4 kernel: Call Trace:
May 7 14:39:28 xt4 kernel: [] schedule+0x29/0x70
May 7 14:39:28 xt4 kernel: [] cv_wait_common+0x125/0x150 [spl]
May 7 14:39:28 xt4 kernel: [] ? wake_up_atomic_t+0x30/0x30
May 7 14:39:28 xt4 kernel: [] __cv_wait+0x15/0x20 [spl]
May 7 14:39:28 xt4 kernel: [] txg_wait_open+0xc3/0x110 [zfs]
May 7 14:39:28 xt4 kernel: [] dmu_tx_wait+0x3a8/0x3c0 [zfs]
May 7 14:39:28 xt4 kernel: [] dmu_tx_assign+0x9a/0x510 [zfs]
May 7 14:39:28 xt4 kernel: [] zfs_dirty_inode+0xf7/0x320 [zfs]
May 7 14:39:28 xt4 kernel: [] ? dequeue_entity+0x11c/0x5d0
May 7 14:39:28 xt4 kernel: [] ? __d_lookup+0x120/0x160
May 7 14:39:28 xt4 kernel: [] ? dequeue_task_fair+0x41e/0x660
May 7 14:39:28 xt4 kernel: [] ? sched_clock_cpu+0x85/0xc0
May 7 14:39:28 xt4 kernel: [] ? __switch_to+0x15c/0x4c0
May 7 14:39:28 xt4 kernel: [] zpl_dirty_inode+0x2c/0x40 [zfs]
May 7 14:39:28 xt4 kernel: [] __mark_inode_dirty+0xca/0x290
May 7 14:39:28 xt4 kernel: [] update_time+0x81/0xd0
May 7 14:39:28 xt4 kernel: [] ? __sb_start_write+0x58/0x110
May 7 14:39:28 xt4 kernel: [] file_update_time+0xa0/0xf0
May 7 14:39:28 xt4 kernel: [] filemap_page_mkwrite+0x40/0xb0
May 7 14:39:28 xt4 kernel: [] do_page_mkwrite+0x54/0xa0
May 7 14:39:28 xt4 kernel: [] do_shared_fault.isra.46+0x77/0x1e0
May 7 14:39:28 xt4 kernel: [] handle_mm_fault+0x61e/0x1000
May 7 14:39:28 xt4 kernel: [] ? do_futex+0x122/0x5b0
May 7 14:39:28 xt4 kernel: [] __do_page_fault+0x154/0x450
May 7 14:39:28 xt4 kernel: [] do_page_fault+0x35/0x90
May 7 14:39:28 xt4 kernel: [] page_fault+0x28/0x30

@gmelikov
Copy link
Member

gmelikov commented May 7, 2018

@YanyunGao please try latest stable version 0.7.8 first.

@javenwu
Copy link
Contributor

javenwu commented May 7, 2018

@gmelikov Thanks for quick reply. Could you hint us which potential fixs are relevant to the issue.
We did basic analysis, it seems dmu_write_uio_dnode() still possible triggers the page fault in the newest code. As long as the page_fault is triggered in the dmu_write_uio_dnode()->uiomove() context, I assume the issue is still there. It's really not easy to reproduce the problem and we have no idea how to reproduce it constantly. Two weeks ago we hit the problem twice, but we haven't trigger it in the following test till today.

@YanyunGao
Copy link
Contributor Author

YanyunGao commented May 7, 2018

I noticed that there is a comment in the code dmu.c. dmu_write_uio_dnode():

     /*
      * XXX uiomove could block forever (eg.nfs-backed
      * pages).  There needs to be a uiolockdown() function
      * to lock the pages in memory, so that uiomove won't
      * block.
      */
     err = uiomove((char *)db->db_data + bufoff, tocpy,
         UIO_WRITE, uio);

It seems we hit the case, uiomove blocks forever in our case. The issue report lists two back traces. One is page_fault context, the other is syscall write context. page_fault context hold the semaphore but stuck in tx_wait_open (wait for quiesce txg complete, unfortunately, the quiesce txg never complete because of the other back trace). The syscall write context increases tc_count of quiescing txg but fails to down read semphore, in turns block quiesce thread. It looks like a deadlock.

@gmelikov
Copy link
Member

gmelikov commented May 7, 2018

@javenwu

We noticed that the problem could be easy to reproduce when there is no too much memory left.

0.7 branch has a big bunch of changes, and one of them - ABD 7657def . If there is a problem, which is easy to reproduce when there is no too much memory left, then this patch is a must have.

And, secondly, I think it's a good tone to try fresh stable version first.

You can look at #7339 too, it looks similar.

@YanyunGao
Copy link
Contributor Author

@gmelikov I noticed that there is a comment in #7339, this is fixed calling uiomove() in mappedread() under the page lock can result in a deadlock if the user space page needs to be faulted in.

@YanyunGao
Copy link
Contributor Author

@trisk @behlendorf

@javenwu
Copy link
Contributor

javenwu commented May 8, 2018

Thanks for pointing out #7339, it's relevant but the it won't fix our problem. The interesting thing is I found we might hit the same symptoms as removing the prefault code @trisk mentioned in the comments of #7339. In our case, prefault is called absolutely, so I think it's the case the page is evicted after prefault. I saw there is a proposal to create a separate issue to track this kind of issue.

@javenwu
Copy link
Contributor

javenwu commented Aug 17, 2018

After dive into the detail of the problem, we figured out the issue more clearly. This is a dead lock case. The dead lock is about mm_sem of two threads context of a process. The application has two threads, one is doing write syscall, the other thread is manipulating memory address which is mapped to a file.
Let's assume syscall write thread is context #1, another context is page fault context triggered by accessing memory address that is mapped to a file, let's call it context #2.
The time sequence is as below:

  1. context Use Barriers in pre-2.6.24 kernels #1 enters zfs_write() and get txg "n" assigned .
  2. context Implement zeventd daemon #2 with mm_sem being hold, zfs_dirty_inode failed to get txg assigned and try to wait previous txg "n" complete.
  3. in context Use Barriers in pre-2.6.24 kernels #1 , prefaulted page is faulted in uiomove. The page fault handler is suspended on mm_sem (mm_sem of the process has already been hold by context Implement zeventd daemon #2) and result in txg "n" cannot complete.
  4. context Implement zeventd daemon #2 will never get txg n+1 assigned because txg n will not complete.

So context #1 and context #2 trap into the dead lock.

If my understand is correct, prefaulted page being faulted in step 3 is the criminal of the deadlock.
I noticed that there is function __probe_kernel_read() in the kernel.
---%<---
long __probe_kernel_read(void *dst, const void *src, size_t size)
{
long ret;
mm_segment_t old_fs = get_fs();

set_fs(KERNEL_DS);
pagefault_disable();
ret = __copy_from_user_inatomic(dst,
        (__force const void __user *)src, size);
pagefault_enable();
set_fs(old_fs);

return ret ? -EFAULT : 0;

}
---%<---
I was thinking if we modify uiomove implementation a little bit, copy_x_user in pagefault_disabled section and return EFAULT if page is not loaded. When EFAULT is returned in zfs_write() context, we can unassign the txg "n" and invoke prefault before reassign and retry. In theory, we expected all pages used in uiomove is faulted by prefault.

How do you guys think?

wgqimut pushed a commit to wgqimut/zfs that referenced this issue Sep 21, 2018
The bug time sequence:
1. context openzfs#1, zfs_write assign a txg "n".
2. In a same process, context openzfs#2, mmap page fault (which means the `mm_sem`
   is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous
   txg "n" completed.
3. context openzfs#1 call `uiomove` to write, however page fault is occurred in `uiomove`,
   which means it need `mm_sem`, but `mm_sem` is hold by context openzfs#2, so it
   stuck and can't complete,  then txg "n" will not complete.

So context openzfs#1 and context openzfs#2 trap into the "dead lock".

Signed-off-by: Grady Wong <[email protected]>
Closes openzfs#7512
@stale
Copy link

stale bot commented Aug 25, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020
@stale stale bot closed this as completed Nov 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue
Projects
None yet
Development

No branches or pull requests

3 participants