Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix distribution detection so that Ubuntu does not identify as Debian #1

Merged
merged 1 commit into from
Feb 28, 2012

Conversation

ryao
Copy link
Owner

@ryao ryao commented Feb 28, 2012

This is meant to bring things in synchronization with the build system changes made in the corresponding zfsonlinux/spl pull request:

openzfs/spl#86

@ryao ryao merged this pull request into master Feb 28, 2012
ryao added a commit that referenced this pull request Feb 27, 2014
ZoL commit 1421c89 unintentionally changed the disk format in a forward-
compatible, but not backward compatible way. This was accomplished by
adding an entry to zbookmark_t, which is included in a couple of
on-disk structures. That lead to the creation of pools with incorrect
dsl_scan_phys_t objects that could only be imported by versions of ZoL
containing that commit.  Such pools cannot be imported by other versions
of ZFS or past versions of ZoL.

The additional field has been removed by the previous commit.  However,
affected pools must be imported and scrubbed using a version of ZoL with
this commit applied.  This will return the pools to a state in which they
may be imported by other implementations.

The 'zpool import' or 'zpool status' command can be used to determine if
a pool is impacted.  A message similar to one of the following means your
pool must be scrubbed to restore compatibility.

$ zpool import
   pool: zol-0.6.2-173
     id: 1165955789558693437
  state: ONLINE
 status: Errata #1 detected.
 action: The pool can be imported using its name or numeric identifier,
         however there is a compatibility issue which should be corrected
         by running 'zpool scrub'
    see: http://zfsonlinux.org/msg/ZFS-8000-ER
 config:
 ...

$ zpool status
  pool: zol-0.6.2-173
 state: ONLINE
  scan: pool compatibility issue detected.
   see: openzfs#2094
action: To correct the issue run 'zpool scrub'.
config:
...

If there was an async destroy in progress 'zpool import' will prevent
the pool from being imported.  Further advice on how to proceed will be
provided by the error message as follows.

$ zpool import
   pool: zol-0.6.2-173
     id: 1165955789558693437
  state: ONLINE
 status: Errata #2 detected.
 action: The pool can not be imported with this version of ZFS due to an
         active asynchronous destroy.  Revert to an earlier version and
         allow the destroy to complete before updating.
         see: http://zfsonlinux.org/msg/ZFS-8000-ER
 config:
 ...

Pools affected by the damaged dsl_scan_phys_t can be detected prior to
an upgrade by running the following command as root:

zdb -dddd poolname 1 | grep -P '^\t\tscan = ' | sed -e 's;scan = ;;' | wc -w

Note that `poolname` must be replaced with the name of the pool you wish
to check. A value of 25 indicates the dsl_scan_phys_t has been damaged.
A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0
indicates that there has never been a scrub run on the pool.

The regression caused by the change to zbookmark_t never made it into a
tagged release, Gentoo backports, Ubuntu, Debian, Fedora, or EPEL
stable respositorys.  Only those using the HEAD version directly from
Github after the 0.6.2 but before the 0.6.3 tag are affected.

This patch does have one limitation that should be mentioned.  It will not
detect errata #2 on a pool unless errata #1 is also present.  It expected
this will not be a significant problem because pools impacted by errata #2
have a high probably of being impacted by errata #1.

End users can ensure they do no hit this unlikely case by waiting for all
asynchronous destroy operations to complete before updating ZoL.  The
presence of any background destroys on any imported pools can be checked
by running `zpool get freeing` as root.  This will display a non-zero
value for any pool with an active asynchronous destroy.

Lastly, it is expected that no user data has been lost as a result of
this erratum.

Original-patch-by: Tim Chase <[email protected]>
Reworked-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#2094
ryao added a commit that referenced this pull request Mar 25, 2014
openzfs#180 occurred because of a race between inode eviction and
zfs_zget(). openzfs/zfs@36df284
tried to address it by making an upcall to the VFS to learn whether an
inode is being evicted and spin until it leaves eviction. This is a hack
around the fact that we cannot ensure "SA" does immediate eviction by
hooking into generic_drop_inode(), which is GPL exported and cannot be
wrapped. Unfortunately, the act of calling ilookup to avoid this race
during writeback creates a deadlock:

[  602.268492] INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
[  602.268496]       Tainted: P           O 3.13.6 #1
[  602.268498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  602.268500] kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
[  602.268511] Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
[  602.268522]  ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
[  602.268526]  ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
[  602.268530]  ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
[  602.268534] Call Trace:
[  602.268541]  [<ffffffff813dd1d4>] schedule+0x24/0x70
[  602.268546]  [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
[  602.268552]  [<ffffffff810821c0>] ? autoremove_wake_function+0x40/0x40
[  602.268555]  [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
[  602.268559]  [<ffffffff811608c5>] ilookup+0x65/0xd0
[  602.268590]  [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
[  602.268594]  [<ffffffff813e013b>] ? __mutex_lock_slowpath+0x21b/0x360
[  602.268613]  [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
[  602.268631]  [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
[  602.268634]  [<ffffffff813dfe79>] ? mutex_unlock+0x9/0x10
[  602.268651]  [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
[  602.268669]  [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
[  602.268674]  [<ffffffff811071ec>] do_writepages+0x1c/0x40
[  602.268677]  [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
[  602.268680]  [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
[  602.268684]  [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
[  602.268689]  [<ffffffff81062cc1>] ? set_worker_desc+0x71/0x80
[  602.268692]  [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
[  602.268696]  [<ffffffff813e12b4>] ? _raw_spin_unlock_irq+0x14/0x40
[  602.268700]  [<ffffffff8106fd19>] ? finish_task_switch+0x59/0x130
[  602.268703]  [<ffffffff8106072c>] process_one_work+0x16c/0x490
[  602.268706]  [<ffffffff810613f3>] worker_thread+0x113/0x390
[  602.268710]  [<ffffffff810612e0>] ? manage_workers.isra.27+0x2a0/0x2a0
[  602.268713]  [<ffffffff81066edf>] kthread+0xdf/0x100
[  602.268717]  [<ffffffff8107877e>] ? arch_vtime_task_switch+0x8e/0xa0
[  602.268720]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190
[  602.268723]  [<ffffffff813e71fc>] ret_from_fork+0x7c/0xb0
[  602.268730]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190

The return value from igrab() gives us the information that ifind()
provided without the risk of a deadlock. Ideally, we should ask upstream
to export generic_drop_inode() so that we can wrap it to properly handle
this situation, but until then, lets hook into the return value of
ifind() so that we do not deadlock here.

In addition, this ensures that successful exit from this function has a
hold on the inode, which the code expects.

Signed-off-by: Richard Yao <[email protected]>

 Please enter the commit message for your changes. Lines starting
ryao added a commit that referenced this pull request Mar 25, 2014
openzfs#180 occurred because of a race between inode eviction and
zfs_zget(). openzfs/zfs@36df284
tried to address it by making an upcall to the VFS to learn whether an
inode is being evicted and spin until it leaves eviction. This is a hack
around the fact that we cannot ensure "SA" does immediate eviction by
hooking into generic_drop_inode(), which is GPL exported and cannot be
wrapped. Unfortunately, the act of calling ilookup to avoid this race
during writeback creates a deadlock:

[  602.268492] INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
[  602.268496]       Tainted: P           O 3.13.6 #1
[  602.268498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  602.268500] kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
[  602.268511] Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
[  602.268522]  ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
[  602.268526]  ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
[  602.268530]  ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
[  602.268534] Call Trace:
[  602.268541]  [<ffffffff813dd1d4>] schedule+0x24/0x70
[  602.268546]  [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
[  602.268552]  [<ffffffff810821c0>] ? autoremove_wake_function+0x40/0x40
[  602.268555]  [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
[  602.268559]  [<ffffffff811608c5>] ilookup+0x65/0xd0
[  602.268590]  [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
[  602.268594]  [<ffffffff813e013b>] ? __mutex_lock_slowpath+0x21b/0x360
[  602.268613]  [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
[  602.268631]  [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
[  602.268634]  [<ffffffff813dfe79>] ? mutex_unlock+0x9/0x10
[  602.268651]  [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
[  602.268669]  [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
[  602.268674]  [<ffffffff811071ec>] do_writepages+0x1c/0x40
[  602.268677]  [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
[  602.268680]  [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
[  602.268684]  [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
[  602.268689]  [<ffffffff81062cc1>] ? set_worker_desc+0x71/0x80
[  602.268692]  [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
[  602.268696]  [<ffffffff813e12b4>] ? _raw_spin_unlock_irq+0x14/0x40
[  602.268700]  [<ffffffff8106fd19>] ? finish_task_switch+0x59/0x130
[  602.268703]  [<ffffffff8106072c>] process_one_work+0x16c/0x490
[  602.268706]  [<ffffffff810613f3>] worker_thread+0x113/0x390
[  602.268710]  [<ffffffff810612e0>] ? manage_workers.isra.27+0x2a0/0x2a0
[  602.268713]  [<ffffffff81066edf>] kthread+0xdf/0x100
[  602.268717]  [<ffffffff8107877e>] ? arch_vtime_task_switch+0x8e/0xa0
[  602.268720]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190
[  602.268723]  [<ffffffff813e71fc>] ret_from_fork+0x7c/0xb0
[  602.268730]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190

The return value from igrab() gives us the information that ifind()
provided without the risk of a deadlock. Ideally, we should ask upstream
to export generic_drop_inode() so that we can wrap it to properly handle
this situation, but until then, lets hook into the return value of
ifind() so that we do not deadlock here.

In addition, this ensures that successful exit from this function has a
hold on the inode, which the code expects.

Signed-off-by: Richard Yao <[email protected]>

 Please enter the commit message for your changes. Lines starting
ryao added a commit that referenced this pull request Mar 25, 2014
openzfs#180 occurred because of a race between inode eviction and
zfs_zget(). openzfs/zfs@36df284
tried to address it by making an upcall to the VFS to learn whether an
inode is being evicted and spin until it leaves eviction. This is a hack
around the fact that we cannot ensure "SA" does immediate eviction by
hooking into generic_drop_inode(), which is GPL exported and cannot be
wrapped. Unfortunately, the act of calling ilookup to avoid this race
during writeback creates a deadlock:

[  602.268492] INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
[  602.268496]       Tainted: P           O 3.13.6 #1
[  602.268498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  602.268500] kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
[  602.268511] Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
[  602.268522]  ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
[  602.268526]  ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
[  602.268530]  ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
[  602.268534] Call Trace:
[  602.268541]  [<ffffffff813dd1d4>] schedule+0x24/0x70
[  602.268546]  [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
[  602.268552]  [<ffffffff810821c0>] ? autoremove_wake_function+0x40/0x40
[  602.268555]  [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
[  602.268559]  [<ffffffff811608c5>] ilookup+0x65/0xd0
[  602.268590]  [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
[  602.268594]  [<ffffffff813e013b>] ? __mutex_lock_slowpath+0x21b/0x360
[  602.268613]  [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
[  602.268631]  [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
[  602.268634]  [<ffffffff813dfe79>] ? mutex_unlock+0x9/0x10
[  602.268651]  [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
[  602.268669]  [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
[  602.268674]  [<ffffffff811071ec>] do_writepages+0x1c/0x40
[  602.268677]  [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
[  602.268680]  [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
[  602.268684]  [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
[  602.268689]  [<ffffffff81062cc1>] ? set_worker_desc+0x71/0x80
[  602.268692]  [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
[  602.268696]  [<ffffffff813e12b4>] ? _raw_spin_unlock_irq+0x14/0x40
[  602.268700]  [<ffffffff8106fd19>] ? finish_task_switch+0x59/0x130
[  602.268703]  [<ffffffff8106072c>] process_one_work+0x16c/0x490
[  602.268706]  [<ffffffff810613f3>] worker_thread+0x113/0x390
[  602.268710]  [<ffffffff810612e0>] ? manage_workers.isra.27+0x2a0/0x2a0
[  602.268713]  [<ffffffff81066edf>] kthread+0xdf/0x100
[  602.268717]  [<ffffffff8107877e>] ? arch_vtime_task_switch+0x8e/0xa0
[  602.268720]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190
[  602.268723]  [<ffffffff813e71fc>] ret_from_fork+0x7c/0xb0
[  602.268730]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190

The return value from igrab() gives us the information that ifind()
provided without the risk of a deadlock. Ideally, we should ask upstream
to export generic_drop_inode() so that we can wrap it to properly handle
this situation, but until then, lets hook into the return value of
ifind() so that we do not deadlock here.

In addition, zfs_zget() should exit with a hold on the inode, but that
was never present in the Linux port. iThis can lead to undefined
behavior, such as inodes that are evicted when they have users. The
function is modified to ensure that successful exit from this function
has a hold on the inode, which the code expects.

Signed-off-by: Richard Yao <[email protected]>

 Please enter the commit message for your changes. Lines starting
ryao added a commit that referenced this pull request Mar 25, 2014
openzfs#180 occurred because of a race between inode eviction and
zfs_zget(). openzfs/zfs@36df284
tried to address it by making an upcall to the VFS to learn whether an
inode is being evicted and spin until it leaves eviction. This is a hack
around the fact that we cannot ensure "SA" does immediate eviction by
hooking into generic_drop_inode(), which is GPL exported and cannot be
wrapped. Unfortunately, the act of calling ilookup to avoid this race
during writeback creates a deadlock:

[  602.268492] INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
[  602.268496]       Tainted: P           O 3.13.6 #1
[  602.268498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  602.268500] kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
[  602.268511] Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
[  602.268522]  ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
[  602.268526]  ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
[  602.268530]  ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
[  602.268534] Call Trace:
[  602.268541]  [<ffffffff813dd1d4>] schedule+0x24/0x70
[  602.268546]  [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
[  602.268552]  [<ffffffff810821c0>] ? autoremove_wake_function+0x40/0x40
[  602.268555]  [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
[  602.268559]  [<ffffffff811608c5>] ilookup+0x65/0xd0
[  602.268590]  [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
[  602.268594]  [<ffffffff813e013b>] ? __mutex_lock_slowpath+0x21b/0x360
[  602.268613]  [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
[  602.268631]  [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
[  602.268634]  [<ffffffff813dfe79>] ? mutex_unlock+0x9/0x10
[  602.268651]  [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
[  602.268669]  [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
[  602.268674]  [<ffffffff811071ec>] do_writepages+0x1c/0x40
[  602.268677]  [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
[  602.268680]  [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
[  602.268684]  [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
[  602.268689]  [<ffffffff81062cc1>] ? set_worker_desc+0x71/0x80
[  602.268692]  [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
[  602.268696]  [<ffffffff813e12b4>] ? _raw_spin_unlock_irq+0x14/0x40
[  602.268700]  [<ffffffff8106fd19>] ? finish_task_switch+0x59/0x130
[  602.268703]  [<ffffffff8106072c>] process_one_work+0x16c/0x490
[  602.268706]  [<ffffffff810613f3>] worker_thread+0x113/0x390
[  602.268710]  [<ffffffff810612e0>] ? manage_workers.isra.27+0x2a0/0x2a0
[  602.268713]  [<ffffffff81066edf>] kthread+0xdf/0x100
[  602.268717]  [<ffffffff8107877e>] ? arch_vtime_task_switch+0x8e/0xa0
[  602.268720]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190
[  602.268723]  [<ffffffff813e71fc>] ret_from_fork+0x7c/0xb0
[  602.268730]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190

The return value from igrab() gives us the information that ifind()
provided without the risk of a deadlock. Ideally, we should ask upstream
to export generic_drop_inode() so that we can wrap it to properly handle
this situation, but until then, lets hook into the return value of
ifind() so that we do not deadlock here.

In addition, zfs_zget() should exit with a hold on the inode, but that
was never done in the Linux port when the inode had already been
constructed. This can lead to undefined behavior, such as inodes that
are evicted when they have users. The function is modified to ensure
that successful exit from this function has a hold on the inode, which
the code expects.

Signed-off-by: Richard Yao <[email protected]>

 Please enter the commit message for your changes. Lines starting
ryao added a commit that referenced this pull request Apr 9, 2014
openzfs#180 occurred because of a race between inode eviction and
zfs_zget(). openzfs/zfs@36df284 tried to address it by making a call
to the VFS to learn whether an inode is being evicted.  If it was being
evicted the operation was retried after dropping and reacquiring the
relevant resources.  Unfortunately, this introduced another deadlock.

  INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
        Tainted: P           O 3.13.6 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
  Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
   ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
   ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
   ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
  Call Trace:
   [<ffffffff813dd1d4>] schedule+0x24/0x70
   [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
   [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
   [<ffffffff811608c5>] ilookup+0x65/0xd0
   [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
   [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
   [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
   [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
   [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
   [<ffffffff811071ec>] do_writepages+0x1c/0x40
   [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
   [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
   [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
   [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
   [<ffffffff8106072c>] process_one_work+0x16c/0x490
   [<ffffffff810613f3>] worker_thread+0x113/0x390
   [<ffffffff81066edf>] kthread+0xdf/0x100

This patch implements the original fix in a slightly different manner in
order to avoid both deadlocks.  Instead of relying on a call to ilookup()
which can block in __wait_on_freeing_inode() the return value from igrab()
is used.  This gives us the information that ilookup() provided without
the risk of a deadlock.

Alternately, this race could be closed by registering an sops->drop_inode()
callback.  The callback would need to detect the active SA hold thereby
informing the VFS that this inode should not be evicted.

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#180
ryao added a commit that referenced this pull request Apr 10, 2014
openzfs#180 occurred because of a race between inode eviction and
zfs_zget(). openzfs/zfs@36df284 tried to address it by making a call
to the VFS to learn whether an inode is being evicted.  If it was being
evicted the operation was retried after dropping and reacquiring the
relevant resources.  Unfortunately, this introduced another deadlock.

  INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
        Tainted: P           O 3.13.6 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
  Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
   ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
   ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
   ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
  Call Trace:
   [<ffffffff813dd1d4>] schedule+0x24/0x70
   [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
   [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
   [<ffffffff811608c5>] ilookup+0x65/0xd0
   [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
   [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
   [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
   [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
   [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
   [<ffffffff811071ec>] do_writepages+0x1c/0x40
   [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
   [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
   [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
   [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
   [<ffffffff8106072c>] process_one_work+0x16c/0x490
   [<ffffffff810613f3>] worker_thread+0x113/0x390
   [<ffffffff81066edf>] kthread+0xdf/0x100

This patch implements the original fix in a slightly different manner in
order to avoid both deadlocks.  Instead of relying on a call to ilookup()
which can block in __wait_on_freeing_inode() the return value from igrab()
is used.  This gives us the information that ilookup() provided without
the risk of a deadlock.

Alternately, this race could be closed by registering an sops->drop_inode()
callback.  The callback would need to detect the active SA hold thereby
informing the VFS that this inode should not be evicted.

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#180
ryao added a commit that referenced this pull request Apr 22, 2015
…hash mutex

The following deadlock occurred on the buildbot:

[ 3774.649030] VERIFY3(((*(volatile typeof((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner) *)&((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner))) != get_current()) failed (ffff880036362dc0 != ffff880036362dc0)
[ 3774.649407] PANIC at zfs_znode.c:1108:zfs_zinactive()
[ 3774.649415] Showing stack for process 32119
[ 3774.649425] CPU: 3 PID: 32119 Comm: filebench Tainted: PF          O 3.11.10-100.fc18.x86_64 #1
[ 3774.649428] Hardware name: Red Hat RHEV Hypervisor, BIOS 0.5.1 01/01/2007
[ 3774.649428]  ffffffffa03a3af8 ffff880047cf2bb8 ffffffff81666676 0000000000000007
[ 3774.649430]  ffffffffa03a3b73 ffff880047cf2bc8 ffffffffa01c73e4 ffff880047cf2d68
[ 3774.649435]  ffffffffa01c761d 0000000000000003 ffff88004b1accc0 0000000000000030
[ 3774.649447] Call Trace:
[ 3774.649457]  [<ffffffff81666676>] dump_stack+0x46/0x58
[ 3774.649465]  [<ffffffffa01c73e4>] spl_dumpstack+0x44/0x50 [spl]
[ 3774.649468]  [<ffffffffa01c761d>] spl_panic+0xbd/0x100 [spl]
[ 3774.649476]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649493]  [<ffffffffa03369d5>] zfs_zinactive+0x1f5/0x240 [zfs]
[ 3774.649538]  [<ffffffffa032fb9c>] zfs_inactive+0x7c/0x430 [zfs]
[ 3774.649546]  [<ffffffffa03506fe>] zpl_evict_inode+0x4e/0xa0 [zfs]
[ 3774.649546]  [<ffffffff811c8e12>] evict+0xa2/0x1a0
[ 3774.649546]  [<ffffffff811c8f4e>] dispose_list+0x3e/0x60
[ 3774.649546]  [<ffffffff811c9cd1>] prune_icache_sb+0x161/0x300
[ 3774.649546]  [<ffffffff811b2e35>] prune_super+0xe5/0x1b0
[ 3774.649546]  [<ffffffff81153771>] shrink_slab+0x151/0x2e0
[ 3774.649546]  [<ffffffff811a9809>] ? vmpressure+0x29/0x90
[ 3774.649546]  [<ffffffff811a97e5>] ? vmpressure+0x5/0x90
[ 3774.649546]  [<ffffffff81156979>] do_try_to_free_pages+0x3e9/0x5a0
[ 3774.649548]  [<ffffffff811527ff>] ? throttle_direct_reclaim.isra.45+0x8f/0x280
[ 3774.649552]  [<ffffffff81156e38>] try_to_free_pages+0xf8/0x180
[ 3774.649556]  [<ffffffff8114ae3a>] __alloc_pages_nodemask+0x6aa/0xae0
[ 3774.649562]  [<ffffffff81189fb8>] alloc_pages_current+0xb8/0x190
[ 3774.649565]  [<ffffffff81193e30>] new_slab+0x2d0/0x3a0
[ 3774.649577]  [<ffffffff81664d2d>] __slab_alloc+0x393/0x560
[ 3774.649579]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649585]  [<ffffffff81195230>] kmem_cache_alloc+0x1a0/0x200
[ 3774.649589]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649594]  [<ffffffffa01c1b30>] spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649596]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649599]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649624]  [<ffffffffa03505c0>] ? zpl_inode_destroy+0x60/0x60 [zfs]
[ 3774.649687]  [<ffffffffa033266f>] zfs_inode_alloc+0x1f/0x40 [zfs]
[ 3774.649687]  [<ffffffffa03505da>] zpl_inode_alloc+0x1a/0x70 [zfs]
[ 3774.649687]  [<ffffffff811c7e16>] alloc_inode+0x26/0xa0
[ 3774.649687]  [<ffffffff811c9e83>] new_inode_pseudo+0x13/0x60
[ 3774.649687]  [<ffffffff811c9eed>] new_inode+0x1d/0x40
[ 3774.649710]  [<ffffffffa0332ac7>] zfs_znode_alloc+0x47/0x730 [zfs]
[ 3774.649770]  [<ffffffffa02c8f4e>] ? sa_build_index+0xbe/0x1b0 [zfs]
[ 3774.649770]  [<ffffffffa02c9775>] ? sa_build_layouts+0x6b5/0xc80 [zfs]
[ 3774.649770]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649794]  [<ffffffffa0333b5e>] zfs_mknode+0x93e/0xe90 [zfs]
[ 3774.649813]  [<ffffffffa032be5b>] zfs_create+0x5db/0x780 [zfs]
[ 3774.649840]  [<ffffffffa0350ba5>] zpl_xattr_set_dir.isra.9+0x245/0x2a0 [zfs]
[ 3774.649843]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649895]  [<ffffffffa0351140>] zpl_xattr_set+0xe0/0x3f0 [zfs]
[ 3774.649895]  [<ffffffffa03516a4>] __zpl_xattr_security_init+0x64/0xb0 [zfs]
[ 3774.649968]  [<ffffffffa0351640>] ? zpl_xattr_trusted_set+0xb0/0xb0 [zfs]
[ 3774.649972]  [<ffffffff812a737c>] security_inode_init_security+0xbc/0xf0
[ 3774.649977]  [<ffffffffa0352028>] zpl_xattr_security_init+0x18/0x20 [zfs]
[ 3774.650017]  [<ffffffffa0350134>] zpl_create+0x154/0x240 [zfs]
[ 3774.650018]  [<ffffffff811bde85>] vfs_create+0xb5/0x120
[ 3774.650018]  [<ffffffff811be874>] do_last+0x984/0xe40
[ 3774.650020]  [<ffffffff811baf55>] ? link_path_walk+0x255/0x880
[ 3774.650023]  [<ffffffff811bedf2>] path_openat+0xc2/0x680
[ 3774.650026]  [<ffffffff811bf653>] do_filp_open+0x43/0xa0
[ 3774.650030]  [<ffffffff811bf615>] ? do_filp_open+0x5/0xa0
[ 3774.650034]  [<ffffffff811ae7fc>] do_sys_open+0x13c/0x230
[ 3774.650037]  [<ffffffff811ae912>] SyS_open+0x22/0x30
[ 3774.650040]  [<ffffffff81675819>] system_call_fastpath+0x16/0x1b

`zfs_mknode()` grabbed an object hash mutex via ZFS_OBJ_HOLD_ENTER(), tried to
allocate a znode with zfs_znode_alloc() and entered direct reclaim, which tried
to do ZFS_OBJ_HOLD_ENTER(). We can fix this by making ZFS_OBJ_HOLD_ENTER() and
ZFS_OBJ_HOLD_EXIT() do calls to spl_fstrans_mark() and spl_fstrans_unmark()
respectively. We can allocate a array to hold the cookies that is protected by
the z_hold_mtx.

Closes openzfs#3331

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit that referenced this pull request Apr 22, 2015
…hash mutex

The following deadlock occurred on the buildbot:

[ 3774.649030] VERIFY3(((*(volatile typeof((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner) *)&((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner))) != get_current()) failed (ffff880036362dc0 != ffff880036362dc0)
[ 3774.649407] PANIC at zfs_znode.c:1108:zfs_zinactive()
[ 3774.649415] Showing stack for process 32119
[ 3774.649425] CPU: 3 PID: 32119 Comm: filebench Tainted: PF          O 3.11.10-100.fc18.x86_64 #1
[ 3774.649428] Hardware name: Red Hat RHEV Hypervisor, BIOS 0.5.1 01/01/2007
[ 3774.649428]  ffffffffa03a3af8 ffff880047cf2bb8 ffffffff81666676 0000000000000007
[ 3774.649430]  ffffffffa03a3b73 ffff880047cf2bc8 ffffffffa01c73e4 ffff880047cf2d68
[ 3774.649435]  ffffffffa01c761d 0000000000000003 ffff88004b1accc0 0000000000000030
[ 3774.649447] Call Trace:
[ 3774.649457]  [<ffffffff81666676>] dump_stack+0x46/0x58
[ 3774.649465]  [<ffffffffa01c73e4>] spl_dumpstack+0x44/0x50 [spl]
[ 3774.649468]  [<ffffffffa01c761d>] spl_panic+0xbd/0x100 [spl]
[ 3774.649476]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649493]  [<ffffffffa03369d5>] zfs_zinactive+0x1f5/0x240 [zfs]
[ 3774.649538]  [<ffffffffa032fb9c>] zfs_inactive+0x7c/0x430 [zfs]
[ 3774.649546]  [<ffffffffa03506fe>] zpl_evict_inode+0x4e/0xa0 [zfs]
[ 3774.649546]  [<ffffffff811c8e12>] evict+0xa2/0x1a0
[ 3774.649546]  [<ffffffff811c8f4e>] dispose_list+0x3e/0x60
[ 3774.649546]  [<ffffffff811c9cd1>] prune_icache_sb+0x161/0x300
[ 3774.649546]  [<ffffffff811b2e35>] prune_super+0xe5/0x1b0
[ 3774.649546]  [<ffffffff81153771>] shrink_slab+0x151/0x2e0
[ 3774.649546]  [<ffffffff811a9809>] ? vmpressure+0x29/0x90
[ 3774.649546]  [<ffffffff811a97e5>] ? vmpressure+0x5/0x90
[ 3774.649546]  [<ffffffff81156979>] do_try_to_free_pages+0x3e9/0x5a0
[ 3774.649548]  [<ffffffff811527ff>] ? throttle_direct_reclaim.isra.45+0x8f/0x280
[ 3774.649552]  [<ffffffff81156e38>] try_to_free_pages+0xf8/0x180
[ 3774.649556]  [<ffffffff8114ae3a>] __alloc_pages_nodemask+0x6aa/0xae0
[ 3774.649562]  [<ffffffff81189fb8>] alloc_pages_current+0xb8/0x190
[ 3774.649565]  [<ffffffff81193e30>] new_slab+0x2d0/0x3a0
[ 3774.649577]  [<ffffffff81664d2d>] __slab_alloc+0x393/0x560
[ 3774.649579]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649585]  [<ffffffff81195230>] kmem_cache_alloc+0x1a0/0x200
[ 3774.649589]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649594]  [<ffffffffa01c1b30>] spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649596]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649599]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649624]  [<ffffffffa03505c0>] ? zpl_inode_destroy+0x60/0x60 [zfs]
[ 3774.649687]  [<ffffffffa033266f>] zfs_inode_alloc+0x1f/0x40 [zfs]
[ 3774.649687]  [<ffffffffa03505da>] zpl_inode_alloc+0x1a/0x70 [zfs]
[ 3774.649687]  [<ffffffff811c7e16>] alloc_inode+0x26/0xa0
[ 3774.649687]  [<ffffffff811c9e83>] new_inode_pseudo+0x13/0x60
[ 3774.649687]  [<ffffffff811c9eed>] new_inode+0x1d/0x40
[ 3774.649710]  [<ffffffffa0332ac7>] zfs_znode_alloc+0x47/0x730 [zfs]
[ 3774.649770]  [<ffffffffa02c8f4e>] ? sa_build_index+0xbe/0x1b0 [zfs]
[ 3774.649770]  [<ffffffffa02c9775>] ? sa_build_layouts+0x6b5/0xc80 [zfs]
[ 3774.649770]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649794]  [<ffffffffa0333b5e>] zfs_mknode+0x93e/0xe90 [zfs]
[ 3774.649813]  [<ffffffffa032be5b>] zfs_create+0x5db/0x780 [zfs]
[ 3774.649840]  [<ffffffffa0350ba5>] zpl_xattr_set_dir.isra.9+0x245/0x2a0 [zfs]
[ 3774.649843]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649895]  [<ffffffffa0351140>] zpl_xattr_set+0xe0/0x3f0 [zfs]
[ 3774.649895]  [<ffffffffa03516a4>] __zpl_xattr_security_init+0x64/0xb0 [zfs]
[ 3774.649968]  [<ffffffffa0351640>] ? zpl_xattr_trusted_set+0xb0/0xb0 [zfs]
[ 3774.649972]  [<ffffffff812a737c>] security_inode_init_security+0xbc/0xf0
[ 3774.649977]  [<ffffffffa0352028>] zpl_xattr_security_init+0x18/0x20 [zfs]
[ 3774.650017]  [<ffffffffa0350134>] zpl_create+0x154/0x240 [zfs]
[ 3774.650018]  [<ffffffff811bde85>] vfs_create+0xb5/0x120
[ 3774.650018]  [<ffffffff811be874>] do_last+0x984/0xe40
[ 3774.650020]  [<ffffffff811baf55>] ? link_path_walk+0x255/0x880
[ 3774.650023]  [<ffffffff811bedf2>] path_openat+0xc2/0x680
[ 3774.650026]  [<ffffffff811bf653>] do_filp_open+0x43/0xa0
[ 3774.650030]  [<ffffffff811bf615>] ? do_filp_open+0x5/0xa0
[ 3774.650034]  [<ffffffff811ae7fc>] do_sys_open+0x13c/0x230
[ 3774.650037]  [<ffffffff811ae912>] SyS_open+0x22/0x30
[ 3774.650040]  [<ffffffff81675819>] system_call_fastpath+0x16/0x1b

`zfs_mknode()` grabbed an object hash mutex via ZFS_OBJ_HOLD_ENTER(), tried to
allocate a znode with zfs_znode_alloc() and entered direct reclaim, which tried
to do ZFS_OBJ_HOLD_ENTER(). We can fix this by making ZFS_OBJ_HOLD_ENTER() and
ZFS_OBJ_HOLD_EXIT() do calls to spl_fstrans_mark() and spl_fstrans_unmark()
respectively. We can allocate a array to hold the cookies that is protected by
the z_hold_mtx.

Closes openzfs#3331

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit that referenced this pull request Apr 22, 2015
…hash mutex

The following deadlock occurred on the buildbot:

[ 3774.649030] VERIFY3(((*(volatile typeof((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner) *)&((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner))) != get_current()) failed (ffff880036362dc0 != ffff880036362dc0)
[ 3774.649407] PANIC at zfs_znode.c:1108:zfs_zinactive()
[ 3774.649415] Showing stack for process 32119
[ 3774.649425] CPU: 3 PID: 32119 Comm: filebench Tainted: PF          O 3.11.10-100.fc18.x86_64 #1
[ 3774.649428] Hardware name: Red Hat RHEV Hypervisor, BIOS 0.5.1 01/01/2007
[ 3774.649428]  ffffffffa03a3af8 ffff880047cf2bb8 ffffffff81666676 0000000000000007
[ 3774.649430]  ffffffffa03a3b73 ffff880047cf2bc8 ffffffffa01c73e4 ffff880047cf2d68
[ 3774.649435]  ffffffffa01c761d 0000000000000003 ffff88004b1accc0 0000000000000030
[ 3774.649447] Call Trace:
[ 3774.649457]  [<ffffffff81666676>] dump_stack+0x46/0x58
[ 3774.649465]  [<ffffffffa01c73e4>] spl_dumpstack+0x44/0x50 [spl]
[ 3774.649468]  [<ffffffffa01c761d>] spl_panic+0xbd/0x100 [spl]
[ 3774.649476]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649493]  [<ffffffffa03369d5>] zfs_zinactive+0x1f5/0x240 [zfs]
[ 3774.649538]  [<ffffffffa032fb9c>] zfs_inactive+0x7c/0x430 [zfs]
[ 3774.649546]  [<ffffffffa03506fe>] zpl_evict_inode+0x4e/0xa0 [zfs]
[ 3774.649546]  [<ffffffff811c8e12>] evict+0xa2/0x1a0
[ 3774.649546]  [<ffffffff811c8f4e>] dispose_list+0x3e/0x60
[ 3774.649546]  [<ffffffff811c9cd1>] prune_icache_sb+0x161/0x300
[ 3774.649546]  [<ffffffff811b2e35>] prune_super+0xe5/0x1b0
[ 3774.649546]  [<ffffffff81153771>] shrink_slab+0x151/0x2e0
[ 3774.649546]  [<ffffffff811a9809>] ? vmpressure+0x29/0x90
[ 3774.649546]  [<ffffffff811a97e5>] ? vmpressure+0x5/0x90
[ 3774.649546]  [<ffffffff81156979>] do_try_to_free_pages+0x3e9/0x5a0
[ 3774.649548]  [<ffffffff811527ff>] ? throttle_direct_reclaim.isra.45+0x8f/0x280
[ 3774.649552]  [<ffffffff81156e38>] try_to_free_pages+0xf8/0x180
[ 3774.649556]  [<ffffffff8114ae3a>] __alloc_pages_nodemask+0x6aa/0xae0
[ 3774.649562]  [<ffffffff81189fb8>] alloc_pages_current+0xb8/0x190
[ 3774.649565]  [<ffffffff81193e30>] new_slab+0x2d0/0x3a0
[ 3774.649577]  [<ffffffff81664d2d>] __slab_alloc+0x393/0x560
[ 3774.649579]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649585]  [<ffffffff81195230>] kmem_cache_alloc+0x1a0/0x200
[ 3774.649589]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649594]  [<ffffffffa01c1b30>] spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649596]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649599]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649624]  [<ffffffffa03505c0>] ? zpl_inode_destroy+0x60/0x60 [zfs]
[ 3774.649687]  [<ffffffffa033266f>] zfs_inode_alloc+0x1f/0x40 [zfs]
[ 3774.649687]  [<ffffffffa03505da>] zpl_inode_alloc+0x1a/0x70 [zfs]
[ 3774.649687]  [<ffffffff811c7e16>] alloc_inode+0x26/0xa0
[ 3774.649687]  [<ffffffff811c9e83>] new_inode_pseudo+0x13/0x60
[ 3774.649687]  [<ffffffff811c9eed>] new_inode+0x1d/0x40
[ 3774.649710]  [<ffffffffa0332ac7>] zfs_znode_alloc+0x47/0x730 [zfs]
[ 3774.649770]  [<ffffffffa02c8f4e>] ? sa_build_index+0xbe/0x1b0 [zfs]
[ 3774.649770]  [<ffffffffa02c9775>] ? sa_build_layouts+0x6b5/0xc80 [zfs]
[ 3774.649770]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649794]  [<ffffffffa0333b5e>] zfs_mknode+0x93e/0xe90 [zfs]
[ 3774.649813]  [<ffffffffa032be5b>] zfs_create+0x5db/0x780 [zfs]
[ 3774.649840]  [<ffffffffa0350ba5>] zpl_xattr_set_dir.isra.9+0x245/0x2a0 [zfs]
[ 3774.649843]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649895]  [<ffffffffa0351140>] zpl_xattr_set+0xe0/0x3f0 [zfs]
[ 3774.649895]  [<ffffffffa03516a4>] __zpl_xattr_security_init+0x64/0xb0 [zfs]
[ 3774.649968]  [<ffffffffa0351640>] ? zpl_xattr_trusted_set+0xb0/0xb0 [zfs]
[ 3774.649972]  [<ffffffff812a737c>] security_inode_init_security+0xbc/0xf0
[ 3774.649977]  [<ffffffffa0352028>] zpl_xattr_security_init+0x18/0x20 [zfs]
[ 3774.650017]  [<ffffffffa0350134>] zpl_create+0x154/0x240 [zfs]
[ 3774.650018]  [<ffffffff811bde85>] vfs_create+0xb5/0x120
[ 3774.650018]  [<ffffffff811be874>] do_last+0x984/0xe40
[ 3774.650020]  [<ffffffff811baf55>] ? link_path_walk+0x255/0x880
[ 3774.650023]  [<ffffffff811bedf2>] path_openat+0xc2/0x680
[ 3774.650026]  [<ffffffff811bf653>] do_filp_open+0x43/0xa0
[ 3774.650030]  [<ffffffff811bf615>] ? do_filp_open+0x5/0xa0
[ 3774.650034]  [<ffffffff811ae7fc>] do_sys_open+0x13c/0x230
[ 3774.650037]  [<ffffffff811ae912>] SyS_open+0x22/0x30
[ 3774.650040]  [<ffffffff81675819>] system_call_fastpath+0x16/0x1b

`zfs_mknode()` grabbed an object hash mutex via ZFS_OBJ_HOLD_ENTER(), tried to
allocate a znode with zfs_znode_alloc() and entered direct reclaim, which tried
to do ZFS_OBJ_HOLD_ENTER(). We can fix this by making ZFS_OBJ_HOLD_ENTER() and
ZFS_OBJ_HOLD_EXIT() do calls to spl_fstrans_mark() and spl_fstrans_unmark()
respectively. We can allocate a array to hold the cookies that is protected by
the z_hold_mtx.

Closes openzfs#3331

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit that referenced this pull request Apr 22, 2015
…hash mutex

The following deadlock occurred on the buildbot:

[ 3774.649030] VERIFY3(((*(volatile typeof((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner) *)&((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner))) != get_current()) failed (ffff880036362dc0 != ffff880036362dc0)
[ 3774.649407] PANIC at zfs_znode.c:1108:zfs_zinactive()
[ 3774.649415] Showing stack for process 32119
[ 3774.649425] CPU: 3 PID: 32119 Comm: filebench Tainted: PF          O 3.11.10-100.fc18.x86_64 #1
[ 3774.649428] Hardware name: Red Hat RHEV Hypervisor, BIOS 0.5.1 01/01/2007
[ 3774.649428]  ffffffffa03a3af8 ffff880047cf2bb8 ffffffff81666676 0000000000000007
[ 3774.649430]  ffffffffa03a3b73 ffff880047cf2bc8 ffffffffa01c73e4 ffff880047cf2d68
[ 3774.649435]  ffffffffa01c761d 0000000000000003 ffff88004b1accc0 0000000000000030
[ 3774.649447] Call Trace:
[ 3774.649457]  [<ffffffff81666676>] dump_stack+0x46/0x58
[ 3774.649465]  [<ffffffffa01c73e4>] spl_dumpstack+0x44/0x50 [spl]
[ 3774.649468]  [<ffffffffa01c761d>] spl_panic+0xbd/0x100 [spl]
[ 3774.649476]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649493]  [<ffffffffa03369d5>] zfs_zinactive+0x1f5/0x240 [zfs]
[ 3774.649538]  [<ffffffffa032fb9c>] zfs_inactive+0x7c/0x430 [zfs]
[ 3774.649546]  [<ffffffffa03506fe>] zpl_evict_inode+0x4e/0xa0 [zfs]
[ 3774.649546]  [<ffffffff811c8e12>] evict+0xa2/0x1a0
[ 3774.649546]  [<ffffffff811c8f4e>] dispose_list+0x3e/0x60
[ 3774.649546]  [<ffffffff811c9cd1>] prune_icache_sb+0x161/0x300
[ 3774.649546]  [<ffffffff811b2e35>] prune_super+0xe5/0x1b0
[ 3774.649546]  [<ffffffff81153771>] shrink_slab+0x151/0x2e0
[ 3774.649546]  [<ffffffff811a9809>] ? vmpressure+0x29/0x90
[ 3774.649546]  [<ffffffff811a97e5>] ? vmpressure+0x5/0x90
[ 3774.649546]  [<ffffffff81156979>] do_try_to_free_pages+0x3e9/0x5a0
[ 3774.649548]  [<ffffffff811527ff>] ? throttle_direct_reclaim.isra.45+0x8f/0x280
[ 3774.649552]  [<ffffffff81156e38>] try_to_free_pages+0xf8/0x180
[ 3774.649556]  [<ffffffff8114ae3a>] __alloc_pages_nodemask+0x6aa/0xae0
[ 3774.649562]  [<ffffffff81189fb8>] alloc_pages_current+0xb8/0x190
[ 3774.649565]  [<ffffffff81193e30>] new_slab+0x2d0/0x3a0
[ 3774.649577]  [<ffffffff81664d2d>] __slab_alloc+0x393/0x560
[ 3774.649579]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649585]  [<ffffffff81195230>] kmem_cache_alloc+0x1a0/0x200
[ 3774.649589]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649594]  [<ffffffffa01c1b30>] spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649596]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649599]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649624]  [<ffffffffa03505c0>] ? zpl_inode_destroy+0x60/0x60 [zfs]
[ 3774.649687]  [<ffffffffa033266f>] zfs_inode_alloc+0x1f/0x40 [zfs]
[ 3774.649687]  [<ffffffffa03505da>] zpl_inode_alloc+0x1a/0x70 [zfs]
[ 3774.649687]  [<ffffffff811c7e16>] alloc_inode+0x26/0xa0
[ 3774.649687]  [<ffffffff811c9e83>] new_inode_pseudo+0x13/0x60
[ 3774.649687]  [<ffffffff811c9eed>] new_inode+0x1d/0x40
[ 3774.649710]  [<ffffffffa0332ac7>] zfs_znode_alloc+0x47/0x730 [zfs]
[ 3774.649770]  [<ffffffffa02c8f4e>] ? sa_build_index+0xbe/0x1b0 [zfs]
[ 3774.649770]  [<ffffffffa02c9775>] ? sa_build_layouts+0x6b5/0xc80 [zfs]
[ 3774.649770]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649794]  [<ffffffffa0333b5e>] zfs_mknode+0x93e/0xe90 [zfs]
[ 3774.649813]  [<ffffffffa032be5b>] zfs_create+0x5db/0x780 [zfs]
[ 3774.649840]  [<ffffffffa0350ba5>] zpl_xattr_set_dir.isra.9+0x245/0x2a0 [zfs]
[ 3774.649843]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649895]  [<ffffffffa0351140>] zpl_xattr_set+0xe0/0x3f0 [zfs]
[ 3774.649895]  [<ffffffffa03516a4>] __zpl_xattr_security_init+0x64/0xb0 [zfs]
[ 3774.649968]  [<ffffffffa0351640>] ? zpl_xattr_trusted_set+0xb0/0xb0 [zfs]
[ 3774.649972]  [<ffffffff812a737c>] security_inode_init_security+0xbc/0xf0
[ 3774.649977]  [<ffffffffa0352028>] zpl_xattr_security_init+0x18/0x20 [zfs]
[ 3774.650017]  [<ffffffffa0350134>] zpl_create+0x154/0x240 [zfs]
[ 3774.650018]  [<ffffffff811bde85>] vfs_create+0xb5/0x120
[ 3774.650018]  [<ffffffff811be874>] do_last+0x984/0xe40
[ 3774.650020]  [<ffffffff811baf55>] ? link_path_walk+0x255/0x880
[ 3774.650023]  [<ffffffff811bedf2>] path_openat+0xc2/0x680
[ 3774.650026]  [<ffffffff811bf653>] do_filp_open+0x43/0xa0
[ 3774.650030]  [<ffffffff811bf615>] ? do_filp_open+0x5/0xa0
[ 3774.650034]  [<ffffffff811ae7fc>] do_sys_open+0x13c/0x230
[ 3774.650037]  [<ffffffff811ae912>] SyS_open+0x22/0x30
[ 3774.650040]  [<ffffffff81675819>] system_call_fastpath+0x16/0x1b

`zfs_mknode()` grabbed an object hash mutex via `ZFS_OBJ_HOLD_ENTER()`,
tried to allocate a znode with `zfs_znode_alloc()` and entered direct
reclaim, which tried to do `ZFS_OBJ_HOLD_ENTER()`. We can fix this by
making `ZFS_OBJ_HOLD_ENTER()` and ZFS_OBJ_HOLD_EXIT() do calls to
`spl_fstrans_mark()` and `spl_fstrans_unmark()` respectively. We resolve
this by allocating an array for each superblock to hold the cookies.
Each cookie is protected by the corresponding `->z_hold_mtx`.

Closes openzfs#3331

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit that referenced this pull request Apr 22, 2015
…hash mutex

The following deadlock occurred on the buildbot:

[ 3774.649030] VERIFY3(((*(volatile typeof((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner) *)&((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner))) != get_current()) failed (ffff880036362dc0 != ffff880036362dc0)
[ 3774.649407] PANIC at zfs_znode.c:1108:zfs_zinactive()
[ 3774.649415] Showing stack for process 32119
[ 3774.649425] CPU: 3 PID: 32119 Comm: filebench Tainted: PF          O 3.11.10-100.fc18.x86_64 #1
[ 3774.649428] Hardware name: Red Hat RHEV Hypervisor, BIOS 0.5.1 01/01/2007
[ 3774.649428]  ffffffffa03a3af8 ffff880047cf2bb8 ffffffff81666676 0000000000000007
[ 3774.649430]  ffffffffa03a3b73 ffff880047cf2bc8 ffffffffa01c73e4 ffff880047cf2d68
[ 3774.649435]  ffffffffa01c761d 0000000000000003 ffff88004b1accc0 0000000000000030
[ 3774.649447] Call Trace:
[ 3774.649457]  [<ffffffff81666676>] dump_stack+0x46/0x58
[ 3774.649465]  [<ffffffffa01c73e4>] spl_dumpstack+0x44/0x50 [spl]
[ 3774.649468]  [<ffffffffa01c761d>] spl_panic+0xbd/0x100 [spl]
[ 3774.649476]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649493]  [<ffffffffa03369d5>] zfs_zinactive+0x1f5/0x240 [zfs]
[ 3774.649538]  [<ffffffffa032fb9c>] zfs_inactive+0x7c/0x430 [zfs]
[ 3774.649546]  [<ffffffffa03506fe>] zpl_evict_inode+0x4e/0xa0 [zfs]
[ 3774.649546]  [<ffffffff811c8e12>] evict+0xa2/0x1a0
[ 3774.649546]  [<ffffffff811c8f4e>] dispose_list+0x3e/0x60
[ 3774.649546]  [<ffffffff811c9cd1>] prune_icache_sb+0x161/0x300
[ 3774.649546]  [<ffffffff811b2e35>] prune_super+0xe5/0x1b0
[ 3774.649546]  [<ffffffff81153771>] shrink_slab+0x151/0x2e0
[ 3774.649546]  [<ffffffff811a9809>] ? vmpressure+0x29/0x90
[ 3774.649546]  [<ffffffff811a97e5>] ? vmpressure+0x5/0x90
[ 3774.649546]  [<ffffffff81156979>] do_try_to_free_pages+0x3e9/0x5a0
[ 3774.649548]  [<ffffffff811527ff>] ? throttle_direct_reclaim.isra.45+0x8f/0x280
[ 3774.649552]  [<ffffffff81156e38>] try_to_free_pages+0xf8/0x180
[ 3774.649556]  [<ffffffff8114ae3a>] __alloc_pages_nodemask+0x6aa/0xae0
[ 3774.649562]  [<ffffffff81189fb8>] alloc_pages_current+0xb8/0x190
[ 3774.649565]  [<ffffffff81193e30>] new_slab+0x2d0/0x3a0
[ 3774.649577]  [<ffffffff81664d2d>] __slab_alloc+0x393/0x560
[ 3774.649579]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649585]  [<ffffffff81195230>] kmem_cache_alloc+0x1a0/0x200
[ 3774.649589]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649594]  [<ffffffffa01c1b30>] spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649596]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649599]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649624]  [<ffffffffa03505c0>] ? zpl_inode_destroy+0x60/0x60 [zfs]
[ 3774.649687]  [<ffffffffa033266f>] zfs_inode_alloc+0x1f/0x40 [zfs]
[ 3774.649687]  [<ffffffffa03505da>] zpl_inode_alloc+0x1a/0x70 [zfs]
[ 3774.649687]  [<ffffffff811c7e16>] alloc_inode+0x26/0xa0
[ 3774.649687]  [<ffffffff811c9e83>] new_inode_pseudo+0x13/0x60
[ 3774.649687]  [<ffffffff811c9eed>] new_inode+0x1d/0x40
[ 3774.649710]  [<ffffffffa0332ac7>] zfs_znode_alloc+0x47/0x730 [zfs]
[ 3774.649770]  [<ffffffffa02c8f4e>] ? sa_build_index+0xbe/0x1b0 [zfs]
[ 3774.649770]  [<ffffffffa02c9775>] ? sa_build_layouts+0x6b5/0xc80 [zfs]
[ 3774.649770]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649794]  [<ffffffffa0333b5e>] zfs_mknode+0x93e/0xe90 [zfs]
[ 3774.649813]  [<ffffffffa032be5b>] zfs_create+0x5db/0x780 [zfs]
[ 3774.649840]  [<ffffffffa0350ba5>] zpl_xattr_set_dir.isra.9+0x245/0x2a0 [zfs]
[ 3774.649843]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649895]  [<ffffffffa0351140>] zpl_xattr_set+0xe0/0x3f0 [zfs]
[ 3774.649895]  [<ffffffffa03516a4>] __zpl_xattr_security_init+0x64/0xb0 [zfs]
[ 3774.649968]  [<ffffffffa0351640>] ? zpl_xattr_trusted_set+0xb0/0xb0 [zfs]
[ 3774.649972]  [<ffffffff812a737c>] security_inode_init_security+0xbc/0xf0
[ 3774.649977]  [<ffffffffa0352028>] zpl_xattr_security_init+0x18/0x20 [zfs]
[ 3774.650017]  [<ffffffffa0350134>] zpl_create+0x154/0x240 [zfs]
[ 3774.650018]  [<ffffffff811bde85>] vfs_create+0xb5/0x120
[ 3774.650018]  [<ffffffff811be874>] do_last+0x984/0xe40
[ 3774.650020]  [<ffffffff811baf55>] ? link_path_walk+0x255/0x880
[ 3774.650023]  [<ffffffff811bedf2>] path_openat+0xc2/0x680
[ 3774.650026]  [<ffffffff811bf653>] do_filp_open+0x43/0xa0
[ 3774.650030]  [<ffffffff811bf615>] ? do_filp_open+0x5/0xa0
[ 3774.650034]  [<ffffffff811ae7fc>] do_sys_open+0x13c/0x230
[ 3774.650037]  [<ffffffff811ae912>] SyS_open+0x22/0x30
[ 3774.650040]  [<ffffffff81675819>] system_call_fastpath+0x16/0x1b

`zfs_mknode()` grabbed an object hash mutex via `ZFS_OBJ_HOLD_ENTER()`,
tried to allocate a znode with `zfs_znode_alloc()` and entered direct
reclaim, which tried to do `ZFS_OBJ_HOLD_ENTER()`. This is an edge case
that the kmem-rework missed. Consequently, it is a regression from
79c76d5.

We can fix this by making `ZFS_OBJ_HOLD_ENTER()` and ZFS_OBJ_HOLD_EXIT()
do calls to `spl_fstrans_mark()` and `spl_fstrans_unmark()`
respectively. We resolve this by allocating an array for each superblock
to hold the cookies.  Each cookie is protected by the corresponding
`->z_hold_mtx`.

Closes openzfs#3331

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit that referenced this pull request Apr 22, 2015
…hash mutex

The following deadlock occurred on the buildbot:

[ 3774.649030] VERIFY3(((*(volatile typeof((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner) *)&((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner))) != get_current()) failed (ffff880036362dc0 != ffff880036362dc0)
[ 3774.649407] PANIC at zfs_znode.c:1108:zfs_zinactive()
[ 3774.649415] Showing stack for process 32119
[ 3774.649425] CPU: 3 PID: 32119 Comm: filebench Tainted: PF          O 3.11.10-100.fc18.x86_64 #1
[ 3774.649428] Hardware name: Red Hat RHEV Hypervisor, BIOS 0.5.1 01/01/2007
[ 3774.649428]  ffffffffa03a3af8 ffff880047cf2bb8 ffffffff81666676 0000000000000007
[ 3774.649430]  ffffffffa03a3b73 ffff880047cf2bc8 ffffffffa01c73e4 ffff880047cf2d68
[ 3774.649435]  ffffffffa01c761d 0000000000000003 ffff88004b1accc0 0000000000000030
[ 3774.649447] Call Trace:
[ 3774.649457]  [<ffffffff81666676>] dump_stack+0x46/0x58
[ 3774.649465]  [<ffffffffa01c73e4>] spl_dumpstack+0x44/0x50 [spl]
[ 3774.649468]  [<ffffffffa01c761d>] spl_panic+0xbd/0x100 [spl]
[ 3774.649476]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649493]  [<ffffffffa03369d5>] zfs_zinactive+0x1f5/0x240 [zfs]
[ 3774.649538]  [<ffffffffa032fb9c>] zfs_inactive+0x7c/0x430 [zfs]
[ 3774.649546]  [<ffffffffa03506fe>] zpl_evict_inode+0x4e/0xa0 [zfs]
[ 3774.649546]  [<ffffffff811c8e12>] evict+0xa2/0x1a0
[ 3774.649546]  [<ffffffff811c8f4e>] dispose_list+0x3e/0x60
[ 3774.649546]  [<ffffffff811c9cd1>] prune_icache_sb+0x161/0x300
[ 3774.649546]  [<ffffffff811b2e35>] prune_super+0xe5/0x1b0
[ 3774.649546]  [<ffffffff81153771>] shrink_slab+0x151/0x2e0
[ 3774.649546]  [<ffffffff811a9809>] ? vmpressure+0x29/0x90
[ 3774.649546]  [<ffffffff811a97e5>] ? vmpressure+0x5/0x90
[ 3774.649546]  [<ffffffff81156979>] do_try_to_free_pages+0x3e9/0x5a0
[ 3774.649548]  [<ffffffff811527ff>] ? throttle_direct_reclaim.isra.45+0x8f/0x280
[ 3774.649552]  [<ffffffff81156e38>] try_to_free_pages+0xf8/0x180
[ 3774.649556]  [<ffffffff8114ae3a>] __alloc_pages_nodemask+0x6aa/0xae0
[ 3774.649562]  [<ffffffff81189fb8>] alloc_pages_current+0xb8/0x190
[ 3774.649565]  [<ffffffff81193e30>] new_slab+0x2d0/0x3a0
[ 3774.649577]  [<ffffffff81664d2d>] __slab_alloc+0x393/0x560
[ 3774.649579]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649585]  [<ffffffff81195230>] kmem_cache_alloc+0x1a0/0x200
[ 3774.649589]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649594]  [<ffffffffa01c1b30>] spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649596]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649599]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649624]  [<ffffffffa03505c0>] ? zpl_inode_destroy+0x60/0x60 [zfs]
[ 3774.649687]  [<ffffffffa033266f>] zfs_inode_alloc+0x1f/0x40 [zfs]
[ 3774.649687]  [<ffffffffa03505da>] zpl_inode_alloc+0x1a/0x70 [zfs]
[ 3774.649687]  [<ffffffff811c7e16>] alloc_inode+0x26/0xa0
[ 3774.649687]  [<ffffffff811c9e83>] new_inode_pseudo+0x13/0x60
[ 3774.649687]  [<ffffffff811c9eed>] new_inode+0x1d/0x40
[ 3774.649710]  [<ffffffffa0332ac7>] zfs_znode_alloc+0x47/0x730 [zfs]
[ 3774.649770]  [<ffffffffa02c8f4e>] ? sa_build_index+0xbe/0x1b0 [zfs]
[ 3774.649770]  [<ffffffffa02c9775>] ? sa_build_layouts+0x6b5/0xc80 [zfs]
[ 3774.649770]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649794]  [<ffffffffa0333b5e>] zfs_mknode+0x93e/0xe90 [zfs]
[ 3774.649813]  [<ffffffffa032be5b>] zfs_create+0x5db/0x780 [zfs]
[ 3774.649840]  [<ffffffffa0350ba5>] zpl_xattr_set_dir.isra.9+0x245/0x2a0 [zfs]
[ 3774.649843]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649895]  [<ffffffffa0351140>] zpl_xattr_set+0xe0/0x3f0 [zfs]
[ 3774.649895]  [<ffffffffa03516a4>] __zpl_xattr_security_init+0x64/0xb0 [zfs]
[ 3774.649968]  [<ffffffffa0351640>] ? zpl_xattr_trusted_set+0xb0/0xb0 [zfs]
[ 3774.649972]  [<ffffffff812a737c>] security_inode_init_security+0xbc/0xf0
[ 3774.649977]  [<ffffffffa0352028>] zpl_xattr_security_init+0x18/0x20 [zfs]
[ 3774.650017]  [<ffffffffa0350134>] zpl_create+0x154/0x240 [zfs]
[ 3774.650018]  [<ffffffff811bde85>] vfs_create+0xb5/0x120
[ 3774.650018]  [<ffffffff811be874>] do_last+0x984/0xe40
[ 3774.650020]  [<ffffffff811baf55>] ? link_path_walk+0x255/0x880
[ 3774.650023]  [<ffffffff811bedf2>] path_openat+0xc2/0x680
[ 3774.650026]  [<ffffffff811bf653>] do_filp_open+0x43/0xa0
[ 3774.650030]  [<ffffffff811bf615>] ? do_filp_open+0x5/0xa0
[ 3774.650034]  [<ffffffff811ae7fc>] do_sys_open+0x13c/0x230
[ 3774.650037]  [<ffffffff811ae912>] SyS_open+0x22/0x30
[ 3774.650040]  [<ffffffff81675819>] system_call_fastpath+0x16/0x1b

`zfs_mknode()` grabbed an object hash mutex via `ZFS_OBJ_HOLD_ENTER()`,
tried to allocate a znode with `zfs_znode_alloc()` and entered direct
reclaim, which tried to do `ZFS_OBJ_HOLD_ENTER()`. This is an edge case
that the kmem-rework missed. Consequently, it is a regression from
79c76d5.

We can fix this by making `ZFS_OBJ_HOLD_ENTER()` and
`ZFS_OBJ_HOLD_EXIT()` do calls to `spl_fstrans_mark()` and
`spl_fstrans_unmark()` respectively. We resolve this by allocating an
array for each superblock to hold the cookies.  Each cookie is protected
by the corresponding `->z_hold_mtx`.

Closes openzfs#3331

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit that referenced this pull request Apr 22, 2015
…hash mutex

The following deadlock occurred on the buildbot:

[ 3774.649030] VERIFY3(((*(volatile typeof((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner) *)&((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner))) != get_current()) failed (ffff880036362dc0 != ffff880036362dc0)
[ 3774.649407] PANIC at zfs_znode.c:1108:zfs_zinactive()
[ 3774.649415] Showing stack for process 32119
[ 3774.649425] CPU: 3 PID: 32119 Comm: filebench Tainted: PF          O 3.11.10-100.fc18.x86_64 #1
[ 3774.649428] Hardware name: Red Hat RHEV Hypervisor, BIOS 0.5.1 01/01/2007
[ 3774.649428]  ffffffffa03a3af8 ffff880047cf2bb8 ffffffff81666676 0000000000000007
[ 3774.649430]  ffffffffa03a3b73 ffff880047cf2bc8 ffffffffa01c73e4 ffff880047cf2d68
[ 3774.649435]  ffffffffa01c761d 0000000000000003 ffff88004b1accc0 0000000000000030
[ 3774.649447] Call Trace:
[ 3774.649457]  [<ffffffff81666676>] dump_stack+0x46/0x58
[ 3774.649465]  [<ffffffffa01c73e4>] spl_dumpstack+0x44/0x50 [spl]
[ 3774.649468]  [<ffffffffa01c761d>] spl_panic+0xbd/0x100 [spl]
[ 3774.649476]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649493]  [<ffffffffa03369d5>] zfs_zinactive+0x1f5/0x240 [zfs]
[ 3774.649538]  [<ffffffffa032fb9c>] zfs_inactive+0x7c/0x430 [zfs]
[ 3774.649546]  [<ffffffffa03506fe>] zpl_evict_inode+0x4e/0xa0 [zfs]
[ 3774.649546]  [<ffffffff811c8e12>] evict+0xa2/0x1a0
[ 3774.649546]  [<ffffffff811c8f4e>] dispose_list+0x3e/0x60
[ 3774.649546]  [<ffffffff811c9cd1>] prune_icache_sb+0x161/0x300
[ 3774.649546]  [<ffffffff811b2e35>] prune_super+0xe5/0x1b0
[ 3774.649546]  [<ffffffff81153771>] shrink_slab+0x151/0x2e0
[ 3774.649546]  [<ffffffff811a9809>] ? vmpressure+0x29/0x90
[ 3774.649546]  [<ffffffff811a97e5>] ? vmpressure+0x5/0x90
[ 3774.649546]  [<ffffffff81156979>] do_try_to_free_pages+0x3e9/0x5a0
[ 3774.649548]  [<ffffffff811527ff>] ? throttle_direct_reclaim.isra.45+0x8f/0x280
[ 3774.649552]  [<ffffffff81156e38>] try_to_free_pages+0xf8/0x180
[ 3774.649556]  [<ffffffff8114ae3a>] __alloc_pages_nodemask+0x6aa/0xae0
[ 3774.649562]  [<ffffffff81189fb8>] alloc_pages_current+0xb8/0x190
[ 3774.649565]  [<ffffffff81193e30>] new_slab+0x2d0/0x3a0
[ 3774.649577]  [<ffffffff81664d2d>] __slab_alloc+0x393/0x560
[ 3774.649579]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649585]  [<ffffffff81195230>] kmem_cache_alloc+0x1a0/0x200
[ 3774.649589]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649594]  [<ffffffffa01c1b30>] spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649596]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649599]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649624]  [<ffffffffa03505c0>] ? zpl_inode_destroy+0x60/0x60 [zfs]
[ 3774.649687]  [<ffffffffa033266f>] zfs_inode_alloc+0x1f/0x40 [zfs]
[ 3774.649687]  [<ffffffffa03505da>] zpl_inode_alloc+0x1a/0x70 [zfs]
[ 3774.649687]  [<ffffffff811c7e16>] alloc_inode+0x26/0xa0
[ 3774.649687]  [<ffffffff811c9e83>] new_inode_pseudo+0x13/0x60
[ 3774.649687]  [<ffffffff811c9eed>] new_inode+0x1d/0x40
[ 3774.649710]  [<ffffffffa0332ac7>] zfs_znode_alloc+0x47/0x730 [zfs]
[ 3774.649770]  [<ffffffffa02c8f4e>] ? sa_build_index+0xbe/0x1b0 [zfs]
[ 3774.649770]  [<ffffffffa02c9775>] ? sa_build_layouts+0x6b5/0xc80 [zfs]
[ 3774.649770]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649794]  [<ffffffffa0333b5e>] zfs_mknode+0x93e/0xe90 [zfs]
[ 3774.649813]  [<ffffffffa032be5b>] zfs_create+0x5db/0x780 [zfs]
[ 3774.649840]  [<ffffffffa0350ba5>] zpl_xattr_set_dir.isra.9+0x245/0x2a0 [zfs]
[ 3774.649843]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649895]  [<ffffffffa0351140>] zpl_xattr_set+0xe0/0x3f0 [zfs]
[ 3774.649895]  [<ffffffffa03516a4>] __zpl_xattr_security_init+0x64/0xb0 [zfs]
[ 3774.649968]  [<ffffffffa0351640>] ? zpl_xattr_trusted_set+0xb0/0xb0 [zfs]
[ 3774.649972]  [<ffffffff812a737c>] security_inode_init_security+0xbc/0xf0
[ 3774.649977]  [<ffffffffa0352028>] zpl_xattr_security_init+0x18/0x20 [zfs]
[ 3774.650017]  [<ffffffffa0350134>] zpl_create+0x154/0x240 [zfs]
[ 3774.650018]  [<ffffffff811bde85>] vfs_create+0xb5/0x120
[ 3774.650018]  [<ffffffff811be874>] do_last+0x984/0xe40
[ 3774.650020]  [<ffffffff811baf55>] ? link_path_walk+0x255/0x880
[ 3774.650023]  [<ffffffff811bedf2>] path_openat+0xc2/0x680
[ 3774.650026]  [<ffffffff811bf653>] do_filp_open+0x43/0xa0
[ 3774.650030]  [<ffffffff811bf615>] ? do_filp_open+0x5/0xa0
[ 3774.650034]  [<ffffffff811ae7fc>] do_sys_open+0x13c/0x230
[ 3774.650037]  [<ffffffff811ae912>] SyS_open+0x22/0x30
[ 3774.650040]  [<ffffffff81675819>] system_call_fastpath+0x16/0x1b

`zfs_mknode()` grabbed an object hash mutex via `ZFS_OBJ_HOLD_ENTER()`,
tried to allocate a znode with `zfs_znode_alloc()` and entered direct
reclaim, which tried to do `ZFS_OBJ_HOLD_ENTER()`. This is an edge case
that the kmem-rework missed. Consequently, it is a regression from
79c76d5.

We can fix this by making `ZFS_OBJ_HOLD_ENTER()` and
`ZFS_OBJ_HOLD_EXIT()` do calls to `spl_fstrans_mark()` and
`spl_fstrans_unmark()` respectively. We resolve this by allocating an
array for each superblock to hold the cookies.  Each cookie is protected
by the corresponding `->z_hold_mtx`.

Closes openzfs#3331

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit that referenced this pull request Apr 22, 2015
…hash mutex

The following deadlock occurred on the buildbot:

[ 3774.649030] VERIFY3(((*(volatile typeof((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner) *)&((((&((zsb))->z_hold_mtx[(((z_id)) & (256 - 1))])))->m_owner))) != get_current()) failed (ffff880036362dc0 != ffff880036362dc0)
[ 3774.649407] PANIC at zfs_znode.c:1108:zfs_zinactive()
[ 3774.649415] Showing stack for process 32119
[ 3774.649425] CPU: 3 PID: 32119 Comm: filebench Tainted: PF          O 3.11.10-100.fc18.x86_64 #1
[ 3774.649428] Hardware name: Red Hat RHEV Hypervisor, BIOS 0.5.1 01/01/2007
[ 3774.649428]  ffffffffa03a3af8 ffff880047cf2bb8 ffffffff81666676 0000000000000007
[ 3774.649430]  ffffffffa03a3b73 ffff880047cf2bc8 ffffffffa01c73e4 ffff880047cf2d68
[ 3774.649435]  ffffffffa01c761d 0000000000000003 ffff88004b1accc0 0000000000000030
[ 3774.649447] Call Trace:
[ 3774.649457]  [<ffffffff81666676>] dump_stack+0x46/0x58
[ 3774.649465]  [<ffffffffa01c73e4>] spl_dumpstack+0x44/0x50 [spl]
[ 3774.649468]  [<ffffffffa01c761d>] spl_panic+0xbd/0x100 [spl]
[ 3774.649476]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649493]  [<ffffffffa03369d5>] zfs_zinactive+0x1f5/0x240 [zfs]
[ 3774.649538]  [<ffffffffa032fb9c>] zfs_inactive+0x7c/0x430 [zfs]
[ 3774.649546]  [<ffffffffa03506fe>] zpl_evict_inode+0x4e/0xa0 [zfs]
[ 3774.649546]  [<ffffffff811c8e12>] evict+0xa2/0x1a0
[ 3774.649546]  [<ffffffff811c8f4e>] dispose_list+0x3e/0x60
[ 3774.649546]  [<ffffffff811c9cd1>] prune_icache_sb+0x161/0x300
[ 3774.649546]  [<ffffffff811b2e35>] prune_super+0xe5/0x1b0
[ 3774.649546]  [<ffffffff81153771>] shrink_slab+0x151/0x2e0
[ 3774.649546]  [<ffffffff811a9809>] ? vmpressure+0x29/0x90
[ 3774.649546]  [<ffffffff811a97e5>] ? vmpressure+0x5/0x90
[ 3774.649546]  [<ffffffff81156979>] do_try_to_free_pages+0x3e9/0x5a0
[ 3774.649548]  [<ffffffff811527ff>] ? throttle_direct_reclaim.isra.45+0x8f/0x280
[ 3774.649552]  [<ffffffff81156e38>] try_to_free_pages+0xf8/0x180
[ 3774.649556]  [<ffffffff8114ae3a>] __alloc_pages_nodemask+0x6aa/0xae0
[ 3774.649562]  [<ffffffff81189fb8>] alloc_pages_current+0xb8/0x190
[ 3774.649565]  [<ffffffff81193e30>] new_slab+0x2d0/0x3a0
[ 3774.649577]  [<ffffffff81664d2d>] __slab_alloc+0x393/0x560
[ 3774.649579]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649583]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649585]  [<ffffffff81195230>] kmem_cache_alloc+0x1a0/0x200
[ 3774.649589]  [<ffffffffa01c1b30>] ? spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649594]  [<ffffffffa01c1b30>] spl_kmem_cache_alloc+0xb0/0xee0 [spl]
[ 3774.649596]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649599]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649624]  [<ffffffffa03505c0>] ? zpl_inode_destroy+0x60/0x60 [zfs]
[ 3774.649687]  [<ffffffffa033266f>] zfs_inode_alloc+0x1f/0x40 [zfs]
[ 3774.649687]  [<ffffffffa03505da>] zpl_inode_alloc+0x1a/0x70 [zfs]
[ 3774.649687]  [<ffffffff811c7e16>] alloc_inode+0x26/0xa0
[ 3774.649687]  [<ffffffff811c9e83>] new_inode_pseudo+0x13/0x60
[ 3774.649687]  [<ffffffff811c9eed>] new_inode+0x1d/0x40
[ 3774.649710]  [<ffffffffa0332ac7>] zfs_znode_alloc+0x47/0x730 [zfs]
[ 3774.649770]  [<ffffffffa02c8f4e>] ? sa_build_index+0xbe/0x1b0 [zfs]
[ 3774.649770]  [<ffffffffa02c9775>] ? sa_build_layouts+0x6b5/0xc80 [zfs]
[ 3774.649770]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649794]  [<ffffffffa0333b5e>] zfs_mknode+0x93e/0xe90 [zfs]
[ 3774.649813]  [<ffffffffa032be5b>] zfs_create+0x5db/0x780 [zfs]
[ 3774.649840]  [<ffffffffa0350ba5>] zpl_xattr_set_dir.isra.9+0x245/0x2a0 [zfs]
[ 3774.649843]  [<ffffffff81675440>] ? ftrace_call+0x5/0x2f
[ 3774.649895]  [<ffffffffa0351140>] zpl_xattr_set+0xe0/0x3f0 [zfs]
[ 3774.649895]  [<ffffffffa03516a4>] __zpl_xattr_security_init+0x64/0xb0 [zfs]
[ 3774.649968]  [<ffffffffa0351640>] ? zpl_xattr_trusted_set+0xb0/0xb0 [zfs]
[ 3774.649972]  [<ffffffff812a737c>] security_inode_init_security+0xbc/0xf0
[ 3774.649977]  [<ffffffffa0352028>] zpl_xattr_security_init+0x18/0x20 [zfs]
[ 3774.650017]  [<ffffffffa0350134>] zpl_create+0x154/0x240 [zfs]
[ 3774.650018]  [<ffffffff811bde85>] vfs_create+0xb5/0x120
[ 3774.650018]  [<ffffffff811be874>] do_last+0x984/0xe40
[ 3774.650020]  [<ffffffff811baf55>] ? link_path_walk+0x255/0x880
[ 3774.650023]  [<ffffffff811bedf2>] path_openat+0xc2/0x680
[ 3774.650026]  [<ffffffff811bf653>] do_filp_open+0x43/0xa0
[ 3774.650030]  [<ffffffff811bf615>] ? do_filp_open+0x5/0xa0
[ 3774.650034]  [<ffffffff811ae7fc>] do_sys_open+0x13c/0x230
[ 3774.650037]  [<ffffffff811ae912>] SyS_open+0x22/0x30
[ 3774.650040]  [<ffffffff81675819>] system_call_fastpath+0x16/0x1b

`zfs_mknode()` grabbed an object hash mutex via `ZFS_OBJ_HOLD_ENTER()`,
tried to allocate a znode with `zfs_znode_alloc()` and entered direct
reclaim, which tried to do `ZFS_OBJ_HOLD_ENTER()`. This is an edge case
that the kmem-rework missed. Consequently, it is a regression from
79c76d5.

We can fix this by making `ZFS_OBJ_HOLD_ENTER()` and
`ZFS_OBJ_HOLD_EXIT()` do calls to `spl_fstrans_mark()` and
`spl_fstrans_unmark()` respectively. We resolve this by allocating an
array for each superblock to hold the cookies.  Each cookie is protected
by the corresponding `->z_hold_mtx`.

Closes openzfs#3331

Signed-off-by: Richard Yao <[email protected]>
ryao pushed a commit that referenced this pull request May 7, 2015
The params to the functions are uint64_t, but the offsets to memcpy
/ bcopy are calculated using 32bit ints. This patch changes them to
also be uint64_t so there isnt an overflow. PaX's Size Overflow
caught this when formatting a zvol.

Gentoo bug: #546490

PAX: offset: 1ffffb000 db->db_offset: 1ffffa000 db->db_size: 2000 size: 5000
PAX: size overflow detected in function dmu_read /var/tmp/portage/sys-fs/zfs-kmod-0.6.3-r1/work/zfs-zfs-0.6.3/module/zfs/../../module/zfs/dmu.c:781 cicus.366_146 max, count: 15
CPU: 1 PID: 2236 Comm: zvol/10 Tainted: P           O   3.17.7-hardened-r1 #1
Call Trace:
 [<ffffffffa0382ee8>] ? dsl_dataset_get_holds+0x9d58/0x343ce [zfs]
 [<ffffffff81a59c88>] dump_stack+0x4e/0x7a
 [<ffffffffa0393c2a>] ? dsl_dataset_get_holds+0x1aa9a/0x343ce [zfs]
 [<ffffffff81206696>] report_size_overflow+0x36/0x40
 [<ffffffffa02dba2b>] dmu_read+0x52b/0x920 [zfs]
 [<ffffffffa0373ad1>] zrl_is_locked+0x7d1/0x1ce0 [zfs]
 [<ffffffffa0364cd2>] zil_clean+0x9d2/0xc00 [zfs]
 [<ffffffffa0364f21>] zil_commit+0x21/0x30 [zfs]
 [<ffffffffa0373fe1>] zrl_is_locked+0xce1/0x1ce0 [zfs]
 [<ffffffff81a5e2c7>] ? __schedule+0x547/0xbc0
 [<ffffffffa01582e6>] taskq_cancel_id+0x2a6/0x5b0 [spl]
 [<ffffffff81103eb0>] ? wake_up_state+0x20/0x20
 [<ffffffffa0158150>] ? taskq_cancel_id+0x110/0x5b0 [spl]
 [<ffffffff810f7ff4>] kthread+0xc4/0xe0
 [<ffffffff810f7f30>] ? kthread_create_on_node+0x170/0x170
 [<ffffffff81a62fa4>] ret_from_fork+0x74/0xa0
 [<ffffffff810f7f30>] ? kthread_create_on_node+0x170/0x170

Signed-off-by: Jason Zaman <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#3333
ryao added a commit that referenced this pull request Jul 23, 2015
DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX
for database workloads. Its purpose was to allow the database to bypass
the page and buffer caches to prevent unnecessary IO operations (e.g.
readahead) while preventing contention for system memory between the
database and kernel caches.

Unfortunately, the semantics were never defined in any standard. The
semantics of O_DIRECT in XFS in Linux are as follows:

1. O_DIRECT requires IOs be aligned to backing device's sector size.
2. O_DIRECT performs unbuffered IO operations between user memory and block
device (DMA when the block device is physical hardware).
3. O_DIRECT implies O_DSYNC.
4. O_DIRECT disables any locking that would serialize IO operations.

The first is not possible in ZFS beause there is no backing device in
the general case.

The second is not possible in ZFS in the presence of compression because
that prevents us from doing DMA from user pages. If we relax the
requirement in the case of compression, we encunter another hurdle. In
specific, avoiding the userland to kernel copy risks other userland
threads modifying buffers during compression and checksum computations.
For compressed data, this would cause undefined behavior while for
checksums, this would imply we write incorrect checksums to disk.  It
would be possible to avoid those issues if we modify the page tables to
make any changes by userland to memory trigger page faults and perform
CoW operations.  However, it is unclear if it is wise for a filesystem
driver to do this.

The third is doable, but we would need to make ZIL perform indirect
logging to avoid writing the data twice.

The fourth is already done for all IO in ZFS.

Other Linux filesystems such as ext4 do not follow #3. Other platforms
implement varying subsets of the XFS semantics. FreeBSD does not
implement #1 and might not implement others (not checked). Mac OS X does
not implement O_DIRECT, but it does implement F_NOCACHE, which is
similiar to #2 in that it prevents new data from being cached. AIX
relaxes #3 by only committing the file data to disk. Metadata updates
required should the operations make the file larger are asynchronous
unless O_DSYNC is specified.

On Solaris and Illumos, there is a library function called directio(3C)
that allows userspace to provide a hint to the filesystem that DirectIO
is useful, but the filesystem is free to ignore it. The semantics are
also entirely a filesystem decision. Those that do not implement it
return ENOTTY.

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If it turns out that it is acceptable for a filesystem driver to
interact with the page tables, the scatter-gather list work will need be
finished and we would need to utilize the page tables to make operations
on the userland pages safe.

References:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
https://blogs.oracle.com/roch/entry/zfs_and_directio
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
https://illumos.org/man/3c/directio
https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit that referenced this pull request Jul 23, 2015
DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX
for database workloads. Its purpose was to allow the database to bypass
the page and buffer caches to prevent unnecessary IO operations (e.g.
readahead) while preventing contention for system memory between the
database and kernel caches.

Unfortunately, the semantics were never defined in any standard. The
semantics of O_DIRECT in XFS in Linux are as follows:

1. O_DIRECT requires IOs be aligned to backing device's sector size.
2. O_DIRECT performs unbuffered IO operations between user memory and block
device (DMA when the block device is physical hardware).
3. O_DIRECT implies O_DSYNC.
4. O_DIRECT disables any locking that would serialize IO operations.

The first is not possible in ZFS beause there is no backing device in
the general case.

The second is not possible in ZFS in the presence of compression because
that prevents us from doing DMA from user pages. If we relax the
requirement in the case of compression, we encunter another hurdle. In
specific, avoiding the userland to kernel copy risks other userland
threads modifying buffers during compression and checksum computations.
For compressed data, this would cause undefined behavior while for
checksums, this would imply we write incorrect checksums to disk.  It
would be possible to avoid those issues if we modify the page tables to
make any changes by userland to memory trigger page faults and perform
CoW operations.  However, it is unclear if it is wise for a filesystem
driver to do this.

The third is doable, but we would need to make ZIL perform indirect
logging to avoid writing the data twice.

The fourth is already done for all IO in ZFS.

Other Linux filesystems such as ext4 do not follow #3. Other platforms
implement varying subsets of the XFS semantics. FreeBSD does not
implement #1 and might not implement others (not checked). Mac OS X does
not implement O_DIRECT, but it does implement F_NOCACHE, which is
similiar to #2 in that it prevents new data from being cached. AIX
relaxes #3 by only committing the file data to disk. Metadata updates
required should the operations make the file larger are asynchronous
unless O_DSYNC is specified.

On Solaris and Illumos, there is a library function called directio(3C)
that allows userspace to provide a hint to the filesystem that DirectIO
is useful, but the filesystem is free to ignore it. The semantics are
also entirely a filesystem decision. Those that do not implement it
return ENOTTY.

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If it turns out that it is acceptable for a filesystem driver to
interact with the page tables, the scatter-gather list work will need be
finished and we would need to utilize the page tables to make operations
on the userland pages safe.

References:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
https://blogs.oracle.com/roch/entry/zfs_and_directio
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
https://illumos.org/man/3c/directio
https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit that referenced this pull request Jul 24, 2015
DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX
for database workloads. Its purpose was to allow the database to bypass
the page and buffer caches to prevent unnecessary IO operations (e.g.
readahead) while preventing contention for system memory between the
database and kernel caches.

Unfortunately, the semantics were never defined in any standard. The
semantics of O_DIRECT in XFS in Linux are as follows:

1. O_DIRECT requires IOs be aligned to backing device's sector size.
2. O_DIRECT performs unbuffered IO operations between user memory and block
device (DMA when the block device is physical hardware).
3. O_DIRECT implies O_DSYNC.
4. O_DIRECT disables any locking that would serialize IO operations.

The first is not possible in ZFS beause there is no backing device in
the general case.

The second is not possible in ZFS in the presence of compression because
that prevents us from doing DMA from user pages. If we relax the
requirement in the case of compression, we encunter another hurdle. In
specific, avoiding the userland to kernel copy risks other userland
threads modifying buffers during compression and checksum computations.
For compressed data, this would cause undefined behavior while for
checksums, this would imply we write incorrect checksums to disk.  It
would be possible to avoid those issues if we modify the page tables to
make any changes by userland to memory trigger page faults and perform
CoW operations.  However, it is unclear if it is wise for a filesystem
driver to do this.

The third is doable, but we would need to make ZIL perform indirect
logging to avoid writing the data twice.

The fourth is already done for all IO in ZFS.

Other Linux filesystems such as ext4 do not follow #3. Other platforms
implement varying subsets of the XFS semantics. FreeBSD does not
implement #1 and might not implement others (not checked). Mac OS X does
not implement O_DIRECT, but it does implement F_NOCACHE, which is
similiar to #2 in that it prevents new data from being cached. AIX
relaxes #3 by only committing the file data to disk. Metadata updates
required should the operations make the file larger are asynchronous
unless O_DSYNC is specified.

On Solaris and Illumos, there is a library function called directio(3C)
that allows userspace to provide a hint to the filesystem that DirectIO
is useful, but the filesystem is free to ignore it. The semantics are
also entirely a filesystem decision. Those that do not implement it
return ENOTTY.

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If it turns out that it is acceptable for a filesystem driver to
interact with the page tables, the scatter-gather list work will need be
finished and we would need to utilize the page tables to make operations
on the userland pages safe.

References:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
https://blogs.oracle.com/roch/entry/zfs_and_directio
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
https://illumos.org/man/3c/directio
https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit that referenced this pull request Jul 24, 2015
DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX
for database workloads. Its purpose was to allow the database to bypass
the page and buffer caches to prevent unnecessary IO operations (e.g.
readahead) while preventing contention for system memory between the
database and kernel caches.

Unfortunately, the semantics were never defined in any standard. The
semantics of O_DIRECT in XFS in Linux are as follows:

1. O_DIRECT requires IOs be aligned to backing device's sector size.
2. O_DIRECT performs unbuffered IO operations between user memory and block
device (DMA when the block device is physical hardware).
3. O_DIRECT implies O_DSYNC.
4. O_DIRECT disables any locking that would serialize IO operations.

The first is not possible in ZFS beause there is no backing device in
the general case.

The second is not possible in ZFS in the presence of compression because
that prevents us from doing DMA from user pages. If we relax the
requirement in the case of compression, we encunter another hurdle. In
specific, avoiding the userland to kernel copy risks other userland
threads modifying buffers during compression and checksum computations.
For compressed data, this would cause undefined behavior while for
checksums, this would imply we write incorrect checksums to disk.  It
would be possible to avoid those issues if we modify the page tables to
make any changes by userland to memory trigger page faults and perform
CoW operations.  However, it is unclear if it is wise for a filesystem
driver to do this.

The third is doable, but we would need to make ZIL perform indirect
logging to avoid writing the data twice.

The fourth is already done for all IO in ZFS.

Other Linux filesystems such as ext4 do not follow #3. Other platforms
implement varying subsets of the XFS semantics. FreeBSD does not
implement #1 and might not implement others (not checked). Mac OS X does
not implement O_DIRECT, but it does implement F_NOCACHE, which is
similiar to #2 in that it prevents new data from being cached. AIX
relaxes #3 by only committing the file data to disk. Metadata updates
required should the operations make the file larger are asynchronous
unless O_DSYNC is specified.

On Solaris and Illumos, there is a library function called directio(3C)
that allows userspace to provide a hint to the filesystem that DirectIO
is useful, but the filesystem is free to ignore it. The semantics are
also entirely a filesystem decision. Those that do not implement it
return ENOTTY.

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If it turns out that it is acceptable for a filesystem driver to
interact with the page tables, the scatter-gather list work will need be
finished and we would need to utilize the page tables to make operations
on the userland pages safe.

References:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
https://blogs.oracle.com/roch/entry/zfs_and_directio
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
https://illumos.org/man/3c/directio
https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit that referenced this pull request Dec 3, 2015
(gdb) bt
\#0  0x00007f31952a35d7 in raise () from /lib64/libc.so.6
\#1  0x00007f31952a4cc8 in abort () from /lib64/libc.so.6
\#2  0x00007f319529c546 in __assert_fail_base () from /lib64/libc.so.6
\#3  0x00007f319529c5f2 in __assert_fail () from /lib64/libc.so.6
\#4  0x00007f319529c66b in __assert () from /lib64/libc.so.6
\#5  0x00007f319659e024 in send_iterate_prop (zhp=zhp@entry=0x19cf500, nv=0x19da110) at libzfs_sendrecv.c:724
\openzfs#6  0x00007f319659e35d in send_iterate_fs (zhp=zhp@entry=0x19cf500, arg=arg@entry=0x7ffc515d6380) at libzfs_sendrecv.c:762
\openzfs#7  0x00007f319659f3ca in gather_nvlist (hdl=<optimized out>, fsname=fsname@entry=0x19cf250 "tpool/test", fromsnap=fromsnap@entry=0x7ffc515d7531 "snap1", tosnap=tosnap@entry=0x7ffc515d7542 "snap2", recursive=B_FALSE,
    nvlp=nvlp@entry=0x7ffc515d6470, avlp=avlp@entry=0x7ffc515d6478) at libzfs_sendrecv.c:809
\openzfs#8  0x00007f31965a408f in zfs_send (zhp=zhp@entry=0x19cf240, fromsnap=fromsnap@entry=0x7ffc515d7531 "snap1", tosnap=tosnap@entry=0x7ffc515d7542 "snap2", flags=flags@entry=0x7ffc515d6d30, outfd=outfd@entry=1,
    filter_func=filter_func@entry=0x0, cb_arg=cb_arg@entry=0x0, debugnvp=debugnvp@entry=0x0) at libzfs_sendrecv.c:1461
\openzfs#9  0x000000000040a981 in zfs_do_send (argc=<optimized out>, argv=0x7ffc515d6ff0) at zfs_main.c:3841
\openzfs#10 0x0000000000404d10 in main (argc=6, argv=0x7ffc515d6fc8) at zfs_main.c:6724
(gdb) fr 5
\#5  0x00007f319659e024 in send_iterate_prop (zhp=zhp@entry=0x19cf500, nv=0x19da110) at libzfs_sendrecv.c:724
724                             verify(nvlist_lookup_uint64(propnv,
(gdb) list
719                             verify(nvlist_lookup_string(propnv,
720                                 ZPROP_VALUE, &value) == 0);
721                             VERIFY(0 == nvlist_add_string(nv, propname, value));
722                     } else {
723                             uint64_t value;
724                             verify(nvlist_lookup_uint64(propnv,
725                                 ZPROP_VALUE, &value) == 0);
726                             VERIFY(0 == nvlist_add_uint64(nv, propname, value));
727                     }
728             }
(gdb) p prop
$1 = ZFS_PROP_RELATIME
ryao pushed a commit that referenced this pull request Sep 12, 2016
DMU_MAX_ACCESS should be cast to a uint64_t otherwise the
multiplication of DMU_MAX_ACCESS with spa_asize_inflation will
be 32 bit and may lead to an overflow. Currently DMU_MAX_ACCESS
is 64 * 1024 * 1024, so spa_asize_inflation being 64 or more will
lead to an overflow.

Found by static analysis with CoverityScan 0.8.5

CID 150942 (#1 of 1): Unintentional integer overflow
  (OVERFLOW_BEFORE_WIDEN)
overflow_before_widen: Potentially overflowing expression
  67108864 * spa_asize_inflation with type int (32 bits, signed)
  is evaluated using 32-bit arithmetic, and then used in a context
  that expects an expression of type uint64_t (64 bits, unsigned).

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#4889
ryao pushed a commit that referenced this pull request Sep 12, 2016
Leaks reported by using AddressSanitizer, GCC 6.1.0

Direct leak of 4097 byte(s) in 1 object(s) allocated from:
    #1 0x414f73 in process_options cmd/ztest/ztest.c:721

Direct leak of 5440 byte(s) in 17 object(s) allocated from:
    #1 0x41bfd5 in umem_alloc ../../lib/libspl/include/umem.h:88
    #2 0x41bfd5 in ztest_zap_parallel cmd/ztest/ztest.c:4659
    #3 0x4163a8 in ztest_execute cmd/ztest/ztest.c:5907

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#4896
ryao pushed a commit that referenced this pull request Jun 5, 2018
It is just plain unsafe to peek inside in-kernel
mutex structure and make assumptions about what kernel
does with those internal fields like owner.

Kernel is all too happy to stop doing the expected things
like tracing lock owner once you load a tainted module
like spl/zfs that is not GPL.

As such you will get instant assertion failures like this:

  VERIFY3(((*(volatile typeof((&((&zo->zo_lock)->m_mutex))->owner) *)&
      ((&((&zo->zo_lock)->m_mutex))->owner))) == 
     ((void *)0)) failed (ffff88030be28500 == (null))
  PANIC at zfs_onexit.c:104:zfs_onexit_destroy()
  Showing stack for process 3626
  CPU: 0 PID: 3626 Comm: mkfs.lustre Tainted: P OE ------------ 3.10.0-debug #1
  Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
  Call Trace:
  dump_stack+0x19/0x1b
  spl_dumpstack+0x44/0x50 [spl]
  spl_panic+0xbf/0xf0 [spl]
  zfs_onexit_destroy+0x17c/0x280 [zfs]
  zfsdev_release+0x48/0xd0 [zfs]

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Chunwei Chen <[email protected]>
Signed-off-by: Oleg Drokin <[email protected]>
Closes openzfs#632 
Closes openzfs#633
ryao pushed a commit that referenced this pull request Jun 5, 2018
It is just plain unsafe to peek inside in-kernel
mutex structure and make assumptions about what kernel
does with those internal fields like owner.

Kernel is all too happy to stop doing the expected things
like tracing lock owner once you load a tainted module
like spl/zfs that is not GPL.

As such you will get instant assertion failures like this:

  VERIFY3(((*(volatile typeof((&((&zo->zo_lock)->m_mutex))->owner) *)&
      ((&((&zo->zo_lock)->m_mutex))->owner))) ==
     ((void *)0)) failed (ffff88030be28500 == (null))
  PANIC at zfs_onexit.c:104:zfs_onexit_destroy()
  Showing stack for process 3626
  CPU: 0 PID: 3626 Comm: mkfs.lustre Tainted: P OE ------------ 3.10.0-debug #1
  Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
  Call Trace:
  dump_stack+0x19/0x1b
  spl_dumpstack+0x44/0x50 [spl]
  spl_panic+0xbf/0xf0 [spl]
  zfs_onexit_destroy+0x17c/0x280 [zfs]
  zfsdev_release+0x48/0xd0 [zfs]

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Oleg Drokin <[email protected]>
Closes openzfs#639
Closes openzfs#632
ryao pushed a commit that referenced this pull request Mar 16, 2019
The bug time sequence:
1. thread #1, `zfs_write` assign a txg "n".
2. In a same process, thread #2, mmap page fault (which means the
   `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
   and wait previous txg "n" completed.
3. thread #1 call `uiomove` to write, however page fault is occurred
   in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
   thread #2, so it stuck and can't complete,  then txg "n" will
   not complete.

So thread #1 and thread #2 are deadlocked.

Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matthew Ahrens <[email protected]>
Signed-off-by: Grady Wong <[email protected]>
Closes openzfs#7939
ryao pushed a commit that referenced this pull request Mar 16, 2019
Trying to mount a dataset from a readonly pool could inadvertently start
the user accounting upgrade task, leading to the following failure:

VERIFY3(tx->tx_threads == 2) failed (0 == 2)
PANIC at txg.c:680:txg_wait_synced()
Showing stack for process 2541
CPU: 2 PID: 2541 Comm: z_upgrade Tainted: P           O  3.16.0-4-amd64 #1 Debian 3.16.51-3
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
 [<0>] ? dump_stack+0x5d/0x78
 [<0>] ? spl_panic+0xc9/0x110 [spl]
 [<0>] ? dnode_next_offset+0x1d4/0x2c0 [zfs]
 [<0>] ? dmu_object_next+0x77/0x130 [zfs]
 [<0>] ? dnode_rele_and_unlock+0x4d/0x120 [zfs]
 [<0>] ? txg_wait_synced+0x91/0x220 [zfs]
 [<0>] ? dmu_objset_id_quota_upgrade_cb+0x10f/0x140 [zfs]
 [<0>] ? dmu_objset_upgrade_task_cb+0xe3/0x170 [zfs]
 [<0>] ? taskq_thread+0x2cc/0x5d0 [spl]
 [<0>] ? wake_up_state+0x10/0x10
 [<0>] ? taskq_thread_should_stop.part.3+0x70/0x70 [spl]
 [<0>] ? kthread+0xbd/0xe0
 [<0>] ? kthread_create_on_node+0x180/0x180
 [<0>] ? ret_from_fork+0x58/0x90
 [<0>] ? kthread_create_on_node+0x180/0x180

This patch updates both functions responsible for checking if we can
perform user accounting to verify the pool is not readonly.

Reviewed-by: Alek Pinchuk <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes openzfs#8424
ryao pushed a commit that referenced this pull request Mar 16, 2019
While ZFS allow renaming of in use ZVOLs at the DSL level without issues
the ZVOL layer does not correctly update the renamed dataset if the
device node is open (zv->zv_open_count > 0): trying to access the stale
dataset name, for instance during a zfs receive, will cause the
following failure:

VERIFY3(zv->zv_objset->os_dsl_dataset->ds_owner == zv) failed ((null) == ffff8800dbb6fc00)
PANIC at zvol.c:1255:zvol_resume()
Showing stack for process 1390
CPU: 0 PID: 1390 Comm: zfs Tainted: P           O  3.16.0-4-amd64 #1 Debian 3.16.51-3
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 0000000000000000 ffffffff8151ea00 ffffffffa0758a80 ffff88028aefba30
 ffffffffa0417219 ffff880037179220 ffffffff00000030 ffff88028aefba40
 ffff88028aefb9e0 2833594649524556 6f5f767a3e2d767a 6f3e2d7465736a62
Call Trace:
 [<0>] ? dump_stack+0x5d/0x78
 [<0>] ? spl_panic+0xc9/0x110 [spl]
 [<0>] ? mutex_lock+0xe/0x2a
 [<0>] ? zfs_refcount_remove_many+0x1ad/0x250 [zfs]
 [<0>] ? rrw_exit+0xc8/0x2e0 [zfs]
 [<0>] ? mutex_lock+0xe/0x2a
 [<0>] ? dmu_objset_from_ds+0x9a/0x250 [zfs]
 [<0>] ? dmu_objset_hold_flags+0x71/0xc0 [zfs]
 [<0>] ? zvol_resume+0x178/0x280 [zfs]
 [<0>] ? zfs_ioc_recv_impl+0x88b/0xf80 [zfs]
 [<0>] ? zfs_refcount_remove_many+0x1ad/0x250 [zfs]
 [<0>] ? zfs_ioc_recv+0x1c2/0x2a0 [zfs]
 [<0>] ? dmu_buf_get_user+0x13/0x20 [zfs]
 [<0>] ? __alloc_pages_nodemask+0x166/0xb50
 [<0>] ? zfsdev_ioctl+0x896/0x9c0 [zfs]
 [<0>] ? handle_mm_fault+0x464/0x1140
 [<0>] ? do_vfs_ioctl+0x2cf/0x4b0
 [<0>] ? __do_page_fault+0x177/0x410
 [<0>] ? SyS_ioctl+0x81/0xa0
 [<0>] ? async_page_fault+0x28/0x30
 [<0>] ? system_call_fast_compare_end+0x10/0x15

Reviewed by: Tom Caputi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: loli10K <[email protected]>
Closes openzfs#6263 
Closes openzfs#8371
ryao pushed a commit that referenced this pull request Mar 16, 2019
Booting debug kernel found an inconsistent lock dependency between
dataset's ds_lock and its directory's dd_lock.

[ 32.215336] ======================================================
[ 32.221859] WARNING: possible circular locking dependency detected
[ 32.221861] 4.14.90+ openzfs#8 Tainted: G           O
[ 32.221862] ------------------------------------------------------
[ 32.221863] dynamic_kernel_/4667 is trying to acquire lock:
[ 32.221864]  (&ds->ds_lock){+.+.}, at: [<ffffffffc10a4bde>] dsl_dataset_check_quota+0x9e/0x8a0 [zfs]
[ 32.221941] but task is already holding lock:
[ 32.221941]  (&dd->dd_lock){+.+.}, at: [<ffffffffc10cd8e9>] dsl_dir_tempreserve_space+0x3b9/0x1290 [zfs]
[ 32.221983] which lock already depends on the new lock.
[ 32.221983] the existing dependency chain (in reverse order) is:
[ 32.221984] -> #1 (&dd->dd_lock){+.+.}:
[ 32.221992] 	__mutex_lock+0xef/0x14c0
[ 32.222049] 	dsl_dir_namelen+0xd4/0x2d0 [zfs]
[ 32.222093] 	dsl_dataset_namelen+0x2f1/0x430 [zfs]
[ 32.222142] 	verify_dataset_name_len+0xd/0x40 [zfs]
[ 32.222184] 	dmu_objset_find_dp_impl+0x5f5/0xef0 [zfs]
[ 32.222226] 	dmu_objset_find_dp_cb+0x40/0x60 [zfs]
[ 32.222235] 	taskq_thread+0x969/0x1460 [spl]
[ 32.222238] 	kthread+0x2fb/0x400
[ 32.222241] 	ret_from_fork+0x3a/0x50

[ 32.222241] -> #0 (&ds->ds_lock){+.+.}:
[ 32.222246] 	lock_acquire+0x14f/0x390
[ 32.222248] 	__mutex_lock+0xef/0x14c0
[ 32.222291] 	dsl_dataset_check_quota+0x9e/0x8a0 [zfs]
[ 32.222355] 	dsl_dir_tempreserve_space+0x5d2/0x1290 [zfs]
[ 32.222392] 	dmu_tx_assign+0xa61/0xdb0 [zfs]
[ 32.222436] 	zfs_create+0x4e6/0x11d0 [zfs]
[ 32.222481] 	zpl_create+0x194/0x340 [zfs]
[ 32.222484] 	lookup_open+0xa86/0x16f0
[ 32.222486] 	path_openat+0xe56/0x2490
[ 32.222488] 	do_filp_open+0x17f/0x260
[ 32.222490] 	do_sys_open+0x195/0x310
[ 32.222491] 	SyS_open+0xbf/0xf0
[ 32.222494] 	do_syscall_64+0x191/0x4f0
[ 32.222496] 	entry_SYSCALL_64_after_hwframe+0x42/0xb7

[ 32.222497] other info that might help us debug this:

[ 32.222497] Possible unsafe locking scenario:
[ 32.222498] CPU0 			CPU1
[ 32.222498] ---- 			----
[ 32.222499] lock(&dd->dd_lock);
[ 32.222500] 				lock(&ds->ds_lock);
[ 32.222502] 				lock(&dd->dd_lock);
[ 32.222503] lock(&ds->ds_lock);
[ 32.222504] *** DEADLOCK ***
[ 32.222505] 3 locks held by dynamic_kernel_/4667:
[ 32.222506] #0: (sb_writers#9){.+.+}, at: [<ffffffffaf68933c>] mnt_want_write+0x3c/0xa0
[ 32.222511] #1: (&type->i_mutex_dir_key#8){++++}, at: [<ffffffffaf652cde>] path_openat+0xe2e/0x2490
[ 32.222515] #2: (&dd->dd_lock){+.+.}, at: [<ffffffffc10cd8e9>] dsl_dir_tempreserve_space+0x3b9/0x1290 [zfs]

The issue is caused by dsl_dataset_namelen() holding ds_lock, followed by
acquiring dd_lock on ds->ds_dir in dsl_dir_namelen().

However, ds->ds_dir should not be protected by ds_lock, so releasing it before
call to dsl_dir_namelen() prevents the lockdep issue

Reviewed-by: Alek Pinchuk <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Chris Dunlop <[email protected]>
Signed-off-by:  Michael Zhivich <[email protected]>
Closes openzfs#8413
ryao pushed a commit that referenced this pull request Aug 1, 2019
lockdep reports a possible recursive lock in dbuf_destroy.

It is true that dbuf_destroy is acquiring the dn_dbufs_mtx
on one dnode while holding it on another dnode.  However,
it is impossible for these to be the same dnode because,
among other things,dbuf_destroy checks MUTEX_HELD before
acquiring the mutex.

This fix defines a class NESTED_SINGLE == 1 and changes
that lock to call mutex_enter_nested with a subclass of
NESTED_SINGLE.

In order to make the userspace code compile,
include/sys/zfs_context.h now defines mutex_enter_nested and
NESTED_SINGLE.

This is the lockdep report:

[  122.950921] ============================================
[  122.950921] WARNING: possible recursive locking detected
[  122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G           O
[  122.950921] --------------------------------------------
[  122.950921] dbu_evict/1457 is trying to acquire lock:
[  122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs]
[  122.950921]
               but task is already holding lock:
[  122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[  122.950921]
               other info that might help us debug this:
[  122.950921]  Possible unsafe locking scenario:

[  122.950921]        CPU0
[  122.950921]        ----
[  122.950921]   lock(&dn->dn_dbufs_mtx);
[  122.950921]   lock(&dn->dn_dbufs_mtx);
[  122.950921]
                *** DEADLOCK ***

[  122.950921]  May be due to missing lock nesting notation

[  122.950921] 1 lock held by dbu_evict/1457:
[  122.950921]  #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[  122.950921]
               stack backtrace:
[  122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G           O      4.19.29-4.19.0-debug-d69edad5368c1166 #1
[  122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011  03/13/2009
[  122.950921] Call Trace:
[  122.950921]  dump_stack+0x91/0xeb
[  122.950921]  __lock_acquire+0x2ca7/0x4f10
[  122.950921]  lock_acquire+0x153/0x330
[  122.950921]  dbuf_destroy+0x3c0/0xdb0 [zfs]
[  122.950921]  dbuf_evict_one+0x1cc/0x3d0 [zfs]
[  122.950921]  dbuf_rele_and_unlock+0xb84/0xd60 [zfs]
[  122.950921]  dnode_evict_dbufs+0x3a6/0x740 [zfs]
[  122.950921]  dmu_objset_evict+0x7a/0x500 [zfs]
[  122.950921]  dsl_dataset_evict_async+0x70/0x480 [zfs]
[  122.950921]  taskq_thread+0x979/0x1480 [spl]
[  122.950921]  kthread+0x2e7/0x3e0
[  122.950921]  ret_from_fork+0x27/0x50

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Jeff Dike <[email protected]>
Closes openzfs#8984
ryao pushed a commit that referenced this pull request Jun 29, 2020
lockdep reports a possible recursive lock in dbuf_destroy.

It is true that dbuf_destroy is acquiring the dn_dbufs_mtx
on one dnode while holding it on another dnode.  However,
it is impossible for these to be the same dnode because,
among other things,dbuf_destroy checks MUTEX_HELD before
acquiring the mutex.

This fix defines a class NESTED_SINGLE == 1 and changes
that lock to call mutex_enter_nested with a subclass of
NESTED_SINGLE.

In order to make the userspace code compile,
include/sys/zfs_context.h now defines mutex_enter_nested and
NESTED_SINGLE.

This is the lockdep report:

[  122.950921] ============================================
[  122.950921] WARNING: possible recursive locking detected
[  122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G           O
[  122.950921] --------------------------------------------
[  122.950921] dbu_evict/1457 is trying to acquire lock:
[  122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs]
[  122.950921]
               but task is already holding lock:
[  122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[  122.950921]
               other info that might help us debug this:
[  122.950921]  Possible unsafe locking scenario:

[  122.950921]        CPU0
[  122.950921]        ----
[  122.950921]   lock(&dn->dn_dbufs_mtx);
[  122.950921]   lock(&dn->dn_dbufs_mtx);
[  122.950921]
                *** DEADLOCK ***

[  122.950921]  May be due to missing lock nesting notation

[  122.950921] 1 lock held by dbu_evict/1457:
[  122.950921]  #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[  122.950921]
               stack backtrace:
[  122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G           O      4.19.29-4.19.0-debug-d69edad5368c1166 #1
[  122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011  03/13/2009
[  122.950921] Call Trace:
[  122.950921]  dump_stack+0x91/0xeb
[  122.950921]  __lock_acquire+0x2ca7/0x4f10
[  122.950921]  lock_acquire+0x153/0x330
[  122.950921]  dbuf_destroy+0x3c0/0xdb0 [zfs]
[  122.950921]  dbuf_evict_one+0x1cc/0x3d0 [zfs]
[  122.950921]  dbuf_rele_and_unlock+0xb84/0xd60 [zfs]
[  122.950921]  dnode_evict_dbufs+0x3a6/0x740 [zfs]
[  122.950921]  dmu_objset_evict+0x7a/0x500 [zfs]
[  122.950921]  dsl_dataset_evict_async+0x70/0x480 [zfs]
[  122.950921]  taskq_thread+0x979/0x1480 [spl]
[  122.950921]  kthread+0x2e7/0x3e0
[  122.950921]  ret_from_fork+0x27/0x50

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Jeff Dike <[email protected]>
Closes openzfs#8984
ryao pushed a commit that referenced this pull request Jun 29, 2020
After spa_vdev_remove_aux() is called, the config nvlist is no longer
valid, as it's been replaced by the new one (with the specified device
removed).  Therefore any pointers into the nvlist are no longer valid.
So we can't save the result of
`fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the
call to spa_vdev_remove_aux().

Instead, use spa_strdup() to save a copy of the string before calling
spa_vdev_remove_aux.

Found by AddressSanitizer:

ERROR: AddressSanitizer: heap-use-after-free on address ...
READ of size 34 at 0x608000a1fcd0 thread T686
    #0 0x7fe88b0c166d  (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d)
    #1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447
    #2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259
    #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229
    #4 0x55ffbc769fba in ztest_execute ztest.c:6714
    #5 0x55ffbc779a90 in ztest_thread ztest.c:6761
    openzfs#6 0x7fe889cbc6da in start_thread
    openzfs#7 0x7fe8899e588e in __clone

0x608000a1fcd0 is located 48 bytes inside of 88-byte region
freed by thread T686 here:
    #0 0x7fe88b14e7b8 in __interceptor_free
    #1 0x7fe88ae541c5 in nvlist_free nvpair.c:874
    #2 0x7fe88ae543ba in nvpair_free nvpair.c:844
    #3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978
    #4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185
    #5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221
    openzfs#6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229
    openzfs#7 0x55ffbc769fba in ztest_execute ztest.c:6714
    openzfs#8 0x55ffbc779a90 in ztest_thread ztest.c:6761
    openzfs#9 0x7fe889cbc6da in start_thread

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes openzfs#9706
ryao pushed a commit that referenced this pull request Sep 28, 2022
`zpool_do_import()` passes `argv[0]`, (optionally) `argv[1]`, and
`pool_specified` to `import_pools()`.  If `pool_specified==FALSE`, the
`argv[]` arguments are not used.  However, these values may be off the
end of the `argv[]` array, so loading them could dereference unmapped
memory.  This error is reported by the asan build:

```
=================================================================
==6003==ERROR: AddressSanitizer: heap-buffer-overflow
READ of size 8 at 0x6030000004a8 thread T0
    #0 0x562a078b50eb in zpool_do_import zpool_main.c:3796
    #1 0x562a078858c5 in main zpool_main.c:10709
    #2 0x7f5115231bf6 in __libc_start_main
    #3 0x562a07885eb9 in _start

0x6030000004a8 is located 0 bytes to the right of 24-byte region
allocated by thread T0 here:
    #0 0x7f5116ac6b40 in __interceptor_malloc
    #1 0x562a07885770 in main zpool_main.c:10699
    #2 0x7f5115231bf6 in __libc_start_main
```

This commit passes NULL for these arguments if they are off the end
of the `argv[]` array.

Reviewed-by: George Wilson <[email protected]>
Reviewed-by: John Kennedy <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Allan Jude <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes openzfs#12339
ryao pushed a commit that referenced this pull request Oct 17, 2022
Before this patch, in zfs_domount, if zfs_root or d_make_root fails, we
leave zfsvfs != NULL. This will lead to execution of the error handling
`if` statement at the `out` label, and hence to a call to
dmu_objset_disown and zfsvfs_free.

However, zfs_umount, which we call upon failure of zfs_root and
d_make_root already does dmu_objset_disown and zfsvfs_free.

I suppose this patch rather adds to the brittleness of this part of the
code base, but I don't want to invest more time in this right now.
To add a regression test, we'd need some kind of fault injection
facility for zfs_root or d_make_root, which doesn't exist right now.
And even then, I think that regression test would be too closely tied
to the implementation.

To repro the double-disown / double-free, do the following:
1. patch zfs_root to always return an error
2. mount a ZFS filesystem

Here's the stack trace you would see then:

  VERIFY3(ds->ds_owner == tag) failed (0000000000000000 == ffff9142361e8000)
  PANIC at dsl_dataset.c:1003:dsl_dataset_disown()
  Showing stack for process 28332
  CPU: 2 PID: 28332 Comm: zpool Tainted: G           O      5.10.103-1.nutanix.el7.x86_64 #1
  Call Trace:
   dump_stack+0x74/0x92
   spl_dumpstack+0x29/0x2b [spl]
   spl_panic+0xd4/0xfc [spl]
   dsl_dataset_disown+0xe9/0x150 [zfs]
   dmu_objset_disown+0xd6/0x150 [zfs]
   zfs_domount+0x17b/0x4b0 [zfs]
   zpl_mount+0x174/0x220 [zfs]
   legacy_get_tree+0x2b/0x50
   vfs_get_tree+0x2a/0xc0
   path_mount+0x2fa/0xa70
   do_mount+0x7c/0xa0
   __x64_sys_mount+0x8b/0xe0
   do_syscall_64+0x38/0x50
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
Signed-off-by: Christian Schwarz <[email protected]>
Closes openzfs#14025
ryao pushed a commit that referenced this pull request Jan 9, 2023
Before this patch, in zfs_domount, if zfs_root or d_make_root fails, we
leave zfsvfs != NULL. This will lead to execution of the error handling
`if` statement at the `out` label, and hence to a call to
dmu_objset_disown and zfsvfs_free.

However, zfs_umount, which we call upon failure of zfs_root and
d_make_root already does dmu_objset_disown and zfsvfs_free.

I suppose this patch rather adds to the brittleness of this part of the
code base, but I don't want to invest more time in this right now.
To add a regression test, we'd need some kind of fault injection
facility for zfs_root or d_make_root, which doesn't exist right now.
And even then, I think that regression test would be too closely tied
to the implementation.

To repro the double-disown / double-free, do the following:
1. patch zfs_root to always return an error
2. mount a ZFS filesystem

Here's the stack trace you would see then:

  VERIFY3(ds->ds_owner == tag) failed (0000000000000000 == ffff9142361e8000)
  PANIC at dsl_dataset.c:1003:dsl_dataset_disown()
  Showing stack for process 28332
  CPU: 2 PID: 28332 Comm: zpool Tainted: G           O      5.10.103-1.nutanix.el7.x86_64 #1
  Call Trace:
   dump_stack+0x74/0x92
   spl_dumpstack+0x29/0x2b [spl]
   spl_panic+0xd4/0xfc [spl]
   dsl_dataset_disown+0xe9/0x150 [zfs]
   dmu_objset_disown+0xd6/0x150 [zfs]
   zfs_domount+0x17b/0x4b0 [zfs]
   zpl_mount+0x174/0x220 [zfs]
   legacy_get_tree+0x2b/0x50
   vfs_get_tree+0x2a/0xc0
   path_mount+0x2fa/0xa70
   do_mount+0x7c/0xa0
   __x64_sys_mount+0x8b/0xe0
   do_syscall_64+0x38/0x50
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
Signed-off-by: Christian Schwarz <[email protected]>
Closes openzfs#14025
ryao pushed a commit that referenced this pull request Feb 24, 2023
Under certain loads, the following panic is hit:

    panic: page fault
    KDB: stack backtrace:
    #0 0xffffffff805db025 at kdb_backtrace+0x65
    #1 0xffffffff8058e86f at vpanic+0x17f
    #2 0xffffffff8058e6e3 at panic+0x43
    #3 0xffffffff808adc15 at trap_fatal+0x385
    #4 0xffffffff808adc6f at trap_pfault+0x4f
    #5 0xffffffff80886da8 at calltrap+0x8
    openzfs#6 0xffffffff80669186 at vgonel+0x186
    openzfs#7 0xffffffff80669841 at vgone+0x31
    openzfs#8 0xffffffff8065806d at vfs_hash_insert+0x26d
    openzfs#9 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#11 0xffffffff8065a28c at lookup+0x45c
    openzfs#12 0xffffffff806594b9 at namei+0x259
    openzfs#13 0xffffffff80676a33 at kern_statat+0xf3
    openzfs#14 0xffffffff8067712f at sys_fstatat+0x2f
    openzfs#15 0xffffffff808ae50c at amd64_syscall+0x10c
    openzfs#16 0xffffffff808876bb at fast_syscall_common+0xf8

The page fault occurs because vgonel() will call VOP_CLOSE() for active
vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While
here, define vop_open for consistency.

After adding the necessary vop, the bug progresses to the following
panic:

    panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1)
    cpuid = 17
    KDB: stack backtrace:
    #0 0xffffffff805e29c5 at kdb_backtrace+0x65
    #1 0xffffffff8059620f at vpanic+0x17f
    #2 0xffffffff81a27f4a at spl_panic+0x3a
    #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40
    #4 0xffffffff8066fdee at vinactivef+0xde
    #5 0xffffffff80670b8a at vgonel+0x1ea
    openzfs#6 0xffffffff806711e1 at vgone+0x31
    openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d
    openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#10 0xffffffff80661c2c at lookup+0x45c
    openzfs#11 0xffffffff80660e59 at namei+0x259
    openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3
    openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f
    openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c
    openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8

This is caused by a race condition that can occur when allocating a new
vnode and adding that vnode to the vfs hash. If the newly created vnode
loses the race when being inserted into the vfs hash, it will not be
recycled as its usecount is greater than zero, hitting the above
assertion.

Fix this by dropping the assertion.

FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700
Reviewed-by: Andriy Gapon <[email protected]>
Reviewed-by: Mateusz Guzik <[email protected]>
Reviewed-by: Alek Pinchuk <[email protected]>
Reviewed-by: Ryan Moeller <[email protected]>
Signed-off-by: Rob Wing <[email protected]>
Co-authored-by: Rob Wing <[email protected]>
Submitted-by: Klara, Inc.
Sponsored-by: rsync.net
Closes openzfs#14501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant