Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding slog to another pool's ZVOL will deadlock #1131

Closed
lundman opened this issue Dec 5, 2012 · 25 comments
Closed

Adding slog to another pool's ZVOL will deadlock #1131

lundman opened this issue Dec 5, 2012 · 25 comments
Labels
Type: Feature Feature request or new feature
Milestone

Comments

@lundman
Copy link
Contributor

lundman commented Dec 5, 2012

I am creating this ticket, as it is something you can do in Solaris 11, and you can not in ZOL. Then we can argue whether it is something worth fixing, or useful and all that.

In my case, I have a root pool 'rpool' which is on SSD. I have a much large data pool on HDDs, called 'zpool'.

Instead of using legacy partitions to carve out a space for SLOG on the SSD, I created a ZVOL for it:

NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
rpool  29.8G  5.65G  24.1G    18%  1.00x  ONLINE  -
zpool  4.53T  3.20T  1.33T    70%  1.00x  ONLINE  -

    NAME                        STATE     READ WRITE CKSUM
    zpool                       ONLINE       0     0     0
      raidz1-0                  ONLINE       0     0     0
        c10t5d0                 ONLINE       0     0     0
        c10t4d0                 ONLINE       0     0     0
        c10t3d0                 ONLINE       0     0     0
        c10t2d0                 ONLINE       0     0     0
        c10t1d0                 ONLINE       0     0     0
    logs
      /dev/zvol/dsk/rpool/slog  ONLINE       0     0     0

zfs get all rpool/slog
rpool/slog  volsize                         2G  

This is on Solaris 11.

Doing a test setup on ZOL-rc12, this happens;

zpool create -f mypool ~/src/pool-image.bin
zpool create -f data      ~/src/pool-image2.bin 
zfs create -V 100M mypool/slog
zpool add data log /dev/zd0

Which results in

Dec  5 12:19:49 zfsdev kernel: [  360.288199] INFO: task txg_sync:1932 blocked for more than 120 seconds.
Dec  5 12:19:49 zfsdev kernel: [  360.289585] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  5 12:19:49 zfsdev kernel: [  360.289966] txg_sync        D ffff8800603139c0     0  1932      2 0x00000000
Dec  5 12:19:49 zfsdev kernel: [  360.289970]  ffff880059cfbc10 0000000000000046 ffff880059d00000 ffff880059cfbfd8
Dec  5 12:19:49 zfsdev kernel: [  360.289973]  ffff880059cfbfd8 ffff880059cfbfd8 ffff88005a650000 ffff880059d00000
Dec  5 12:19:49 zfsdev kernel: [  360.289975]  ffff880059cfbc20 ffff880051f23ae8 ffff880051f23ab8 ffff880051f23af0
Dec  5 12:19:49 zfsdev kernel: [  360.289978] Call Trace:
Dec  5 12:19:49 zfsdev kernel: [  360.289986]  [<ffffffff81682189>] schedule+0x29/0x70
Dec  5 12:19:49 zfsdev kernel: [  360.289998]  [<ffffffffa012dd1c>] cv_wait_common+0x9c/0x190 [spl]
Dec  5 12:19:49 zfsdev kernel: [  360.290002]  [<ffffffff810768a0>] ? finish_wait+0x80/0  
5 12:19:49 zfsdev kernel: [  360.290007]  [<ffffffffa012de43>] __cv_wait+0x13/0x20 [spl]
Dec  5 12:19:49 zfsdev kernel: [  360.290038]  [<ffffffffa021ae2b>] spa_config_enter+0xeb/0x100 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.290057]  [<ffffffffa020fdb4>] spa_sync+0x94/0x9e0 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.290060]  [<ffffffff810a1bb0>] ? ktime_get_ts+0xb0/0xf0
Dec  5 12:19:49 zfsdev kernel: [  360.290080]  [<ffffffffa021ecb3>] txg_sync_thread+0x323/0x590 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.290100]  [<ffffffffa021e990>] ? txg_init+0x250/0x250 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.290105]  [<ffffffffa0126c48>] thread_generic_wrapper+0x78/0x90 [spl]
Dec  5 12:19:49 zfsdev kernel: [  360.290109]  [<ffffffffa0126bd0>] ? __thread_create+0x340/0x340 [spl]
Dec  5 12:19:49 zfsdev kernel: [  360.290111]  [<ffffffff81075f03>] kthread+0x93/0xa0
Dec  5 12:19:49 zfsdev kernel: [  360.290114]  [<ffffffff8168c624>] kernel_thread_helper+0x4/0x10
Dec  5 12:19:49 zfsdev kernel: [  360.29  [<ffffffff81075e70>] ? kthread_freezable_should_stop+0x70/0x70
Dec  5 12:19:49 zfsdev kernel: [  360.290118]  [<ffffffff8168c620>] ? gs_change+0x13/0x13



Dec  5 12:19:49 zfsdev kernel: [  360.290120] INFO: task zpool:1962 blocked for more than 120 seconds.
Dec  5 12:19:49 zfsdev kernel: [  360.290346] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  5 12:19:49 zfsdev kernel: [  360.290831] zpool           D ffff8800602139c0     0  1962   1321 0x00000000
Dec  5 12:19:49 zfsdev kernel: [  360.290834]  ffff880051ebdb98 0000000000000082 ffff880059d04500 ffff880051ebdfd8
Dec  5 12:19:49 zfsdev kernel: [  360.290836]  ffff880051ebdfd8 ffff880051ebdfd8 ffff880059d1c500 ffff880059d04500
Dec  5 12:19:49 zfsdev kernel: [  360.290838]  ffff880051ebdba8 ffff88005d3cb100 0000000000000001 ffff88005d3cb1b0
Dec  5 12:19:49 zfsdev kernel: [  360.290840] Call Trace:
Dec  5 12:19:49 zfsdev kernel: [  360.290843]  [<ffffffff81682189>] schedule+0x29/0x70
Dec  5 12:19:49 zfsdev kernel: [  360.290849]  [<ffffffffa012720d>] __taskq_wait_id+0x7d/0x150 [spl]
Dec  5 12:19:49 zfsdev kernel: [  360.290852]  [<ffffffff810768a0>] ? finish_wait+0x80/0x85 
12:19:49 zfsdev kernel: [  360.290856]  [<ffffffffa0127333>] __taskq_wait+0x53/0xf0 [spl]
Dec  5 12:19:49 zfsdev kernel: [  360.290861]  [<ffffffffa0127425>] __taskq_destroy+0x55/0x190 [spl]
Dec  5 12:19:49 zfsdev kernel: [  360.290882]  [<ffffffffa0220932>] vdev_open_children+0x82/0xc0 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.290902]  [<ffffffffa022c961>] vdev_root_open+0x51/0xf0 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.290922]  [<ffffffffa0223755>] vdev_open+0xf5/0x480 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.290941]  [<ffffffffa0223b07>] vdev_create+0x27/0x90 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.290960]  [<ffffffffa0211a4a>] spa_vdev_add+0xba/0x2d0 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.290980]  [<ffffffffa0242d1d>] zfs_ioc_vdev_add+0xed/0x130 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.291000]  [<ffffffffa024860d>] zfsdev_ioctl+0xdd/0x1d0 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.291003]  [<ffffffff81193ad9>] do_vfs_ioctl+0x99/0x590
Dec  5 12:19:49 zfsdev kernel: [  360.005]  [<ffffffff8119eb06>] ? alloc_fd+0xc6/0x110
Dec  5 12:19:49 zfsdev kernel: [  360.291008]  [<ffffffff8116decb>] ? kmem_cache_free+0x7b/0x100
Dec  5 12:19:49 zfsdev kernel: [  360.291010]  [<ffffffff8118cac3>] ? putname+0x33/0x50
Dec  5 12:19:49 zfsdev kernel: [  360.291012]  [<ffffffff81194069>] sys_ioctl+0x99/0xa0
Dec  5 12:19:49 zfsdev kernel: [  360.291015]  [<ffffffff8168b329>] system_call_fastpath+0x16/0x1b
Dec  5 12:19:49 zfsdev kernel: [  360.291016] INFO: task blkid:1967 blocked for more than 120 seconds.
Dec  5 12:19:49 zfsdev kernel: [  360.291295] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  5 12:19:49 zfsdev kernel: [  360.291888] blkid           D ffff8800603139c0     0  1967    395 0x00000004
Dec  5 12:19:49 zfsdev kernel: [  360.291891]  ffff880036aa57e8 0000000000000082 ffff880059d01700 ffff880036aa5fd8
Dec  5 12:19:49 zfsdev kernel: [  360.291893]  ffff880036aa5fd8 ffff880036aa5fd8 ffff88005d262e00 ffff880059d01700
Dec  5 12:19:49 zfsdev kernel: [  360.291895]  0000000000000d7f ffffffffa02a3140 ffff880059d01700 ffffffffa02a3144
Dec  5 12:19:49 zfsdev kernel: [  360.291897] Call Trace:
Dec  5 12:19:49 zfsdev kernel: [  360.291900]  [<ffffffff81682189>] schedule+0x29/0x70
Dec  5 12:19:49 zfsdev kernel: [  360.291903]  [<ffffffff8168244e>] schedule_preempt_disabled+0xe/0x10
Dec  5 12:19:49 zfsdev kernel: [  360.291905]  [<ffffffff81680f67>] __mutex_lock_slowpath+0xd7/0x150
Dec  5 12:19:49 zfsdev kernel: [  360.291907]  [<ffffffff81680a3a>] mutex_lock+0x2a/0x50
Dec  5 12:19:49 zfsdev kernel: [  360.291928]  [<ffffffffa0214131>] spa_open_common+0x61/0x3d0 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.291945]  [<ffffffffa01f6807>] ? getcomponent.part.5+0x177/0x190 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.291965]  [<ffffffffa02144b3>] spa_open+0x13/0x20 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.291981]  [<ffffffffa01f8753>] dsl_dir_open_spa+0x533/0x610 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.291985]  [<ffffffff8103fa39>] ? default_spin_lock_flags+0x9/0x10
Dec  5 12:19:49 zfsdev kernel: [  360.291986]  [<ffffffff8103fa39>] ? default_spin_lock_flags+0x9/0x10
Dec  5 12:19:49 zfsdev kernel: [  360.292002]  [<ffffffffa01ed310>] dsl_dataset_hold+0x40/0x2d0 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.292007]  [<ffffffffa0128499>] ? __taskq_create+0x2f9/0x4c0 [spl]
Dec  5 12:19:49 zfsdev kernel: [  360.292025]  [<ffffffffa01fd311>] dsl_prop_get+0x41/0x1a0 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.292044]  [<ffffffffa026ee20>] ? zvol_get_done+0x70/0x70 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.292062]  [<ffffffffa01fd48e>] dsl_prop_get_integer+0x1e/0x20 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.292081]  [<ffffffffa026ed16>] zvol_open+0x1c6/0x260 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.292084]  [<ffffffff811b9f40>] __blkdev_get+0xd0/0x4b0
Dec  5 12:19:49 zfsdev kernel: [  360.292086]  [<ffffffff811ba4cd>] blkdev_get+0x1ad/0x300
Dec  5 12:19:49 zfsdev kernel: [  360.292088]  [<ffffffff816830ae>] ? _raw_spin_lock+0xe/0x20
Dec  5 12:19:49 zfsdev kernel: [  360.292090]  [<ffffffff811ba67f>] blkdev_open+0x5f/0x90
Dec  5 12:19:49 zfsdev kernel: [  360.292093]  [<ffffffff811802df>] __dentry_open+0x21f/0x330
Dec  5 12:19:49 zfsdev kernel: [  360.292095]  [<ffffffff811ba620>] ? blkdev_get+0x300/0x300
Dec  5 12:19:49 zfsdev kernel: [  360.292097]  [<ffffffff8118042a>] vfs_open+0x3a/0x40
Dec  5 12:19:49 zfsdev kernel: [  360.292099]  [<ffffffff811813d8>] nameidata_to_filp+0x58/0xb0
Dec  5 12:19:49 zfsdev kernel: [  360.292100]  [<ffffffff8118fd0f>] do_last+0x49f/0xa10
Dec  5 12:19:49 zfsdev kernel: [  360.292104]  [<ffffffff812ec8bc>] ? apparmor_file_alloc_security+0x2c/0x60
Dec  5 12:19:49 zfsdev kernel: [  360.292106]  [<ffffffff81191569>] path_openat+0xd9/0x430
Dec  5 12:19:49 zfsdev kernel: [  360.292108]  [<ffffffff811919e1>] do_filp_open+0x41/0xa0
Dec  5 12:19:49 zfsdev kernel: [  360.292110]  [<ffffffff8119eb06>] ? alloc_fd+0xc6/0x110
Dec  5 12:19:49 zfsdev kernel: [  360.292111]  [<ffffffff81181525>] do_sys_open+0xf5/0x230
Dec  5 12:19:49 zfsdev kernel: [  360.292113]  [<ffffffff81181681>] sys_open+0x21/0x30
Dec  5 12:19:49 zfsdev kernel: [  360.292115]  [<ffffffff8168b329>] system_call_fastpath+0x16/0x1b
Dec  5 12:19:49 zfsdev kernel: [  360.292116] INFO: task vdev_open/0:1968 blocked for more than 120 seconds.
Dec  5 12:19:49 zfsdev kernel: [  360.292116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  5 12:19:49 zfsdev kernel: [  360.292116] vdev_open/0     D ffff8800602139c0     0  1968      2 0x00000000
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  ffff880036a3bb30 0000000000000046 ffff880059d1c500 ffff880036a3bfd8
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  ffff880036a3bfd8 ffff880036a3bfd8 ffffffff81c13460 ffff880059d1c500
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  0000000000000000 ffff88005c563758 ffff880059d1c500 ffff88005c56375c
Dec  5 12:19:49 zfsdev kernel: [  360.292116] Call Trace:
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff81682189>] schedule+0x29/0x70
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff8168244e>] schedule_preempt_disabled+0xe/0x10
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff81680f67>] __mutex_lock_slowpath+0xd7/0x150
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff81680a3a>] mutex_lock+0x2a/0x50
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff811b9edb>] __blkdev_get+0x6b/0x4b0
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff811ba530>] blkdev_get+0x210/0x300
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff811a0444>] ? mntput+0x24/0x40
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff8118c662>] ? path_put+0x22/0x30
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff811ba75d>] blkdev_get_by_path+0x3d/0x90
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffffa02258c8>] vdev_disk_open+0xe8/0x3f0 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff8101257b>] ? __switch_to+0x12b/0x420
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffffa0223755>] vdev_open+0xf5/0x480 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff81681a4f>] ? __schedule+0x3cf/0x7c0
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffffa02243f6>] vdev_open_child+0x26/0x40 [zfs]
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffffa012779c>] taskq_thread+0x23c/0x5a0 [spl]
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff81083daa>] ? finish_task_switch+0x4a/0xf0
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff81087cb0>] ? try_to_wake_up+0x2a0/0x2a0
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffffa0127560>] ? __taskq_destroy+0x190/0x190 [spl]
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff81075f03>] kthread+0x93/0xa0
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff8168c624>] kernel_thread_helper+0x4/0x10
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff81075e70>] ? kthread_freezable_should_stop+0x70/0x70
Dec  5 12:19:49 zfsdev kernel: [  360.292116]  [<ffffffff8168c620>] ? gs_change+0x13/0x13

And the same commands on Solaris 11:

zpool create -f mypool ~lundman/src/pool-image.bin
zpool create -f data ~lundman/src/pool-image2.bin
zfs create -V 100M mypool/slog
zpool add data log /dev/zvol/dsk/mypool/slog 
#copy some files
zpool list
NAME      SIZE  ALLOC   FREE  CAP  DEDUP   HEALTH  ALTROOT
data     1016M  2.32M  1014M   0%  1.00x   ONLINE  -
zpool status data
  pool: data
 state: ONLINE
  scan: none requested
 config:

    NAME                                 STATE     READ WRITE CKSUM
    data                                 ONLINE       0     0     0
      /home/lundman/src/pool-image2.bin  ONLINE       0     0     0
    logs
      /dev/zvol/dsk/mypool/slog          ONLINE       0     0     0
@lundman
Copy link
Contributor Author

lundman commented Dec 5, 2012

This is probably related to #612 (Sorry, I should have looked harder).

@lundman
Copy link
Contributor Author

lundman commented Dec 5, 2012

Actually, one slightly amusing side note is, I performed the following;

Solaris 11

# zpool create -f -o version=28 -O version=5 mypool  ~lundman/src/pool-image.bin
# zpool create -f -o version=28 -O version=5 data  ~lundman/src/pool-image2.bin
# zfs create -V 100M mypool/slog

Zero it, just for binary comparison
# dd if=/dev/zero of=/dev/zvol/dsk/mypool/slog bs=65536

# zpool add data log /dev/zvol/dsk/mypool/slog
device in use checking failed: I/O error
: invalid vdev configuration

# cd /data
# gtar -cf - ~lundman/src/spl/ | gtar -xvf -

# zpool export data
# zpool export mypool

(That error when adding slog always happens, but the device appears to be attached anyway)

At this point, I dd the slog device to file slog, for binary comparison.

Poweroff Solaris, and boot Linux.

Linux x64

# zpool import -d ~/src/ 
   pool: data
     id: 6976256868752547775
  state: DEGRADED
 status: The pool was last accessed by another system.
 action: The pool can be imported despite missing or damaged devices.  The
        fault tolerance of the pool may be compromised if imported.
   see: http://zfsonlinux.org/msg/ZFS-8000-EY
 config:

        data                                DEGRADED
          /media/sf_zfssrc/pool-image2.bin  ONLINE
        logs
          slog                              UNAVAIL

   pool: mypool
          /media/sf_zfssrc/pool-image.bin  ONLINE


# zpool import -d ~/src/ mypool
# zpool import -f -d ~/src/ data

# zpool status
  pool: data
 state: DEGRADED
        NAME                                STATE     READ WRITE CKSUM
        data                                DEGRADED     0     0     0
          /media/sf_zfssrc/pool-image2.bin  ONLINE       0     0     0
        logs
          10102695017746293181              UNAVAIL      0     0     0  was /media/sf_zfssrc/slog


# zpool replace data 10102695017746293181 /dev/zd0
/dev/zd0 is part of exported pool 'data'

# zpool replace -f data 10102695017746293181 /dev/zd0
        NAME                                STATE     READ WRITE CKSUM
        data                                ONLINE       0     0     0
          /media/sf_zfssrc/pool-image2.bin  ONLINE       0     0     0
        logs
          zd0                               ONLINE       0     0     0

and.. it is running! \o/

Let's tar something more to it (to hopefully make slog write more records)

# tar -cf - ~/src/zfs/ | tar -xvf -

and dd in the /dev/zd0 to slog2 this time, some diffs. I am showing the last few lines (before all zeros) in both cases, so it is encouraging that Linux has moved further along.

slog : Solaris

00069490  e9 92 f9 21 23 05 00 00  62 ea f7 1e df f2 00 00  |...!#...b.......|
000694a0  52 0c 99 48 3c f8 1e 00  22 00 00 00 00 00 00 00  |R..H<...".......|
000694b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000697d0  00 00 00 00 00 00 00 00  11 7a 0c b1 7a da 10 02  |.........z..z...|
000697e0  0d 57 0a dc e1 81 1a 30  34 aa 15 4b a4 6a 4a 95  |.W.....04..K.jJ.|
000697f0  a2 68 a4 12 64 3a 65 64  1a 34 80 44 b7 6c 11 cb  |.h..d:ed.4.D.l..|
00069800  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0007ffd0  00 00 00 00 00 00 00 00  11 7a 0c b1 7a da 10 02  |.........z..z...|
0007ffe0  e5 25 32 9f c9 6f fb 31  52 9a ba cc 40 1e 30 59  |.%[email protected]|
0007fff0  52 1b 3a 30 dc 8f 23 02  32 c5 3e 33 4b 1b ce a5  |R.:0..#.2.>3K...|
00080000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

slog2 : Linux

00078480  2c 00 00 00 00 00 00 00  2a b2 4e e0 11 00 00 00  |,.......*.N.....|
00078490  d7 3c 02 d7 56 06 00 00  e5 2e 5d d8 1c 2a 01 00  |.<..V.....]..*..|
000784a0  7a 9b a6 17 24 cc 25 00  1e 00 00 00 00 00 00 00  |z...$.%.........|
000784b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000787d0  00 00 00 00 00 00 00 00  11 7a 0c b1 7a da 10 02  |.........z..z...|
000787e0  51 78 8b a4 b3 49 fc f4  24 24 5e 5b e6 e9 c5 7d  |Qx...I..$$^[...}|
000787f0  11 ed 71 57 90 fd cf f9  a7 21 ce d1 fd 87 a3 00  |..qW.....!......|
00078800  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0007ffd0  00 00 00 00 00 00 00 00  11 7a 0c b1 7a da 10 02  |.........z..z...|
0007ffe0  e5 89 4a cc be 71 63 5b  2d c8 28 1b d5 d8 da 26  |..J..qc[-.(....&|
0007fff0  a0 5b d4 ce 87 1e 81 91  e7 17 54 5f ee 3e 20 21  |.[........T_.> !|
00080000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Would would suggest the deadlocking problem is actually in the attaching of the device, and (possibly not) a problem with run-time usage.

It would also be nice to really confirm it does use the ZVOL for slog.

@ryao
Copy link
Contributor

ryao commented Dec 5, 2012

@lundman This likely is the same as issue #612, although it does not hurt to have a second test case to consider.

@behlendorf
Copy link
Contributor

Would would suggest the deadlocking problem is actually in the attaching of the device, and (possibly not) a problem with run-time usage.

I agree. Based on the stack you posted the deadlock appears to be down the zfs attach path. So if this deadlock were resolved you could likely use ZFS like this. I'm not sure you would want too... but you could.

@atonkyra
Copy link

atonkyra commented Dec 5, 2012

It might be worth noting that at least on OpenIndiana (Illumos), ZFS code is (or at least was) prone to deadlocks if Pool A resources are used in Pool B. Not sure if that applies here.

(http://mail.opensolaris.org/pipermail/zfs-discuss/2010-July/042806.html)

@lundman
Copy link
Contributor Author

lundman commented Dec 5, 2012

Excellent, a discussion about this! Personally I do think a pool inside a pool to be a little crazy, but having the slog in a zvol (in a different, presumably, root-pool) to be very similar to 'swap in zvol'. (I actually thought Sun stops you from pool in pool, I will test this in the VM.)

Most of the discussions on Sun, and IllumOS appears to be that people are willing to say it could lead to deadlocks, as nobody has really tried it. And yet, there does not appear to be any examples where deadlocks has happened. The closest is the chap who lost his rpool (which means you can't boot anyway), and had no way to import his data pool due to the missing slog device.

But this issue has been address with "import -m" option. There is no reason you could not rebuild your rpool, and keep the data on your data pool. So, nobody wants to go out of a limb to say this setup does work, that is ok.

Half the reasons as to why having a slice for slog is 'not a big deal' stems from that Solaris has to boot from a slice, on a VTOC label. But this is no longer true for Linux. You can boot from EFI labels. FBSD already lets you boot from raidz. We should celebrate the death of the legacy partitions. Nobody wants to have to carve a slice out for swap . Nobody wants to do the same for slog.

Anyway, it is a project to keep me busy :)

@atonkyra
Copy link

atonkyra commented Dec 5, 2012

I actually went using this model on an OI test box. When I did tests like hot-removing zil slog, yanking drives out in general I managed to get my pool in a state where I could not remove the zil slog device any more using the zfs tools. (The device was permanently stuck in the pool and could only be offlined but not replaced or removed)

So if you have a test box running solaris you might want to try some "yanking experiments" :)

In this configuration I had a rpool of 2 SSD drives in a mirror which also housed a zvol for the "data" pool zil. Actual data pool was simply 2 SATA drives in a mirror.

@lundman
Copy link
Contributor Author

lundman commented Dec 5, 2012

I do indeed have test machines. Just to confirm, that you made it hang, or needed a reboot doesn't concern me. But you got to a state where you could not import your data pool, using regular zfs tools and reboot imports? I just racked a 30 SSD storage server not in use until Q1, so I will attack that.

Although, I would use Sol 11 or Sol 10, since OI is lagging. Hmm guess I could try all, its just cold in the data centre :)

@atonkyra
Copy link

atonkyra commented Dec 5, 2012

I think it actually may have imported fine but managing devices in the pool was a pain. I tried it out like 7 months ago so it's not exactly fresh in my mind. If I recall correctly during my testing I did not have problems importing the pool, though sometimes I had to import it with the slog device destroyed from rpool (and therefore discarded the transactions in the zil before reboot).

But I clearly remember having problems with the pool management after some yanking experiments in order to test the redundancies :)

@lundman
Copy link
Contributor Author

lundman commented Dec 6, 2012

But that sounds like ZFS is doing exactly its job. Never losing data, or stopping you from getting at it (importing pool), no matter how nastily you yank drives. That you might have to "let go of the slog" device sometimes, is also true with slog on a raw SSD (ie, when you simply lose the SSD. - this happens the most at work on storage devices interestingly).

Having to do some "juggling" before you can get at your data (ie, rpool/slog has to be available before data import, which is fairly typical, as well as possible performance hit by using slog of ZVOL, is totally worth it. Compared to that moment when you realise you should have made that partition of a different size. At work we dedicate 2 whole SSDs for just the slog, but at home that is not always possible.

Anyway, does not sound like anyone is getting pitchforks and tar just yet, so I will dig deeper and see if I can get lucky with the deadlock problem.

@atonkyra
Copy link

atonkyra commented Dec 6, 2012

I'm very interested in any results you might get, please report back on your findings :)

@lundman
Copy link
Contributor Author

lundman commented Dec 7, 2012

Ok, I have 3 x4540's spare that are to be returned next month, so I will do some yank-tests.

Meanwhile, looking at the problem of Linux using log in zvol. Chasing the code down, it ends up in vdev_open_children which creates a thread So that all locks are held by same thread in case they are ZVOL. [1]

This code then enters

    /*
     * in order to handle pools on top of zvols, do the opens
     * in a single thread so that the same thread holds the
     * spa_namespace_lock
     */
    if (vdev_uses_zvols(vd)) {
        for (c = 0; c < children; c++)
            vd->vdev_child[c]->vdev_open_error =
                vdev_open(vd->vdev_child[c]);
        return;
    }
        // Other stuff
            taskq_destroy(tq);

What is interesting here, is that we deadlock in taskq_destroy(), which should never be called.

Naturally, I go check out vdev_uses_zvols

/*
 * Stacking zpools on top of zvols is unsupported until we implement a method
 * for determining if an arbitrary block device is a zvol without using the
 * path.  Solaris would check the 'zvol' path component but this does not
 * exist in the Linux port, so we really should do something like stat the
 * file and check the major number.  This is complicated by the fact that
 * we need to do this portably in user or kernel space.
 */
#if 0
    int c;
    if (vd->vdev_path && strncmp(vd->vdev_path, ZVOL_DIR,
        strlen(ZVOL_DIR)) == 0)
        return (B_TRUE);
    for (c = 0; c < vd->vdev_children; c++)
        if (vdev_uses_zvols(vd->vdev_child[c]))
            return (B_TRUE);
#endif
    return (B_FALSE);
}

Ah, that could have something to do with it :)

I throw in the line

if (vd->vdev_path && strncmp(vd->vdev_path, "/dev/zd0", 8) == 0)
    return (B_TRUE);

and we test:

# zpool add data log /dev/zd0
Dec  7 15:05:17 zfsdev kernel: [ 1911.898863]  uses_zvols '/dev/zd0'

# zpool status
  pool: data
 state: ONLINE
 scan: none requested
config:

        NAME                                 STATE     READ WRITE CKSUM
        data                                 ONLINE       0     0     0
          /home/lundman/src/pool-image2.bin  ONLINE       0     0     0
        logs
          zd0                                ONLINE       0     0     0

Just that easy. Naturally, that is a bit of a hack. As the comment suggests, I should look at stat() and check the major number. If I can figure it out, I will throw a patch your way.

Now, going to pools in zvol, which even I think is a little weird, and I thought Solaris stopped, I tested this on Solaris:

# zpool create -o version=30 -O version=5 -f mypool ~lundman/src/pool-image.bin
# zfs create -V 500M mypool/ext
# zpool create -o version=30 -O version=5 -f insidepool /dev/zvol/dsk/mypool/ext 

# zpool status
  pool: insidepool
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        insidepool                  ONLINE       0     0     0
          /dev/zvol/dsk/mypool/ext  ONLINE       0     0     0

Apparently that is also legal! [1]

Interestingly, I can not import these pools on Linux, it deadlocks. Also, creating a 'pool inside zvol' with my above patch, also deadlocks. So that part needs extra work.

[1] Implying that Sun does in fact support log/pools in ZVOL regardless of how icky that seems. :)

@lundman
Copy link
Contributor Author

lundman commented Dec 7, 2012

Please find patch https://github.com/lundman/zfs-master/commit/76e3e875a3af494378465f4134501db492de1c23 for your perusal.

I am not entirely sure how to stat() something from inside the kernel (not easy to google for) but lookup_bdev() does appear to work. As for the 'pool in zvol' problem, it will sometimes work, which is amusing.

# zpool create -f mypool ~/src/pool-image.bin 
# zpool create -f data ~/src/pool-image2.bin 
# zfs create -V 500M  mypool/ext
# zpool add data log /dev/zd0
# zpool status
  pool: data
 state: ONLINE
config:

        NAME                                 STATE     READ WRITE CKSUM
        data                                 ONLINE       0     0     0
          /home/lundman/src/pool-image2.bin  ONLINE       0     0     0
        logs
          zd0                                ONLINE       0     0     0

I believe this closes #1131

@lundman
Copy link
Contributor Author

lundman commented Dec 7, 2012

Oh on the 'pool in zvol' thing. If I just try to create it, it will deadlock. but interestingly, if it was previously just used as a slog device, it works. Could it be something about existing label?

# zpool create -f mypool ~/src/pool-image.bin
# zpool create -f data ~/src/pool-image2.bin
# zfs create -V 500M  mypool/ext
# zpool add data log /dev/zd0
# zpool status
        NAME                                 STATE     READ WRITE CKSUM
        data                                 ONLINE       0     0     0
          /home/lundman/src/pool-image2.bin  ONLINE       0     0     0
        logs
          zd0                                ONLINE       0     0     0

# zpool export data
# zpool create -f data /dev/zd0
# zpool status

  pool: data
        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          zd0       ONLINE       0     0     0

  pool: mypool
        NAME                                STATE     READ WRITE CKSUM
        mypool                              ONLINE       0     0     0
          /home/lundman/src/pool-image.bin  ONLINE       0     0     0

# zpool list
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
data     492M   872K   491M     0%  1.00x  ONLINE  -
mypool  1016M  2.13M  1014M     0%  1.00x  ONLINE  -

@behlendorf
Copy link
Contributor

@lundman Nice find! I now vaguely remember disabling this way back when. I've marked up your patch if you could rework and repush that would be great.

As for the 'pool in zvol' case let's open a new issue and track that use case there.

@ryao
Copy link
Contributor

ryao commented Dec 8, 2012

@behlendorf That is issue #612.

@lundman
Copy link
Contributor Author

lundman commented Dec 10, 2012

@behlendorf Made a comment, waiting to hear how you want to solve the vdev_bdev_mode() declaration issue.

Amusingly, using the vdev_bdev_open() route, means you can no longer make a pool in a zvol, it fails cleanly.

# zfs create -V 500M  mypool/ext
# zpool create -f data /dev/zd0
cannot open '/dev/zd0': Device or resource busy

Which is better than deadlocking. I might still try to fix this issue, as I have little else to do.

@behlendorf
Copy link
Contributor

@lundman That's interesting, obviously because we open the device exclusively. Still it would be nice to walk the full call path and see why. But for now as you says a clean failure is better than a deadlock any day.

@ryao Ahh, of course. Thanks.

@lundman
Copy link
Contributor Author

lundman commented Dec 11, 2012

Here is a revised patch for the initial problem;

https://github.com/lundman/zfs-master/commit/a627730cedfcd57c738fe120ca6c55d51a7d156d

Note that I am not entirely happy with it myself, and will need to dig deeper. In particular, this situation happens;

# zpool create -f mypool ~/src/pool-image.bin
# zfs create -V 500M  mypool/ext
# zpool create -f data ~/src/pool-image2.bin 
# zpool add -f data log /dev/zd0
# zpool status
        data                                 ONLINE       0     0     0
          /home/lundman/src/pool-image2.bin  ONLINE       0     0     0
        logs
          zd0                                ONLINE       0     0     0
# cp * /data/
# reboot

# zpool import -d ~/src/
   pool: mypool
        mypool                             ONLINE
          /media/sf_zfssrc/pool-image.bin  ONLINE

   pool: data
        data                                UNAVAIL  missing device
          /media/sf_zfssrc/pool-image2.bin  ONLINE

        Additional devices are known to be part of this pool, though their
        exact configuration cannot be determined.

# zpool import -d ~/src/ mypool
Dec 11 12:08:10 zfsdev kernel: [   61.707118]  zd0: unknown partition table

# zpool import -d ~/src/ 
   pool: data
        data                                UNAVAIL  missing device
          /media/sf_zfssrc/pool-image2.bin  ONLINE

        Additional devices are known to be part of this pool, though their
        exact configuration cannot be determined.

# zpool import -d ~/src/ data
The devices below are missing, use '-m' to import the pool anyway:
            zd0 [log]

# zpool import -md ~/src/ data

# zpool status
  pool: data

        NAME                                STATE     READ WRITE CKSUM
        data                                ONLINE       0     0     0
          /media/sf_zfssrc/pool-image2.bin  ONLINE       0     0     0
        logs
          zd0                               ONLINE       0     0     0

So,

  1. why does it say "zd0" is missing, even though it does exist? Did it try to open it but failed (the device isn't locked yet).
  2. Forcing me to use -m, and yet, there is no step missing there. It imported the pool with /dev/zd0, even though it just said that it could not. There is some step in the import that needs to be looked at, possibly it locked against itself looking for vdevs?

I would also like to test using the /dev/mypool/ext notation, but I don't get those in my system. I am guessing it is part of udev rules?

@behlendorf
Copy link
Contributor

My first guess is that ZFS has no notion pools depending on other pools. So when importing everything your going to need to be careful to import the pool with the zvol so it's available when you import the data pool. Right now the pools just get imported in the order in which they appear int the cache file.

I would also like to test using the /dev/mypool/ext notation, but I don't get those in my system. I am guessing it is part of udev rules?

Right, you need the /lib/udev/zvol_id helper and /lib/udev/rules.d/60-zvol.rules rules to have those names be automatically created.

@lundman
Copy link
Contributor Author

lundman commented Dec 17, 2012

@behlendorf I added another comment to the patch.

@lundman
Copy link
Contributor Author

lundman commented Dec 17, 2012

Here is a smaller cleaner patch https://github.com/lundman/zfs-master/commit/364915536ece65d9fa22abd4a08a2153557d4504 Naturally, it would not need the pragma conditional, I left it there until I know which way you prefer. :)

@behlendorf
Copy link
Contributor

@lundman Much better, let's go with lookup_bdev()

@lundman
Copy link
Contributor Author

lundman commented Dec 18, 2012

https://github.com/lundman/zfs-master/commit/6f033f721685fc44a011b873116888cea286acc6

Cleaned up patch.

I would still like to return to the pool in zvol problem at some point, but I will use the #612 issue for that.

@behlendorf
Copy link
Contributor

Minor tweaking but merged

unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
During the original ZoL port the vdev_uses_zvols() function was
disabled until it could be properly implemented.  This prevented
a zpool from use a zvol for its slog device.

This patch implements that missing functionality by adding a
zvol_is_zvol() function to zvol.c.  Given the full path to a
device it will lookup the device and verify its major number
against the registered zvol major number for the system.  If
they match we know the device is a zvol.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#1131
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
….47.1 in /lib/liboci_sdk (openzfs#1131)

build(deps): bump github.com/oracle/oci-go-sdk/v65 in /lib/liboci_sdk

Bumps [github.com/oracle/oci-go-sdk/v65](https://github.com/oracle/oci-go-sdk) from 65.45.0 to 65.47.1.
- [Release notes](https://github.com/oracle/oci-go-sdk/releases)
- [Changelog](https://github.com/oracle/oci-go-sdk/blob/master/CHANGELOG.md)
- [Commits](oracle/oci-go-sdk@v65.45.0...v65.47.1)

---
updated-dependencies:
- dependency-name: github.com/oracle/oci-go-sdk/v65
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

4 participants