zfs and zpool commands hangs when znapzend runs backups and influence for the cluster #7331

arturpzol · 2018-03-22T09:58:39Z

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	Jessie
Linux Kernel	4.4
Architecture	64
ZFS Version	0.7.7 and master
SPL Version	0.7.7 and master

Describe the problem you're observing

When znapzend runs some backups in the same time sometimes zfs and zpool commands hangs for a few seconds/minutes depending on the system load. This behaviour causes failover in cluster enviroment.

For example:

In ZFS OCF RA Pacemaker resource monitor checks health of the pool every 5 seconds:

<action name="monitor" depth="0" timeout="30s" interval="5s" />

zpool_monitor function:

zpool_monitor () {
    # If the pool is not imported, then we can't monitor its health
    if ! zpool_is_imported; then
        return $OCF_NOT_RUNNING
    fi

    # Check the pool status
    HEALTH=$(zpool list -H -o health "$OCF_RESKEY_pool")
    case "$HEALTH" in
        ONLINE|DEGRADED) return $OCF_SUCCESS;;
        FAULTED)         return $OCF_NOT_RUNNING;;
        *)               return $OCF_ERR_GENERIC;;
    esac
}

When zpool list -H -o health will hang for a 30 seconds failover is performing.

It looks like txg_sync need to finish work before unblock commands:


Mar 21 16:07:12 [kern.info] [ 1976.537474] txg_sync        D 0000000000000000     0 42138      2 0x00000000
Mar 21 16:07:12 [kern.warning] [ 1976.537475]  ffff880080713200 ffff88007ef64000 ffff88007ef63c10 ffff88023686dc40
Mar 21 16:07:12 [kern.warning] [ 1976.537477]  ffff88007ef63bc0 ffff88023686dc40 ffff880086d96eb0 ffffffff8170d18c
Mar 21 16:07:12 [kern.warning] [ 1976.537478]  000000010003038a ffffffff8170f3a8 ffffffffa0703b6c 0000000000000282
Mar 21 16:07:12 [kern.warning] [ 1976.537479] Call Trace:
Mar 21 16:07:12 [kern.warning] [ 1976.537481]  [<ffffffff8170d18c>] ? schedule+0x2c/0x70
Mar 21 16:07:12 [kern.warning] [ 1976.537488]  [<ffffffff8170f3a8>] ? schedule_timeout+0x108/0x1a0
Mar 21 16:07:12 [kern.warning] [ 1976.537518]  [<ffffffffa0703b6c>] ? zio_taskq_dispatch+0x8c/0xa0 [zfs]
Mar 21 16:07:12 [kern.warning] [ 1976.537520]  [<ffffffff810a6bf0>] ? cascade+0x60/0x60
Mar 21 16:07:12 [kern.warning] [ 1976.537522]  [<ffffffff8170c9af>] ? io_schedule_timeout+0xaf/0x130
Mar 21 16:07:12 [kern.warning] [ 1976.537528]  [<ffffffffa0523906>] ? __cv_timedwait_common+0xf6/0x130 [spl]
Mar 21 16:07:12 [kern.warning] [ 1976.537529]  [<ffffffff81093230>] ? wait_woken+0x80/0x80
Mar 21 16:07:12 [kern.warning] [ 1976.537559]  [<ffffffffa07085db>] ? zio_wait+0x11b/0x240 [zfs]
Mar 21 16:07:12 [kern.warning] [ 1976.537583]  [<ffffffffa067e34b>] ? dsl_pool_sync+0x39b/0x3f0 [zfs]
Mar 21 16:07:12 [kern.warning] [ 1976.537609]  [<ffffffffa069e259>] ? spa_sync+0x3d9/0xcb0 [zfs]
Mar 21 16:07:12 [kern.warning] [ 1976.537637]  [<ffffffffa06b03a1>] ? txg_sync_thread+0x301/0x4c0 [zfs]
Mar 21 16:07:12 [kern.warning] [ 1976.537664]  [<ffffffffa06b00a0>] ? txg_delay+0x150/0x150 [zfs]
Mar 21 16:07:12 [kern.warning] [ 1976.537667]  [<ffffffffa051f120>] ? thread_generic_wrapper+0x70/0x80 [spl]
Mar 21 16:07:12 [kern.warning] [ 1976.537670]  [<ffffffffa051f0b0>] ? __thread_exit+0x10/0x10 [spl]
Mar 21 16:07:12 [kern.warning] [ 1976.537671]  [<ffffffff81077b7a>] ? kthread+0xca/0xe0
Mar 21 16:07:12 [kern.warning] [ 1976.537672]  [<ffffffff81077ab0>] ? kthread_create_on_node+0x170/0x170
Mar 21 16:07:12 [kern.warning] [ 1976.537673]  [<ffffffff8171034f>] ? ret_from_fork+0x3f/0x70
Mar 21 16:07:12 [kern.warning] [ 1976.537675]  [<ffffffff81077ab0>] ? kthread_create_on_node+0x170/0x170

strace shows that commands are waiting for open /dev/zfs:


brk(0)                                  = 0x11d8000
brk(0x11f9000)                          = 0x11f9000
access("/sys/module/zfs", F_OK)         = 0
open("/dev/zfs", O_RDWR)

Is there possible to run zpool list or zpool status in async mode or any other possibility to get status of the pool? or if the only solution is to increase the timeout?

The text was updated successfully, but these errors were encountered:

loli10K · 2018-03-22T22:45:59Z

Depending on how "znapzend backups" work under the hood this could be a duplicate of #3104.

behlendorf · 2018-03-23T00:19:23Z

@arturpzol let's start by seeing if we can determine exactly where it's waiting. I would have expected to see zpool list waiting in ioctl(2), not open(2) which shouldn't take long under most circumstances.
When it's waiting can you cat /proc/pid/stack to dump the kernel stack and post it.

arturpzol · 2018-03-23T09:26:14Z

@behlendorf below logs from stack:

ps axf|grep "zpool list"
117241 pts/0    D+     0:00  |       \_ zpool list -H -o health Pool-0

cat /proc/117241/stack 
[<ffffffffa06eca65>] zfsdev_open+0x15/0x120 [zfs]
[<ffffffff81404afd>] misc_open+0x9d/0x140
[<ffffffff8114482c>] chrdev_open+0x9c/0x170
[<ffffffff81144790>] chrdev_open+0x0/0x170
[<ffffffff8113ec28>] do_dentry_open+0x1b8/0x2d0
[<ffffffff8114d2e6>] path_openat+0x596/0x1170
[<ffffffff81149b00>] terminate_walk+0xa0/0xb0
[<ffffffff8114f2b9>] do_filp_open+0x79/0xd0
[<ffffffff8115a417>] __alloc_fd+0x37/0x160
[<ffffffff8113fd9e>] do_sys_open+0x11e/0x200
[<ffffffff8171032e>] entry_SYSCALL_64_fastpath+0x12/0x71
[<ffffffffffffffff>] 0xffffffffffffffff

after a few seconds/minutes (depends on the load) when zpool list will return state of the pool process is still hanged for a few next seconds/minutes:

zpool list -H -o health Pool-0
ONLINE

cat /proc/117241/stack 
[<ffffffffa06e67c0>] zfsdev_release+0x10/0x70 [zfs]
[<ffffffff8114229b>] __fput+0x8b/0x1d0
[<ffffffff810763d0>] task_work_run+0x70/0x90
[<ffffffff8100308f>] exit_to_usermode_loop+0x8f/0xa0
[<ffffffff81003565>] syscall_return_slowpath+0x45/0x50
[<ffffffff8171048c>] int_ret_from_sys_call+0x25/0x8f
[<ffffffffffffffff>] 0xffffffffffffffff

When system load is low zpool list return output at once of course.

I have tried to use ZFS Channel Programs but suspension can be also noticed:

ps axf|grep "zfs program"
8326 pts/5    D+     0:00 zfs program -n

cat /proc/8326/stack 
[<ffffffffa06eca65>] zfsdev_open+0x15/0x120 [zfs]
[<ffffffff81404a7d>] misc_open+0x1d/0x140
[<ffffffff8114482c>] chrdev_open+0x9c/0x170
[<ffffffff81144790>] chrdev_open+0x0/0x170
[<ffffffff8113ec28>] do_dentry_open+0x1b8/0x2d0
[<ffffffff8114d2e6>] path_openat+0x596/0x1170
[<ffffffff81149b00>] terminate_walk+0xa0/0xb0
[<ffffffff8114f2b9>] do_filp_open+0x79/0xd0
[<ffffffff8115a417>] __alloc_fd+0x37/0x160
[<ffffffff8113fd9e>] do_sys_open+0x11e/0x200
[<ffffffff8171032e>] entry_SYSCALL_64_fastpath+0x12/0x71
[<ffffffffffffffff>] 0xffffffffffffffff

tonyhutter · 2018-04-06T01:02:23Z

@arturpzol do you happen to have a reliable reproducer? I'd like to look into this.

tonyhutter · 2018-04-06T21:13:47Z

I tried to reproduce this on my own by doing a vdev attach while running strace -r on zpool list -H -o health:

0.000104 openat(AT_FDCWD, "/dev/zfs", O_RDWR) = 5
...
0.000347 ioctl(3, _IOC(0, 0x5a, 0x5, 0), 0x7ffdf404c730) = 0
3.415809 ioctl(3, _IOC(0, 0x5a, 0x27, 0), 0x7ffdf404b670) = 0  <--- ZFS_IOC_POOL_GET_PROPS

The openat() went though without delay. The ioctl() took 3.4 seconds, which is what we'd expect.

Can you rerun strace -r and post the full output? What is your pool configuration? Also, do you happen to know what kind of operations znapzend does during a backup (snapshots? zfs send? etc).

We ran into a similar problem back in February, where were we had a dying disk that was holding up up ZFS. That caused our zpool status heartbeat to hang, which in turn caused pacemaker to think the node was down and powered it off. It would be nice if there was a timeout or an async option for zpool so that this doesn't happen. We first need a reliable way to reproduce the issue before we can make a fix though.

behlendorf · 2018-04-07T02:25:35Z

For the Pacemaker case it wouldn't be unreasonable to add a /proc/spl/kstat/zfs/<pool>/health kstat which provides only this information. We'd of course still want to get to the bottom of any unexpected delays described in this issue.

marcinkuk · 2018-05-08T10:40:34Z

Hello,

I have similar issue in my environment.

Environment:

node1 - rpool1/zvol1
node2 - rpool2/zvol2

Scenario:
# zfs send recv rpool1/zvlol1@snap1 | ssh node2 zfs recv rpool2/zvol2 -F
poweroff node2 during send + recv

Behaviour:

on node1:
# zpool list -H -o health (and other zfs get commands) hangs for about 30 seconds

I think it may be related.

tonyhutter · 2018-05-14T20:02:20Z

@marcinkuk I tried to reproduce your zfs send hang between two VMs with:

zpool create mypool1 `pwd`/file1
zfs create -V 2000M mypool1/zvol1
zfs set compression=off mypool1/zvol1
dd if=/dev/urandom of=/dev/zd0 bs=1M count=800
zfs snapshot mypool1/zvol1@snap1
zfs send mypool1/zvol1@snap1 | ssh hutter@fedora27 zfs recv mypool2/zvol2 -F

However, I didn't see the sending VM hang when I paused the receiving VM. This was with 0.7.9. Can you give me some more details on your dataset, and which ZoL version you're running?

tonyhutter · 2018-05-17T17:47:39Z

I setup a test where I continuously wrote to a mirrored pool, while simultaneously offlining/onlining disks in the pool. While doing that, I would run zpool status in another window, and sometimes see small hangs. I tracked it down to this call path:

zpool status -> ZFS_IOC_POOL_STATS ioctl() -> spa_get_stats()->spa_open_common()-> spa_open_common() -> spa_config_generate()

In spa_config_generate(), I noticed that spa_config_enter() would sometimes take up to 4 seconds:

nvlist_t *
spa_config_generate(spa_t *spa, vdev_t *vd, uint64_t txg, int getstats)
{
...
                spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_READER); 
...

I narrowed it down further, and saw that it was the SCL_CONFIG lock (and not SCL_STATE) that was taking the 4 seconds. I'm not sure how much we can improve that lock, short of major surgery. With that in mind, I think implementing @behlendorf's lockless /proc/spl/kstat/zfs/<pool>/health idea is the way to go for pacemaker.

1. Add a proc entry to display the pool's health: $ cat /proc/spl/kstat/zfs/tank/health ONLINE This is done without using the spa config locks, so it will never hang. 2. Fix 'zpool status' and 'zpool list -o health' output to print "SUSPENDED" instead of "ONLINE" for suspended pools. Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#7331 Requires-spl: refs/pull/705/head

1. Add a proc entry to display the pool's health: $ cat /proc/spl/kstat/zfs/tank/health ONLINE This is done without using the spa config locks, so it will never hang. 2. Fix 'zpool status' and 'zpool list -o health' output to print "SUSPENDED" instead of "ONLINE" for suspended pools. Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#7331

1. Add a proc entry to display the pool's state: $ cat /proc/spl/kstat/zfs/tank/state ONLINE This is done without using the spa config locks, so it will never hang. 2. Fix 'zpool status' and 'zpool list -o health' output to print "SUSPENDED" instead of "ONLINE" for suspended pools. Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#7331

1. Add a proc entry to display the pool's state: $ cat /proc/spl/kstat/zfs/tank/state ONLINE This is done without using the spa config locks, so it will never hang. 2. Fix 'zpool status' and 'zpool list -o health' output to print "SUSPENDED" instead of "ONLINE" for suspended pools. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Richard Elling <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#7331 Closes openzfs#7563

gmelikov · 2018-10-17T14:37:37Z

JFYI PR with /proc usage was just merged to resource agents' master branch ClusterLabs/resource-agents@2bdeee4

behlendorf added the Type: Performance Performance improvement or performance problem label Mar 27, 2018

tonyhutter mentioned this issue May 25, 2018

Add pool health /proc entry, "SUSPENDED" pools #7563

Merged

13 tasks

behlendorf closed this as completed in f0ed6c7 Jun 6, 2018

gmelikov mentioned this issue Oct 17, 2018

ZFS agent: check status without locks when possible ClusterLabs/resource-agents#1247

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zfs and zpool commands hangs when znapzend runs backups and influence for the cluster #7331

zfs and zpool commands hangs when znapzend runs backups and influence for the cluster #7331

arturpzol commented Mar 22, 2018

loli10K commented Mar 22, 2018

behlendorf commented Mar 23, 2018

arturpzol commented Mar 23, 2018

tonyhutter commented Apr 6, 2018

tonyhutter commented Apr 6, 2018

behlendorf commented Apr 7, 2018

marcinkuk commented May 8, 2018 •

edited

Loading

tonyhutter commented May 14, 2018 •

edited

Loading

tonyhutter commented May 17, 2018

gmelikov commented Oct 17, 2018

zfs and zpool commands hangs when znapzend runs backups and influence for the cluster #7331

zfs and zpool commands hangs when znapzend runs backups and influence for the cluster #7331

Comments

arturpzol commented Mar 22, 2018

System information

Describe the problem you're observing

loli10K commented Mar 22, 2018

behlendorf commented Mar 23, 2018

arturpzol commented Mar 23, 2018

tonyhutter commented Apr 6, 2018

tonyhutter commented Apr 6, 2018

behlendorf commented Apr 7, 2018

marcinkuk commented May 8, 2018 • edited Loading

tonyhutter commented May 14, 2018 • edited Loading

tonyhutter commented May 17, 2018

gmelikov commented Oct 17, 2018

marcinkuk commented May 8, 2018 •

edited

Loading

tonyhutter commented May 14, 2018 •

edited

Loading