-
Notifications
You must be signed in to change notification settings - Fork 178
simultaneous zfs snapshot access hangs #143
Comments
don't know if this is related, but: root@mailhost:/mnt/zfs/fulldisk/.zfs/snapshot# ls -lat ENTER dmesg: [ 23.558479] eth0: no IPv6 routers present /tmp/spl-log.1343567577.338 uploaded here: |
root@mailhost:/var/tmp# dmesg | egrep 'spl|zfs' root@mailhost:/var/tmp# ps axfww|egrep -i 'cron|sync|zfs|zpool' |
a hint on what to do to help debug this would be appreciated. |
It appears that you might have a couple issues going on here. But if your able to reproduce the issue and can still log in to the node (or have console) access it would be very helpful to dump all of the stacks. You can do this by issuing a sysrq-t either on the console or by |
On Tue, Aug 7, 2012 at 8:58 PM, Brian Behlendorf
Thank you very much for the hint. |
Alright, certainly let me know if you hit the issue again. As for stress tools I typically run the xfstests test suite which is what the upstream kernel folks use for ext4, xfs, etc. There are many more but we don't really have a zfs specific test suite for Linux just yet. |
as i'm tracking the daily builds on the ubuntu ppa, i'm now running 0.6.0.71-0ubuntu1~precise1 this got caught by the logcheck cron run this morning. Not sure if it is related. sysrq-t dumped 7 hours later at http://paste.ubuntu.com/1158784/. zpool/zfs seems to work ok, no processes in D state. Aug 21 06:58:45 mailhost kernel: [1158944.802862] INFO: task zfs:51806 blocked for more than 120 seconds. [...other daemons...] Aug 21 06:59:06 mailhost zfs-auto-snap: @zfs-auto-snap_daily-2012-08-21-0656, 1 created, 1 destroyed, 0 warnings. |
My guess is this was just related to a lack of free memory on the machine delaying the txg_sync thread. These issues are being worked and I'm hopeful thing will be better for -rc11, but we shall see. |
I'm still getting occasional errors like these (actually only two of them, caught by logcheck): Oct 9 07:03:07 mailhost kernel: [451101.047619] INFO: task zfs:5546 blocked for more than 120 seconds. Oct 11 06:55:07 mailhost kernel: [623383.021135] INFO: task zfs:32526 blocked for more than 120 seconds. |
At 6:25 am everyday, run-parts called from crontab kicks cron.daily scripts. |
@mailinglists35 Does anything fail? The stacks you posted are just advisory and I'm seriously considering suppressing them to avoid them causing unwarranted concern. |
No, I have noticed nothing else failing. I'm certainly sleeping anyway at the time cron.daily runs :) and so the users of that box. |
i have narrowed this down: evey time zfs command is run when another zfs is creating a snapshot, results in permanent hang. I hope this is reproductible for you too, as I can reproduce this on latest ubuntu 12.04 lts and debian wheezy kernels (3.2.0 amd64) with latest ubuntu-zfs packages. to reproduce: while true; do zfs list; zpool list; done Dec 3 19:29:03 homerouter kernel: [ 427.749244] SPL: Loaded module v0.6.0.88-rc12 root@homerouter: root@homerouter: (two minutes later, root@homerouter:~# Dec 3 19:33:56 homerouter kernel: [ 720.612081] INFO: task txg_sync:15139 blocked for more than 120 seconds. root@homerouter: |
while (unsuccessfully) trying to reproduce #97 on this debian machine, I ran the diff -qr command on a snapdir. I'm not saying there is a connection between these two issues but that was just how I ran into it. So it appears I cannot even read data from a snapshot dir while another snapshot gets created - Or - cannot create a snapshot while reading data from a previous snapshot; either way, cannot perform both actions at the same time. (btw, zfs-auto-snapshot is also deleting older snapshots, but it did not reach the deletion step (it is blocked at creation) and anyway I was reading from a snapshot that was not scheduled to be deleted anyway) root@homerouter:~# ps ax|grep D Dec 3 21:02:50 homerouter kernel: [ 3120.612085] INFO: task txg_sync:13926 blocked for more than 120 seconds. |
Callers of zap_deref_leaf() must be careful to drop leaf->l_rwlock since that function returns with the lock held on success. All other callers drop the lock correctly but it seems fzap_cursor_move_to_key() does not. This may block writers or cause VERIFY failures when the lock is freed. Fixes openzfs#1215 Fixes openzfs/spl#143 Fixes openzfs/spl#97
Callers of zap_deref_leaf() must be careful to drop leaf->l_rwlock since that function returns with the lock held on success. All other callers drop the lock correctly but it seems fzap_cursor_move_to_key() does not. This may block writers or cause VERIFY failures when the lock is freed. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1215 Closes openzfs/spl#143 Closes openzfs/spl#97
openzfs/zfs@1305d33 fzap_cursor_move_to_key() should drop l_rwlock |
Callers of zap_deref_leaf() must be careful to drop leaf->l_rwlock since that function returns with the lock held on success. All other callers drop the lock correctly but it seems fzap_cursor_move_to_key() does not. This may block writers or cause VERIFY failures when the lock is freed. Signed-off-by: Brian Behlendorf <[email protected]> Closes #1215 Closes openzfs/spl#143 Closes openzfs/spl#97
Is anyone else still having this issue? I'm using the Ubuntu PPA for the RCs and I recently updated to RC14 and still seem to be experiencing this exact same issue. Cron is running zfs-auto-snapshot and if two of them run at the same time because the disk is otherwise busy, the whole filesystem hangs (the filesystem was busy because I was running a scrub to make sure no damage was present from the issue happening on the previous RC). Feb 6 23:21:12 santos kernel: [44521.368218] INFO: task txg_sync:522 blocked for more than 120 seconds. |
@ iamacarpet This looks like a different issue to me. Your system is blocked waiting on an outstanding I/O to the disk for some unknown reason. Check |
Thanks, I've opened a new ticket about it here with a little more info: openzfs/zfs#1283 |
Callers of zap_deref_leaf() must be careful to drop leaf->l_rwlock since that function returns with the lock held on success. All other callers drop the lock correctly but it seems fzap_cursor_move_to_key() does not. This may block writers or cause VERIFY failures when the lock is freed. Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#1215 Closes openzfs/spl#143 Closes openzfs/spl#97
01:18:44 up 9:06, 3 users, load average: 8.83, 7.56, 6.23
this is an ubuntu 12.04 lts amd64 test system doing nothing (yet)
root@mailhost:
# dpkg -l|grep zfsprecise1 Native ZFS filesystem library for Linuxii libzfs1 0.6.0.67-0ubuntu1
ii mountall 2.36-zfs1 filesystem mounting tool
ii ubuntu-zfs 6
precise Native ZFS filesystem metapackage for Ubuntu.precise1 ZFS Automatic Snapshot Serviceii zfs-auto-snapshot 1.0.8-0ubuntu1
ii zfs-dkms 0.6.0.67-0ubuntu1
precise1 Native ZFS filesystem kernel modules for Linuxprecise1 Native ZFS management utilities for Linuxii zfsutils 0.6.0.67-0ubuntu1
[ 8.397519] SPL: Loaded module v0.6.0.67-rc9, using hostid 0xa8c04711
root@mailhost:~# cat /etc/apt/sources.list.d/zfs-native-daily-precise.list
deb http://ppa.launchpad.net/zfs-native/daily/ubuntu precise main
deb-src http://ppa.launchpad.net/zfs-native/daily/ubuntu precise main
Linux mailhost 3.2.0-27-generic #43-Ubuntu SMP Fri Jul 6 14:25:57 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
somehow cron managed to overlap two jobs that create a snapshot
luckyly the zpool is on a dedicated pair of disks, and I have console acces to reset the box (reboot fails at "killing all remaining processes" then hangs at "unmounting local filesystems - or I don't have that much patience to see how long it waits. uhm, I'll just let it overnight to see if it eventually reboots, sleeping now.)
1224 ? Ss 0:00 cron
35459 ? S 0:00 _ CRON
35460 ? Ss 0:00 | _ /bin/sh -c zfs-auto-snapshot -q -g --label=frequent --keep=4 //
35461 ? S 0:00 | _ /bin/sh /sbin/zfs-auto-snapshot -q -g --label=frequent --keep=4 //
35484 ? D 0:00 | _ zfs snapshot -o com.sun:auto-snapshot-desc - -r fulldisk@zfs-auto-snap_frequent-2012-07-28-0015
36090 ? S 0:00 _ CRON
36091 ? Ss 0:00 | _ /bin/sh -c cd / && run-parts --report /etc/cron.hourly
36092 ? S 0:00 | _ run-parts --report /etc/cron.hourly
36093 ? S 0:00 | _ /bin/sh /sbin/zfs-auto-snapshot --quiet --syslog --label=hourly --keep=24 //
36095 ? D 0:00 | _ zpool status
36799 ? S 0:00 _ CRON
36800 ? Ss 0:00 | _ /bin/sh -c zfs-auto-snapshot -q -g --label=frequent --keep=4 //
36801 ? S 0:00 | _ /bin/sh /sbin/zfs-auto-snapshot -q -g --label=frequent --keep=4 //
36803 ? D 0:00 | _ zpool status
36849 ? S 0:00 _ CRON
36850 ? Ss 0:00 | _ /bin/sh -c zfs-auto-snapshot -q -g --label=frequent --keep=4 //
36851 ? S 0:00 | _ /bin/sh /sbin/zfs-auto-snapshot -q -g --label=frequent --keep=4 //
36853 ? D 0:00 | _ zpool status
38077 ? S 0:00 _ CRON
38078 ? Ss 0:00 _ /bin/sh -c zfs-auto-snapshot -q -g --label=frequent --keep=4 //
38079 ? S 0:00 _ /bin/sh /sbin/zfs-auto-snapshot -q -g --label=frequent --keep=4 //
38081 ? D 0:00 _ zpool status
The text was updated successfully, but these errors were encountered: