Hang of whole server 2 hours after accessing a snapshot via NFS #4716

rubentolosa · 2016-05-31T09:54:19Z

I have a server with Ubuntu 14.04, kernel 3.13.0-86-generic, zfs version 0.6.5.4.

We have about 70 active users, and each one has his home in a filesystem.
I keep about 60 snapshots for each user (13 x 5 minutes, 13 x 20 minutes, and so on...) for a month...
Users mount their home directory via NFS and when needed they access a snapshot to recover some deleted files... It's been few monts working like a charm, but yesterday, a user deleted some files, so he tried to recover them from one of the 5 minutes snapshots availabe. The files were there and the user was happy...

But one hour and five minutes later, (13 x 5 minutes) the snapshot from which the user recovered the files had to be deleted. And then something went wrong... Here you have what /var/log/syslog said:

May 30 13:50:13 clara kernel: [10909369.395243] BUG: Dentry ffff8806e0b12000{i=8181a,n=time_table.pdf}  still in use (1) [unmount of zfs zfs]
May 30 13:50:13 clara kernel: [10909369.401421] ------------[ cut here ]------------
May 30 13:50:13 clara kernel: [10909369.401435] WARNING: CPU: 7 PID: 154793 at /build/linux-kyAd43/linux-3.13.0/fs/dcache.c:1329 umount_check+0x7c/0x90()
May 30 13:50:13 clara kernel: [10909369.401438] Modules linked in: btrfs ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c 8021q garp stp mrp llc x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd joydev sb_edac edac_core mei_me mei lpc_ich ioatdma wmi shpchp bonding mac_hid lp parport nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache zfs(POX) zunicode(POX) zcommon(POX) znvpair(POX) spl(OX) zavl(POX) raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid0 multipath ses enclosure raid1 linear hid_generic igb usbhid isci ixgbe mpt2sas i2c_algo_bit hid dca libsas raid_class ahci ptp libahci scsi_transport_sas mdio megaraid_sas pps_core
May 30 13:50:13 clara kernel: [10909369.401550] CPU: 7 PID: 154793 Comm: umount Tainted: P           OX 3.13.0-76-generic #120-Ubuntu
May 30 13:50:13 clara kernel: [10909369.401553] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
May 30 13:50:13 clara kernel: [10909369.401557]  0000000000000009 ffff881566203d70 ffffffff81724b70 0000000000000000
May 30 13:50:13 clara kernel: [10909369.401574]  ffff881566203da8 ffffffff810678bd ffff880c750b0780 ffff880c750b0820
May 30 13:50:13 clara kernel: [10909369.401583]  ffff8806e0b12000 ffff8806e0b12058 ffff8806e0b12090 ffff881566203db8
May 30 13:50:13 clara kernel: [10909369.401592] Call Trace:
May 30 13:50:13 clara kernel: [10909369.401607]  [<ffffffff81724b70>] dump_stack+0x45/0x56
May 30 13:50:13 clara kernel: [10909369.401616]  [<ffffffff810678bd>] warn_slowpath_common+0x7d/0xa0
May 30 13:50:13 clara kernel: [10909369.401621]  [<ffffffff8106799a>] warn_slowpath_null+0x1a/0x20
May 30 13:50:13 clara kernel: [10909369.401637]  [<ffffffff811d472c>] umount_check+0x7c/0x90
May 30 13:50:13 clara kernel: [10909369.401643]  [<ffffffff811d6052>] d_walk+0xe2/0x2e0
May 30 13:50:13 clara kernel: [10909369.401653]  [<ffffffff811d46b0>] ? d_lru_del+0xa0/0xa0
May 30 13:50:13 clara kernel: [10909369.401664]  [<ffffffff811d63d6>] do_one_tree+0x26/0x40
May 30 13:50:13 clara kernel: [10909369.401670]  [<ffffffff811d6d7f>] shrink_dcache_for_umount+0x2f/0x90
May 30 13:50:13 clara kernel: [10909369.401681]  [<ffffffff811c0701>] generic_shutdown_super+0x21/0xf0
May 30 13:50:13 clara kernel: [10909369.401687]  [<ffffffff811c0992>] kill_anon_super+0x12/0x20
May 30 13:50:13 clara kernel: [10909369.401765]  [<ffffffffa03cccca>] zpl_kill_sb+0x1a/0x20 [zfs]
May 30 13:50:13 clara kernel: [10909369.401771]  [<ffffffff811c0ced>] deactivate_locked_super+0x3d/0x60
May 30 13:50:13 clara kernel: [10909369.401775]  [<ffffffff811c12a6>] deactivate_super+0x46/0x60
May 30 13:50:13 clara kernel: [10909369.401781]  [<ffffffff811de436>] mntput_no_expire+0xd6/0x170
May 30 13:50:13 clara kernel: [10909369.401786]  [<ffffffff811df77e>] SyS_umount+0x8e/0x120
May 30 13:50:13 clara kernel: [10909369.401792]  [<ffffffff8173575d>] system_call_fastpath+0x1a/0x1f
May 30 13:50:13 clara kernel: [10909369.401796] ---[ end trace 07130758d1fa4c6b ]---

The log is full of messages like that from that moment on, but the server kept on working for another hour until it crashed... no NFS clients were served, there was no disk activity at all, avg load raised to 150 and we had no other solution than hard resetting the machine, as a soft reboot got stuck also...

After rebooting everything seems fine. But we'd like to know what happened, or how to avoid something like this in the future.

Can someone give us a clue?

Thanks a lot!

The text was updated successfully, but these errors were encountered:

dbakken mentioned this issue Feb 21, 2017

NFS Kernel Freeze BUG: Dentry still in use ... (1) [unmount of zfs zfs] #5810

Closed

behlendorf closed this as completed in 9b77d1c Mar 8, 2017

stephan2012 mentioned this issue Jun 8, 2018

Kernel panic on unmount: Dentry ffff881f3294a198{i=4,n=/} still in use (1) [unmount of zfs zfs] #6612

Closed

c0xc mentioned this issue Apr 15, 2022

umount, snapshot ZFS processes stuck in kernel forever causing high load #13327

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang of whole server 2 hours after accessing a snapshot via NFS #4716

Hang of whole server 2 hours after accessing a snapshot via NFS #4716

rubentolosa commented May 31, 2016

Hang of whole server 2 hours after accessing a snapshot via NFS #4716

Hang of whole server 2 hours after accessing a snapshot via NFS #4716

Comments

rubentolosa commented May 31, 2016