You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a server with Ubuntu 14.04, kernel 3.13.0-86-generic, zfs version 0.6.5.4.
We have about 70 active users, and each one has his home in a filesystem.
I keep about 60 snapshots for each user (13 x 5 minutes, 13 x 20 minutes, and so on...) for a month...
Users mount their home directory via NFS and when needed they access a snapshot to recover some deleted files... It's been few monts working like a charm, but yesterday, a user deleted some files, so he tried to recover them from one of the 5 minutes snapshots availabe. The files were there and the user was happy...
But one hour and five minutes later, (13 x 5 minutes) the snapshot from which the user recovered the files had to be deleted. And then something went wrong... Here you have what /var/log/syslog said:
May 30 13:50:13 clara kernel: [10909369.395243] BUG: Dentry ffff8806e0b12000{i=8181a,n=time_table.pdf} still in use (1) [unmount of zfs zfs]
May 30 13:50:13 clara kernel: [10909369.401421] ------------[ cut here ]------------
May 30 13:50:13 clara kernel: [10909369.401435] WARNING: CPU: 7 PID: 154793 at /build/linux-kyAd43/linux-3.13.0/fs/dcache.c:1329 umount_check+0x7c/0x90()
May 30 13:50:13 clara kernel: [10909369.401438] Modules linked in: btrfs ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c 8021q garp stp mrp llc x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd joydev sb_edac edac_core mei_me mei lpc_ich ioatdma wmi shpchp bonding mac_hid lp parport nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache zfs(POX) zunicode(POX) zcommon(POX) znvpair(POX) spl(OX) zavl(POX) raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid0 multipath ses enclosure raid1 linear hid_generic igb usbhid isci ixgbe mpt2sas i2c_algo_bit hid dca libsas raid_class ahci ptp libahci scsi_transport_sas mdio megaraid_sas pps_core
May 30 13:50:13 clara kernel: [10909369.401550] CPU: 7 PID: 154793 Comm: umount Tainted: P OX 3.13.0-76-generic #120-Ubuntu
May 30 13:50:13 clara kernel: [10909369.401553] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
May 30 13:50:13 clara kernel: [10909369.401557] 0000000000000009 ffff881566203d70 ffffffff81724b70 0000000000000000
May 30 13:50:13 clara kernel: [10909369.401574] ffff881566203da8 ffffffff810678bd ffff880c750b0780 ffff880c750b0820
May 30 13:50:13 clara kernel: [10909369.401583] ffff8806e0b12000 ffff8806e0b12058 ffff8806e0b12090 ffff881566203db8
May 30 13:50:13 clara kernel: [10909369.401592] Call Trace:
May 30 13:50:13 clara kernel: [10909369.401607] [<ffffffff81724b70>] dump_stack+0x45/0x56
May 30 13:50:13 clara kernel: [10909369.401616] [<ffffffff810678bd>] warn_slowpath_common+0x7d/0xa0
May 30 13:50:13 clara kernel: [10909369.401621] [<ffffffff8106799a>] warn_slowpath_null+0x1a/0x20
May 30 13:50:13 clara kernel: [10909369.401637] [<ffffffff811d472c>] umount_check+0x7c/0x90
May 30 13:50:13 clara kernel: [10909369.401643] [<ffffffff811d6052>] d_walk+0xe2/0x2e0
May 30 13:50:13 clara kernel: [10909369.401653] [<ffffffff811d46b0>] ? d_lru_del+0xa0/0xa0
May 30 13:50:13 clara kernel: [10909369.401664] [<ffffffff811d63d6>] do_one_tree+0x26/0x40
May 30 13:50:13 clara kernel: [10909369.401670] [<ffffffff811d6d7f>] shrink_dcache_for_umount+0x2f/0x90
May 30 13:50:13 clara kernel: [10909369.401681] [<ffffffff811c0701>] generic_shutdown_super+0x21/0xf0
May 30 13:50:13 clara kernel: [10909369.401687] [<ffffffff811c0992>] kill_anon_super+0x12/0x20
May 30 13:50:13 clara kernel: [10909369.401765] [<ffffffffa03cccca>] zpl_kill_sb+0x1a/0x20 [zfs]
May 30 13:50:13 clara kernel: [10909369.401771] [<ffffffff811c0ced>] deactivate_locked_super+0x3d/0x60
May 30 13:50:13 clara kernel: [10909369.401775] [<ffffffff811c12a6>] deactivate_super+0x46/0x60
May 30 13:50:13 clara kernel: [10909369.401781] [<ffffffff811de436>] mntput_no_expire+0xd6/0x170
May 30 13:50:13 clara kernel: [10909369.401786] [<ffffffff811df77e>] SyS_umount+0x8e/0x120
May 30 13:50:13 clara kernel: [10909369.401792] [<ffffffff8173575d>] system_call_fastpath+0x1a/0x1f
May 30 13:50:13 clara kernel: [10909369.401796] ---[ end trace 07130758d1fa4c6b ]---
The log is full of messages like that from that moment on, but the server kept on working for another hour until it crashed... no NFS clients were served, there was no disk activity at all, avg load raised to 150 and we had no other solution than hard resetting the machine, as a soft reboot got stuck also...
After rebooting everything seems fine. But we'd like to know what happened, or how to avoid something like this in the future.
Can someone give us a clue?
Thanks a lot!
The text was updated successfully, but these errors were encountered:
I have a server with Ubuntu 14.04, kernel 3.13.0-86-generic, zfs version 0.6.5.4.
We have about 70 active users, and each one has his home in a filesystem.
I keep about 60 snapshots for each user (13 x 5 minutes, 13 x 20 minutes, and so on...) for a month...
Users mount their home directory via NFS and when needed they access a snapshot to recover some deleted files... It's been few monts working like a charm, but yesterday, a user deleted some files, so he tried to recover them from one of the 5 minutes snapshots availabe. The files were there and the user was happy...
But one hour and five minutes later, (13 x 5 minutes) the snapshot from which the user recovered the files had to be deleted. And then something went wrong... Here you have what /var/log/syslog said:
The log is full of messages like that from that moment on, but the server kept on working for another hour until it crashed... no NFS clients were served, there was no disk activity at all, avg load raised to 150 and we had no other solution than hard resetting the machine, as a soft reboot got stuck also...
After rebooting everything seems fine. But we'd like to know what happened, or how to avoid something like this in the future.
Can someone give us a clue?
Thanks a lot!
The text was updated successfully, but these errors were encountered: