Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang of whole server 2 hours after accessing a snapshot via NFS #4716

Closed
rubentolosa opened this issue May 31, 2016 · 0 comments
Closed

Hang of whole server 2 hours after accessing a snapshot via NFS #4716

rubentolosa opened this issue May 31, 2016 · 0 comments

Comments

@rubentolosa
Copy link

I have a server with Ubuntu 14.04, kernel 3.13.0-86-generic, zfs version 0.6.5.4.

We have about 70 active users, and each one has his home in a filesystem.
I keep about 60 snapshots for each user (13 x 5 minutes, 13 x 20 minutes, and so on...) for a month...
Users mount their home directory via NFS and when needed they access a snapshot to recover some deleted files... It's been few monts working like a charm, but yesterday, a user deleted some files, so he tried to recover them from one of the 5 minutes snapshots availabe. The files were there and the user was happy...

But one hour and five minutes later, (13 x 5 minutes) the snapshot from which the user recovered the files had to be deleted. And then something went wrong... Here you have what /var/log/syslog said:

May 30 13:50:13 clara kernel: [10909369.395243] BUG: Dentry ffff8806e0b12000{i=8181a,n=time_table.pdf}  still in use (1) [unmount of zfs zfs]
May 30 13:50:13 clara kernel: [10909369.401421] ------------[ cut here ]------------
May 30 13:50:13 clara kernel: [10909369.401435] WARNING: CPU: 7 PID: 154793 at /build/linux-kyAd43/linux-3.13.0/fs/dcache.c:1329 umount_check+0x7c/0x90()
May 30 13:50:13 clara kernel: [10909369.401438] Modules linked in: btrfs ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c 8021q garp stp mrp llc x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd joydev sb_edac edac_core mei_me mei lpc_ich ioatdma wmi shpchp bonding mac_hid lp parport nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache zfs(POX) zunicode(POX) zcommon(POX) znvpair(POX) spl(OX) zavl(POX) raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid0 multipath ses enclosure raid1 linear hid_generic igb usbhid isci ixgbe mpt2sas i2c_algo_bit hid dca libsas raid_class ahci ptp libahci scsi_transport_sas mdio megaraid_sas pps_core
May 30 13:50:13 clara kernel: [10909369.401550] CPU: 7 PID: 154793 Comm: umount Tainted: P           OX 3.13.0-76-generic #120-Ubuntu
May 30 13:50:13 clara kernel: [10909369.401553] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
May 30 13:50:13 clara kernel: [10909369.401557]  0000000000000009 ffff881566203d70 ffffffff81724b70 0000000000000000
May 30 13:50:13 clara kernel: [10909369.401574]  ffff881566203da8 ffffffff810678bd ffff880c750b0780 ffff880c750b0820
May 30 13:50:13 clara kernel: [10909369.401583]  ffff8806e0b12000 ffff8806e0b12058 ffff8806e0b12090 ffff881566203db8
May 30 13:50:13 clara kernel: [10909369.401592] Call Trace:
May 30 13:50:13 clara kernel: [10909369.401607]  [<ffffffff81724b70>] dump_stack+0x45/0x56
May 30 13:50:13 clara kernel: [10909369.401616]  [<ffffffff810678bd>] warn_slowpath_common+0x7d/0xa0
May 30 13:50:13 clara kernel: [10909369.401621]  [<ffffffff8106799a>] warn_slowpath_null+0x1a/0x20
May 30 13:50:13 clara kernel: [10909369.401637]  [<ffffffff811d472c>] umount_check+0x7c/0x90
May 30 13:50:13 clara kernel: [10909369.401643]  [<ffffffff811d6052>] d_walk+0xe2/0x2e0
May 30 13:50:13 clara kernel: [10909369.401653]  [<ffffffff811d46b0>] ? d_lru_del+0xa0/0xa0
May 30 13:50:13 clara kernel: [10909369.401664]  [<ffffffff811d63d6>] do_one_tree+0x26/0x40
May 30 13:50:13 clara kernel: [10909369.401670]  [<ffffffff811d6d7f>] shrink_dcache_for_umount+0x2f/0x90
May 30 13:50:13 clara kernel: [10909369.401681]  [<ffffffff811c0701>] generic_shutdown_super+0x21/0xf0
May 30 13:50:13 clara kernel: [10909369.401687]  [<ffffffff811c0992>] kill_anon_super+0x12/0x20
May 30 13:50:13 clara kernel: [10909369.401765]  [<ffffffffa03cccca>] zpl_kill_sb+0x1a/0x20 [zfs]
May 30 13:50:13 clara kernel: [10909369.401771]  [<ffffffff811c0ced>] deactivate_locked_super+0x3d/0x60
May 30 13:50:13 clara kernel: [10909369.401775]  [<ffffffff811c12a6>] deactivate_super+0x46/0x60
May 30 13:50:13 clara kernel: [10909369.401781]  [<ffffffff811de436>] mntput_no_expire+0xd6/0x170
May 30 13:50:13 clara kernel: [10909369.401786]  [<ffffffff811df77e>] SyS_umount+0x8e/0x120
May 30 13:50:13 clara kernel: [10909369.401792]  [<ffffffff8173575d>] system_call_fastpath+0x1a/0x1f
May 30 13:50:13 clara kernel: [10909369.401796] ---[ end trace 07130758d1fa4c6b ]---

The log is full of messages like that from that moment on, but the server kept on working for another hour until it crashed... no NFS clients were served, there was no disk activity at all, avg load raised to 150 and we had no other solution than hard resetting the machine, as a soft reboot got stuck also...

After rebooting everything seems fine. But we'd like to know what happened, or how to avoid something like this in the future.

Can someone give us a clue?

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant