Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.6.3-1.3 + hung nfsd/zfs_zget uninterruptible sleep #4639

Closed
crollorc opened this issue May 12, 2016 · 2 comments
Closed

0.6.3-1.3 + hung nfsd/zfs_zget uninterruptible sleep #4639

crollorc opened this issue May 12, 2016 · 2 comments

Comments

@crollorc
Copy link

Hi,

I am running 0.6.3-1.3 on Ubuntu 12.04.5 LTS with all access over NFS and am experiencing intermittent NFS lockups.

When this occurred, no NFS requests were being processed at all and all nfsd processed were in a uninterruptible sleep/D state -

    root@san3:~# ps -eopid,state,wchan:32,cmd | awk '$2 ~ /D/'|tee ~/ps
     34197 D zfs_zget                         [nfsd]
     34198 D zfs_zget                         [nfsd]
     34200 D zfs_zget                         [nfsd]
     34203 D zfs_zget                         [nfsd]
    ...

I ran -

    root@san3:~# echo t >/proc/sysrq-trigger
    root@san3:~# echo l >/proc/sysrq-trigger

at the time and so have stack dumps. Here is a sample nfsd process stack -

    May 12 15:35:26 san3 kernel: [78302.578188] NMI backtrace for cpu 1
    May 12 15:35:26 san3 kernel: [78302.578193] CPU: 1 PID: 34199 Comm: nfsd Tainted: P          IOX 3.13.0-86-generic #130~precise1-Ubuntu
    May 12 15:35:26 san3 kernel: [78302.578196] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.01.06.0002.110120121539 11/01/2012
    May 12 15:35:26 san3 kernel: [78302.578198] task: ffff881f4d664800 ti: ffff8810104d4000 task.ti: ffff8810104d4000
    May 12 15:35:26 san3 kernel: [78302.578200] RIP: 0010:[<ffffffff8176db31>]  [<ffffffff8176db31>] _raw_spin_lock_irqsave+0x41/0x60
    May 12 15:35:26 san3 kernel: [78302.578207] RSP: 0018:ffff8810104d58e8  EFLAGS: 00000002
    May 12 15:35:26 san3 kernel: [78302.578209] RAX: 0000000000007fd5 RBX: 00000000004d1f52 RCX: 0000000000002596
    May 12 15:35:26 san3 kernel: [78302.578211] RDX: 0000000000000206 RSI: 0000000000002596 RDI: ffff8800bc3ec790
    May 12 15:35:26 san3 kernel: [78302.578213] RBP: ffff8810104d58e8 R08: 0000000000002592 R09: ffffea000bd21e00
    May 12 15:35:26 san3 kernel: [78302.578215] R10: ffffffffa022a7ba R11: 0000000000000000 R12: ffff8800bc3ec788
    May 12 15:35:26 san3 kernel: [78302.578216] R13: ffff8823d57ef800 R14: 0000000000000001 R15: ffff8810104d5990
    May 12 15:35:26 san3 kernel: [78302.578219] FS:  0000000000000000(0000) GS:ffff8811f7620000(0000) knlGS:0000000000000000
    May 12 15:35:26 san3 kernel: [78302.578221] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    May 12 15:35:26 san3 kernel: [78302.578222] CR2: 0000000000f9d000 CR3: 0000001121c6b000 CR4: 00000000000407e0
    May 12 15:35:26 san3 kernel: [78302.578224] Stack:
    May 12 15:35:26 san3 kernel: [78302.578225]  ffff8810104d5968 ffffffffa03c7331 ffff8800bc3ec790 00000000be6f62c8
    May 12 15:35:26 san3 kernel: [78302.578230]  0000000100000000 ffff8821be6f62c8 ffff8810104d5958 ffffffffa03c76bc
    May 12 15:35:26 san3 kernel: [78302.578233]  ffff8800bc3ec790 ffff88200d42f530 ffffffffa0467d61 ffff88116e882000
    May 12 15:35:26 san3 kernel: [78302.578236] Call Trace:
    May 12 15:35:26 san3 kernel: [78302.578290]  [<ffffffffa03c7331>] dnode_hold_impl+0x3e1/0x5d0 [zfs]
    May 12 15:35:26 san3 kernel: [78302.578318]  [<ffffffffa03c76bc>] ? dnode_rele_and_unlock+0x6c/0xb0 [zfs]
    May 12 15:35:26 san3 kernel: [78302.578345]  [<ffffffffa03c7539>] dnode_hold+0x19/0x20 [zfs]
    May 12 15:35:26 san3 kernel: [78302.578370]  [<ffffffffa03b4b94>] dmu_bonus_hold+0x34/0x280 [zfs]
    May 12 15:35:26 san3 kernel: [78302.578405]  [<ffffffffa03eeb7e>] sa_buf_hold+0xe/0x10 [zfs]
    May 12 15:35:26 san3 kernel: [78302.578447]  [<ffffffffa044939e>] zfs_zget+0x1be/0x350 [zfs]
    May 12 15:35:26 san3 kernel: [78302.578489]  [<ffffffffa0439fc8>] zfs_vget+0x228/0x3a0 [zfs]
    May 12 15:35:26 san3 kernel: [78302.578503]  [<ffffffffa0738e80>] ? nfsd_acceptable.part.12+0xe0/0xe0 [nfsd]
    May 12 15:35:26 san3 kernel: [78302.578545]  [<ffffffffa0458f67>] zpl_fh_to_dentry+0x87/0xd0 [zfs]
    May 12 15:35:26 san3 kernel: [78302.578549]  [<ffffffff812c1d7e>] exportfs_decode_fh+0x6e/0x300
    May 12 15:35:26 san3 kernel: [78302.578577]  [<ffffffffa0661421>] ? cache_check+0x61/0x200 [sunrpc]
    May 12 15:35:26 san3 kernel: [78302.578586]  [<ffffffffa073f2ba>] ? exp_find+0x10a/0x1b0 [nfsd]
    May 12 15:35:26 san3 kernel: [78302.578046] Code: 89 e5 ff 15 b9 0f 90 00 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 8d 04 bd 00 00 00 00 65 48 8b 14 25 a0 29 01 00 <48> 8d 0c 12 48 c1 e2 06 48 89 e5 48 29 ca f7 e2 48 8d 7a 01 ff

They situation recovered when NFSD was restarted.

I noticed issue 3349 and related bugfix and wondered if adding cond_resched() to zfs_zget would help.

Do you guys have any ideas on what the underlying problem could be?

Unfortunately, I cannot upgrade currently as we have experienced a big performance drop when we tried this so I am looking to cherry-pick bug fixes and compile from source if it helps.

Thanks for your help.

@crollorc
Copy link
Author

@kpande

It doesn't make much sense to file this issue for 0.6.3 because it most likely has already been fixed in master, It would be more helpful to file an issue for the performance.

yes, that makes sense.

however i did that and unfortunately the issue was unresolved so we ended up downgrading to 0.6.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@crollorc and others