Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel lockup in zpl_lookup (kernel 4.7, zfs v0.6.5.8-2) #5257

Closed
aktau opened this issue Oct 10, 2016 · 4 comments
Closed

Kernel lockup in zpl_lookup (kernel 4.7, zfs v0.6.5.8-2) #5257

aktau opened this issue Oct 10, 2016 · 4 comments

Comments

@aktau
Copy link

aktau commented Oct 10, 2016

Debian Stretch user using the default debian ZFS packages here.

[    0.000000] Linux version 4.7.0-1-amd64 ([email protected]) (gcc version 5.4.1 20160904 (Debian 5.4.1-2) ) #1 SMP Debian 4.7.5-1 (2016-09-26)
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.7.0-1-amd64 root=UUID=5cbf530f-aaee-4760-bc10-702315117fe6 ro quiet
...
[    3.870097] SPL: Loaded module v0.6.5.8-2
[    4.182596] ZFS: Loaded module v0.6.5.8-1, ZFS pool version 5000, ZFS filesystem version 5
[    4.452492] SPL: using hostid 0x007f0101

[  600.225704] INFO: task transmission-da:1518 blocked for more than 120 seconds.
[  600.225717]       Tainted: P           OE   4.7.0-1-amd64 #1
[  600.225720] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  600.225724] transmission-da D ffff88021f216d80     0  1518      1 0x00000004
[  600.225734]  ffff880210bb4280 ffffffff8f60d500 ffff880210b00800 ffffffff8ee10f12
[  600.225740]  ffff880210208000 ffff8800d2b55000 ffff8800d2b55058 ffff8800d2b550b0
[  600.225745]  ffffffff8f910ee8 ffff8800d28d86c0 ffffffff8f1d7251 ffff880210bb4280
[  600.225750] Call Trace:
[  600.225765]  [<ffffffff8ee10f12>] ? __d_alloc+0x122/0x1a0
[  600.225774]  [<ffffffff8f1d7251>] ? schedule+0x31/0x80
[  600.225779]  [<ffffffff8ee11873>] ? d_alloc_parallel+0x423/0x480
[  600.225785]  [<ffffffff8f1d8fee>] ? mutex_lock+0xe/0x30
[  600.225794]  [<ffffffff8eca2fb0>] ? wake_up_q+0x60/0x60
[  600.225799]  [<ffffffff8ee11b06>] ? d_add_ci+0x56/0xd0
[  600.225927]  [<ffffffffc09d509b>] ? zpl_lookup+0x15b/0x1e0 [zfs]
[  600.225937]  [<ffffffff8ee03603>] ? lookup_slow+0xa3/0x170
[  600.225944]  [<ffffffff8ee041f3>] ? walk_component+0x1f3/0x320
[  600.225950]  [<ffffffff8ee048c8>] ? link_path_walk+0x188/0x510
[  600.225955]  [<ffffffff8ee02229>] ? path_init+0x1d9/0x330
[  600.225961]  [<ffffffff8ee04d4c>] ? path_lookupat+0x7c/0x110
[  600.225967]  [<ffffffff8ef2ddfc>] ? copy_page_to_iter+0xfc/0x3c0
[  600.225974]  [<ffffffff8ee071d1>] ? filename_lookup+0xb1/0x180
[  600.225981]  [<ffffffff8ee06e0f>] ? getname_flags+0x6f/0x1e0
[  600.225987]  [<ffffffff8edfc859>] ? vfs_fstatat+0x59/0xb0
[  600.225993]  [<ffffffff8edf9cd4>] ? __fput+0x164/0x1e0
[  600.225998]  [<ffffffff8edfcdaa>] ? SYSC_newstat+0x2a/0x60
[  600.226005]  [<ffffffff8ec03300>] ? exit_to_usermode_loop+0xa0/0xc0
[  600.226010]  [<ffffffff8ec03bd5>] ? syscall_return_slowpath+0x45/0x50
[  600.226019]  [<ffffffff8f1db136>] ? system_call_fast_compare_end+0xc/0x96

Even something like ls in the root of the zpool seems to do it (although I can't be 100% sure, tranmission-daemon runs at startup and would also try to read from the disk).

I remember upgrading the linux kernel from 4.6 to 4.7 yesterday, so that's definitely something to keep in mind when looking for a culprit.

Here's the apt-get update I did:

Here's (what I think are) the relevant upgrade to linux and zfs/spl packages over the last month

$ cat /var/log/apt/history.log | grep -v 'Remove:' | grep -oP '(Start-Date.*)|((linux-image|zfs|spl|dkms).*?\)(?=,))'
Start-Date: 2016-09-06  22:06:43
zfs-zed:amd64 (0.6.5.7-1, 0.6.5.7-2)
dkms:amd64 (2.2.0.3-5, 2.2.1.0+git20160527-1)
zfsutils-linux:amd64 (0.6.5.7-1, 0.6.5.7-2)
zfs2linux:amd64 (0.6.5.7-1, 0.6.5.7-2)
zfs-dkms:amd64 (0.6.5.7-1, 0.6.5.7-2)
Start-Date: 2016-09-06  22:18:02
Start-Date: 2016-09-11  13:34:12
Start-Date: 2016-09-11  13:45:23
Start-Date: 2016-09-19  22:49:26
Start-Date: 2016-09-19  22:51:15
Start-Date: 2016-09-25  14:30:37
zfs-zed:amd64 (0.6.5.7-2, 0.6.5.8-1)
zfsutils-linux:amd64 (0.6.5.7-2, 0.6.5.8-1)
spl:amd64 (0.6.5.7-1, 0.6.5.8-1)
zfs2linux:amd64 (0.6.5.7-2, 0.6.5.8-1)
spl-dkms:amd64 (0.6.5.7-1, 0.6.5.8-1)
zfs-dkms:amd64 (0.6.5.7-2, 0.6.5.8-1)
Start-Date: 2016-10-09  19:27:48
linux-image-4.7.0-1-amd64:amd64 (4.7.5-1, automatic)
spl:amd64 (0.6.5.8-1, 0.6.5.8-2)
linux-image-amd64:amd64 (4.6+74, 4.7+75)
spl-dkms:amd64 (0.6.5.8-1, 0.6.5.8-2)
Start-Date: 2016-10-09  23:30:00

If you need anything else, please let me know.

@aktau aktau changed the title Kernel lockup in zpl_lookup Kernel lockup in zpl_lookup (kernel 4.7, zfs v0.6.5.8-2) Oct 10, 2016
@aktau
Copy link
Author

aktau commented Oct 11, 2016

Another stack trace which is similar, but not the same:

[  720.340667] INFO: task JobWorker:1624 blocked for more than 120 seconds.
[  720.340669]       Tainted: P           OE   4.7.0-1-amd64 #1
[  720.340670] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  720.340672] JobWorker       D ffff88021f216d80     0  1624   1596 0x00000000
[  720.340675]  ffff8800d0570200 ffff880215a38e80 ffff8800d0401800 ffffffff8e410f12
[  720.340678]  ffff8800ccd3c000 ffff8802100af300 ffff8802100af358 ffff8802100af3b0
[  720.340681]  ffffffff8ef113e8 ffff8800d2400180 ffffffff8e7d7251 ffff8800d0570200
[  720.340684] Call Trace:
[  720.340687]  [<ffffffff8e410f12>] ? __d_alloc+0x122/0x1a0
[  720.340690]  [<ffffffff8e7d7251>] ? schedule+0x31/0x80
[  720.340693]  [<ffffffff8e411873>] ? d_alloc_parallel+0x423/0x480
[  720.340696]  [<ffffffff8e2a2fb0>] ? wake_up_q+0x60/0x60
[  720.340700]  [<ffffffff8e4035e0>] ? lookup_slow+0x80/0x170
[  720.340703]  [<ffffffff8e4041f3>] ? walk_component+0x1f3/0x320
[  720.340707]  [<ffffffff8e4048c8>] ? link_path_walk+0x188/0x510
[  720.340710]  [<ffffffff8e402229>] ? path_init+0x1d9/0x330
[  720.340713]  [<ffffffff8e404d4c>] ? path_lookupat+0x7c/0x110
[  720.340716]  [<ffffffff8e4071d1>] ? filename_lookup+0xb1/0x180
[  720.340720]  [<ffffffff8e406e0f>] ? getname_flags+0x6f/0x1e0
[  720.340723]  [<ffffffff8e3fc859>] ? vfs_fstatat+0x59/0xb0
[  720.340725]  [<ffffffff8e3fcdaa>] ? SYSC_newstat+0x2a/0x60
[  720.340728]  [<ffffffff8e3f7c83>] ? vfs_write+0x163/0x1a0
[  720.340732]  [<ffffffff8e3f8fd6>] ? SyS_write+0x76/0xc0
[  720.340735]  [<ffffffff8e7db136>] ? system_call_fast_compare_end+0xc/0x96

It doesn't even feature zpl_lookup, so I assume it's some sort of fast path.

@tuxoko
Copy link
Contributor

tuxoko commented Oct 12, 2016

This is fixed in #5141

@behlendorf
Copy link
Contributor

Closing. The fix for this is queued up for 0.6.5.9.

@aktau
Copy link
Author

aktau commented Oct 14, 2016

Thanks!

I knew it would be something "special" about my setup, otherwise it would've been noticed before. I foolishly created that ZFS volume when I was still using OSX. Thinking I would use it for some apple stuff, I made it case-insensitive (and normalizing...).

A very bad choice. Since I don't have another HDD that's as large, I'm afraid I'll never be able to turn off those bad choices anymore (at least until I get a new, larger one, far in the future).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants