Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.17 kernel, BUG: using smp_processor_id() in preemptible - caller is zio_nowait #2769

Closed
kernelOfTruth opened this issue Oct 7, 2014 · 8 comments
Milestone

Comments

@kernelOfTruth
Copy link
Contributor

Hi,

just completed booting up a 3.17.0-based kernel

during loading of the kernel modules the following is shown:

[  110.691372] BUG: using smp_processor_id() in preemptible [00000000] code: zpool/7429
[  110.691384] caller is zio_nowait+0x157/0x170 [zfs]
[  110.691386] CPU: 1 PID: 7429 Comm: zpool Tainted: P           O   3.17.0-v2+ #1
[  110.691387] Hardware name: ASUS All Series/P9D WS, BIOS 2004 06/05/2014
[  110.691388]  ffffffff82ecfa7b ffffffff82a3e876 0000000000000001 ffffffff8254334b
[  110.691390]  ffff8800dc66d7c0 ffff8807df9e4540 0000000000000000 ffffffffc0381377
[  110.691392]  0000000000000000 ffff8807dc4e3940 0000000000000000 ffff8807dfe67998
[  110.691393] Call Trace:
[  110.691397]  [<ffffffff82a3e876>] ? dump_stack+0x4a/0x75
[  110.691400]  [<ffffffff8254334b>] ? check_preemption_disabled+0xeb/0xf0
[  110.691404]  [<ffffffffc0381377>] ? zio_nowait+0x157/0x170 [zfs]
[  110.691408]  [<ffffffffc02e42b7>] ? arc_read+0x2b7/0x990 [zfs]
[  110.691413]  [<ffffffffc02ebd5a>] ? dbuf_prefetch+0x13a/0x250 [zfs]
[  110.691419]  [<ffffffffc0300925>] ? dmu_tx_fini+0x115/0x1a0 [zfs]
[  110.691425]  [<ffffffffc0301166>] ? dmu_zfetch+0x4d6/0x1990 [zfs]
[  110.691430]  [<ffffffffc02eb397>] ? dbuf_read+0x5a7/0xda0 [zfs]
[  110.691435]  [<ffffffffc0302a50>] ? dnode_hold_impl+0x1c0/0x550 [zfs]
[  110.691441]  [<ffffffffc02f36f6>] ? dmu_buf_hold_noread+0x26/0x150 [zfs]
[  110.691446]  [<ffffffffc02f3845>] ? dmu_buf_hold+0x25/0x80 [zfs]
[  110.691453]  [<ffffffffc034e7b5>] ? zap_lockdir+0x55/0x820 [zfs]
[  110.691457]  [<ffffffffc02e9d89>] ? dbuf_rele_and_unlock+0x289/0x470 [zfs]
[  110.691461]  [<ffffffffc02ea1a1>] ? dmu_buf_rele+0x11/0x620 [zfs]
[  110.691469]  [<ffffffffc034aef9>] ? fzap_lookup+0x119/0x150 [zfs]
[  110.691473]  [<ffffffff820f7819>] ? get_parent_ip+0x9/0x20
[  110.691475]  [<ffffffff820f7877>] ? preempt_count_add+0x47/0xa0
[  110.691478]  [<ffffffff82a45622>] ? _raw_spin_lock_irqsave+0x12/0x60
[  110.691485]  [<ffffffffc034f618>] ? zap_lookup_norm+0x38/0x1b0 [zfs]
[  110.691491]  [<ffffffffc034f7ba>] ? zap_lookup+0x2a/0x30 [zfs]
[  110.691499]  [<ffffffffc032d546>] ? spa_lookup_by_guid+0x13a6/0x2a30 [zfs]
[  110.691507]  [<ffffffffc032ef13>] ? spa_tryimport+0x83/0x430 [zfs]
[  110.691513]  [<ffffffffc035d4a9>] ? zfs_secpolicy_smb_acl+0x1a99/0x4ef0 [zfs]
[  110.691519]  [<ffffffffc035fa71>] ? zfs_secpolicy_smb_acl+0x4061/0x4ef0 [zfs]
[  110.691522]  [<ffffffff821f7ef0>] ? do_vfs_ioctl+0x2d0/0x4a0
[  110.691524]  [<ffffffff821e6daf>] ? __fput+0x13f/0x1c0
[  110.691527]  [<ffffffff82203a29>] ? mntput_no_expire+0x9/0x130
[  110.691529]  [<ffffffff820ec224>] ? task_work_run+0x94/0xd0
[  110.691531]  [<ffffffff821f80f6>] ? SyS_ioctl+0x36/0x80
[  110.691533]  [<ffffffff82a45f62>] ? int_signal+0x12/0x17
[  110.691535]  [<ffffffff82a45cd6>] ? system_call_fastpath+0x1a/0x1f
[  110.691547] BUG: using smp_processor_id() in preemptible [00000000] code: zpool/7429
[  110.691552] caller is zio_nowait+0x157/0x170 [zfs]
[  110.691553] CPU: 1 PID: 7429 Comm: zpool Tainted: P           O   3.17.0-v2+ #1
[  110.691554] Hardware name: ASUS All Series/P9D WS, BIOS 2004 06/05/2014
[  110.691554]  ffffffff82ecfa7b ffffffff82a3e876 0000000000000001 ffffffff8254334b
[  110.691556]  ffff8800dc66d7c0 ffff8807dc4231a8 0000000000000000 ffffffffc0381377
[  110.691557]  0000000000000000 ffff8807dc4e3940 0000000000000000 ffff8807dfe67948
[  110.691558] Call Trace:
[  110.691560]  [<ffffffff82a3e876>] ? dump_stack+0x4a/0x75
[  110.691562]  [<ffffffff8254334b>] ? check_preemption_disabled+0xeb/0xf0
[  110.691566]  [<ffffffffc0381377>] ? zio_nowait+0x157/0x170 [zfs]
[  110.691570]  [<ffffffffc02e42b7>] ? arc_read+0x2b7/0x990 [zfs]
[  110.691574]  [<ffffffffc02ebd5a>] ? dbuf_prefetch+0x13a/0x250 [zfs]
[  110.691580]  [<ffffffffc0300925>] ? dmu_tx_fini+0x115/0x1a0 [zfs]
[  110.691585]  [<ffffffffc0301166>] ? dmu_zfetch+0x4d6/0x1990 [zfs]
[  110.691590]  [<ffffffffc02eb397>] ? dbuf_read+0x5a7/0xda0 [zfs]
[  110.691595]  [<ffffffffc0302a50>] ? dnode_hold_impl+0x1c0/0x550 [zfs]
[  110.691600]  [<ffffffffc02f36f6>] ? dmu_buf_hold_noread+0x26/0x150 [zfs]
[  110.691605]  [<ffffffffc02f3845>] ? dmu_buf_hold+0x25/0x80 [zfs]
[  110.691612]  [<ffffffffc034e7b5>] ? zap_lockdir+0x55/0x820 [zfs]
[  110.691616]  [<ffffffffc02e9d89>] ? dbuf_rele_and_unlock+0x289/0x470 [zfs]
[  110.691620]  [<ffffffffc02ea1a1>] ? dmu_buf_rele+0x11/0x620 [zfs]
[  110.691627]  [<ffffffffc034aef9>] ? fzap_lookup+0x119/0x150 [zfs]
[  110.691629]  [<ffffffff820f7819>] ? get_parent_ip+0x9/0x20
[  110.691631]  [<ffffffff820f7877>] ? preempt_count_add+0x47/0xa0
[  110.691633]  [<ffffffff82a45622>] ? _raw_spin_lock_irqsave+0x12/0x60
[  110.691639]  [<ffffffffc034f618>] ? zap_lookup_norm+0x38/0x1b0 [zfs]
[  110.691645]  [<ffffffffc034f7ba>] ? zap_lookup+0x2a/0x30 [zfs]
[  110.691653]  [<ffffffffc032d546>] ? spa_lookup_by_guid+0x13a6/0x2a30 [zfs]
[  110.691661]  [<ffffffffc032ef13>] ? spa_tryimport+0x83/0x430 [zfs]
[  110.691668]  [<ffffffffc035d4a9>] ? zfs_secpolicy_smb_acl+0x1a99/0x4ef0 [zfs]
[  110.691673]  [<ffffffffc035fa71>] ? zfs_secpolicy_smb_acl+0x4061/0x4ef0 [zfs]
[  110.691675]  [<ffffffff821f7ef0>] ? do_vfs_ioctl+0x2d0/0x4a0
[  110.691676]  [<ffffffff821e6daf>] ? __fput+0x13f/0x1c0
[  110.691678]  [<ffffffff82203a29>] ? mntput_no_expire+0x9/0x130
[  110.691680]  [<ffffffff820ec224>] ? task_work_run+0x94/0xd0
[  110.691681]  [<ffffffff821f80f6>] ? SyS_ioctl+0x36/0x80
[  110.691683]  [<ffffffff82a45f62>] ? int_signal+0x12/0x17
[  110.691685]  [<ffffffff82a45cd6>] ? system_call_fastpath+0x1a/0x1f
[  110.802751] BUG: using smp_processor_id() in preemptible [00000000] code: zpool/7429
[  110.802757] caller is zio_nowait+0x157/0x170 [zfs]
[  110.802759] CPU: 1 PID: 7429 Comm: zpool Tainted: P           O   3.17.0-v2+ #1
[  110.802760] Hardware name: ASUS All Series/P9D WS, BIOS 2004 06/05/2014
[  110.802761]  ffffffff82ecfa7b ffffffff82a3e876 0000000000000001 ffffffff8254334b
[  110.802762]  ffff8800dc66d7c0 ffff8807de81f9d8 0000000000000000 ffffffffc0381377
[  110.802764]  0000000000000000 ffff8807dc4e39e0 0000000000000000 ffff8807dfe676c8
[  110.802765] Call Trace:
[  110.802767]  [<ffffffff82a3e876>] ? dump_stack+0x4a/0x75
[  110.802769]  [<ffffffff8254334b>] ? check_preemption_disabled+0xeb/0xf0
[  110.802774]  [<ffffffffc0381377>] ? zio_nowait+0x157/0x170 [zfs]
[  110.802777]  [<ffffffffc02e42b7>] ? arc_read+0x2b7/0x990 [zfs]
[  110.802782]  [<ffffffffc03812cb>] ? zio_nowait+0xab/0x170 [zfs]
[  110.802786]  [<ffffffffc02ebd5a>] ? dbuf_prefetch+0x13a/0x250 [zfs]
[  110.802792]  [<ffffffffc0300925>] ? dmu_tx_fini+0x115/0x1a0 [zfs]
[  110.802798]  [<ffffffffc0301166>] ? dmu_zfetch+0x4d6/0x1990 [zfs]
[  110.802802]  [<ffffffffc02eb397>] ? dbuf_read+0x5a7/0xda0 [zfs]
[  110.802807]  [<ffffffffc02f2465>] ? dmu_return_arcbuf+0x175/0x590 [zfs]
[  110.802812]  [<ffffffffc02f2ee2>] ? dmu_read+0x92/0x320 [zfs]
[  110.802820]  [<ffffffffc03206ea>] ? metaslab_init+0x1ba/0x280 [zfs]
[  110.802829]  [<ffffffffc033c9d8>] ? vdev_metaslab_init+0xf8/0x220 [zfs]
[  110.802836]  [<ffffffffc033fda0>] ? vdev_load+0xb0/0xc0 [zfs]
[  110.802842]  [<ffffffffc033fd1b>] ? vdev_load+0x2b/0xc0 [zfs]
[  110.802851]  [<ffffffffc032d63d>] ? spa_lookup_by_guid+0x149d/0x2a30 [zfs]
[  110.802858]  [<ffffffffc032ef13>] ? spa_tryimport+0x83/0x430 [zfs]
[  110.802865]  [<ffffffffc035d4a9>] ? zfs_secpolicy_smb_acl+0x1a99/0x4ef0 [zfs]
[  110.802870]  [<ffffffffc035fa71>] ? zfs_secpolicy_smb_acl+0x4061/0x4ef0 [zfs]
[  110.802872]  [<ffffffff821f7ef0>] ? do_vfs_ioctl+0x2d0/0x4a0
[  110.802874]  [<ffffffff821e6daf>] ? __fput+0x13f/0x1c0
[  110.802876]  [<ffffffff82203a29>] ? mntput_no_expire+0x9/0x130
[  110.802877]  [<ffffffff820ec224>] ? task_work_run+0x94/0xd0
[  110.802879]  [<ffffffff821f80f6>] ? SyS_ioctl+0x36/0x80
[  110.802881]  [<ffffffff82a45f62>] ? int_signal+0x12/0x17
[  110.802883]  [<ffffffff82a45cd6>] ? system_call_fastpath+0x1a/0x1f

full error message:

http://pastebin.com/G6YkLSeb

is this an zfs- or rather a spl-related issue ?

@behlendorf
Copy link
Contributor

@kernelOfTruth is this the very latest master code? If so this is likely a regression due to e022864 which was just merged.

@behlendorf behlendorf added this to the 0.6.4 milestone Oct 7, 2014
@behlendorf
Copy link
Contributor

That's exactly what happened. Let me push a patch quick to fix it, if you could verify that fix that would be great.

@behlendorf
Copy link
Contributor

@kernelOfTruth #2770 addresses the issue. If you could verify the fix that would be great, I'd like to get this merged faster than usual since I'm not keen on leaving the master branch in it's current state for long.

@kernelOfTruth
Copy link
Contributor Author

@behlendorf unfortunately I can't tell anything whether it helps except the following experience:

after booting into the kernel with the patches ZFS code I get stall reported by rcu_preempt

did 4 reboots so far and no luck

1st boot) booting into composited xfce4, opened a few apps (tomboy notes, klipper, gnome-terminal) - during attempt to modify a file (~/.xinitrc) -> a letter was stuck, being repeated - numlock still worked for some time, mouse couldn't move - desktop frozen, then even magic sysrq key didn't work; - only reset helped (via button on PC case)

2nd boot) reboot, zfs mount -a; still in VT; if I remember correctly that was all - then there came an error message - unfortunately the first part of it scrolled by too fast (or even 2 traces) - only the 3rd I could take a picture:

INFO: rcu_preempt detected stalls on CPUs/tasks:
0: (2 GPs behind) idle= ... softirq= ... last_accelerate: ... , nonlazy_posted: ..., ....
1: (1 GPs behind) idle= ... softirq= ... last_accelerate: ... , nonlazy_posted: ..., ....
detected by 3, t=9602 jiffies, g=1827, c=1826, q=29)
Task dump for CPU0:
kworker/0:2

...

Call Trace:
cache_reap
process_one_work
worker_thread
rescuer_thread
kthread+0xca/0xe0
kthread_create_on_node
ret_from_fork
kthread_create_on_node

Task dump for CPU 1:
txg_sync
...
Call Trace:
spa_sync
__wake_up
txg_delay
txg_delay
__thread_exit
__thread_exit
kthread
kthread_create_on_node
ret_from_fork
kthread_create_on_node

hardlock after some time - only reset via button helped

3rd boot) zfs mount -a; /etc/init.d/slim restart (firing up the display manager); then when entering the usual desktop environment LXQt - it would hardlock up after a few seconds of starting up apps (e.g. klipper, tomboy notes, ...)

4th boot) zfs mount -a; /etc/init.d/slim restart (firing up the display manager); then when entering the usual desktop environment LXQt - it would hardlock up after a few seconds of starting up Chromium

After this even booting into the 3.17-based kernel with the zfs-code which had not been patched the locked would occur - this got me worried whether ~/.xinitrc or the pool was broken

Data seems fine so far and I'm currently writing from a 3.16.4-based kernel with latest code from yesterday, (CEST), to be honest I don't know whether the 5176 illumos code change already was included and am a little scared - so I'll stay on this kernel for some time

Sorry for being unable to provide more info (call trace, BUG name, etc.) - tried to get the error message but none is shown - also last kernel messages, etc. can't be shown since it's a hardlock which happens rather fast

@kernelOfTruth
Copy link
Contributor Author

I'll try again and wipe all information related to 3.17 kernels from root and /lib/modules

hopefully that improves things ...

@behlendorf
Copy link
Contributor

@kernelOfTruth thanks for the update. The root cause of the original BUG you've posted has been addressed in master. But please let us know if you observe any other strange symptoms.

@kernelOfTruth
Copy link
Contributor Author

@behlendorf

compiled a new kernel with added zram & btrfs patches added on top only and no hardlocks so far

I suspect that there was an issue with the graphics driver (or even virtualbox, since I haven't installed that module this time yet) that interacted badly with ZFS

so far it has survived 2 incremental data transfers via rsync to an Btrfs and an ZFS volume (1.8 TB of data)

dmesg | grep -i zfs
[ 100.907717] ZFS: Loaded module v0.6.3-106_ga215ee1, ZFS pool version 5000, ZFS filesystem version 5

dmesg | grep -i spl
[ 100.893328] SPL: Loaded module v0.6.3-11_g81857a3
[ 100.919120] SPL: using hostid 0x00000000

no additional (problem indicating) messages so far in dmesg

Thanks 👍

@kernelOfTruth
Copy link
Contributor Author

FYI:

There seems to be a deeper issue with 3.17.y kernels and RCU stalls:

http://marc.info/?l=linux-kernel&m=141288512930863 [RCU stalls -> lockup linux-3.17]

so I'm not the only one and it's most probably NOT related to ZFS or SPL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants