-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INFO: task zfs:11829 blocked for more than 120 seconds. #1301
Comments
This looks like contention on the clone->ds_rwlock rwlock. The warning is just advisory here and can be safely ignored but it does suggest that the locking here is too coarse and should be improved. |
Thank you for looking at this. I'm afraid this warning leads to an issue I have not yet pinned down which is not simply a warning. When these warnings are on a system they seem to indicate too much zfs activity request. Things like lots of zfs list commands running while snapshots are being taken and destroyed and the root drive as it updates the devices causes so much activity that the result is a system which stops responding to zfs commands all together. Our root file systems live on top of a MD array and ext4. The kernel does not crash in such a way that I can't access the system. Indeed I can still ssh into a failed system and in some cases the virtual machines running on top of the zvols are still operating. HOWEVER no zfs command ever returns to a command prompt and system load will hover at 70 or so eventually load will grow to the point of system lockup (crash). A reboot resolves this unless the number of snapshots is VERY large (500 - x000) inwhich case many times all the devices will not finish being processed on boot and we use a process I adapted from wonderful information posted on this form: On systems with less than 1000 snapshots as root run:
|
Sorry for whatever caused that goofy font?? :( |
@byteharmony: Probably the '#' symbol in the transcript. On Github, the trick is to surround cut-and-pasted material that would usually go in <pre> or [code] tags with three back-ticks instead. (```) If you wrote that in the Github web editor, then you can click the Edit button to change it. |
@dajhorn Thanks for the help, you're right about # symbols, I used them in the post to designate a command prompt. Now I have ``` listed with the same goofy print :(. DId I screw it up? BK |
Not quite. You need to add newlines like this:
|
@dajhorn Devil is always in the details ;). Thanks for you're help,looking forward tomuch prettier comments :). BK |
If you don't use snapshot devices you could try this patch : |
Bump. Still an issue with the latest 0.6.3. Managed to avoid it by offsetting start times of cron jobs utilizing "zfs list". Until today, when I had a mental lapse and scheduled two jobs simultaneously. Two "zfs list -H -t snap -o name" processes running at the same time. Neither finishes, other zfs commands lock, zvols went offline. Log file snippet attached: Jul 14 13:31:37 dtc-san2 kernel: drbd detroitzvol: meta connection shut down by peer. |
Also possibly of interest, there was a "zfs send" operation running prior to the invocation of the two "zfs list" commands. And numerous snapshots on the server (~100), which seems to cause significant delay (around 10 seconds) for "zfs list -t all" to finish. |
Nothing unusual in the history (zdb -h pool), truncated for brevity to just the last several lines: ... |
@olw2005 OK, thanks for letting us know there's still an issue here. |
Posting this mostly for the benefit of anyone coming across this via google: As this bug has been tagged as 0.7.0, it may be awhile before this issue gets addressed. In the interim I've replaced all instances of "zfs list" commands in my scripts with "zfs_list" which runs a [crude / brute-force / hack] wrapper scripts as shown below:
In addition, where possible I have added the "-s name" option to "zfs list" commands. As stated in another bug report, a "zfs list -o name -s name -t all" is Much Faster (orders of magnitude?!) than "zfs list -o name -t all". To wit: (zfs list 73 snapshots with "-s name" option. Fraction of a second, all good.)
(and the same zfs list w/o the "-s name" option. Ouch.)
|
Closing. This is no longer believed to be an issue with the latest code. |
This is happening only on heavier load servers with slower system drives (USB Sticks running the base linux system) centos 6.3, ext4 with raid 1 for system drives. Seems to happen more when raid resync is allowed to go faster.
sysctl.conf:
dev.raid.speed_limit_max = 5000
Helped but still happening, moving to 2000 (which will limit resync speed to 2MBps, pretty slow. This is USB2, usb3 may help, i think it'd be nice to increase the timeouts. This is usually only an issue onsystems with LOTS of snapshots lots of programs listing snapshots to send the right data back and forth.
This machine is on rc13, haven't started workon rc14 yet.
BK
INFO: task zfs:11829 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
zfs D 0000000000000002 0 11829 11827 0x00000080
ffff88068f84b9e8 0000000000000082 ffff88068f84baa8 ffff880841e2f538
0000000000000000 ffff880841e2f500 ffff88086ce60aa0 0000000000000000
ffff880841e2fab8 ffff88068f84bfd8 000000000000fb88 ffff880841e2fab8
Call Trace:
[] rwsem_down_failed_common+0x95/0x1d0
[] rwsem_down_write_failed+0x23/0x30
[] call_rwsem_down_write_failed+0x13/0x20
[] ? down_write+0x32/0x40
[] ? autoremove_wake_function+0x0/0x40
[] dsl_dataset_clone_swap+0x1d9/0x460 [zfs]
[] dmu_recv_end+0xaa/0x220 [zfs]
[] ? dmu_objset_rele+0x11/0x20 [zfs]
[] ? get_zfs_sb+0x61/0xd0 [zfs]
[] zfs_ioc_recv+0x8af/0xf50 [zfs]
[] ? kmem_free_debug+0x4b/0x150 [spl]
[] ? dbuf_rele_and_unlock+0x159/0x200 [zfs]
[] ? kmem_free_debug+0x4b/0x150 [spl]
[] ? spa_name_compare+0xe/0x30 [zfs]
[] ? spa_lookup+0x62/0xc0 [zfs]
[] ? spa_open_common+0x23c/0x370 [zfs]
[] zfsdev_ioctl+0xfd/0x1d0 [zfs]
[] vfs_ioctl+0x22/0xa0
[] do_vfs_ioctl+0x84/0x580
[] ? security_file_permission+0x16/0x20
[] ? kvm_on_user_return+0x73/0x80 [kvm]
[] sys_ioctl+0x81/0xa0
[] system_call_fastpath+0x16/0x1b
The text was updated successfully, but these errors were encountered: