Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage of zfs_iput_taskq and slow writes #1953

Closed
satmandu opened this issue Dec 11, 2013 · 10 comments
Closed

High CPU usage of zfs_iput_taskq and slow writes #1953

satmandu opened this issue Dec 11, 2013 · 10 comments

Comments

@satmandu
Copy link
Contributor

I have a folder shared via netatalk to do time machine backups. I'm attempting to backup ~ 250Gb of data (replacing another 250gb of data). The backup slows to a crawl about ~30Gb in though. Top reveals that zfs_iput_taskq is at 99%. Writes to other zpools on other disks also slow to a near crawl at this point.

On an Ubuntu 13.10 system with a 3.12.3 kernel. (This is an issue I have seen before. I've usually been able to reboot once the zfs_iput_tasq is done, and then the issue goes away for a while, but it comes back.)

I've also now set xattr=sa since I've read a little bit on how this might be connected... but I assume this doesn't take effect immediately.

top:

top - 11:16:12 up 1 day, 21:20,  1 user,  load average: 1.12, 1.38, 3.39
Tasks: 424 total,   3 running, 419 sleeping,   0 stopped,   2 zombie
%Cpu(s):  0.2 us, 50.4 sy,  0.0 ni, 48.6 id,  0.8 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   6095000 total,  5734220 used,   360780 free,       44 buffers
KiB Swap:   975868 total,    31744 used,   944124 free,  1818300 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND           
 1665 root       0 -20     0    0    0 R  99.9  0.0 917:04.91 zfs_iput_taskq/   
 5761 satmandu   20   0 20768 1784 1116 R   0.7  0.0   0:00.07 top               

zpool iostat 1:

backups      529G  3.12T      3      0  20.5K      0
----------  -----  -----  -----  -----  -----  -----
backups      529G  3.12T      6      0  33.5K      0
----------  -----  -----  -----  -----  -----  -----
backups      529G  3.12T      3      0  18.5K      0
----------  -----  -----  -----  -----  -----  -----
backups      529G  3.12T      5      0  19.0K      0
----------  -----  -----  -----  -----  -----  -----

zdb output:

backups:
    version: 28
    name: 'backups'
    state: 0
    txg: 3804191
    pool_guid: 7826503835675459938
    hostname: 'backup'
    vdev_children: 2
    vdev_tree:
        type: 'root'
        id: 0
        guid: 7826503835675459938
        children[0]:
            type: 'disk'
            id: 0
            guid: 15722699178899966303
            path: '/dev/disk/by-id/ata-Hitachi_HDS722020ALA330_JK1130YAGSJLYT-part2'
            whole_disk: 0
            metaslab_array: 31
            metaslab_shift: 31
            ashift: 12
            asize: 2000050192384
            is_log: 0
            DTL: 41
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 9435143644745875
            path: '/dev/disk/by-id/ata-Hitachi_HDS722020ALA330_JK1130YAGSM8ZT-part2'
            whole_disk: 0
            metaslab_array: 444
            metaslab_shift: 31
            ashift: 9
            asize: 2000050192384
            is_log: 0
            DTL: 42
            create_txg: 20734
    features_for_read:

module info:

zfs

/sbin/modinfo zfs
filename:       /lib/modules/3.12.3-031203-generic/updates/dkms/zfs.ko
license:        CDDL
author:         Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description:    ZFS
srcversion:     52118DBD850807AF7765395
depends:        spl,znvpair,zcommon,zunicode,zavl
vermagic:       3.12.3-031203-generic SMP mod_unload modversions 
parm:           zvol_inhibit_dev:Do not create zvol device nodes (uint)
parm:           zvol_major:Major number for zvol device (uint)
parm:           zvol_threads:Number of threads for zvol device (uint)
parm:           zvol_max_discard_blocks:Max number of blocks to discard at once (ulong)
parm:           zio_injection_enabled:Enable fault injection (int)
parm:           zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm:           zio_delay_max:Max zio millisec delay before posting event (int)
parm:           zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (int)
parm:           zfs_sync_pass_deferred_free:defer frees starting in this pass (int)
parm:           zfs_sync_pass_dont_compress:don't compress starting in this pass (int)
parm:           zfs_sync_pass_rewrite:rewrite new bps starting in this pass (int)
parm:           zil_replay_disable:Disable intent logging replay (int)
parm:           zfs_nocacheflush:Disable cache flushes (int)
parm:           zil_slog_limit:Max commit bytes to separate log device (ulong)
parm:           zfs_read_chunk_size:Bytes to read per chunk (long)
parm:           zfs_immediate_write_sz:Largest data block to write to zil (long)
parm:           zfs_flags:Set additional debugging flags (int)
parm:           zfs_recover:Set to attempt to recover from fatal errors (int)
parm:           zfs_expire_snapshot:Seconds to expire .zfs/snapshot (int)
parm:           zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm:           zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm:           zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm:           zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm:           zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm:           zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm:           zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm:           zfs_vdev_mirror_switch_us:Switch mirrors every N usecs (int)
parm:           zfs_vdev_scheduler:I/O scheduler (charp)
parm:           zfs_vdev_cache_max:Inflate reads small than max (int)
parm:           zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm:           zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm:           zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm:           zfs_read_history:Historic statistics for the last N reads (int)
parm:           zfs_read_history_hits:Include cache hits in read history (int)
parm:           zfs_txg_history:Historic statistics for the last N txgs (int)
parm:           zfs_deadman_synctime:Expire in units of zfs_txg_synctime_ms (ulong)
parm:           zfs_deadman_enabled:Enable deadman timer (int)
parm:           spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm:           zfs_autoimport_disable:Disable pool import at module load (int)
parm:           metaslab_debug:keep space maps in core to verify frees (int)
parm:           zfs_zevent_len_max:Max event queue length (int)
parm:           zfs_zevent_cols:Max event column width (int)
parm:           zfs_zevent_console:Log events to the console (int)
parm:           zfs_top_maxinflight:Max I/Os per top-level (int)
parm:           zfs_resilver_delay:Number of ticks to delay resilver (int)
parm:           zfs_scrub_delay:Number of ticks to delay scrub (int)
parm:           zfs_scan_idle:Idle window in clock ticks (int)
parm:           zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm:           zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm:           zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm:           zfs_no_scrub_io:Set to disable scrub I/O (int)
parm:           zfs_no_scrub_prefetch:Set to disable scrub prefetching (int)
parm:           zfs_no_write_throttle:Disable write throttling (int)
parm:           zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm:           zfs_txg_synctime_ms:Target milliseconds between txg sync (int)
parm:           zfs_write_limit_min:Min txg write limit (ulong)
parm:           zfs_write_limit_max:Max txg write limit (ulong)
parm:           zfs_write_limit_inflated:Inflated txg write limit (ulong)
parm:           zfs_write_limit_override:Override txg write limit (ulong)
parm:           zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm:           zfetch_max_streams:Max number of streams per zfetch (uint)
parm:           zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm:           zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm:           zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm:           zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm:           zfs_mdcomp_disable:Disable meta data compression (int)
parm:           zfs_nopwrite_enabled:Enable NOP writes (int)
parm:           zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm:           zfs_dbuf_state_index:Calculate arc header index (int)
parm:           zfs_arc_min:Min arc size (ulong)
parm:           zfs_arc_max:Max arc size (ulong)
parm:           zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm:           zfs_arc_meta_prune:Bytes of meta data to prune (int)
parm:           zfs_arc_grow_retry:Seconds before growing arc size (int)
parm:           zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm:           zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
parm:           zfs_disable_dup_eviction:disable duplicate buffer eviction (int)
parm:           zfs_arc_memory_throttle_disable:disable memory throttle (int)
parm:           zfs_arc_min_prefetch_lifespan:Min life of prefetch block (int)
parm:           l2arc_write_max:Max write bytes per interval (ulong)
parm:           l2arc_write_boost:Extra write bytes during device warmup (ulong)
parm:           l2arc_headroom:Number of max device writes to precache (ulong)
parm:           l2arc_headroom_boost:Compressed l2arc_headroom multiplier (ulong)
parm:           l2arc_feed_secs:Seconds between L2ARC writing (ulong)
parm:           l2arc_feed_min_ms:Min feed interval in milliseconds (ulong)
parm:           l2arc_noprefetch:Skip caching prefetched buffers (int)
parm:           l2arc_nocompress:Skip compressing L2ARC buffers (int)
parm:           l2arc_feed_again:Turbo L2ARC warmup (int)
parm:           l2arc_norw:No reads during writes (int)

spl:

/sbin/modinfo spl
filename:       /lib/modules/3.12.3-031203-generic/updates/dkms/spl.ko
license:        GPL
description:    Solaris Porting Layer
author:         Lawrence Livermore National Labs
srcversion:     E5FEF2127809879D7CF69EA
depends:        
vermagic:       3.12.3-031203-generic SMP mod_unload modversions 
parm:           spl_hostid:The system hostid. (ulong)
parm:           spl_hostid_path:The system hostid file (/etc/hostid) (charp)
parm:           spl_kmem_cache_expire:By age (0x1) or low memory (0x2) (uint)
parm:           spl_debug_subsys:Subsystem debugging level mask. (ulong)
parm:           spl_debug_mask:Debugging level mask. (ulong)
parm:           spl_debug_printk:Console printk level mask. (ulong)
parm:           spl_debug_mb:Total debug buffer size. (int)
parm:           spl_debug_panic_on_bug:Panic on BUG (uint)
@DeHackEd
Copy link
Contributor

If a kernel thread is spinning on a CPU you can usually get its stack trace with echo l > /proc/sysrq-trigger (That's a lowercase L). Stacks go to dmesg

@satmandu
Copy link
Contributor Author

That gives me this:

[171378.241657] SysRq : Show backtrace of all active CPUs
[171378.245393] sending NMI to all CPUs:
[171378.245393] NMI backtrace for cpu 1
[171378.245393] CPU: 1 PID: 10658 Comm: bash Tainted: PF          O 3.12.3-031203-generic #201312041453
[171378.245393] Hardware name: Apple Inc. MacBookPro3,1/Mac-F4238BC8, BIOS    MBP31.88Z.0070.B07.0803051658 03/05/08
[171378.245393] task: ffff8801049e0000 ti: ffff88017795a000 task.ti: ffff88017795a000
[171378.245393] RIP: 0010:[<ffffffff8138b171>]  [<ffffffff8138b171>] __const_udelay+0x1/0x30
[171378.245393] RSP: 0018:ffff88017795be38  EFLAGS: 00000046
[171378.245393] RAX: 0000000000000000 RBX: 0000000000002710 RCX: 0000000000000004
[171378.245393] RDX: 0000000000000007 RSI: 0000000000000100 RDI: 0000000000418958
[171378.245393] RBP: ffff88017795be58 R08: 000000000000000a R09: 0000000000000000
[171378.245393] R10: 000000000000071b R11: 000000000000071a R12: ffffffff81c9dec0
[171378.245393] R13: 0000000000000286 R14: 0000000000000004 R15: 0000000000000000
[171378.245393] FS:  00007f7789c5d740(0000) GS:ffff8801bfd00000(0000) knlGS:0000000000000000
[171378.245393] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[171378.245393] CR2: 00007ffd7ecf6000 CR3: 000000015955a000 CR4: 00000000000007e0
[171378.245393] Stack:
[171378.245393]  ffff88017795be58 ffffffff810472d2 0000000000005782 000000000000006c
[171378.245393]  ffff88017795be68 ffffffff8147220e ffff88017795bea8 ffffffff81472611
[171378.245393]  ffff88017795bee8 0000000000000002 fffffffffffffffb 0000000001e90408
[171378.245393] Call Trace:
[171378.245393]  [<ffffffff810472d2>] ? arch_trigger_all_cpu_backtrace+0x72/0xa0
[171378.245393]  [<ffffffff8147220e>] sysrq_handle_showallcpus+0xe/0x10
[171378.245393]  [<ffffffff81472611>] __handle_sysrq+0x121/0x190
[171378.245393]  [<ffffffff814726b3>] write_sysrq_trigger+0x33/0x40
[171378.245393]  [<ffffffff81225db3>] proc_reg_write+0x43/0x70
[171378.245393]  [<ffffffff811c088e>] vfs_write+0xce/0x200
[171378.245393]  [<ffffffff811c0d92>] SyS_write+0x52/0xa0
[171378.245393]  [<ffffffff817598bf>] tracesys+0xe1/0xe6
[171378.245393] Code: ff c8 5d c3 66 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 ff 15 79 0c 90 00 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 <48> 8d 04 bd 00 00 00 00 65 48 8b 14 25 20 3d 01 00 48 8d 0c 12 
[171378.251935] NMI backtrace for cpu 0
[171378.251943] CPU: 0 PID: 1665 Comm: zfs_iput_taskq/ Tainted: PF          O 3.12.3-031203-generic #201312041453
[171378.251948] Hardware name: Apple Inc. MacBookPro3,1/Mac-F4238BC8, BIOS    MBP31.88Z.0070.B07.0803051658 03/05/08
[171378.251953] task: ffff8801b48faf60 ti: ffff8801b4940000 task.ti: ffff8801b4940000
[171378.251957] RIP: 0010:[<ffffffffa01e114c>]  [<ffffffffa01e114c>] avl_find+0x1c/0xa0 [zavl]
[171378.251983] RSP: 0018:ffff8801b49418e8  EFLAGS: 00000296
[171378.251987] RAX: ffff8801b48faf60 RBX: ffff88018fee1920 RCX: 0000000100000005
[171378.251991] RDX: ffff8801b4941928 RSI: ffff8801b4941930 RDI: ffff88018fee1920
[171378.251995] RBP: ffff8801b4941918 R08: 88783256c7d3e93f R09: 88783256c7d3e93f
[171378.251998] R10: ffffffffa019134b R11: ffff88018fee1780 R12: ffff88018fee1920
[171378.252002] R13: ffff88018fee1878 R14: ffff8801b4941930 R15: 0000000000000100
[171378.252007] FS:  0000000000000000(0000) GS:ffff8801bfc00000(0000) knlGS:0000000000000000
[171378.252007] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[171378.252007] CR2: 00007f5d7a403847 CR3: 0000000016ed4000 CR4: 00000000000007f0
[171378.252007] Stack:
[171378.252007]  ffffffffa023d06e ffff88018fee1920 0000000000004f3c ffff88018fee1878
[171378.252007]  0000000000000000 0000000000000100 ffff8801b4941978 ffffffffa025ce91
[171378.252007]  0000000000000000 ffff880036a7c000 ffffffffa024465b 44bde31e8872970a
[171378.252007] Call Trace:
[171378.252007]  [<ffffffffa023d06e>] ? arc_read+0x40e/0x8d0 [zfs]
[171378.252007]  [<ffffffffa025ce91>] dnode_block_freed+0x81/0x150 [zfs]
[171378.252007]  [<ffffffffa024465b>] ? dbuf_hold_impl+0x8b/0xa0 [zfs]
[171378.252007]  [<ffffffffa0244697>] dbuf_prefetch+0x27/0x260 [zfs]
[171378.252007]  [<ffffffffa025926f>] dmu_zfetch_dofetch.isra.5+0x10f/0x180 [zfs]
[171378.252007]  [<ffffffffa0259a9b>] dmu_zfetch+0x52b/0xf20 [zfs]
[171378.252007]  [<ffffffffa0243fb0>] dbuf_read+0x700/0x880 [zfs]
[171378.252007]  [<ffffffffa019134b>] ? kmem_free_debug+0x4b/0x150 [spl]
[171378.252007]  [<ffffffffa024465b>] ? dbuf_hold_impl+0x8b/0xa0 [zfs]
[171378.252007]  [<ffffffffa025b5cf>] dnode_hold_impl+0x18f/0x5c0 [zfs]
[171378.252007]  [<ffffffffa025ba19>] dnode_hold+0x19/0x20 [zfs]
[171378.252007]  [<ffffffffa024d861>] dmu_object_info+0x21/0x50 [zfs]
[171378.252007]  [<ffffffffa02b0fad>] zfs_unlinked_drain+0x7d/0x120 [zfs]
[171378.252007]  [<ffffffffa019134b>] ? kmem_free_debug+0x4b/0x150 [spl]
[171378.252007]  [<ffffffff811a6b24>] ? kfree+0x134/0x170
[171378.252007]  [<ffffffffa019134b>] ? kmem_free_debug+0x4b/0x150 [spl]
[171378.252007]  [<ffffffff81096c23>] ? __wake_up+0x53/0x70
[171378.252007]  [<ffffffffa0197897>] taskq_thread+0x237/0x4b0 [spl]
[171378.252007]  [<ffffffff8109e190>] ? try_to_wake_up+0x210/0x210
[171378.252007]  [<ffffffffa0197660>] ? taskq_cancel_id+0x1f0/0x1f0 [spl]
[171378.252007]  [<ffffffff8108c810>] kthread+0xc0/0xd0
[171378.252007]  [<ffffffff8108c750>] ? flush_kthread_worker+0xb0/0xb0
[171378.252007]  [<ffffffff817595fc>] ret_from_fork+0x7c/0xb0
[171378.252007]  [<ffffffff8108c750>] ? flush_kthread_worker+0xb0/0xb0
[171378.252007] Code: c0 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 56 49 89 f6 41 55 41 54 49 89 fc 53 48 83 ec 08 <4c> 8b 3f 48 89 55 d0 4c 8b 6f 10 4d 85 ff 75 20 eb 45 66 90 83 
[171378.252007] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.053 msecs

@dweeezil
Copy link
Contributor

Do lots of the files have xattrs? If so, are you using xattr=sa? If you're not sure, you could try something like find <dir> -print0 | xargs -0 -n 1 attr -l on your directory tree.

@satmandu
Copy link
Contributor Author

Running that shows many many xattrs....

Is there a way to nondestructively move xattrs over to system attributes?

@dweeezil
Copy link
Contributor

Unfortunately, the only way at the moment is to copy the files. WIth current master code, even removing all the xattrs won'd delete the hidden xattr directory.

Just out of curiosity, what is the nature of those xattrs? The reason I ask is because there is an unrelated issue (forgot the number right now) on a filesystem used by netatalk and I'd love to get a handle on the types of xattr operations it's performing. A quick glance at netatalk's source code made it appear that it simply passes-through xattr operations from the client.

@satmandu
Copy link
Contributor Author

This is a filesystem used by netatalk...

find /Volumes/backups/ -print0 | xargs -0 -n 1 attr -l
Attribute "org.netatalk.Metadata" has a 402 byte value for /Volumes/backups/SharedItems/Backups/MacMachine.sparsebundle/bands/161e
Attribute "org.netatalk.Metadata" has a 402 byte value for /Volumes/backups/SharedItems/Backups/MacMachine.sparsebundle/bands/1594
Attribute "org.netatalk.Metadata" has a 402 byte value for /Volumes/backups/SharedItems/Backups/MacMachine.sparsebundle/bands/178a
Attribute "org.netatalk.Metadata" has a 402 byte value for /Volumes/backups/SharedItems/Backups/MacMachine.sparsebundle/bands/a70
Attribute "org.netatalk.Metadata" has a 402 byte value for /Volumes/backups/SharedItems/Backups/MacMachine.sparsebundle/bands/1400
Attribute "org.netatalk.Metadata" has a 402 byte value for /Volumes/backups/SharedItems/Backups/MacMachine.sparsebundle/bands/1477
Attribute "org.netatalk.Metadata" has a 402 byte value for /Volumes/backups/SharedItems/Backups/MacMachine.sparsebundle/bands/a07
Attribute "org.netatalk.Metadata" has a 402 byte value for /Volumes/backups/SharedItems/Backups/MacMachine.sparsebundle/bands/b8f
Attribute "org.netatalk.Metadata" has a 402 byte value for /Volumes/backups/SharedItems/Backups/MacMachine.sparsebundle/bands/166b
Attribute "org.netatalk.Metadata" has a 402 byte value for /Volumes/backups/SharedItems/Backups/MacMachine.sparsebundle/bands/c1b
Attribute "org.netatalk.Metadata" has a 402 byte value for /Volumes/backups/SharedItems/Backups/MacMachine.sparsebundle/bands/16f8

Let me know if you want any further debug info.

@dweeezil
Copy link
Contributor

When I skimmed through the netatalk source code, I clearly didn't look closely enough. Apparently, as of version 3.1, it can store Mac Metadata in the org.netatalk.Metadata xattr.

In your original report, you mentioned backing up 250 GB of data "replacing another 250 GB of data". Does this imply the backup process was deleting files on the destination as the backup was continuing? If so, there are definitely issues when it comes to deleting files with dir-style xattrs, especially when lots of files are being deleted.

I've been working on various xattr-related issues and would like to know whether, after changing to xattr=sa, your system was updating or deleting xattrs regularly of whether it mainly left them alone once they were created.

@satmandu
Copy link
Contributor Author

I'm not quite sure what was happening. Maybe there was a metadata mismatch and time machine decided to do the backup all over again, at which point I think it was attempting to overwrite all of the data.

Let me know if there's any debugging you think I can enable to help you track down what you're looking for.

@ryao
Copy link
Contributor

ryao commented Jun 20, 2014

I believe that #2408 will fix this.

@behlendorf
Copy link
Contributor

I agree, this was almost certainly fixed by #2408. Closing.

@behlendorf behlendorf modified the milestone: 0.6.5 Nov 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants