Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arc_prune eats the whole CPU core [0.6.5.4] #4345

Closed
validname opened this issue Feb 18, 2016 · 34 comments
Closed

arc_prune eats the whole CPU core [0.6.5.4] #4345

validname opened this issue Feb 18, 2016 · 34 comments
Labels
Type: Performance Performance improvement or performance problem
Milestone

Comments

@validname
Copy link

Hello!

Right now we have a server at which [arc_prune] constantly eats CPU and a lot of processes are stuck at 'D' state. It lasts for 18 hours without changes. We've tried to increase both zfs_arc_max and zfs_arc_meta_limit but nothing happens.
ZFS: 0.6.5.4/1ffc4c1
SPL: 0.6.5.4/6e5e068
Linux kernel: 4.3.3

We've tried to ftrace this process for 1 second:

  1. function trace:
    https://gist.githubusercontent.com/validname/1a7f44ec106a933e0ca0/raw/61cb22b928bdb1cf4977b193d27dc66ea71e9f8b/01.arc_prune_function_trace
  2. function graph trace:
    https://gist.githubusercontent.com/validname/44bcc5e55e6e294d9e09/raw/bc36b1d394034c4e12fcfa97b82e7dabdc7bc696/02.arc_prune_function_graph_trace

What we can investigate further to find a reason of this behaviour?
And what we can do to prevent it?

@kernelOfTruth
Copy link
Contributor

@validname Hi,

can you please post output of

/proc/spl/kstat/zfs/arcstats

some hardware specs

the values for zfs_arc_max and zfs_arc_meta_limit and what else you've set ?

@validname
Copy link
Author

@kernelOfTruth OK.
This server is 2* Intel Xeon E5-2680 v2 (20 real cores) / 128 GiB / internal LSI 2108-based RAID / 2* Intel S3500 series (in single drive R0 mode) / ZFS mirror. Runned under Arch Linux with custom built kernel 4.3.3 (we manually added Apparmor support).

# cat /proc/spl/kstat/zfs/arcstats
6 1 0x01 91 4368 6221819066 1785599012599807
name                            type data
hits                            4    7758722774
misses                          4    3493296
demand_data_hits                4    4966044152
demand_data_misses              4    226217
demand_metadata_hits            4    2788354227
demand_metadata_misses          4    1900078
prefetch_data_hits              4    16723
prefetch_data_misses            4    761333
prefetch_metadata_hits          4    4307672
prefetch_metadata_misses        4    605668
mru_hits                        4    89910614
mru_ghost_hits                  4    2117197
mfu_hits                        4    7664489200
mfu_ghost_hits                  4    157388
deleted                         4    8873585
mutex_miss                      4    35
evict_skip                      4    82813
evict_not_enough                4    188
evict_l2_cached                 4    0
evict_l2_eligible               4    312979130880
evict_l2_ineligible             4    37087518208
evict_l2_skip                   4    0
hash_elements                   4    1231520
hash_elements_max               4    1350818
hash_collisions                 4    7455802
hash_chains                     4    43022
hash_chain_max                  4    4
p                               4    6901987328
c                               4    10737418240
c_min                           4    33554432
c_max                           4    10737418240
size                            4    9544153912
hdr_size                        4    440300512
data_size                       4    4070914048
metadata_size                   4    1722458112
other_size                      4    3310481240
anon_size                       4    1231360
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    4592134656
mru_evictable_data              4    3037456384
mru_evictable_metadata          4    119716352
mru_ghost_size                  4    5629000192
mru_ghost_evictable_data        4    4572791296
mru_ghost_evictable_metadata    4    1056208896
mfu_size                        4    1200006144
mfu_evictable_data              4    1032448000
mfu_evictable_metadata          4    111417856
mfu_ghost_size                  4    2263311872
mfu_ghost_evictable_data        4    2126214144
mfu_ghost_evictable_metadata    4    137097728
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    186
arc_meta_used                   4    5473239864
arc_meta_limit                  4    8589934592
arc_meta_max                    4    6942137120
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    2112065536

Actually, we've set only zfs_arc_max (initially to 8 GiB) and zfs_arc_meta_limit (to 6 GiB). Other parameter have default values:

#find /sys/module/zfs/parameters/ -type f | sort | xargs -l1 -I@ bash -c 'echo -en "@\t"; cat @'
/sys/module/zfs/parameters/l2arc_feed_again     1
/sys/module/zfs/parameters/l2arc_feed_min_ms    200
/sys/module/zfs/parameters/l2arc_feed_secs      1
/sys/module/zfs/parameters/l2arc_headroom       2
/sys/module/zfs/parameters/l2arc_headroom_boost 200
/sys/module/zfs/parameters/l2arc_nocompress     0
/sys/module/zfs/parameters/l2arc_noprefetch     1
/sys/module/zfs/parameters/l2arc_norw   0
/sys/module/zfs/parameters/l2arc_write_boost    8388608
/sys/module/zfs/parameters/l2arc_write_max      8388608
/sys/module/zfs/parameters/metaslab_aliquot     524288
/sys/module/zfs/parameters/metaslab_bias_enabled        1
/sys/module/zfs/parameters/metaslab_debug_load  0
/sys/module/zfs/parameters/metaslab_debug_unload        0
/sys/module/zfs/parameters/metaslab_fragmentation_factor_enabled        1
/sys/module/zfs/parameters/metaslab_lba_weighting_enabled       1
/sys/module/zfs/parameters/metaslab_preload_enabled     1
/sys/module/zfs/parameters/metaslabs_per_vdev   200
/sys/module/zfs/parameters/spa_asize_inflation  24
/sys/module/zfs/parameters/spa_config_path      /etc/zfs/zpool.cache
/sys/module/zfs/parameters/spa_load_verify_data 1
/sys/module/zfs/parameters/spa_load_verify_maxinflight  10000
/sys/module/zfs/parameters/spa_load_verify_metadata     1
/sys/module/zfs/parameters/spa_slop_shift       5
/sys/module/zfs/parameters/zfetch_array_rd_sz   1048576
/sys/module/zfs/parameters/zfetch_block_cap     256
/sys/module/zfs/parameters/zfetch_max_streams   8
/sys/module/zfs/parameters/zfetch_min_sec_reap  2
/sys/module/zfs/parameters/zfs_admin_snapshot   0
/sys/module/zfs/parameters/zfs_arc_average_blocksize    8192
/sys/module/zfs/parameters/zfs_arc_grow_retry   0
/sys/module/zfs/parameters/zfs_arc_lotsfree_percent     10
/sys/module/zfs/parameters/zfs_arc_max  10737418240
/sys/module/zfs/parameters/zfs_arc_meta_adjust_restarts 4096
/sys/module/zfs/parameters/zfs_arc_meta_limit   8589934592
/sys/module/zfs/parameters/zfs_arc_meta_min     0
/sys/module/zfs/parameters/zfs_arc_meta_prune   10000
/sys/module/zfs/parameters/zfs_arc_meta_strategy        1
/sys/module/zfs/parameters/zfs_arc_min  1
/sys/module/zfs/parameters/zfs_arc_min_prefetch_lifespan        0
/sys/module/zfs/parameters/zfs_arc_num_sublists_per_state       40
/sys/module/zfs/parameters/zfs_arc_p_aggressive_disable 1
/sys/module/zfs/parameters/zfs_arc_p_dampener_disable   1
/sys/module/zfs/parameters/zfs_arc_p_min_shift  0
/sys/module/zfs/parameters/zfs_arc_shrink_shift 0
/sys/module/zfs/parameters/zfs_arc_sys_free     0
/sys/module/zfs/parameters/zfs_autoimport_disable       1
/sys/module/zfs/parameters/zfs_dbgmsg_enable    0
/sys/module/zfs/parameters/zfs_dbgmsg_maxsize   4194304
/sys/module/zfs/parameters/zfs_dbuf_state_index 0
/sys/module/zfs/parameters/zfs_deadman_enabled  1
/sys/module/zfs/parameters/zfs_deadman_synctime_ms      1000000
/sys/module/zfs/parameters/zfs_dedup_prefetch   0
/sys/module/zfs/parameters/zfs_delay_min_dirty_percent  60
/sys/module/zfs/parameters/zfs_delay_scale      500000
/sys/module/zfs/parameters/zfs_dirty_data_max   13517239500
/sys/module/zfs/parameters/zfs_dirty_data_max_max       33793098752
/sys/module/zfs/parameters/zfs_dirty_data_max_max_percent       25
/sys/module/zfs/parameters/zfs_dirty_data_max_percent   10
/sys/module/zfs/parameters/zfs_dirty_data_sync  67108864
/sys/module/zfs/parameters/zfs_disable_dup_eviction     0
/sys/module/zfs/parameters/zfs_expire_snapshot  300
/sys/module/zfs/parameters/zfs_flags    0
/sys/module/zfs/parameters/zfs_free_leak_on_eio 0
/sys/module/zfs/parameters/zfs_free_max_blocks  100000
/sys/module/zfs/parameters/zfs_free_min_time_ms 1000
/sys/module/zfs/parameters/zfs_immediate_write_sz       32768
/sys/module/zfs/parameters/zfs_max_recordsize   1048576
/sys/module/zfs/parameters/zfs_mdcomp_disable   0
/sys/module/zfs/parameters/zfs_metaslab_fragmentation_threshold 70
/sys/module/zfs/parameters/zfs_mg_fragmentation_threshold       85
/sys/module/zfs/parameters/zfs_mg_noalloc_threshold     0
/sys/module/zfs/parameters/zfs_nocacheflush     0
/sys/module/zfs/parameters/zfs_nopwrite_enabled 1
/sys/module/zfs/parameters/zfs_no_scrub_io      0
/sys/module/zfs/parameters/zfs_no_scrub_prefetch        0
/sys/module/zfs/parameters/zfs_pd_bytes_max     52428800
/sys/module/zfs/parameters/zfs_prefetch_disable 0
/sys/module/zfs/parameters/zfs_read_chunk_size  1048576
/sys/module/zfs/parameters/zfs_read_history     0
/sys/module/zfs/parameters/zfs_read_history_hits        0
/sys/module/zfs/parameters/zfs_recover  0
/sys/module/zfs/parameters/zfs_resilver_delay   2
/sys/module/zfs/parameters/zfs_resilver_min_time_ms     3000
/sys/module/zfs/parameters/zfs_scan_idle        50
/sys/module/zfs/parameters/zfs_scan_min_time_ms 1000
/sys/module/zfs/parameters/zfs_scrub_delay      4
/sys/module/zfs/parameters/zfs_send_corrupt_data        0
/sys/module/zfs/parameters/zfs_sync_pass_deferred_free  2
/sys/module/zfs/parameters/zfs_sync_pass_dont_compress  5
/sys/module/zfs/parameters/zfs_sync_pass_rewrite        2
/sys/module/zfs/parameters/zfs_top_maxinflight  32
/sys/module/zfs/parameters/zfs_txg_history      0
/sys/module/zfs/parameters/zfs_txg_timeout      5
/sys/module/zfs/parameters/zfs_vdev_aggregation_limit   131072
/sys/module/zfs/parameters/zfs_vdev_async_read_max_active       3
/sys/module/zfs/parameters/zfs_vdev_async_read_min_active       1
/sys/module/zfs/parameters/zfs_vdev_async_write_active_max_dirty_percent        60
/sys/module/zfs/parameters/zfs_vdev_async_write_active_min_dirty_percent        30
/sys/module/zfs/parameters/zfs_vdev_async_write_max_active      10
/sys/module/zfs/parameters/zfs_vdev_async_write_min_active      1
/sys/module/zfs/parameters/zfs_vdev_cache_bshift        16
/sys/module/zfs/parameters/zfs_vdev_cache_max   16384
/sys/module/zfs/parameters/zfs_vdev_cache_size  0
/sys/module/zfs/parameters/zfs_vdev_max_active  1000
/sys/module/zfs/parameters/zfs_vdev_mirror_switch_us    10000
/sys/module/zfs/parameters/zfs_vdev_read_gap_limit      32768
/sys/module/zfs/parameters/zfs_vdev_scheduler   noop
/sys/module/zfs/parameters/zfs_vdev_scrub_max_active    2
/sys/module/zfs/parameters/zfs_vdev_scrub_min_active    1
/sys/module/zfs/parameters/zfs_vdev_sync_read_max_active        10
/sys/module/zfs/parameters/zfs_vdev_sync_read_min_active        10
/sys/module/zfs/parameters/zfs_vdev_sync_write_max_active       10
/sys/module/zfs/parameters/zfs_vdev_sync_write_min_active       10
/sys/module/zfs/parameters/zfs_vdev_write_gap_limit     4096
/sys/module/zfs/parameters/zfs_zevent_cols      80
/sys/module/zfs/parameters/zfs_zevent_console   0
/sys/module/zfs/parameters/zfs_zevent_len_max   640
/sys/module/zfs/parameters/zil_replay_disable   0
/sys/module/zfs/parameters/zil_slog_limit       1048576
/sys/module/zfs/parameters/zio_delay_max        30000
/sys/module/zfs/parameters/zio_requeue_io_start_cut_in_line     1
/sys/module/zfs/parameters/zio_taskq_batch_pct  75
/sys/module/zfs/parameters/zvol_inhibit_dev     0
/sys/module/zfs/parameters/zvol_major   230
/sys/module/zfs/parameters/zvol_max_discard_blocks      16384
/sys/module/zfs/parameters/zvol_prefetch_bytes  131072

@kernelOfTruth
Copy link
Contributor

@validname how big is your L2ARC ?

What additional filesystems do you use ?

What usecase are we talking about ?

Any reason why you chose such a small ARC size ?

Is there output in dmesg (kernel log) ?

@validname
Copy link
Author

@kernelOfTruth

  1. L2ARC has zero size, if I'm correctly determined it.
# cat /proc/spl/kstat/zfs/arcstats | grep l2 | grep size
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0

2+3.
This server hosts several LXC containers, each container has it's own ZFS filesystem (with lz4 compression enabled). Each container running some software serving network requests (php-fpm, for example). There is no filestorage with billion files if you've supposed this. Concerning to ZFS, all containers can be divided on two groups: first one has memory cache and log files which monotonically grows. Second group has episodically added new files (we releasing new program code up to 10 times per day). Strange, but right now we have about 40 processes stuck in 'D' state and all of them belongs to the only one container with memory cache and big systemd journal (about 35 GiB uncompressed, as I remember). We've tried to 'stop' this container so it's filesystem is listed as unmounted now but we can't mount it again. I think because of stuck processes. Other filesystems including root fs, works fine now.
4. Actually, I'm new to ZFS and cannot discuss what size of ARC cache is big or small for our case. Why you decide that 10-12 Gb is small size?
5. Sorry, I've forgot about this.

INFO: task crond:29130 blocked for more than 120 seconds.
      Tainted: P           O    4.3.3-3-apparmor #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
crond           D ffff88203f0d5680     0 29130  28850 0x00000080
 ffff881f4e12fc08 0000000000000082 ffff882038fc3800 ffff880f7d641c00
 ffff881f4e130000 ffff881f5829faa4 ffff880f7d641c00 00000000ffffffff
 ffff881f5829faa8 ffff881f4e12fc20 ffffffff81573b6a ffff881f5829faa0
Call Trace:
 [<ffffffff81573b6a>] schedule+0x3a/0x90
 [<ffffffff81573f55>] schedule_preempt_disabled+0x15/0x20
 [<ffffffff8157539e>] __mutex_lock_slowpath+0xce/0x140
 [<ffffffff81575427>] mutex_lock+0x17/0x30
 [<ffffffff811ea315>] walk_component+0x1b5/0x2a0
 [<ffffffff811e8450>] ? path_init+0x210/0x3c0
 [<ffffffff811eacbd>] path_lookupat+0x5d/0x110
 [<ffffffff811ece61>] filename_lookup+0xb1/0x180
 [<ffffffff81098e7e>] ? finish_task_switch+0x5e/0x1b0
 [<ffffffff811970d1>] ? handle_mm_fault+0xc41/0x1850
 [<ffffffff811ed006>] user_path_at_empty+0x36/0x40
 [<ffffffff811e2776>] vfs_fstatat+0x66/0xc0
 [<ffffffff811e2ce3>] SyS_newstat+0x33/0x60
 [<ffffffff81003782>] ? syscall_trace_enter_phase1+0x162/0x170
 [<ffffffff8157762e>] entry_SYSCALL_64_fastpath+0x12/0x71
INFO: task zabbix_agentd:29188 blocked for more than 120 seconds.
      Tainted: P           O    4.3.3-3-apparmor #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
zabbix_agentd   D ffff88103fa55680     0 29188  29184 0x00000082
 ffff880f2325bc68 0000000000000086 ffff882038fc0000 ffff880f265daa00
 ffff880f2325c000 ffff881f5829faa4 ffff880f265daa00 00000000ffffffff
 ffff881f5829faa8 ffff880f2325bc80 ffffffff81573b6a ffff881f5829faa0
Call Trace:
 [<ffffffff81573b6a>] schedule+0x3a/0x90
 [<ffffffff81573f55>] schedule_preempt_disabled+0x15/0x20
 [<ffffffff8157539e>] __mutex_lock_slowpath+0xce/0x140
 [<ffffffff81575427>] mutex_lock+0x17/0x30
 [<ffffffff811eb52a>] path_openat+0x45a/0x1260
 [<ffffffff811eda21>] do_filp_open+0x91/0x100
 [<ffffffff811fa608>] ? __alloc_fd+0x88/0x110
 [<ffffffff811dcde6>] do_sys_open+0x146/0x230
 [<ffffffff811dceee>] SyS_open+0x1e/0x20
 [<ffffffff8157762e>] entry_SYSCALL_64_fastpath+0x12/0x71
INFO: task ngs.cms:3510860 blocked for more than 120 seconds.
      Tainted: P           O    4.3.3-3-apparmor #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ngs.cms         D ffff88103f855680     0 3510860  28850 0x00000080
 ffff881913793c68 0000000000000082 ffff8810393f8e00 ffff881940ae8e00
 ffff881913794000 ffff881f5829faa4 ffff881940ae8e00 00000000ffffffff
 ffff881f5829faa8 ffff881913793c80 ffffffff81573b6a ffff881f5829faa0
Call Trace:
 [<ffffffff81573b6a>] schedule+0x3a/0x90
 [<ffffffff81573f55>] schedule_preempt_disabled+0x15/0x20
 [<ffffffff8157539e>] __mutex_lock_slowpath+0xce/0x140
 [<ffffffff81575427>] mutex_lock+0x17/0x30
 [<ffffffff811eb52a>] path_openat+0x45a/0x1260
 [<ffffffff810ab284>] ? should_numa_migrate_memory+0x54/0x130
 [<ffffffff811eda21>] do_filp_open+0x91/0x100
 [<ffffffff811fa608>] ? __alloc_fd+0x88/0x110
 [<ffffffff811dcde6>] do_sys_open+0x146/0x230
 [<ffffffff811dceee>] SyS_open+0x1e/0x20
 [<ffffffff8157762e>] entry_SYSCALL_64_fastpath+0x12/0x71
INFO: task ngs.cms:2162654 blocked for more than 120 seconds.
      Tainted: P           O    4.3.3-3-apparmor #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ngs.cms         D ffff88103f815680     0 2162654  28850 0x00000080
 ffff88080ff67c68 0000000000000082 ffffffff81812500 ffff8803e444b800
 ffff88080ff68000 ffff881f5829faa4 ffff8803e444b800 00000000ffffffff
 ffff881f5829faa8 ffff88080ff67c80 ffffffff81573b6a ffff881f5829faa0
Call Trace:
 [<ffffffff81573b6a>] schedule+0x3a/0x90
 [<ffffffff81573f55>] schedule_preempt_disabled+0x15/0x20
 [<ffffffff8157539e>] __mutex_lock_slowpath+0xce/0x140
 [<ffffffff81575427>] mutex_lock+0x17/0x30
 [<ffffffff811eb52a>] path_openat+0x45a/0x1260
 [<ffffffff8109cc73>] ? wake_up_process+0x23/0x40
 [<ffffffff810ab284>] ? should_numa_migrate_memory+0x54/0x130
 [<ffffffff811eda21>] do_filp_open+0x91/0x100
 [<ffffffff811fa608>] ? __alloc_fd+0x88/0x110
 [<ffffffff811dcde6>] do_sys_open+0x146/0x230
 [<ffffffff811dceee>] SyS_open+0x1e/0x20
 [<ffffffff8157762e>] entry_SYSCALL_64_fastpath+0x12/0x71
INFO: task sh:2995162 blocked for more than 120 seconds.
      Tainted: P           O    4.3.3-3-apparmor #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sh              D ffff88203f415680     0 2995162  29189 0x00000084
 ffff8808f8b37c68 0000000000000086 ffff882038ff0000 ffff880e86fc4600
 ffff8808f8b38000 ffff881f51c1a504 ffff880e86fc4600 00000000ffffffff
 ffff881f51c1a508 ffff8808f8b37c80 ffffffff81573b6a ffff881f51c1a500
Call Trace:
 [<ffffffff81573b6a>] schedule+0x3a/0x90
 [<ffffffff81573f55>] schedule_preempt_disabled+0x15/0x20
 [<ffffffff8157539e>] __mutex_lock_slowpath+0xce/0x140
 [<ffffffff81575427>] mutex_lock+0x17/0x30
 [<ffffffff811eb52a>] path_openat+0x45a/0x1260
 [<ffffffff811cf036>] ? mem_cgroup_begin_page_stat+0x16/0x90
 [<ffffffff811eda21>] do_filp_open+0x91/0x100
 [<ffffffff811fa608>] ? __alloc_fd+0x88/0x110
 [<ffffffff811dcde6>] do_sys_open+0x146/0x230
 [<ffffffff811dceee>] SyS_open+0x1e/0x20
 [<ffffffff8157762e>] entry_SYSCALL_64_fastpath+0x12/0x71
INFO: task sh:3002599 blocked for more than 120 seconds.
      Tainted: P           O    4.3.3-3-apparmor #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sh              D ffff88203f395680     0 3002599  29187 0x00000084
 ffff880e2a95bc68 0000000000000086 ffff882038fe6200 ffff8802312a9c00
 ffff880e2a95c000 ffff881f51c1a504 ffff8802312a9c00 00000000ffffffff
 ffff881f51c1a508 ffff880e2a95bc80 ffffffff81573b6a ffff881f51c1a500
Call Trace:
 [<ffffffff81573b6a>] schedule+0x3a/0x90
 [<ffffffff81573f55>] schedule_preempt_disabled+0x15/0x20
 [<ffffffff8157539e>] __mutex_lock_slowpath+0xce/0x140
 [<ffffffff81575427>] mutex_lock+0x17/0x30
 [<ffffffff811eb52a>] path_openat+0x45a/0x1260
 [<ffffffff811cf036>] ? mem_cgroup_begin_page_stat+0x16/0x90
 [<ffffffff811eda21>] do_filp_open+0x91/0x100
 [<ffffffff811fa608>] ? __alloc_fd+0x88/0x110
 [<ffffffff811dcde6>] do_sys_open+0x146/0x230
 [<ffffffff811dceee>] SyS_open+0x1e/0x20
 [<ffffffff8157762e>] entry_SYSCALL_64_fastpath+0x12/0x71
INFO: task sh:3006718 blocked for more than 120 seconds.
      Tainted: P           O    4.3.3-3-apparmor #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sh              D 0000000000000000     0 3006718  29191 0x00000084
 ffff880b0fc1bc68 0000000000000082 ffff8810393d0000 ffff880caa395400
 ffff880b0fc1c000 ffff881f51c1a504 ffff880caa395400 00000000ffffffff
 ffff881f51c1a508 ffff880b0fc1bc80 ffffffff81573b6a ffff881f51c1a500
Call Trace:
 [<ffffffff81573b6a>] schedule+0x3a/0x90
 [<ffffffff81573f55>] schedule_preempt_disabled+0x15/0x20
 [<ffffffff8157539e>] __mutex_lock_slowpath+0xce/0x140
 [<ffffffff81575427>] mutex_lock+0x17/0x30
 [<ffffffff811eb52a>] path_openat+0x45a/0x1260
 [<ffffffff811cf036>] ? mem_cgroup_begin_page_stat+0x16/0x90
 [<ffffffff811eda21>] do_filp_open+0x91/0x100
 [<ffffffff811fa608>] ? __alloc_fd+0x88/0x110
 [<ffffffff811dcde6>] do_sys_open+0x146/0x230
 [<ffffffff811dceee>] SyS_open+0x1e/0x20
 [<ffffffff8157762e>] entry_SYSCALL_64_fastpath+0x12/0x71
INFO: task sshd:3026648 blocked for more than 120 seconds.
      Tainted: P           O    4.3.3-3-apparmor #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sshd            D ffff88203f215680     0 3026648  29177 0x00000080
 ffff881e20ddbc68 0000000000000086 ffff882038fd0000 ffff881ffbda9c00
 ffff881e20ddc000 ffff881f51c1a504 ffff881ffbda9c00 00000000ffffffff
 ffff881f51c1a508 ffff881e20ddbc80 ffffffff81573b6a ffff881f51c1a500
Call Trace:
 [<ffffffff81573b6a>] schedule+0x3a/0x90
 [<ffffffff81573f55>] schedule_preempt_disabled+0x15/0x20
 [<ffffffff8157539e>] __mutex_lock_slowpath+0xce/0x140
 [<ffffffff81575427>] mutex_lock+0x17/0x30
 [<ffffffff811eb52a>] path_openat+0x45a/0x1260
 [<ffffffff811cf036>] ? mem_cgroup_begin_page_stat+0x16/0x90
 [<ffffffff811eda21>] do_filp_open+0x91/0x100
 [<ffffffff811fa608>] ? __alloc_fd+0x88/0x110
 [<ffffffff811dcde6>] do_sys_open+0x146/0x230
 [<ffffffff811dceee>] SyS_open+0x1e/0x20
 [<ffffffff8157762e>] entry_SYSCALL_64_fastpath+0x12/0x71

All processes mentioned above are either absent (was killed by LXC utils) or stuck in 'D' state now.

@behlendorf
Copy link
Contributor

@validname thanks for the function graph trace that sheds considerable light on the problem. You're definitely repeatedly hitting the goto again case in zfs_vget(), right here. That suggests that the inode is already under going eviction but that process must be blocked on something. If you could post all the stack traces we may be able to tell for sure.

I suspect you may have have run afoul of a deadlock in the eviction path which was recently fixed in master, 3b9fd93. The fix for this has already been cherry picked in to the zfs-0.6.5-release branch and will appear in the next zfs tag.

If this was just a one of I'd suggest waiting for the new tag in a week or two. If you're able to trigger this more easily then you may want to cherry pick the fix and verify it resolves the issue.

@validname
Copy link
Author

@behlendorf Thanks for answer!
I'm not surely understand which stack trace do you mean, so I'll post several.

# arc_prune kernel stack distinct traces for 10 seconds
for i in `seq 1 100`; do stack=`cat /proc/979/stack`; if [ "$stack" != "[<ffffffffffffffff>] 0xffffffffffffffff" ]; then echo -e "---------\n$stack"; fi; sleep 0.1; done > /tmp/arc_prune_stacks.lst

https://gist.githubusercontent.com/validname/e53eaea240132d23b5f6/raw/bb77fef3b5e25b3d74a47fa0cbfe1d7b4c19fbfe/arc_prune_stacks.lst

# arc_prune ftrace distinct traces for 10 seconds
echo "979" > set_ftrace_pid
echo "nop" > current_tracer
echo "1" > /proc/sys/kernel/stack_tracer_enabled
prev=""
for i in `seq 1 100`; do stack=`cat stack_trace`; if [ "$stack" != "$prev" ]; then echo -e "-------\n $stack"; fi; prev="$stack"; sleep 0.1; done > /tmp/arc_prune_stacks_ftrace.lst
echo "0" > /proc/sys/kernel/stack_tracer_enabled

https://gist.githubusercontent.com/validname/b96b47b0911f94619c4f/raw/8f146d4b7be990a1b12be33e7946e7ee7e6a71d6/arc_prune_stacks_ftrace.lst

# instant stack traces for all process in the UNINTERRUPTABLE_SLEEP state
ps -A -o state,pid| awk '$1~"^D" {print $2}'| xargs -l1 -I@ bash -c 'echo "-------------"; cat /proc/@/stack' > /tmp/uninterruptable_sleep_stacks.lst

https://gist.githubusercontent.com/validname/ed7b664e463999e31dae/raw/3f48730763e2d362f38ab8228b24a5a1f984a4e2/uninterruptable_sleep_stacks.lst

Unfortunately, we don't know what exactly can trigger this problem. But we have several servers with nearly the same configuration and load, so I'll just setup ZFS with the fix that you mentioned on the half of them.

@gordan-bobic
Copy link
Contributor

I also seem to be hitting this bug. This is about as much as I could get out of the system as the whole machine becomes completely unresponsive (couldn't even open a new terminal window or start programs (root on zfs):
https://goo.gl/photos/oEfSkarrC4VR6G11A
So by 30 hours of uptime, arc_prune has already clocked up nearly 7 hours of CPU time consumed.

echo 3 > /proc/sys/vm/drop_caches returned nearly immediately with no memory actually being freed.

There was a scrub running when the problem began to manifest, so this may contribute to the problem without being a root cause.

I should also probably point out that this happens fairly regularly on this machine, which is NOT in 24/7 use (gets switched off when not used most of the time), but when it does get used it is usually under a very heavy load.

Spec:
2x X5690 (each with 6 cores / 12 threads)
96GB of ECC RAM

Two pools:
2x 1TB SSD mirrored
4x 4TB HDD RAIDZ2

I am reasonably certain this wasn't happening last year, so it may be interesting to find a version where this issue first started arising.

@dweeezil
Copy link
Contributor

@validname I think @behlendorf meant zfs_zget(). Also, the code in question has dropped down to line 1133 (the risks of line number links to the master branch).

@logan2211
Copy link

logan2211 commented Jul 8, 2016

I am maybe hitting some issue related to this? arc_prune is eating a core and many (maybe all?) processes accessing this filesystem are hung. zpool iostat shows basically no i/o on the pool.

[24725.103645] glusterfsd      D ffff88078a33bb98     0 31528      1 0x00000000
[24725.103651]  ffff88078a33bb98 ffff88078a33bb68 ffff880c24260000 ffff880c225a8000
[24725.103653]  ffff88078a33c000 ffff880bf7c22504 ffff880c225a8000 00000000ffffffff
[24725.103656]  ffff880bf7c22508 ffff88078a33bbb0 ffffffff818235b5 ffff880bf7c22500
[24725.103659] Call Trace:
[24725.103662]  [<ffffffff818235b5>] schedule+0x35/0x80
[24725.103665]  [<ffffffff8182385e>] schedule_preempt_disabled+0xe/0x10
[24725.103667]  [<ffffffff81825499>] __mutex_lock_slowpath+0xb9/0x130
[24725.103670]  [<ffffffff8182552f>] mutex_lock+0x1f/0x30
[24725.103673]  [<ffffffff81219ec7>] walk_component+0x217/0x300
[24725.103676]  [<ffffffff8121abc1>] link_path_walk+0x191/0x5b0
[24725.103679]  [<ffffffff81219adb>] ? path_init+0x1eb/0x3c0
[24725.103681]  [<ffffffff8121b0dc>] path_lookupat+0x7c/0x110
[24725.103684]  [<ffffffff8121cd01>] filename_lookup+0xb1/0x180
[24725.103686]  [<ffffffff81233209>] ? getxattr+0x1e9/0x250
[24725.103689]  [<ffffffff811eba07>] ? kmem_cache_alloc+0x187/0x1f0
[24725.103692]  [<ffffffff8121c906>] ? getname_flags+0x56/0x1f0
[24725.103695]  [<ffffffff8121cea6>] user_path_at_empty+0x36/0x40
[24725.103697]  [<ffffffff81211fe6>] vfs_fstatat+0x66/0xc0
[24725.103700]  [<ffffffff812125a1>] SYSC_newlstat+0x31/0x60
[24725.103702]  [<ffffffff8122dae4>] ? mntput+0x24/0x40
[24725.103704]  [<ffffffff81216efe>] ? path_put+0x1e/0x30
[24725.103707]  [<ffffffff812332e1>] ? path_getxattr+0x71/0xb0
[24725.103709]  [<ffffffff81039ca3>] ? fpu__restore+0xf3/0x150
[24725.103712]  [<ffffffff812126de>] SyS_newlstat+0xe/0x10
[24725.103714]  [<ffffffff818276b2>] entry_SYSCALL_64_fastpath+0x16/0x71
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10717 root      20   0 1639364 112540   6856 S 744.4  0.2   1467:47 glusterfsd
 1768 root      20   0       0      0      0 R  98.0  0.0 235:09.06 arc_prune
27926 root      20   0   70640  58068   2476 R  95.0  0.1 231:44.92 rsync
12426 root      20   0   15532   2796   2460 R  94.7  0.0 217:48.84 rsync
28735 root      20   0   58932  45464   2464 R  93.4  0.1 224:53.47 rsync
 9205 root      20   0   15788   3076   2352 R  92.7  0.0 224:23.04 rsync
28635 root      20   0   91376  76592   2432 R  92.1  0.2 233:41.91 rsync
 8272 root      20   0   19788   7936   2456 R  91.7  0.0 218:08.47 rsync
27728 root      20   0   81632  69080   2416 R  90.7  0.1 226:41.08 rsync
 7118 root      20   0  139572 118916   2464 R  88.7  0.2 221:01.25 rsync
# dpkg -l | grep zfs
ii  libzfs2linux                       0.6.5.6-0ubuntu10               amd64        Native OpenZFS filesystem library for Linux
ii  zfs-doc                            0.6.5.6-0ubuntu10               all          Native OpenZFS filesystem documentation and examples.
ii  zfs-zed                            0.6.5.6-0ubuntu10               amd64        OpenZFS Event Daemon (zed)
ii  zfsutils-linux                     0.6.5.6-0ubuntu10               amd64        Native OpenZFS management utilities for Linux
Linux host 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
# find /sys/module/zfs/parameters/ -type f | sort | xargs -l1 -I@ bash -c 'echo -en "@\t"; cat @'
/sys/module/zfs/parameters/l2arc_feed_again 1
/sys/module/zfs/parameters/l2arc_feed_min_ms    200
/sys/module/zfs/parameters/l2arc_feed_secs  1
/sys/module/zfs/parameters/l2arc_headroom   2
/sys/module/zfs/parameters/l2arc_headroom_boost 200
/sys/module/zfs/parameters/l2arc_nocompress 0
/sys/module/zfs/parameters/l2arc_noprefetch 1
/sys/module/zfs/parameters/l2arc_norw   0
/sys/module/zfs/parameters/l2arc_write_boost    8388608
/sys/module/zfs/parameters/l2arc_write_max  8388608
/sys/module/zfs/parameters/metaslab_aliquot 524288
/sys/module/zfs/parameters/metaslab_bias_enabled    1
/sys/module/zfs/parameters/metaslab_debug_load  0
/sys/module/zfs/parameters/metaslab_debug_unload    0
/sys/module/zfs/parameters/metaslab_fragmentation_factor_enabled    1
/sys/module/zfs/parameters/metaslab_lba_weighting_enabled   1
/sys/module/zfs/parameters/metaslab_preload_enabled 1
/sys/module/zfs/parameters/metaslabs_per_vdev   200
/sys/module/zfs/parameters/spa_asize_inflation  24
/sys/module/zfs/parameters/spa_config_path  /etc/zfs/zpool.cache
/sys/module/zfs/parameters/spa_load_verify_data 1
/sys/module/zfs/parameters/spa_load_verify_maxinflight  10000
/sys/module/zfs/parameters/spa_load_verify_metadata 1
/sys/module/zfs/parameters/spa_slop_shift   5
/sys/module/zfs/parameters/zfetch_array_rd_sz   1048576
/sys/module/zfs/parameters/zfetch_block_cap 256
/sys/module/zfs/parameters/zfetch_max_streams   8
/sys/module/zfs/parameters/zfetch_min_sec_reap  2
/sys/module/zfs/parameters/zfs_admin_snapshot   0
/sys/module/zfs/parameters/zfs_arc_average_blocksize    8192
/sys/module/zfs/parameters/zfs_arc_grow_retry   0
/sys/module/zfs/parameters/zfs_arc_lotsfree_percent 10
/sys/module/zfs/parameters/zfs_arc_max  0
/sys/module/zfs/parameters/zfs_arc_meta_adjust_restarts 4096
/sys/module/zfs/parameters/zfs_arc_meta_limit   0
/sys/module/zfs/parameters/zfs_arc_meta_min 0
/sys/module/zfs/parameters/zfs_arc_meta_prune   10000
/sys/module/zfs/parameters/zfs_arc_meta_strategy    1
/sys/module/zfs/parameters/zfs_arc_min  0
/sys/module/zfs/parameters/zfs_arc_min_prefetch_lifespan    0
/sys/module/zfs/parameters/zfs_arc_num_sublists_per_state   24
/sys/module/zfs/parameters/zfs_arc_p_aggressive_disable 1
/sys/module/zfs/parameters/zfs_arc_p_dampener_disable   1
/sys/module/zfs/parameters/zfs_arc_p_min_shift  0
/sys/module/zfs/parameters/zfs_arc_shrink_shift 0
/sys/module/zfs/parameters/zfs_arc_sys_free 0
/sys/module/zfs/parameters/zfs_autoimport_disable   1
/sys/module/zfs/parameters/zfs_dbgmsg_enable    0
/sys/module/zfs/parameters/zfs_dbgmsg_maxsize   4194304
/sys/module/zfs/parameters/zfs_dbuf_state_index 0
/sys/module/zfs/parameters/zfs_deadman_enabled  1
/sys/module/zfs/parameters/zfs_deadman_synctime_ms  1000000
/sys/module/zfs/parameters/zfs_dedup_prefetch   0
/sys/module/zfs/parameters/zfs_delay_min_dirty_percent  60
/sys/module/zfs/parameters/zfs_delay_scale  500000
/sys/module/zfs/parameters/zfs_dirty_data_max   5063816396
/sys/module/zfs/parameters/zfs_dirty_data_max_max   12659540992
/sys/module/zfs/parameters/zfs_dirty_data_max_max_percent   25
/sys/module/zfs/parameters/zfs_dirty_data_max_percent   10
/sys/module/zfs/parameters/zfs_dirty_data_sync  67108864
/sys/module/zfs/parameters/zfs_disable_dup_eviction 0
/sys/module/zfs/parameters/zfs_expire_snapshot  300
/sys/module/zfs/parameters/zfs_flags    0
/sys/module/zfs/parameters/zfs_free_leak_on_eio 0
/sys/module/zfs/parameters/zfs_free_max_blocks  100000
/sys/module/zfs/parameters/zfs_free_min_time_ms 1000
/sys/module/zfs/parameters/zfs_immediate_write_sz   32768
/sys/module/zfs/parameters/zfs_max_recordsize   1048576
/sys/module/zfs/parameters/zfs_mdcomp_disable   0
/sys/module/zfs/parameters/zfs_metaslab_fragmentation_threshold 70
/sys/module/zfs/parameters/zfs_mg_fragmentation_threshold   85
/sys/module/zfs/parameters/zfs_mg_noalloc_threshold 0
/sys/module/zfs/parameters/zfs_nocacheflush 0
/sys/module/zfs/parameters/zfs_nopwrite_enabled 1
/sys/module/zfs/parameters/zfs_no_scrub_io  0
/sys/module/zfs/parameters/zfs_no_scrub_prefetch    0
/sys/module/zfs/parameters/zfs_object_mutex_size    64
/sys/module/zfs/parameters/zfs_pd_bytes_max 52428800
/sys/module/zfs/parameters/zfs_prefetch_disable 0
/sys/module/zfs/parameters/zfs_read_chunk_size  1048576
/sys/module/zfs/parameters/zfs_read_history 0
/sys/module/zfs/parameters/zfs_read_history_hits    0
/sys/module/zfs/parameters/zfs_recover  0
/sys/module/zfs/parameters/zfs_resilver_delay   2
/sys/module/zfs/parameters/zfs_resilver_min_time_ms 3000
/sys/module/zfs/parameters/zfs_scan_idle    50
/sys/module/zfs/parameters/zfs_scan_min_time_ms 1000
/sys/module/zfs/parameters/zfs_scrub_delay  4
/sys/module/zfs/parameters/zfs_send_corrupt_data    0
/sys/module/zfs/parameters/zfs_sync_pass_deferred_free  2
/sys/module/zfs/parameters/zfs_sync_pass_dont_compress  5
/sys/module/zfs/parameters/zfs_sync_pass_rewrite    2
/sys/module/zfs/parameters/zfs_top_maxinflight  32
/sys/module/zfs/parameters/zfs_txg_history  0
/sys/module/zfs/parameters/zfs_txg_timeout  5
/sys/module/zfs/parameters/zfs_vdev_aggregation_limit   131072
/sys/module/zfs/parameters/zfs_vdev_async_read_max_active   3
/sys/module/zfs/parameters/zfs_vdev_async_read_min_active   1
/sys/module/zfs/parameters/zfs_vdev_async_write_active_max_dirty_percent    60
/sys/module/zfs/parameters/zfs_vdev_async_write_active_min_dirty_percent    30
/sys/module/zfs/parameters/zfs_vdev_async_write_max_active  10
/sys/module/zfs/parameters/zfs_vdev_async_write_min_active  1
/sys/module/zfs/parameters/zfs_vdev_cache_bshift    16
/sys/module/zfs/parameters/zfs_vdev_cache_max   16384
/sys/module/zfs/parameters/zfs_vdev_cache_size  0
/sys/module/zfs/parameters/zfs_vdev_max_active  1000
/sys/module/zfs/parameters/zfs_vdev_mirror_switch_us    10000
/sys/module/zfs/parameters/zfs_vdev_read_gap_limit  32768
/sys/module/zfs/parameters/zfs_vdev_scheduler   noop
/sys/module/zfs/parameters/zfs_vdev_scrub_max_active    2
/sys/module/zfs/parameters/zfs_vdev_scrub_min_active    1
/sys/module/zfs/parameters/zfs_vdev_sync_read_max_active    10
/sys/module/zfs/parameters/zfs_vdev_sync_read_min_active    10
/sys/module/zfs/parameters/zfs_vdev_sync_write_max_active   10
/sys/module/zfs/parameters/zfs_vdev_sync_write_min_active   10
/sys/module/zfs/parameters/zfs_vdev_write_gap_limit 4096
/sys/module/zfs/parameters/zfs_zevent_cols  80
/sys/module/zfs/parameters/zfs_zevent_console   0
/sys/module/zfs/parameters/zfs_zevent_len_max   384
/sys/module/zfs/parameters/zil_replay_disable   0
/sys/module/zfs/parameters/zil_slog_limit   1048576
/sys/module/zfs/parameters/zio_delay_max    30000
/sys/module/zfs/parameters/zio_requeue_io_start_cut_in_line 1
/sys/module/zfs/parameters/zio_taskq_batch_pct  75
/sys/module/zfs/parameters/zvol_inhibit_dev 0
/sys/module/zfs/parameters/zvol_major   230
/sys/module/zfs/parameters/zvol_max_discard_blocks  16384
/sys/module/zfs/parameters/zvol_prefetch_bytes  131072
# cat /proc/spl/kstat/zfs/arcstats
6 1 0x01 91 4368 40679789768 36914794635643
name                            type data
hits                            4    41650151
misses                          4    34525314
demand_data_hits                4    765910
demand_data_misses              4    13387934
demand_metadata_hits            4    40177671
demand_metadata_misses          4    20019690
prefetch_data_hits              4    28035
prefetch_data_misses            4    403038
prefetch_metadata_hits          4    678535
prefetch_metadata_misses        4    714652
mru_hits                        4    23290516
mru_ghost_hits                  4    507972
mfu_hits                        4    17653066
mfu_ghost_hits                  4    410386
deleted                         4    3751055
mutex_miss                      4    218
evict_skip                      4    48612229
evict_not_enough                4    1739168
evict_l2_cached                 4    131469312
evict_l2_eligible               4    140930869760
evict_l2_ineligible             4    10385740288
evict_l2_skip                   4    300
hash_elements                   4    417356
hash_elements_max               4    1527304
hash_collisions                 4    401053
hash_chains                     4    9968
hash_chain_max                  4    4
p                               4    23736509824
c                               4    25319081984
c_min                           4    33554432
c_max                           4    25319081984
size                            4    19524622384
hdr_size                        4    174089312
data_size                       4    512
metadata_size                   4    6726998528
other_size                      4    12622829952
anon_size                       4    16896
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    3108688384
mru_evictable_data              4    0
mru_evictable_metadata          4    0
mru_ghost_size                  4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_size                        4    3618293760
mfu_evictable_data              4    0
mfu_evictable_metadata          4    0
mfu_ghost_size                  4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_hits                         4    17570
l2_misses                       4    1496198
l2_feeds                        4    14129
l2_rw_clash                     4    0
l2_read_bytes                   4    11333120
l2_write_bytes                  4    22817792
l2_writes_sent                  4    571
l2_writes_done                  4    571
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    5
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    106093568
l2_asize                        4    22616064
l2_hdr_size                     4    704080
l2_compress_successes           4    6643
l2_compress_zeros               4    0
l2_compress_failures            4    48
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    104
memory_indirect_count           4    280
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    1396529
arc_meta_used                   4    19524621872
arc_meta_limit                  4    18989311488
arc_meta_max                    4    19524839400
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    791220224

@logan2211
Copy link

Looking at the work going on in #4850 ...is there a workaround for production systems right now by setting some arc parameters? I have a few systems that keep falling into this.

@logan2211
Copy link

@dweeezil: I'm testing out your patch at https://github.com/dweeezil/zfs/tree/arc-dnode-limit-0.6.5 and seeing similar breakage still.

[53163.849380] INFO: task glusterfsd:22397 blocked for more than 120 seconds.
[53163.920949]       Tainted: P          IOE   4.4.0-31-generic #50-Ubuntu
[53163.992100] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[53164.132648] glusterfsd      D ffff88080301fb98     0 22397      1 0x00000000
[53164.132653]  ffff88080301fb98 ffff88080301fb68 ffff8806242b9900 ffff88091b3f7080
[53164.132656]  ffff880803020000 ffff8809243b4414 ffff88091b3f7080 00000000ffffffff
[53164.132659]  ffff8809243b4418 ffff88080301fbb0 ffffffff81829a25 ffff8809243b4410
[53164.132661] Call Trace:
[53164.132672]  [<ffffffff81829a25>] schedule+0x35/0x80
[53164.132679]  [<ffffffff81829cce>] schedule_preempt_disabled+0xe/0x10
[53164.132683]  [<ffffffff8182b909>] __mutex_lock_slowpath+0xb9/0x130
[53164.132685]  [<ffffffff8182b99f>] mutex_lock+0x1f/0x30
[53164.132691]  [<ffffffff81219f67>] walk_component+0x217/0x300
[53164.132693]  [<ffffffff8121ac61>] link_path_walk+0x191/0x5b0
[53164.132694]  [<ffffffff81219b7b>] ? path_init+0x1eb/0x3c0
[53164.132696]  [<ffffffff8121b17c>] path_lookupat+0x7c/0x110
[53164.132698]  [<ffffffff8121cda1>] filename_lookup+0xb1/0x180
[53164.132702]  [<ffffffff812332a9>] ? getxattr+0x1e9/0x250
[53164.132706]  [<ffffffff811eba97>] ? kmem_cache_alloc+0x187/0x1f0
[53164.132708]  [<ffffffff8121c9a6>] ? getname_flags+0x56/0x1f0
[53164.132711]  [<ffffffff8121cf46>] user_path_at_empty+0x36/0x40
[53164.132713]  [<ffffffff81212086>] vfs_fstatat+0x66/0xc0
[53164.132714]  [<ffffffff81212641>] SYSC_newlstat+0x31/0x60
[53164.132716]  [<ffffffff8122db84>] ? mntput+0x24/0x40
[53164.132718]  [<ffffffff81216f9e>] ? path_put+0x1e/0x30
[53164.132719]  [<ffffffff81233381>] ? path_getxattr+0x71/0xb0
[53164.132723]  [<ffffffff8121277e>] SyS_newlstat+0xe/0x10
[53164.132728]  [<ffffffff8182db32>] entry_SYSCALL_64_fastpath+0x16/0x71
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 7586 root      20   0 1370968  85452   6684 S 476.9  0.2  39441:59 glusterfsd
14796 root      20   0       0      0      0 R  92.3  0.0   7972:35 arc_prune
# git -C zfs branch
* arc-dnode-limit-0.6.5
# modinfo zfs/module/zfs/zfs.ko | grep version
version:        0.6.5.7-1
srcversion:     D032EDD984A9484F926D9B2
vermagic:       4.4.0-31-generic SMP mod_unload modversions
# cat /sys/module/zfs/*version*
D032EDD984A9484F926D9B2
0.6.5.7-1
Linux host 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
# find /sys/module/zfs/parameters/ -type f | sort | xargs -l1 -I@ bash -c 'echo -en "@\t"; cat @'
/sys/module/zfs/parameters/l2arc_feed_again 1
/sys/module/zfs/parameters/l2arc_feed_min_ms    200
/sys/module/zfs/parameters/l2arc_feed_secs  1
/sys/module/zfs/parameters/l2arc_headroom   2
/sys/module/zfs/parameters/l2arc_headroom_boost 200
/sys/module/zfs/parameters/l2arc_nocompress 0
/sys/module/zfs/parameters/l2arc_noprefetch 1
/sys/module/zfs/parameters/l2arc_norw   0
/sys/module/zfs/parameters/l2arc_write_boost    8388608
/sys/module/zfs/parameters/l2arc_write_max  8388608
/sys/module/zfs/parameters/metaslab_aliquot 524288
/sys/module/zfs/parameters/metaslab_bias_enabled    1
/sys/module/zfs/parameters/metaslab_debug_load  0
/sys/module/zfs/parameters/metaslab_debug_unload    0
/sys/module/zfs/parameters/metaslab_fragmentation_factor_enabled    1
/sys/module/zfs/parameters/metaslab_lba_weighting_enabled   1
/sys/module/zfs/parameters/metaslab_preload_enabled 1
/sys/module/zfs/parameters/metaslabs_per_vdev   200
/sys/module/zfs/parameters/spa_asize_inflation  24
/sys/module/zfs/parameters/spa_config_path  /etc/zfs/zpool.cache
/sys/module/zfs/parameters/spa_load_verify_data 1
/sys/module/zfs/parameters/spa_load_verify_maxinflight  10000
/sys/module/zfs/parameters/spa_load_verify_metadata 1
/sys/module/zfs/parameters/spa_slop_shift   5
/sys/module/zfs/parameters/zfetch_array_rd_sz   1048576
/sys/module/zfs/parameters/zfetch_block_cap 256
/sys/module/zfs/parameters/zfetch_max_streams   8
/sys/module/zfs/parameters/zfetch_min_sec_reap  2
/sys/module/zfs/parameters/zfs_admin_snapshot   0
/sys/module/zfs/parameters/zfs_arc_average_blocksize    8192
/sys/module/zfs/parameters/zfs_arc_dnode_limit  0
/sys/module/zfs/parameters/zfs_arc_dnode_reduce_percent 10
/sys/module/zfs/parameters/zfs_arc_grow_retry   0
/sys/module/zfs/parameters/zfs_arc_lotsfree_percent 10
/sys/module/zfs/parameters/zfs_arc_max  0
/sys/module/zfs/parameters/zfs_arc_meta_adjust_restarts 4096
/sys/module/zfs/parameters/zfs_arc_meta_limit   0
/sys/module/zfs/parameters/zfs_arc_meta_min 0
/sys/module/zfs/parameters/zfs_arc_meta_prune   10000
/sys/module/zfs/parameters/zfs_arc_meta_strategy    1
/sys/module/zfs/parameters/zfs_arc_min  0
/sys/module/zfs/parameters/zfs_arc_min_prefetch_lifespan    0
/sys/module/zfs/parameters/zfs_arc_num_sublists_per_state   24
/sys/module/zfs/parameters/zfs_arc_p_aggressive_disable 1
/sys/module/zfs/parameters/zfs_arc_p_dampener_disable   1
/sys/module/zfs/parameters/zfs_arc_p_min_shift  0
/sys/module/zfs/parameters/zfs_arc_shrink_shift 0
/sys/module/zfs/parameters/zfs_arc_sys_free 0
/sys/module/zfs/parameters/zfs_autoimport_disable   1
/sys/module/zfs/parameters/zfs_dbgmsg_enable    0
/sys/module/zfs/parameters/zfs_dbgmsg_maxsize   4194304
/sys/module/zfs/parameters/zfs_dbuf_state_index 0
/sys/module/zfs/parameters/zfs_deadman_enabled  1
/sys/module/zfs/parameters/zfs_deadman_synctime_ms  1000000
/sys/module/zfs/parameters/zfs_dedup_prefetch   0
/sys/module/zfs/parameters/zfs_delay_min_dirty_percent  60
/sys/module/zfs/parameters/zfs_delay_scale  500000
/sys/module/zfs/parameters/zfs_dirty_data_max   3795455590
/sys/module/zfs/parameters/zfs_dirty_data_max_max   9488638976
/sys/module/zfs/parameters/zfs_dirty_data_max_max_percent   25
/sys/module/zfs/parameters/zfs_dirty_data_max_percent   10
/sys/module/zfs/parameters/zfs_dirty_data_sync  67108864
/sys/module/zfs/parameters/zfs_disable_dup_eviction 0
/sys/module/zfs/parameters/zfs_expire_snapshot  300
/sys/module/zfs/parameters/zfs_flags    0
/sys/module/zfs/parameters/zfs_free_leak_on_eio 0
/sys/module/zfs/parameters/zfs_free_max_blocks  100000
/sys/module/zfs/parameters/zfs_free_min_time_ms 1000
/sys/module/zfs/parameters/zfs_immediate_write_sz   32768
/sys/module/zfs/parameters/zfs_max_recordsize   1048576
/sys/module/zfs/parameters/zfs_mdcomp_disable   0
/sys/module/zfs/parameters/zfs_metaslab_fragmentation_threshold 70
/sys/module/zfs/parameters/zfs_mg_fragmentation_threshold   85
/sys/module/zfs/parameters/zfs_mg_noalloc_threshold 0
/sys/module/zfs/parameters/zfs_nocacheflush 0
/sys/module/zfs/parameters/zfs_nopwrite_enabled 1
/sys/module/zfs/parameters/zfs_no_scrub_io  0
/sys/module/zfs/parameters/zfs_no_scrub_prefetch    0
/sys/module/zfs/parameters/zfs_object_mutex_size    64
/sys/module/zfs/parameters/zfs_pd_bytes_max 52428800
/sys/module/zfs/parameters/zfs_prefetch_disable 0
/sys/module/zfs/parameters/zfs_read_chunk_size  1048576
/sys/module/zfs/parameters/zfs_read_history 0
/sys/module/zfs/parameters/zfs_read_history_hits    0
/sys/module/zfs/parameters/zfs_recover  0
/sys/module/zfs/parameters/zfs_resilver_delay   2
/sys/module/zfs/parameters/zfs_resilver_min_time_ms 3000
/sys/module/zfs/parameters/zfs_scan_idle    50
/sys/module/zfs/parameters/zfs_scan_min_time_ms 1000
/sys/module/zfs/parameters/zfs_scrub_delay  4
/sys/module/zfs/parameters/zfs_send_corrupt_data    0
/sys/module/zfs/parameters/zfs_sync_pass_deferred_free  2
/sys/module/zfs/parameters/zfs_sync_pass_dont_compress  5
/sys/module/zfs/parameters/zfs_sync_pass_rewrite    2
/sys/module/zfs/parameters/zfs_top_maxinflight  32
/sys/module/zfs/parameters/zfs_txg_history  0
/sys/module/zfs/parameters/zfs_txg_timeout  5
/sys/module/zfs/parameters/zfs_vdev_aggregation_limit   131072
/sys/module/zfs/parameters/zfs_vdev_async_read_max_active   3
/sys/module/zfs/parameters/zfs_vdev_async_read_min_active   1
/sys/module/zfs/parameters/zfs_vdev_async_write_active_max_dirty_percent    60
/sys/module/zfs/parameters/zfs_vdev_async_write_active_min_dirty_percent    30
/sys/module/zfs/parameters/zfs_vdev_async_write_max_active  10
/sys/module/zfs/parameters/zfs_vdev_async_write_min_active  1
/sys/module/zfs/parameters/zfs_vdev_cache_bshift    16
/sys/module/zfs/parameters/zfs_vdev_cache_max   16384
/sys/module/zfs/parameters/zfs_vdev_cache_size  0
/sys/module/zfs/parameters/zfs_vdev_max_active  1000
/sys/module/zfs/parameters/zfs_vdev_mirror_switch_us    10000
/sys/module/zfs/parameters/zfs_vdev_read_gap_limit  32768
/sys/module/zfs/parameters/zfs_vdev_scheduler   noop
/sys/module/zfs/parameters/zfs_vdev_scrub_max_active    2
/sys/module/zfs/parameters/zfs_vdev_scrub_min_active    1
/sys/module/zfs/parameters/zfs_vdev_sync_read_max_active    10
/sys/module/zfs/parameters/zfs_vdev_sync_read_min_active    10
/sys/module/zfs/parameters/zfs_vdev_sync_write_max_active   10
/sys/module/zfs/parameters/zfs_vdev_sync_write_min_active   10
/sys/module/zfs/parameters/zfs_vdev_write_gap_limit 4096
/sys/module/zfs/parameters/zfs_zevent_cols  80
/sys/module/zfs/parameters/zfs_zevent_console   0
/sys/module/zfs/parameters/zfs_zevent_len_max   384
/sys/module/zfs/parameters/zil_replay_disable   0
/sys/module/zfs/parameters/zil_slog_limit   1048576
/sys/module/zfs/parameters/zio_delay_max    30000
/sys/module/zfs/parameters/zio_requeue_io_start_cut_in_line 1
/sys/module/zfs/parameters/zio_taskq_batch_pct  75
/sys/module/zfs/parameters/zvol_inhibit_dev 0
/sys/module/zfs/parameters/zvol_major   230
/sys/module/zfs/parameters/zvol_max_discard_blocks  16384
/sys/module/zfs/parameters/zvol_prefetch_bytes  131072

behlendorf pushed a commit to behlendorf/zfs that referenced this issue Jul 26, 2016
Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).

In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.

The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes.  Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.

The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded.  By default,
it tries to unpin 10% of the dnodes.

The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):

    - Create a large number of small files until arc_meta_used exceeds
      arc_meta_limit (3GiB with default tuning) and arc_prune
      starts increasing.

    - Create a 3GiB file with dd.  Observe arc_mata_used.  It will still
      be around 3GiB.

    - Repeatedly read the 3GiB file and observe arc_meta_limit as before.
      It will continue to stay around 3GiB.

With this modification, space for the 3GiB file is gradually made
available as subsequent demands on the ARC are made.  The previous behavior
can be restored by setting zfs_arc_dnode_limit to the same value as the
zfs_arc_meta_limit.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#4345
Issue openzfs#4512
Issue openzfs#4773
Closes openzfs#4858
@logan2211
Copy link

Is there any additional debugging I could provide to help? I am still hitting this constantly on https://github.com/dweeezil/zfs/tree/arc-dnode-limit-0.6.5

@dweeezil
Copy link
Contributor

dweeezil commented Aug 6, 2016

@logan2211 A good start would be the arcstats when running with the dnode limiting patch. I referenced this issue in the patch because your original arcstats showed the characteristics of a problem which it would fix. The extra arcstats it provides might help determine what'a happening. (note: I'm "off the grid" until August 15th, so will likely not have a chance to pursue this further until then).

@logan2211
Copy link

logan2211 commented Aug 6, 2016

Sorry I forgot to include that! Thanks for the reply.

This is on a system running 0.6.5.7-1 from your branch that is spinning arc_prune right now.

# cat /proc/spl/kstat/zfs/arcstats
6 1 0x01 94 4512 12124429915051 434014865433845
name                            type data
hits                            4    10119064
misses                          4    35137432
demand_data_hits                4    500531
demand_data_misses              4    7746092
demand_metadata_hits            4    9263246
demand_metadata_misses          4    27010730
prefetch_data_hits              4    15779
prefetch_data_misses            4    142833
prefetch_metadata_hits          4    339508
prefetch_metadata_misses        4    237777
mru_hits                        4    1531914
mru_ghost_hits                  4    93963
mfu_hits                        4    8231863
mfu_ghost_hits                  4    90029
deleted                         4    748904
mutex_miss                      4    48
evict_skip                      4    88822
evict_not_enough                4    832
evict_l2_cached                 4    85633411072
evict_l2_eligible               4    57720585728
evict_l2_ineligible             4    3524957184
evict_l2_skip                   4    0
hash_elements                   4    868164
hash_elements_max               4    881643
hash_collisions                 4    202027
hash_chains                     4    41860
hash_chain_max                  4    4
p                               4    774172750
c                               4    5894210764
c_min                           4    33554432
c_max                           4    25319063552
size                            4    5776723096
hdr_size                        4    132901928
data_size                       4    1422624768
metadata_size                   4    2126125568
dbuf_size                       4    442700928
dnode_size                      4    1164875616
bonus_size                      4    435344320
anon_size                       4    16384
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    420177920
mru_evictable_data              4    0
mru_evictable_metadata          4    3477504
mru_ghost_size                  4    5473600000
mru_ghost_evictable_data        4    3752202752
mru_ghost_evictable_metadata    4    1721397248
mfu_size                        4    3128556032
mfu_evictable_data              4    1422624768
mfu_evictable_metadata          4    1574905856
mfu_ghost_size                  4    420490752
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    420490752
l2_hits                         4    142135
l2_misses                       4    34995246
l2_feeds                        4    429184
l2_rw_clash                     4    1
l2_read_bytes                   4    490681856
l2_write_bytes                  4    83366187520
l2_writes_sent                  4    59866
l2_writes_done                  4    59866
l2_writes_error                 4    0
l2_writes_lock_retry            4    1
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    688
l2_cdata_free_on_write          4    73
l2_abort_lowmem                 4    2
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    86455729664
l2_asize                        4    75745787904
l2_hdr_size                     4    52149968
l2_compress_successes           4    533455
l2_compress_zeros               4    0
l2_compress_failures            4    365842
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    830
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    9118
arc_meta_used                   4    4354098328
arc_meta_limit                  4    18989297664
arc_dnode_limit                 4    1898929766
arc_meta_max                    4    8487656368
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    791220224

@dweeezil
Copy link
Contributor

dweeezil commented Aug 6, 2016

@logan2211 Those arcstats do show a dnode metadata jam-up. You mentioned gluster which I think might be a big xattr user. If so, are you using xattr=sa. You might also need dfbc863 or possibly 09fb30e. You might also try increasing zfs_arc_dnode_reduce_percent from its default of 10 (percent) possibly all the way up to 100. There's some reason the system can't shed the dnodes and that's why its spinning.

@logan2211
Copy link

Thanks for the info. I was not using xattr=sa but I have enabled it now and deleted/resynced a lot of data into the pools. I also increased zfs_arc_dnode_reduce_percent to 100. I was still seeing the problem after doing these things and one of the patches you mentioned did not merge cleanly into your arc-dnode-limit-0.6.5 branch, so what I did a few days ago was build a module from zfs/master (e35c5a8) and since then I have not had any problems. I am considering removing the zfs_arc_dnode_reduce_percent override on one of the nodes to see if things continue to run smoothly. It seems like whatever was causing it may be fixed in master though!

@AndCycle
Copy link

I do still get some serious [arc_prune] cpu usage if I set only 4GB zfs_arc_max even after dweeezil@650f6cf applied on my Gentoo box with zfs 0.6.5.8,
arc_prune just struggle to make some effort,

if I raise the bar to 8GB zfs_arc_max with 6GB arc_meta_limit it won't happened,

my system got some workload mix between heavy metadata usage and some stream reading,

here is my system status just after boot for 50mins

Linux bacztwo 4.4.20-gentoo #1 SMP Tue Sep 13 17:24:51 CST 2016 x86_64 Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz GenuineIntel GNU/Linux
 02:26:34 up 51 min,  2 users,  load average: 22.46, 20.85, 17.32
Tasks: 2537 total,  11 running, 2526 sleeping,   0 stopped,   0 zombie
%Cpu(s): 20.3 us, 30.8 sy, 46.3 ni,  1.5 id,  0.9 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 32624400 total,  6918072 free, 18926252 used,  6780076 buff/cache
KiB Swap: 22848004 total, 22848004 free,        0 used.  5977792 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
17991 root      20   0 12.553g 7.416g  15908 S  77.7 23.8  34:31.22 /usr/local/crashplan/jre/bin/java -Dfile.encoding=UTF-8 -Dapp=CrashPlanService -DappBaseName=CrashPlan -Xms20m -Xmx8G -Dsun.net.inetaddr.ttl=300 -Dnetworkaddress.cache+
20147 miya      20   0 8501932 2.194g  20036 S  68.8  7.1  37:12.51 /opt/icedtea-bin-3.1.0/bin/java -server -Xmx4096M -jar forge-1.10.2-12.18.1.2076-universal.jar
19343 qemu      20   0 6959992 4.083g   9888 S  63.7 13.1  26:55.05 /usr/bin/qemu-system-x86_64 -name guest=Win10_JP_P2P,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-Win10_JP_P2P/maste+
17765 boinc     39  19  278616 235188  75412 R  62.4  0.7  17:49.94 /var/lib/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.12_i686-pc-linux-gnu 235095 /var/lib/boinc/slots/5
17775 boinc     39  19  278320 234936  75408 R  58.3  0.7  17:41.62 /var/lib/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.12_i686-pc-linux-gnu 233045 /var/lib/boinc/slots/6
 6069 munin     30  10   82552  41872   3648 R  52.5  0.1   0:01.65 /usr/bin/perl /usr/libexec/munin/munin-html
17787 boinc     39  19  278468 235076  75408 R  41.1  0.7  18:12.84 /var/lib/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.12_i686-pc-linux-gnu 235135 /var/lib/boinc/slots/4
17755 boinc     39  19  539316 318236  27436 R  35.4  1.0  27:47.10 ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.73_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:native 00001.pdb -silen+
 2976 root      20   0       0      0      0 S  25.5  0.0   1:38.45 [arc_prune]
17769 boinc     39  19  278448 235044  75412 R  23.2  0.7  17:46.45 /var/lib/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.12_i686-pc-linux-gnu 232090 /var/lib/boinc/slots/1
 2978 root      20   0       0      0      0 S  22.9  0.0   1:37.48 [arc_prune]
17779 boinc     39  19  278524 235108  75400 R  22.9  0.7  18:05.64 /var/lib/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.12_i686-pc-linux-gnu 234185 /var/lib/boinc/slots/2
17759 boinc     39  19  278224 234840  75416 R  20.7  0.7  18:06.29 /var/lib/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.12_i686-pc-linux-gnu 228710 /var/lib/boinc/slots/0
 2977 root      20   0       0      0      0 S  19.7  0.0   1:36.54 [arc_prune]
 2979 root      20   0       0      0      0 S  19.4  0.0   2:04.06 [arc_reclaim]
 2975 root      20   0       0      0      0 S  18.8  0.0   1:37.01 [arc_prune]
 2973 root      20   0       0      0      0 S  18.5  0.0   1:40.66 [arc_prune]
 2974 root      20   0       0      0      0 S  17.8  0.0   1:40.07 [arc_prune]
 2971 root      20   0       0      0      0 S  15.6  0.0   1:38.63 [arc_prune]
17783 boinc     39  19  278596 235192  75408 R  15.6  0.7  18:02.78 /var/lib/boinc/projects/climateprediction.net/wah2rm3m2t_um_8.12_i686-pc-linux-gnu 223555 /var/lib/boinc/slots/7
 2972 root      20   0       0      0      0 S  14.0  0.0   1:38.98 [arc_prune]
22460 nobody    20   0  235696  12920   9484 R  10.5  0.0   0:47.90 /usr/sbin/smbd -D
19360 root      20   0       0      0      0 S   3.2  0.0   0:54.71 [vhost-19343]
19396 root      20   0    9492    132      0 S   3.2  0.0   0:21.93 /usr/sbin/rngd --pid-file /var/run/rngd.pid --background --random-step 64 --fill-watermark 2048
 3332 root      20   0   31428   5304   2124 R   2.5  0.0   0:02.29 top
 2970 root      20   0       0      0      0 S   1.0  0.0   0:17.08 [dbu_evict]
 2980 root      20   0       0      0      0 S   0.6  0.0   0:08.74 [arc_user_evicts]

NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zassist   222G  23.0G   199G         -    38%    10%  1.00x  ONLINE  -
  ata-INTEL_SSDSC2BW240A4_CVDA349202642403GN   222G  23.0G   199G         -    38%    10%
zbackup-20160822  36.2T  30.9T  5.40T         -      -    85%  1.00x  ONLINE  -
  raidz1  36.2T  30.9T  5.40T         -      -    85%
    ata-ST4000VN000-1H4168_S301B0HM      -      -      -         -      -      -
    ata-ST4000VN000-1H4168_S301B0QR      -      -      -         -      -      -
    ata-ST4000VN000-1H4168_Z300LSPL      -      -      -         -      -      -
    ata-ST4000VN000-1H4168_Z300M6X2      -      -      -         -      -      -
    ata-ST4000VN000-1H4168_Z300M7QH      -      -      -         -      -      -
    ata-ST4000VN000-1H4168_Z300PJ59      -      -      -         -      -      -
    ata-ST4000VN000-1H4168_Z300Q3BX      -      -      -         -      -      -
    ata-ST4000VN000-1H4168_Z300Q6CC      -      -      -         -      -      -
    ata-ST4000VN000-1H4168_Z30335X2      -      -      -         -      -      -
    ata-ST4000VN000-1H4168_Z3046QH9      -      -      -         -      -      -
zmess  27.2T  17.1T  10.2T         -    26%    62%  1.00x  ONLINE  -
  raidz2  27.2T  17.1T  10.2T         -    26%    62%
    ata-TOSHIBA_MD04ACA500_Z481K69HFS9A      -      -      -         -      -      -
    ata-TOSHIBA_MD04ACA500_Z484K4E1FS9A      -      -      -         -      -      -
    ata-TOSHIBA_MD04ACA500_6679K8KEFS9A      -      -      -         -      -      -
    ata-TOSHIBA_MD04ACA500_Z485K1GEFS9A      -      -      -         -      -      -
    ata-TOSHIBA_MD04ACA500_Z485K1GFFS9A      -      -      -         -      -      -
    ata-TOSHIBA_MD04ACA500_Z485K1GHFS9A      -      -      -         -      -      -
zroot   444G   366G  78.0G         -    54%    82%  1.00x  ONLINE  -
  sda   444G   366G  78.0G         -    54%    82%
zsingle  3.62T  1.27T  2.35T         -    30%    35%  1.00x  ONLINE  -
  ata-TOSHIBA_MD03ACA400V_34BMK004F  3.62T  1.27T  2.35T         -    30%    35%
ztank  32.5T  27.2T  5.27T         -    33%    83%  1.00x  ONLINE  -
  raidz2  32.5T  27.2T  5.27T         -    33%    83%
    ata-TOSHIBA_MD04ACA600_5574K0MRFFQC      -      -      -         -      -      -
    ata-TOSHIBA_MD04ACA600_5576K0T2FFQC      -      -      -         -      -      -
    ata-TOSHIBA_MD04ACA600_5592K14HFFQC      -      -      -         -      -      -
    ata-TOSHIBA_MD04ACA600_55JBK02TFFQC      -      -      -         -      -      -
    ata-TOSHIBA_MD04ACA600_55JBK02UFFQC      -      -      -         -      -      -
    ata-TOSHIBA_MD04ACA600_55Q4K0VJFFQC      -      -      -         -      -      -
cache      -      -      -         -      -      -
  ata-INTEL_SSDSC2KW120H6_CVLT616200V0120GGN-part2  30.0G  2.45G  27.5G         -     0%     8%
6 1 0x01 94 4512 13260305672 3020916418030
name                            type data
hits                            4    396309232
misses                          4    71125695
demand_data_hits                4    329366951
demand_data_misses              4    65084977
demand_metadata_hits            4    63405776
demand_metadata_misses          4    3860549
prefetch_data_hits              4    1961990
prefetch_data_misses            4    1630318
prefetch_metadata_hits          4    1574515
prefetch_metadata_misses        4    549851
mru_hits                        4    87833843
mru_ghost_hits                  4    196186
mfu_hits                        4    304988813
mfu_ghost_hits                  4    246261
deleted                         4    3293400
mutex_miss                      4    101206
evict_skip                      4    731125702
evict_not_enough                4    2399463
evict_l2_cached                 4    17088484864
evict_l2_eligible               4    71545396736
evict_l2_ineligible             4    35658309632
evict_l2_skip                   4    122
hash_elements                   4    355956
hash_elements_max               4    605368
hash_collisions                 4    378490
hash_chains                     4    13994
hash_chain_max                  4    5
p                               4    3998082048
c                               4    4294967296
c_min                           4    33554432
c_max                           4    4294967296
size                            4    3621499752
hdr_size                        4    31219344
data_size                       4    134238208
metadata_size                   4    1245721600
dbuf_size                       4    441753696
dnode_size                      4    1272769456
bonus_size                      4    466719360
anon_size                       4    39997952
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    784983552
mru_evictable_data              4    61673984
mru_evictable_metadata          4    11631616
mru_ghost_size                  4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_size                        4    554978304
mfu_evictable_data              4    44591104
mfu_evictable_metadata          4    1659904
mfu_ghost_size                  4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_hits                         4    530901
l2_misses                       4    70586693
l2_feeds                        4    3025
l2_rw_clash                     4    1
l2_read_bytes                   4    2146740224
l2_write_bytes                  4    2978040832
l2_writes_sent                  4    2853
l2_writes_done                  4    2853
l2_writes_error                 4    0
l2_writes_lock_retry            4    2
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    104
l2_cdata_free_on_write          4    3
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    6512591872
l2_asize                        4    2629901824
l2_hdr_size                     4    29078088
l2_compress_successes           4    245240
l2_compress_zeros               4    0
l2_compress_failures            4    5880
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    2421296
arc_meta_used                   4    3487261544
arc_meta_limit                  4    3221225472
arc_dnode_limit                 4    322122547
arc_meta_max                    4    3560479088
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    521990144

@validname
Copy link
Author

@AndCycle Actually, that to seems to be a normal behaviour for system with limited resources (like memory) and arc_prune just prunes obsolete cache entries. What will happen if you increase limits (both zfs_arc_max and arc_meta_limit) when arc_prune consumes CPU? Do you have hard disk activity when this happens?
The initial case was about arc_prune cannot do it's normal job under certain circumstances
as I understand. In this case it just blocks any disk activity and do nothing by self.

@AndCycle
Copy link

@validname if I increase zfs_arc_max and arc_meta_limit, it won't consume that much cpu, just few percent, my hard disk is still activity but system isn't complete fine with this,

I can observe some hiccup from munin as there are empty data point,
which means it fail to write to rrd in time, happened few times a day,

I think this just step away from push it over limit,

yea, my case is not really fit in your situation, the only common part here is high arc_prune cpu usage,
I just follow dweeezil's patch then got here.

@behlendorf behlendorf removed the Bug label Sep 30, 2016
@behlendorf
Copy link
Contributor

It seems like whatever was causing it may be fixed in master though

Can someone verify that this is the case. Has anyone been able to reproduce this issue in master? How about 0.6.5.8?

@AndCycle
Copy link

@validname noop, not really normal, because it's a busy loop without actually doing anything,
the metadata just keep grow over limit,

after play around on my system I found out that's probably caused by inotify or some related stuff,

I have CrashPlan backup service on my machine, which do constantly scan through system,
and it watch all file on system in order to collect realtime info about what's to backup,

once I stop CrashPlan, the dnode_size drop immediately,

my guess is inotify pin the metadata which can't be released at all on my scenario,
not many people have a full system file watch notify service in background.

@gordan-bobic
Copy link
Contributor

Interesting, I suspect SELinux' restorecond may also be doing a similar thing, albeit not as extensively as CrashPlan or lsyncd.

@dweeezil
Copy link
Contributor

@AndCycle It seems an "inotify watch everything" workload would call for increasing zfs_arc_dnode_limit_percent possibly to 100. Have you increased max_user_watches to accommodate a huge amount of inotify?

@gordan-bobic
Copy link
Contributor

If the dnodes are being wedged in an unevictable state in the ARC and can't be freed with inotify watches in place, won't upping the zfs_arc_dnode_limit_percent to 100 only postpone the inevitable? Especially in cases where the ARC size is very restricted.

@AndCycle
Copy link

@dweeezil I tried increased zfs_arc_dnode_limit_percent to 100, no effect,

here is my current system value,
fs.epoll.max_user_watches = 6672732
fs.inotify.max_user_watches = 1048576

I increased max_user_watches as CrashPlan requested due to storage size,
well, I can turn of the realtime watch flag in CrashPlan to keep my system going for now.

@dweeezil
Copy link
Contributor

@AndCycle Unfortunately, it's definitely possible to cause Bad Things to happen with ZoL if you can pin enough metadata and it seems the idea of inotifying everything can do just that. You might be able to mitigate the issue a bit by setting zfs_arc_meta_strategy=0 or zfs_arc_meta_adjust_restarts=1.

I'm not very familiar with these filesystem notifications schemes within the kernel. Is this program trying to watch every single file? Or just every single directory? Or both? I can't imagine this ever working very well on a very large filesystem with, say, millions or billions of files/directories. It doesn't seem very scalable.

One thing that comes to mind is that we're using max_ncpus to scale the prune taskq and on some systems (hypervisors, EC2™ with certain AMIs maybe), I've seen it set to absurdly huge values which doesn't help things much. Maybe this is a bit heavy-handed and should be tunable or maybe even it ought to run at a bit lower priority. In any case, however, if it's not making any progress, the result will likely be unbounded ARC growth.

@AndCycle
Copy link

AndCycle commented Oct 14, 2016

@dweeezil yea, it's not very scalable,

Inotify monitoring of directories is not recursive: to monitor
subdirectories under a directory, additional watches must be created.
This can take a significant amount time for large directory trees.

if you wanna recreate the scenario, get inotify tools

https://github.com/rvoicilas/inotify-tools/wiki

then make it watch some place with massive directory structure

inotifywatch -r /home/

will easily have tons of dnode usage pinned inside arc.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Nov 8, 2016
@validname
Copy link
Author

Sorry for very long answer. And sorry for bad news.
I cannot find a way to simulate or provoke this issue on our server. So, I just waited for it again.
I've builded the ZFS from the same commit + patch maded from 650f6cf commit. And I set new parameters:

/sys/module/zfs/parameters/zfs_arc_dnode_limit	3221225472
/sys/module/zfs/parameters/zfs_arc_dnode_reduce_percent	10

Test server worked 52 days (vs. 30 average days before) under it's usual workload and it happens again. Again, one CPU core was eaten by one kernel process of arc_prune.
And sorry (for third time), kernel reboot this server automatically when I've tried to get ftrace tracings.

@cserem
Copy link

cserem commented Nov 26, 2016

Hi!

I seem to be having the same issue. It is on 0.6.5.7-1~trusty, but since this issue is still open I am posting here.
Kernel: 4.4.0-45-generic #66~14.04.1-Ubuntu

After about two weeks of system uptime and ~30 minutes into the nightly rsync that copies data from the zfs arc_prune goes to 100% on one cpu core, and stays there until reboot.

Right now arc_prune has been running like this for about 12 hours.
See attached arcstat, function_graph, and traces like the ones requested by @behlendorf , provided by @validname : https://gist.github.com/cserem/aea75b9606c095dcf0f1740aba506c75

Opposed to #4726, I only have one arc_prune process running wild, no wild arc_reclaims.
Opposed to #4239, rsync does not hang.

Am I suffering from the same issue?

@validname
Copy link
Author

validname commented Dec 6, 2016

I've got some info from another stalled server with 4.3.3 kernel, zfs 0.6.5.4 (1ffc4c1) +patch from 650f6cf. As usual, increasing ARC size didn't help, only reboot did.
https://gist.github.com/validname/63bfd4bfcb046d4d2ffd6d4899b1ff8b
(edited part:)
It's interesting, that there is nothing about dnodes now:

$ cat 10.arcstats | egrep '^(arc_meta_used|hdr_size|metadata_size|dbuf_size|dnode_size|bonus_size)'| sort | awk '{print $3"\t"$1}'  | sort -rn
7328655848      arc_meta_used
2995504608      dnode_size
2375707648      metadata_size
814165168       dbuf_size
746072960       bonus_size
397205464       hdr_size

dnode_size is far away from it limit of 4Gb, but arc_meta_used is at it's limit.
I'd rather say, that arc_prune hardly worked under evitting some another kine of metadata (meta-meta-data?).

@seletskiy
Copy link
Contributor

Any progress on that issue? We have servers with zfs 0.6.5.4 and aforementioned patches that got stuck every two weeks (the pattern is high write rate into systemd-journald). Is there any kind of additional debugging info that I can provide to identify bug?

@dweeezil
Copy link
Contributor

@seletskiy Could you please open a new issue with specifics from your system when the problem is occurring (see https://github.com/zfsonlinux/zfs/wiki/FAQ#reporting-a-problem) and also the arcstats.

@validname In the gist you posted on Dec 5 2016, the arc_meta_used is actually 187MB below arc_meta_limit. Also I'm not sure what you mean "nothing about dnodes"?

Something else I'd like to point out to anyone seeing arc_prune use a lot of CPU is that a task is dispatched to it for each mounted filesystem; if you have many mounted filesystems, especially if there's lots of active inodes in each one, its (arc_prune) CPU usage will definitely be higher.

Finally, anyone experiencing this issue should try setting zfs_arc_meta_strategy=0 to see if the situation improves.

@ghost
Copy link

ghost commented Jan 12, 2017

... the pattern is high write rate into systemd-journald ...

... After about two weeks of system uptime and ~30 minutes into the nightly rsync ...

@cserem @seletskiy @validname
May you provide any reproducible i/o profile in the form of any user-space
commands/utilities/daemons with any configs etc. (you may use gist(s) etc.),
which results to this zfs behavior as soon as possible (i.e. not days/weeks
but hours/minutes), or just reproducible at this moment as is?

Thanks in advance!

@behlendorf
Copy link
Contributor

Closing. This issue should be resolved in the 0.7 series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

9 participants