Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARC_PRUNE Storm - Server becomes unusable - 1 SSD drive, 60 zfs filesystems mounted #6630

Closed
partoneoftwo opened this issue Sep 11, 2017 · 4 comments

Comments

@partoneoftwo
Copy link

Type Version/Name
Distribution Name Debian
Distribution Version 9.1
Linux Kernel 4.10.17-1-pve (as shipped with Proxmox v5.0)
Architecture x64
ZFS Version 0.6.5.9-1 (as shipped with Proxmox v5.0)
SPL Version 0.6.5.9-1 (as shipped with Proxmox v5.0)

Describe the problem you're observing

We're experiencing an extremely high CPU load (400-450 continous, for hours) and soft lockups multiple times a day involving arc_prune and spl_system_task:

As a result of this bug, the machine becomes practically inoperable.
SSH logins are still possible. The system is not able to stat most things in /proc/
So top works, htop does not.
Ps ax works, ps aux does not.

Describe how to reproduce the problem

The box has 92G ECC RAM, Xeon E5620 x 2 (16 cores) and, possibly crucially, 60 mounted zfs filesystems (base filesystems + snapshots).
The ZFS filesystem is installed on an SSD drive. The machine has over 50% free memory and CPU utilization is 2%-10%.

Number of LXC containers using ZFS, running at the same time: 59
Number of KVM machines using ZFS, booted up: 1

LXC containers using ZFS filesystems are not causing the issue.
The ARC_PRUNE storm, as I've learnt this phomenon is called occurs when booting up the KVM virtual machines and performing OS installs, or copying 10GB+ files onto the VM filesystem.

The server then becomes bogged down by ARC_PRUNE processes that never seem to end. Load averages goes from 90, 85, 15 to 350, 345, 320 to 450, 450, 450.

Output of lshw -class disk

  *-disk
       description: ATA Disk
       product: Samsung SSD 850
       physical id: 0.0.0
       bus info: scsi@3:0.0.0
       logical name: /dev/sdf
       version: 2B6Q
       serial: S2BBNWAG112461J
       size: 953GiB (1024GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=ffd6d002-c0cb-eb41-bf7d-09896421be99 logicalsectorsize=512 sectorsize=512

Output of iostat

root@thebank:~# iostat
Linux 4.10.17-1-pve (thebank)   09/11/2017      _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.56    0.02    4.15    4.58    0.00   88.69

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdf             132.53      3700.79      1116.19 1021636972  308135390

Output of iostat -x

root@thebank:~# iostat -x
Linux 4.10.17-1-pve (thebank)   09/11/2017      _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.56    0.02    4.15    4.58    0.00   88.69

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdf               3.31     0.79   96.67   35.89  3701.82  1116.50    72.70     0.62    4.65    1.98   11.85   2.40  31.75

Output of ps ax

root@thebank:~# ps ax | grep arc_prune

  462 ?        S      0:00 [arc_prune]
  463 ?        S      0:00 [arc_prune]
  464 ?        S      0:00 [arc_prune]
  465 ?        S      0:00 [arc_prune]
  466 ?        S      0:00 [arc_prune]
  467 ?        S      0:00 [arc_prune]
  468 ?        S      0:00 [arc_prune]
  469 ?        S      0:00 [arc_prune]
  470 ?        S      0:00 [arc_prune]
  471 ?        S      0:00 [arc_prune]
  472 ?        S      0:00 [arc_prune]
  473 ?        S      0:00 [arc_prune]
  474 ?        S      0:00 [arc_prune]
  475 ?        S      0:00 [arc_prune]
  476 ?        S      0:00 [arc_prune]
  477 ?        S      0:00 [arc_prune]
  478 ?        S      0:00 [arc_prune]
  479 ?        S      0:00 [arc_prune]
  480 ?        S      0:00 [arc_prune]
  481 ?        S      0:00 [arc_prune]
  482 ?        S      0:00 [arc_prune]
  483 ?        S      0:00 [arc_prune]
  484 ?        S      0:00 [arc_prune]
  485 ?        S      0:00 [arc_prune]
  486 ?        S      0:00 [arc_prune]
  487 ?        S      0:00 [arc_prune]
  488 ?        S      0:00 [arc_prune]
  489 ?        S      0:00 [arc_prune]
  490 ?        S      0:00 [arc_prune]
  491 ?        S      0:00 [arc_prune]
  492 ?        S      0:00 [arc_prune]
  493 ?        S      0:00 [arc_prune]

  427 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  428 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  429 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  430 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  431 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  432 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  433 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  434 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  435 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  436 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  437 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  438 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  439 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  440 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  441 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  442 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  443 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  444 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  445 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  446 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  447 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  448 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_system_task
  449 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 spl_dynamic_tas
  462 root      20   0       0      0      0 S   0.0  0.0   0:00.00 arc_prune
  463 root      20   0       0      0      0 S   0.0  0.0   0:00.00 arc_prune
  464 root      20   0       0      0      0 S   0.0  0.0   0:00.00 arc_prune
  465 root      20   0       0      0      0 S   0.0  0.0   0:00.00 arc_prune
  466 root      20   0       0      0      0 S   0.0  0.0   0:00.00 arc_prune
  467 root      20   0       0      0      0 S   0.0  0.0   0:00.00 arc_prune
  468 root      20   0       0      0      0 S   0.0  0.0   0:00.00 arc_prune
  469 root      20   0       0      0      0 S   0.0  0.0   0:00.00 arc_prune
  470 root      20   0       0      0      0 S   0.0  0.0   0:00.00 arc_prune
  471 root      20   0       0      0      0 S   0.0  0.0   0:00.00 arc_prune
  472 root      20   0       0      0      0 S   0.0  0.0   0:00.00 arc_prune

output of arcstat

root@thebank:~# arcstat
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
20:48:28    17     8     47     8   47     0    0     0    0   4.6G  361M

#### output of arc_summary
root@thebank:~# arc_summary

------------------------------------------------------------------------
ZFS Subsystem Report                            Mon Sep 11 20:48:51 2017
ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                37.56m
        Mutex Misses:                           1.06m
        Evict Skips:                            1.06m

ARC Size:                               12.25%  4.58    GiB
        Target Size: (Adaptive)         0.97%   369.22  MiB
        Min Size (Hard Limit):          0.08%   32.00   MiB
        Max Size (High Water):          1195:1  37.36   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       7.39%   346.14  MiB
        Frequently Used Cache Size:     92.61%  4.24    GiB

ARC Hash Breakdown:
        Elements Max:                           2.65m
        Elements Current:               6.14%   162.56k
        Collisions:                             2.47m
        Chain Max:                              4
        Chains:                                 680

ARC Total accesses:                                     6.39b
        Cache Hit Ratio:                98.15%  6.27b
        Cache Miss Ratio:               1.85%   118.43m
        Actual Hit Ratio:               97.22%  6.21b

        Data Demand Efficiency:         99.63%  6.16b
        Data Prefetch Efficiency:       76.86%  64.59m

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             0.74%   46.74m
          Most Recently Used:           1.61%   101.23m
          Most Frequently Used:         97.44%  6.11b
          Most Recently Used Ghost:     0.05%   3.04m
          Most Frequently Used Ghost:   0.16%   9.74m

        CACHE HITS BY DATA TYPE:
          Demand Data:                  97.86%  6.14b
          Prefetch Data:                0.79%   49.64m
          Demand Metadata:              1.17%   73.61m
          Prefetch Metadata:            0.17%   10.78m

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  19.03%  22.54m
          Prefetch Data:                12.62%  14.95m
          Demand Metadata:              64.63%  76.54m
          Prefetch Metadata:            3.72%   4.40m


File-Level Prefetch: (HEALTHY)
DMU Efficiency:                                 638.82m
        Hit Ratio:                      84.37%  539.00m
        Miss Ratio:                     15.63%  99.82m

        Colinear:                               99.82m
          Hit Ratio:                    0.07%   71.09k
          Miss Ratio:                   99.93%  99.75m

        Stride:                                 452.04m
          Hit Ratio:                    98.49%  445.20m
          Miss Ratio:                   1.51%   6.84m

DMU Misc:
        Reclaim:                                99.75m
          Successes:                    3.07%   3.06m
          Failures:                     96.93%  96.69m

        Streams:                                93.83m
          +Resets:                      0.03%   27.03k
          -Resets:                      99.97%  93.80m
          Bogus:                                0


ZFS Tunable:
        l2arc_headroom                                    2
        zfs_free_leak_on_eio                              0
        zfs_free_max_blocks                               100000
        zfs_read_chunk_size                               1048576
        metaslab_preload_enabled                          1
        zfs_dedup_prefetch                                0
        zfs_txg_history                                   0
        zfs_scrub_delay                                   4
        zfs_vdev_async_read_max_active                    3
        zfs_read_history                                  0
        zfs_arc_sys_free                                  0
        l2arc_write_max                                   8388608
        zfs_dbuf_state_index                              0
        metaslab_debug_unload                             0
        zvol_inhibit_dev                                  0
        zfetch_max_streams                                8
        zfs_recover                                       0
        metaslab_fragmentation_factor_enabled             1
        zfs_sync_pass_rewrite                             2
        zfs_object_mutex_size                             64
        zfs_arc_min_prefetch_lifespan                     0
        zfs_arc_meta_prune                                10000
        zfs_read_history_hits                             0
        l2arc_norw                                        0
        zfs_dirty_data_max_percent                        10
        zfs_arc_meta_min                                  0
        metaslabs_per_vdev                                200
        zfs_arc_meta_adjust_restarts                      4096
        zil_slog_limit                                    1048576
        spa_load_verify_maxinflight                       10000
        spa_load_verify_metadata                          1
        zfs_send_corrupt_data                             0
        zfs_delay_min_dirty_percent                       60
        zfs_vdev_sync_read_max_active                     10
        zfs_dbgmsg_enable                                 0
        zio_requeue_io_start_cut_in_line                  1
        l2arc_headroom_boost                              200
        zfs_zevent_cols                                   80
        spa_config_path                                   /etc/zfs/zpool.cache
        zfs_vdev_cache_size                               0
        zfs_vdev_sync_write_min_active                    10
        zfs_vdev_scrub_max_active                         2
        zfs_disable_dup_eviction                          0
        ignore_hole_birth                                 1
        zvol_major                                        230
        zil_replay_disable                                0
        zfs_dirty_data_max_max_percent                    25
        zfs_expire_snapshot                               300
        zfs_sync_pass_deferred_free                       2
        spa_asize_inflation                               24
        zfs_vdev_mirror_switch_us                         10000
        l2arc_feed_secs                                   1
        zfs_autoimport_disable                            1
        zfs_arc_p_aggressive_disable                      1
        zfs_zevent_len_max                                512
        l2arc_noprefetch                                  1
        zfs_arc_meta_limit                                0
        zfs_flags                                         0
        zfs_dirty_data_max_max                            20057063424
        zfs_arc_average_blocksize                         8192
        zfs_vdev_cache_bshift                             16
        zfs_vdev_async_read_min_active                    1
        zfs_arc_num_sublists_per_state                    16
        zfs_arc_grow_retry                                0
        l2arc_feed_again                                  1
        zfs_arc_lotsfree_percent                          10
        zfs_zevent_console                                0
        zvol_prefetch_bytes                               131072
        zfs_free_min_time_ms                              1000
        zio_taskq_batch_pct                               75
        zfetch_block_cap                                  256
        spa_load_verify_data                              1
        zfs_dirty_data_max                                8022825369
        zfs_vdev_async_write_max_active                   10
        zfs_dbgmsg_maxsize                                4194304
        zfs_nocacheflush                                  0
        zfetch_array_rd_sz                                1048576
        zfs_arc_meta_strategy                             1
        zfs_dirty_data_sync                               67108864
        zvol_max_discard_blocks                           16384
        zfs_vdev_async_write_active_max_dirty_percent     60
        zfs_arc_p_dampener_disable                        1
        zfs_txg_timeout                                   5
        metaslab_aliquot                                  524288
        zfs_mdcomp_disable                                0
        zfs_vdev_sync_read_min_active                     10
        metaslab_debug_load                               0
        zfs_vdev_aggregation_limit                        131072
        l2arc_nocompress                                  0
        metaslab_lba_weighting_enabled                    1
        zfs_vdev_scheduler                                noop
        zfs_vdev_scrub_min_active                         1
        zfs_no_scrub_io                                   0
        zfs_vdev_cache_max                                16384
        zfs_scan_idle                                     50
        zfs_arc_shrink_shift                              0
        spa_slop_shift                                    5
        zfs_deadman_synctime_ms                           1000000
        metaslab_bias_enabled                             1
        zfs_admin_snapshot                                0
        zfs_no_scrub_prefetch                             0
        zfs_metaslab_fragmentation_threshold              70
        zfs_max_recordsize                                1048576
        zfs_arc_min                                       0
        zfs_nopwrite_enabled                              1
        zfs_arc_p_min_shift                               0
        zfs_mg_fragmentation_threshold                    85
        l2arc_write_boost                                 8388608
        zfs_prefetch_disable                              0
        l2arc_feed_min_ms                                 200
        zio_delay_max                                     30000
        zfs_vdev_write_gap_limit                          4096
        zfs_pd_bytes_max                                  52428800
        zfs_scan_min_time_ms                              1000
        zfs_resilver_min_time_ms                          3000
        zfs_delay_scale                                   500000
        zfs_vdev_async_write_active_min_dirty_percent     30
        zfs_vdev_sync_write_max_active                    10
        zfs_mg_noalloc_threshold                          0
        zfs_deadman_enabled                               1
        zfs_resilver_delay                                2
        zfs_arc_max                                       0
        zfs_top_maxinflight                               32
        zfetch_min_sec_reap                               2
        zfs_immediate_write_sz                            32768
        zfs_vdev_async_write_min_active                   1
        zfs_sync_pass_dont_compress                       5
        zfs_vdev_read_gap_limit                           32768
        zfs_vdev_max_active                               1000

Suggestions I have found elsewhere with solutions

Why I am adding a ticket on this

I am affected of this error in September 2017, and I am hoping to learn if I can solve the issue with the current installed versions of ZFS and SPF.

The rich discussion on a similar/identical ticket, provided a solution where setting a DNODE_MAX seems to have solved the issue. #6223
It seems that the following solves it:

There's another trap for young and old there: per arc_tuning_update(), the only time zfs_arc_dnode_limit_percent is used is if you actually change zfs_arc_max. (And whilst we're here, if you change zfs_arc_meta_min then arc_dnode_limit is automatically set back to 10% of arc_meta_limit, regardless of zfs_arc_dnode_limit or zfs_arc_dnode_limit_percent.)
With my zfs_arc_dnode_limit_percent set to 75 the dnode_size/limit looks much healthier:
# grep dnode /proc/spl/kstat/zfs/arcstats
dnode_size                      4    38120264000
arc_dnode_limit                 4    67645734912
And I've yet to see another arc_prune storm. Yippee!

From <https://github.com/zfsonlinux/zfs/issues/6223> 

I don't have ZFS v0.7 or newer installed, so I can't tune the DNODE_SIZE setting.
Are there any other things I can do?

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Sep 12, 2017

referencing: #4345 arc_prune eats the whole CPU core [0.6.5.4] ,

#4850 Limit the amount of dnode metadata in the ARC (PR)

@chrisrd
Copy link
Contributor

chrisrd commented Sep 13, 2017

@thomasfrivold It's arc_prune's job to reduce the ARC memory usage in response to memory pressure. Your arcstat shows:

ARC Size:                               12.25%  4.58    GiB
        Target Size: (Adaptive)         0.97%   369.22  MiB
        Min Size (Hard Limit):          0.08%   32.00   MiB
        Max Size (High Water):          1195:1  37.36   GiB

I.e. your ARC been as high as 37G, current size is 4.5G, and it's trying to get that down to 369M.

Check your ex-ZFS memory usage: have a look for what might be causing the memory pressure that's pushing down the ARC size.

@chrisrd
Copy link
Contributor

chrisrd commented Sep 14, 2017

@thomasfrivold Given you're on linux 4.10 and have a reasonable amount of memory, you may well be suffering from the problem described in this comment in #6635. The patch fixing that issue hasn't yet been committed, but if you're comfortable with git and compiling your own ZFS modules you might like to give it a try. (I'm running it in production, so it's safe enough...)

@behlendorf
Copy link
Contributor

Closing. This issue should be resolved in the 0.7 series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants