-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARC grows well past the zfs_arc_max #676
Comments
How did you manage to create pool version 26 with zfs-fuse? |
Perhaps it is a typo and he meant pool version 23. |
I was going to suggest that you try the patch in issue #618 but you beat me too it. I'm a bit surprised that it doesn't resolve the issue, clearly there's more to be done. |
I haven't had the chance to apply #618 yet. I'll try it tomorrow, time permitting. |
No love. With #618 applied the same scenario still deadlocks after ignoring the arc size limit These outputs are all taken after ZFS jammed up. # free total used free shared buffers cached Mem: 8182236 7540728 641508 0 28528 43364 -/+ buffers/cache: 7468836 713400 Swap: 4194300 14492 4179808 # history | grep modprobe modprobe zfs zfs_vdev_max_pending=20 zfs_arc_max=2147483648 # cat /proc/spl/kstat/zfs/arcstats 10 1 0x01 77 3696 141145250104682 144360035578750 name type data hits 4 5947062 misses 4 939766 demand_data_hits 4 215008 demand_data_misses 4 13180 demand_metadata_hits 4 4365140 demand_metadata_misses 4 469195 prefetch_data_hits 4 20650 prefetch_data_misses 4 53521 prefetch_metadata_hits 4 1346264 prefetch_metadata_misses 4 403870 mru_hits 4 895846 mru_ghost_hits 4 277838 mfu_hits 4 3684335 mfu_ghost_hits 4 189079 deleted 4 248071 recycle_miss 4 488895 mutex_miss 4 8046 evict_skip 4 362916066 evict_l2_cached 4 0 evict_l2_eligible 4 14598742528 evict_l2_ineligible 4 5304508416 hash_elements 4 253442 hash_elements_max 4 253519 hash_collisions 4 240599 hash_chains 4 66173 hash_chain_max 4 7 p 4 134225408 c 4 2147483648 c_min 4 293601280 c_max 4 2147483648 size 4 3320870416 hdr_size 4 85214688 data_size 4 2363930112 other_size 4 871725616 anon_size 4 27773952 anon_evict_data 4 0 anon_evict_metadata 4 0 mru_size 4 161863680 mru_evict_data 4 0 mru_evict_metadata 4 3620864 mru_ghost_size 4 1977015808 mru_ghost_evict_data 4 908544000 mru_ghost_evict_metadata 4 1068471808 mfu_size 4 2174292480 mfu_evict_data 4 0 mfu_evict_metadata 4 1738240 mfu_ghost_size 4 166193152 mfu_ghost_evict_data 4 0 mfu_ghost_evict_metadata 4 166193152 l2_hits 4 0 l2_misses 4 0 l2_feeds 4 0 l2_rw_clash 4 0 l2_read_bytes 4 0 l2_write_bytes 4 0 l2_writes_sent 4 0 l2_writes_done 4 0 l2_writes_error 4 0 l2_writes_hdr_miss 4 0 l2_evict_lock_retry 4 0 l2_evict_reading 4 0 l2_free_on_write 4 0 l2_abort_lowmem 4 0 l2_cksum_bad 4 0 l2_io_error 4 0 l2_size 4 0 l2_hdr_size 4 0 memory_throttle_count 4 0 memory_direct_count 4 0 memory_indirect_count 4 0 arc_no_grow 4 0 arc_tempreserve 4 475136 arc_loaned_bytes 4 262144 arc_prune 4 2313 arc_meta_used 4 3309051920 arc_meta_limit 4 536870912 arc_meta_max 4 3309157328 Stack traces, some are additional kernel threads. z_wr_int/15 D ffff880113a49408 0 18763 2 0x00000000 ffff880113a49160 0000000000000046 ffffffff811ddf2f ffff880100000000 ffff88022e8e67d0 000000000000ffc0 ffff8801a8d05fd8 0000000000004000 ffff8801a8d05fd8 000000000000ffc0 ffff880113a49160 ffff8801a8d04010 Call Trace: [] ? scsi_request_fn+0x314/0x3ef [] ? __make_request+0x22a/0x243 [] ? generic_make_request+0x201/0x262 [] ? __mutex_lock_slowpath+0xe2/0x128 [] ? mutex_lock+0x12/0x25 [] ? vdev_cache_write+0x57/0x11d [zfs] [] ? zio_vdev_io_done+0x6f/0x141 [zfs] [] ? zio_execute+0xad/0xd1 [zfs] [] ? taskq_thread+0x2c2/0x508 [spl] [] ? try_to_wake_up+0x1d9/0x1eb [] ? try_to_wake_up+0x1eb/0x1eb [] ? spl_taskq_init+0x159/0x159 [spl] [] ? spl_taskq_init+0x159/0x159 [spl] [] ? kthread+0x7a/0x82 [] ? kernel_thread_helper+0x4/0x10 [] ? kthread_worker_fn+0x139/0x139 [] ? gs_change+0xb/0xb txg_sync D ffff88018865b898 0 18781 2 0x00000000 ffff88018865b5f0 0000000000000046 ffff88022fc10028 ffff88011160c248 ffff88011160c200 000000000000ffc0 ffff88015dd89fd8 0000000000004000 ffff88015dd89fd8 000000000000ffc0 ffff88018865b5f0 ffff88015dd88010 Call Trace: [] ? check_preempt_curr+0x25/0x62 [] ? ttwu_do_wakeup+0x11/0x83 [] ? try_to_wake_up+0x1d9/0x1eb [] ? __wake_up_common+0x41/0x78 [] ? cv_wait_common+0xb8/0x141 [spl] [] ? wake_up_bit+0x23/0x23 [] ? zio_wait+0xe8/0x11c [zfs] [] ? dsl_pool_sync+0xbf/0x428 [zfs] [] ? spa_sync+0x47d/0x829 [zfs] [] ? txg_sync_thread+0x29a/0x3f6 [zfs] [] ? set_user_nice+0x115/0x139 [] ? txg_thread_exit+0x2b/0x2b [zfs] [] ? __thread_create+0x2df/0x2df [spl] [] ? thread_generic_wrapper+0x6a/0x75 [spl] [] ? kthread+0x7a/0x82 [] ? kernel_thread_helper+0x4/0x10 [] ? kthread_worker_fn+0x139/0x139 [] ? gs_change+0xb/0xb rsync D ffff88020b19f008 0 19506 19505 0x00000000 ffff88020b19ed60 0000000000000082 0000000000000001 0000000000000007 ffff880151257930 000000000000ffc0 ffff88015f65dfd8 0000000000004000 ffff88015f65dfd8 000000000000ffc0 ffff88020b19ed60 ffff88015f65c010 Call Trace: [] ? extract_buf+0x76/0xc2 [] ? invalidate_interrupt1+0xe/0x20 [] ? __mutex_lock_slowpath+0xe2/0x128 [] ? mutex_lock+0x12/0x25 [] ? vdev_cache_read+0x92/0x2de [zfs] [] ? zio_vdev_io_start+0x1c1/0x228 [zfs] [] ? zio_nowait+0xd0/0xf4 [zfs] [] ? vdev_mirror_io_start+0x2fa/0x313 [zfs] [] ? vdev_config_generate+0xac7/0xac7 [zfs] [] ? zio_nowait+0xd0/0xf4 [zfs] [] ? arc_read_nolock+0x662/0x673 [zfs] [] ? arc_read+0xc2/0x146 [zfs] [] ? dnode_block_freed+0xfe/0x119 [zfs] [] ? dbuf_fill_done+0x61/0x61 [zfs] [] ? dbuf_read+0x3ce/0x5c0 [zfs] [] ? dnode_hold_impl+0x1a8/0x43c [zfs] [] ? remove_reference+0x93/0x9f [zfs] [] ? dmu_bonus_hold+0x22/0x26e [zfs] [] ? zfs_zget+0x5c/0x19f [zfs] [] ? zfs_dirent_lock+0x447/0x48f [zfs] [] ? zfs_zaccess_aces_check+0x1d5/0x203 [zfs] [] ? zfs_dirlook+0x20a/0x276 [zfs] [] ? zfs_lookup+0x26e/0x2b6 [zfs] [] ? zpl_lookup+0x47/0x80 [zfs] [] ? d_alloc_and_lookup+0x43/0x60 [] ? do_lookup+0x1c9/0x2bb [] ? path_lookupat+0xe2/0x5af [] ? do_path_lookup+0x1d/0x5f [] ? user_path_at_empty+0x49/0x84 [] ? tsd_exit+0x83/0x18d [spl] [] ? cp_new_stat+0xdf/0xf1 [] ? vfs_fstatat+0x43/0x70 [] ? sys_newlstat+0x11/0x2d [] ? system_call_fastpath+0x16/0x1b rsync D ffff8801ba0c6cb8 0 19511 19510 0x00000000 ffff8801ba0c6a10 0000000000000086 ffffffff8115539c 0000000000000001 ffff8801512572a0 000000000000ffc0 ffff8801b1051fd8 0000000000004000 ffff8801b1051fd8 000000000000ffc0 ffff8801ba0c6a10 ffff8801b1050010 Call Trace: [] ? cpumask_any_but+0x28/0x37 [] ? __schedule+0x727/0x7b0 [] ? rcu_implicit_dynticks_qs+0x3f/0x60 [] ? force_quiescent_state+0x1c3/0x230 [] ? __mutex_lock_slowpath+0xe2/0x128 [] ? arc_buf_remove_ref+0xe6/0xf4 [zfs] [] ? mutex_lock+0x12/0x25 [] ? zfs_zinactive+0x5a/0xd4 [zfs] [] ? zfs_inactive+0x106/0x19e [zfs] [] ? evict+0x78/0x117 [] ? dispose_list+0x2c/0x36 [] ? shrink_icache_memory+0x278/0x2a8 [] ? shrink_slab+0xe3/0x153 [] ? do_try_to_free_pages+0x253/0x3f0 [] ? get_page_from_freelist+0x47a/0x4ad [] ? try_to_free_pages+0x79/0x7e [] ? __alloc_pages_nodemask+0x48b/0x6c8 [] ? __get_free_pages+0x12/0x52 [] ? spl_kmem_cache_alloc+0x236/0x975 [spl] [] ? dbuf_create+0x38/0x32e [zfs] [] ? dnode_hold_impl+0x3f0/0x43c [zfs] [] ? dbuf_create_bonus+0x16/0x1f [zfs] [] ? dmu_bonus_hold+0x10b/0x26e [zfs] [] ? zfs_zget+0x5c/0x19f [zfs] [] ? zfs_dirent_lock+0x447/0x48f [zfs] [] ? zfs_zaccess_aces_check+0x1d5/0x203 [zfs] [] ? zfs_dirlook+0x20a/0x276 [zfs] [] ? zfs_lookup+0x26e/0x2b6 [zfs] [] ? zpl_lookup+0x47/0x80 [zfs] [] ? d_alloc_and_lookup+0x43/0x60 [] ? do_lookup+0x1c9/0x2bb [] ? path_lookupat+0xe2/0x5af [] ? strncpy_from_user+0x9/0x4e [] ? do_path_lookup+0x1d/0x5f [] ? user_path_at_empty+0x49/0x84 [] ? tsd_exit+0x83/0x18d [spl] [] ? cp_new_stat+0xdf/0xf1 [] ? vfs_fstatat+0x43/0x70 [] ? sys_newlstat+0x11/0x2d [] ? system_call_fastpath+0x16/0x1b rsync D ffff88022365e728 0 19522 19521 0x00000000 ffff88022365e480 0000000000000082 ffffffff8115539c ffff8801b219b438 ffff8802219bcea0 000000000000ffc0 ffff8801b219bfd8 0000000000004000 ffff8801b219bfd8 000000000000ffc0 ffff88022365e480 ffff8801b219a010 Call Trace: [] ? cpumask_any_but+0x28/0x37 [] ? __pagevec_release+0x19/0x22 [] ? move_active_pages_to_lru+0x130/0x154 [] ? select_task_rq_fair+0x35e/0x791 [] ? common_interrupt+0xe/0x13 [] ? sched_clock_local+0x13/0x76 [] ? __mutex_lock_slowpath+0xe2/0x128 [] ? check_preempt_curr+0x25/0x62 [] ? mutex_lock+0x12/0x25 [] ? zfs_zinactive+0x5a/0xd4 [zfs] [] ? zfs_inactive+0x106/0x19e [zfs] [] ? evict+0x78/0x117 [] ? dispose_list+0x2c/0x36 [] ? shrink_icache_memory+0x278/0x2a8 [] ? shrink_slab+0xe3/0x153 [] ? do_try_to_free_pages+0x253/0x3f0 [] ? get_page_from_freelist+0x47a/0x4ad [] ? try_to_free_pages+0x79/0x7e [] ? __alloc_pages_nodemask+0x48b/0x6c8 [] ? __get_free_pages+0x12/0x52 [] ? spl_kmem_cache_alloc+0x236/0x975 [spl] [] ? dbuf_create+0x38/0x32e [zfs] [] ? dnode_hold_impl+0x3f0/0x43c [zfs] [] ? dbuf_create_bonus+0x16/0x1f [zfs] [] ? dmu_bonus_hold+0x10b/0x26e [zfs] [] ? zfs_zget+0x5c/0x19f [zfs] [] ? zfs_dirent_lock+0x447/0x48f [zfs] [] ? zfs_zaccess_aces_check+0x1d5/0x203 [zfs] [] ? zfs_dirlook+0x20a/0x276 [zfs] [] ? zfs_lookup+0x26e/0x2b6 [zfs] [] ? zpl_lookup+0x47/0x80 [zfs] [] ? d_alloc_and_lookup+0x43/0x60 [] ? do_lookup+0x1c9/0x2bb [] ? path_lookupat+0xe2/0x5af [] ? do_path_lookup+0x1d/0x5f [] ? user_path_at_empty+0x49/0x84 [] ? tsd_exit+0x83/0x18d [spl] [] ? cp_new_stat+0xdf/0xf1 [] ? vfs_fstatat+0x43/0x70 [] ? sys_newlstat+0x11/0x2d [] ? system_call_fastpath+0x16/0x1b rsync D ffff8802216c93c8 0 19527 19526 0x00000000 ffff8802216c9120 0000000000000082 000000000000000a ffff880000000000 ffff88022e8a6750 000000000000ffc0 ffff88021c0f1fd8 0000000000004000 ffff88021c0f1fd8 000000000000ffc0 ffff8802216c9120 ffff88021c0f0010 Call Trace: [] ? zap_get_leaf_byblk+0x1b5/0x249 [zfs] [] ? zap_leaf_array_match+0x166/0x197 [zfs] [] ? remove_reference+0x93/0x9f [zfs] [] ? arc_buf_remove_ref+0xe6/0xf4 [zfs] [] ? dbuf_rele_and_unlock+0x12b/0x19a [zfs] [] ? __mutex_lock_slowpath+0xe2/0x128 [] ? mutex_lock+0x12/0x25 [] ? zfs_zget+0x46/0x19f [zfs] [] ? zfs_dirent_lock+0x447/0x48f [zfs] [] ? zfs_zaccess_aces_check+0x1d5/0x203 [zfs] [] ? zfs_dirlook+0x20a/0x276 [zfs] [] ? zfs_lookup+0x26e/0x2b6 [zfs] [] ? zpl_lookup+0x47/0x80 [zfs] [] ? d_alloc_and_lookup+0x43/0x60 [] ? do_lookup+0x1c9/0x2bb [] ? path_lookupat+0xe2/0x5af [] ? do_path_lookup+0x1d/0x5f [] ? user_path_at_empty+0x49/0x84 [] ? tsd_exit+0x83/0x18d [spl] [] ? cp_new_stat+0xdf/0xf1 [] ? vfs_fstatat+0x43/0x70 [] ? sys_newlstat+0x11/0x2d [] ? system_call_fastpath+0x16/0x1b zfs-fuse has pool v26 available in the Git repo. [Edit: full arcstats included] |
We're working on some spl kmem improvements which may help with this. Expect patches in the next week or so. Two cases have been identified which can result in memory reclaim from the slab being less the optimal. |
I'm also experiencing this exact issue when attempting an rsync. Has there been any progress; is there anything I should try? I'm running 0.6.0-rc8 on Debian wheezy/sid using the deb instructions on the website. |
Several patches which should help with this have been merged in to master post rc8. Please try the latest master source. Additionally, you can try increasing /proc/sys/vm/min_free_kbytes to say 256m or so. |
I was actually just looking at #154 and it seemed to be related, and was just looking for confirmation that I should indeed try the latest master. So, thanks, will do, and will report back. |
@jspiros Is there anything useful being output to dmesg? You might want to run |
@ryao Indeed, there were many hung tasks, even with the default of 120 seconds. The first three were kswapd0, arc_reclaim, and rsync, though many others hung after that. I'm currently running the latest master, and I set min_free_kbytes to 256MB as recommended, but if you tell me what information you want should the same problem occur again tonight (I have some rsync backups that run nightly), I'll try to get it here. I was able to run some commands as root, including dmesg, so just tell me what would be helpful and I'll be sure to get as much information as I can. |
The hung task information in the dmesg output would be useful. |
Fresh git pulls from this morning on both spl and zfs. Versions are 2371321 and ee191e8 Arc size 2^31 bytes. It's the same setup from my first post. # cat /proc/spl/kstat/zfs/arcstats 10 1 0x01 77 3696 216989312967 3547642914356 name type data hits 4 4820390 misses 4 1632473 demand_data_hits 4 2 demand_data_misses 4 58777 demand_metadata_hits 4 4437436 demand_metadata_misses 4 875202 prefetch_data_hits 4 0 prefetch_data_misses 4 0 prefetch_metadata_hits 4 382952 prefetch_metadata_misses 4 698494 mru_hits 4 676077 mru_ghost_hits 4 483254 mfu_hits 4 3761394 mfu_ghost_hits 4 483721 deleted 4 418327 recycle_miss 4 643047 mutex_miss 4 2093 evict_skip 4 96315269 evict_l2_cached 4 0 evict_l2_eligible 4 15802661888 evict_l2_ineligible 4 7292848128 hash_elements 4 267779 hash_elements_max 4 299201 hash_collisions 4 328936 hash_chains 4 71327 hash_chain_max 4 8 p 4 135680512 c 4 2147483648 c_min 4 293601280 c_max 4 2147483648 size 4 2306523480 hdr_size 4 90425088 data_size 4 1618198528 other_size 4 597899864 anon_size 4 3948544 anon_evict_data 4 0 anon_evict_metadata 4 0 mru_size 4 199776256 mru_evict_data 4 0 mru_evict_metadata 4 0 mru_ghost_size 4 1946481152 mru_ghost_evict_data 4 16462336 mru_ghost_evict_metadata 4 1930018816 mfu_size 4 1414473728 mfu_evict_data 4 0 mfu_evict_metadata 4 0 mfu_ghost_size 4 200998912 mfu_ghost_evict_data 4 0 mfu_ghost_evict_metadata 4 200998912 l2_hits 4 0 l2_misses 4 0 l2_feeds 4 0 l2_rw_clash 4 0 l2_read_bytes 4 0 l2_write_bytes 4 0 l2_writes_sent 4 0 l2_writes_done 4 0 l2_writes_error 4 0 l2_writes_hdr_miss 4 0 l2_evict_lock_retry 4 0 l2_evict_reading 4 0 l2_free_on_write 4 0 l2_abort_lowmem 4 0 l2_cksum_bad 4 0 l2_io_error 4 0 l2_size 4 0 l2_hdr_size 4 0 memory_throttle_count 4 0 memory_direct_count 4 0 memory_indirect_count 4 0 arc_no_grow 4 0 arc_tempreserve 4 0 arc_loaned_bytes 4 0 arc_prune 4 1992 arc_meta_used 4 2306523480 arc_meta_limit 4 536870912 arc_meta_max 4 2657959776 I caught it before it the OS threatened to lock up and SIGSTOP'd all my rsync processes. While it was paused the arc 'size' was slowly dropping and I resumed once it got below 1.6G. Right now the size is 2,718,915,552 and still climbing unless I stop it again. |
@behlendorf Your suggestions helped me make it through the nightly rsync, and everything seems to be working fine. @ryao Due to this, I'm afraid I do not currently have any dmesg output for you (yet). If something goes wrong again, I'll provide it, of course. |
@jspiros That's good news. The hope is that going forward some of Richards VM work will remove the need for this sort of tuning but we're not quite there yet. |
I (partially?) retract my previous post. I left the rsync running so it would presumably ruin the system, but it didn't die. That was some hours ago and the ARC is now restraining itself at about 570-580 megabytes. I'm turning up the heat a bit to see if I can coerce a useful stack dump out of it or find out if it properly behaves now. Edit: after turning up the pressure the ARC did eventually grow to the breaking point. It seems to fare better but still cracks under intense pressure. |
Okay, well, this morning was doing alright until just now. It probably doesn't help that I decided to scrub my pool, which will take about three days (18TB in [usable] capacity), overlapping with the nightly rsync. Here's what dmesg is looking like right now.
I will try to avoid rebooting the system, in case anyone has any suggestions that might resolve this problem online. Update: Nevermind, I realized that with kswapd0 in deadlock it's highly unlikely that there's anything I could do to fix this problem without a reboot, so I rebooted. On an unrelated note, I'm happy to see that the scrub is continuing from where it left off. |
While using ryao's "gentoo" branch of his own fork here on on github I killed the same machine again. Here's more data dumps. Edit: must have been too much for github. Here's a pastebin of what I wanted: http://pastebin.com/303uhMen |
After talking with ryao in IRC I've been asked to add a few points to this issue. The vdev is a single device, /dev/sda, which is actually 13x drives in a hardware RAID-6 array. It can churn out pretty high IOPS for random reads which is the significant portion of the system's job - lots of rsync (as a sync target) instances all scanning what is tens of millions of files. It's also compressed with a ratio of about 1.7x so it grows pretty fast. (No Dedup) The 'size' field from the kstats for the ARC, during heavy IO, just grows past the c and c_max fields, but it also seems to have some kind of background job that is draining the ARC at all times to around 512 megabytes. When completely idle, the As a hack I can keep the system stable by SIGSTOP'ing all the rsync processes if the In my most recent reports I've been setting sysctl vm.min_free_kbytes to 524288 |
After switching to vm.min_free_kbytes = 512M from 256M I've been able to survive the nightly rsyncs, however, now I'm running into exactly the same problem except with CrashPlan (a java application for remote backups) causing it instead of rsync. Exactly the same problem in that kswapd0 deadlocks followed by arc_reclaim and then everything else. I needed to get the system back online and neglected to copy the dmesg output, but I confirmed that it is essentially identical to the previous output from the rsync-caused problem. I am hesitant to turn CrashPlan back on until there's something new to try. |
Still trying to collect useful information. I set the arc_max to 256 megabytes just for experimentation. It still deadlocked. Stack traces and arcstats from kstat in pastebin: Two sets of stack traces are provided, set about 2.5 hours apart. The first is from long after the system had hung - I don't know when that actually happened. vm.min_free_kbytes = 67584 (bootup default) # free -m total used free shared buffers cached Mem: 7990 7835 155 0 52 111 -/+ buffers/cache: 7671 319 Swap: 4095 1 4094 # history | grep zfs.ko insmod zfs.ko zfs_arc_max=268435456 zfs_arc_min=134217728 zfs_arc_shrink_shift=4 zfs_vdev_max_pending=20 zfs_vdev_min_pending=8 |
I turned on debugging and turned off a bunch of compiler optimizations and locked it up again the same way. These stack traces look a lot nicer to decode. |
After speaking with @DeHackEd on IRC about the cause of this problem, I tried limiting the queue depth (to 16 from 32) on the SATA drives that make up my pool using hdparm. I experienced success with this, in that I survived the night with both CrashPlan and the nightly rsyncs running. This isn't a long term fix, as it does slow everything down, but if someone else is in a situation like mine, it might help until this issue is solved. |
The crashing/deadlocking side of the issue is apparently fixed by https://bugs.gentoo.org/show_bug.cgi?id=416685 (which I applied and cleaned up for kernel 3.0.x). Now it can run for over an hour with nothing controlling its execution speed. I'll update if anything goes wrong on me. While running in this mode the ARC appears to behave itself better with regard to following the Note: I am currently using ryao's 'gentoo' tree. |
The memory management and swap improvements in issue #883 have now been merged in to master and should remove the need for the kernel vmalloc() patch. If you could verify these resolve your issue I think we'll be able to close this issue. |
I can try installing it probably on Tuesday, but it may be a few days before the result is yay/nay. |
No great rush, I'm not going anywhere. But it would be nice to know. :) |
I am experiencing similar problems. load average: 49.08, 48.50, 43.52 #cat /proc/spl/kstat/zfs/arcstats | grep -e arc_meta_used -e "^size " -e c_max -e limit #free #top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24718 root 0 -20 0 0 0 R 100 0.0 21:24.07 arc_adapt 0x7a [zfs] |
I tried to reproduce this bug in a virtual environment, but everything worked correctly. I think the reason this is a lot of metadata. I have several backup servers with pool sizes from 1-4TB. On each of them ~ 1000 datasets with snapshots. And each of them has this bug. To reproduce it enough to run parallel find on all datasets. I must periodically run echo 3>/proc/sys/vm/drop_caches to clean memory, but it does not always help. |
@behlendorf I had a discussion with @DeHackEd about this in IRC. It seems that he can only reproduce this issue when L2ARC is present. He also reports that my code from a few months ago that depended upon the kernel VM patch is more stable. A difference that comes to mind is the following change to the SPL:
I have asked him to test it. |
Looking at the original arcstats that @DeHackEd posted, it looks like arc_meta_limit is not honored when L2ARC is present. I do not have time to look through the code right now, but I think that this has to do with the L2ARC-only buffers used to keep track of the L2ARC map. Here are some references:
http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/
http://www.zfsbuild.com/2011/11/18/dedupe-be-careful/ It feels to me like memory management of the L2ARC map is not being done correctly under sustained load conditions. |
@ryao @DeHackEd Ahh! That makes perfect sense. Now I'm kicking myself for not exposing the l2c_only arc_state_t in arcstats when I originally did it for the other arc states. The above patch corrects that issue. Can you please give it a try and verify that the new Now exactly what do it about it is a different question. At the moment there is no mechanism for memory pressure on the system to cause these headers to be evicted. Doing so would effectively abandon the data buffers in the L2ARC. This is safe to do but not exactly optimal for performance. There is probably a case to be made that this is preferable as a last resort when the system needs memory. |
In my case l2 cache is not used.
|
@arp096 You have a different but equally interesting case. In your case the limit was exceeded because very little of your meta data can be safely evicted, it appears that most of it is of type "other". This likely means it's holding dnodes and bonus buffers for lots of active files. I would expect that when you back off the workload it quickly brings the usage under control? |
In most of the arcstats being shown here, L2ARC does not appear to be used. It is only the most recent one from @DeHackEd that shows use. This issue could be describing two separate issues that happen to have the same symptoms. |
@behlendorf Yes, that's right. And bug is not present if the parallel rsync processes is less than ten. |
That does describe my situation somewhat as well. When the number of rsync processes is relatively low (best guess of the maximum safe range is 10-15) the system is stable and the ARC size is honoured, L2ARC or not. When It's much higher (around 20 maybe?) I do get the memory usage problem. What I can't be sure about is how L2ARC factors into things. After @ryao asked me to try stressing it with the L2ARC removed I wasn't able to. Just now I've tried putting medium pressure on (10 rsyncs + send/recv) with the L2ARC and it starts tipping over sometimes (4300 MB of ARC used out of 4096 allowed, just floating above the line but clearly moreso than should be tolerated). Remove the L2arc yet raise the pressure (15 rsyncs + send/recv) and it behaves much better. I always assumed the cache was responsible for changing the load average from 15 processes in the D state most of the time (waiting on the disk IO) to 15 processes in the D state a lot less of the time (not waiting so much) but now I'm completely confused. |
It is strange that the metadata cache continues to grow even when turned off. zfs set primarycache=none backups And after some minutes |
It doesn't help that "metadata" is rather vaguely defined in this context. |
i had the same issue, when i run several rsync's in a 1TB pool under KVM-qemu system (VM) I could see on the host that my virtual machine was eating all the cpus assigned to it. Besides I'm using Dedup in this pool, I've been reading about it and I understand it uses up to 25% of this memory. I calculated the amount of neccesary memory to keep the dedup table within the ARC memory. my question is : Does DDT table stay in this memory? |
Here's my most recent run of the backup cycle, which wasn't even at full speed ("only" 15 copies of rsync) and it crashed in record time. This time I was graphing a whole bunch of the arcstats from the spl data. System information:
ZFS: 4c837f0 (Could be more up-to-date at this point) pool: poolname1 state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scan: scrub repaired 0 in 59h56m with 0 errors on Fri Sep 28 23:48:31 2012 config: NAME STATE READ WRITE CKSUM poolname1 ONLINE 0 0 0 scsi-36001e4f022847f0016c30400b49f5666 ONLINE 0 0 0 cache scsi-360000000000000001786fd14bcbe933c-part1 ONLINE 0 0 0 errors: No known data errors Graphs: Data polling interval is 10 seconds at the RRD level. |
@DeHackEd When you say it crashed do you mean the system paniced or became unstable in some way, or that we just exceeded that meta limit? I believe I see why that's happening now regarding the meta limit but I'd like to better understand the instability your referring too. |
@DeHackEd Can you please try 6c5207088f732168569d1a0b29f5f949b91bb503 which I believe should address your issue. Basically it was possible under a very heavy meta data workload to move all the meta data just from the MRU/MFU on to the MRU/MFU ghost lists. It would then never be free'd from there if there was data on the ghost list which could be reclaimed instead. I'd be interested to hear if this change improves your performance as well. |
While it does seem to help, it hasn't completely eliminated the issue. At best it just delays the inevitable. arp096's method of using drop_caches does seem to keep it stable though. Not a great workaround but still... |
I tried the patch, but nothing has changed cat /proc/spl/kstat/zfs/arcstats |
Yet more output from arcstats, /proc/meminfo and some IO stats, immediately before and after doing a drop_caches after I started hitting the redline of memory usage. |
A workaround for issue #1101 seems to fix this issue as well under my typical workload. |
Yes, this workaround works for me. But I set ZFS_OBJ_MTX_SZ to 10240. It works fine with ~500 concurrent processes. Thank You! |
I'm closing this long long long issue because a fix was merged for the deadlock, see issue #1101. The remaining memory management concerns I'll open a new more succinct issue to track. |
@behlendorf: Do you have a reference to that new memory management issue? |
See #1132, it's largely a place holder. |
While running a lot of rsync instances (~45 at once... it makes sense in context) I found my ARC expanding beyond the set limit of 2 gigabytes until ZFS deadlocked - almost all rsync processes hung in the D state and some kernel threads as well. A reboot was necessary. It only took about 5-10 minutes of this kind of pressure to break.
Not too long after all ZFS-involved processes froze up.
Stack dumps for some hung processes:
Machine specs:
Single-socket quad-core Xeon
8 Gigs of RAM
SPL version: b29012b
ZFS version: 409dc1a
Kernel version: 3.0.28 vanilla custom build
ZPool version 26 (originally built/run by zfs-fuse)
I've also tried using the module/zfs/arc.c from #669 for testing and reducing the ARC size. RAM usage still exceeds the limits set.
Nevertheless it's been running for a few hours now reliably.
(Edit: I also raised vm.min_free_kbytes from its default up to 262144 as part of a shotgun attempt to make this more stable.)
The text was updated successfully, but these errors were encountered: