Data throughput causing apparent (directory) metadata eviction with metadata_size << arc_meta_min #10508

malventano · 2020-06-28T02:21:18Z

System information

Type	Version/Name
Distribution Name	Proxmox VE (some configs below were repeated on Ubuntu 20.04 with the same result)
Distribution Version	6.2 and 6.4-1
Linux Kernel	(tested across): 5.4.34-1-pve, 5.4.41-1-pve, 5.4.44-1-pve, and most recently 5.4.106-1-pve)
Architecture	2x Xeon Gold 6154, 192GB, boot from ext4 SATA SSD or 4x SSD DC P4510 (ZFS as root)
ZFS Version	0.8.4-pve1 (also experienced with 0.8.3 and most recently 2.0.4)
SPL Version	0.8.4-pve1 (")
zfs parameters	spa_slop_shift=7 (also tried default), zfs_arc_meta_min set to 16GB or 32GB
zpool parameters	atime=off, xattr=sa, recordsize=1M

After a clean boot, system memory climbs to a steady 118-120GB of 188GB as ARC populates (as reported by HTOP). No other memory heavy operations taking place on the system. It is basically idle save for this testing.

Describe the problem you're observing

In testing usage scenarios sensitive to metadata eviction (repeated indexing/walking of a large directory structure, hdd's remaining spun down while refreshing directory contents, etc), I've found that beyond a certain file transfer throughput to (and possibly from) a zpool, the directory metadata appears to have a hair-trigger to be purged. If the throughput remains relatively low (low hundreds of MB/s), the transfer can continue for days with all directory contents remaining cached, but with higher throughputs (600+ MB/s), I've found that it takes just a few dozen GB of transfer for the directory traversal to revert to its uncached / cold-boot speed. For a raidz zpool with thousands of directories, this means an operation that took <10 seconds to traverse all directories will now take tens of minutes to complete (as long as it took when the cache was cold after boot).

I've tested this with zpool configurations of varying numbers of raidz/z2/z3 vdevs. In all cases, the vdevs were wide enough as to support 2GB/s bursts (observable via iostat 0.1), and the 'high' throughputs that trigger the cache misses are still sufficiently low that they don't appear to be running up against the write throttle (observed bursts of 2GB/s for a fraction of the zfs_txg_timeout, with 1+ seconds of 0 writes between timeouts).

For my setup:

arc_meta_limit = 70GB (default)

After a cold boot and a find run against the zpool (~1.2 million files / ~100k dirs):

metadata_size = ~5GB
arc_meta_used = ~10GB
arc_meta_min = 32GB (set via zfs_arc_meta_min)

The metadata_size and arc_meta_used will fluctuate depending on the type of activity, but even a few seconds of sufficiently high data throughput can cause the cache to drop directory data.

The specific vdev/zpool configuration(s) in use do not appear to have any impact. I've observed the cache misses triggered with the following widely varying scenarios:

Copying to multiple single-hdd zpools via samba 10Gbps LAN (at 1GB/s), multiple threads.
Copying from multiple single-drive zpools to a single raidz2 (single, dual, and triple dev) via rsync and/or cp.
Copying from one raidz2 to another raidz2 with multiple rsync and/or cp threads.
Copying from one raidz2 to another with a single cp thread (appears thread limited to ~600 MB/s).

However, if I repeat that last operation with a single rsync instead of a single cp (which runs closer to ~300 MB/s), this transfer can continue for literally days with zero impact on cached directories. Add a few more rsyncs or shift back to a single cp and it takes just a few seconds for the directories to no longer be cached.

Some other observations:

After a clean boot + find to cache directory metadata, mru/mfu_evictable_metadata remains at 0 until some time after data_size hits its limit (~83GB for this config), then evictable_metadata begins to climb as metadata_size and meta_used start to fall off, but if the transfer is stopped and the find operation is repeated, the numbers revert and the directories remain cached. This shifting to evictable occurs even though meta_used is well below meta_min (and meta_min had been set since boot). If the transfer continues or throughput increases to a high enough value, even looping the find operation (every 3-4 seconds if cache hits) will not be sufficient to keep dirs cached. If the transfer is paused, the (now cache missing) find operation will be observable in zpool iostat, and will take a very long time to complete. Once complete, a repeat again shows the dirs are cached, and then resuming the file transfer for a few more seconds will once again lead to cache misses on the next find.

Describe how to reproduce the problem

Monitor arcstats, etc.
Set zfs_arc_meta_min to a value higher than your setup's arc_meta_used climbs to after a find operation caches all directory structure data. Ideally have this set at boot.
Have a couple of zpools or one zpool and other data sources capable of producing sufficient throughput.
Have enough dirs/files present on a zpool such that a find operation can demonstrate a performance difference between cached and uncached performance.
Perform a few find operations on the volume as to traverse the tree and cache metadata.
Begin a file copy. This can be any type of relatively high throughput operation. I've found 600 MB/s to be the approximate threshold necessary to trigger the issue.
After a few seconds of the copy operation, repeat the find operation and note performance.
If the copy throughput was low (e.g. copying from a single HDD to the zpool, or just a single thread via SMB, etc), note that the find operation continues to complete quickly, even after several hours / dozens of TB transferred.
If the copy throughput was high (>~600 MB/s), note that the find operation reverts to cold-boot performance.

Include any warning/errors/backtraces from the system logs

Nothing abnormal to report.

chrisrd · 2020-07-01T02:49:03Z

Have you tried increasing zfs_arc_dnode_limit_percent (or zfs_arc_dnode_limit) to avoid flushing the dnodes too aggressively?

(See also man zfs-module-parameters)

malventano · 2020-07-01T04:21:04Z

Have you tried increasing zfs_arc_dnode_limit_percent (or zfs_arc_dnode_limit) to avoid flushing the dnodes too aggressively?

zfs_arc_dnode_limit automatically rises to match zfs_arc_meta_min, which is set to 32GB (and overrides zfs_arc_dnode_limit_percent).
dnode_size remains stable at 5.04GB during all operations above (as well as when directory queries become cache misses once the data throughput/copy operation resumes).

As an additional data point, in performing repeated find operations on the zpool in order to confirm dnode_size remained constant for this reply, and with the copy operation stopped, I noted that with just a single background task reading data from the zpool (in this case a media scanner) several back-to-back find operations in a row appeared to run at uncached speed. It wasn't until the 3rd or 4th repeat that the directory metadata appeared to 'stick' in the cache. Then once the directory metadata was cached, with the background sequential read task continuing, I can watch mfu_evictable_metadata slowly climb again. Repeating the find knocks it down again. This system has seen a lot of sequential throughput over the past two days, without me repeating the find until just now.

It's as if dnode/metadata is fighting with data mru/mfu somehow and is failing to follow the respective parameters. The only way I can get the directory metadata to remain cached is to repeat a find across the zpool at a rate sufficient to prevent eviction. The higher the data throughput, the more frequently I need to repeat the directory traversal to keep it in the arc. If I repeat the find a bunch of times in a row, that appears to 'buy me some time', but if data transfer throughput is increased, then I must increase the frequency of the find operation to compensate or else it will revert to uncached performance. With sufficiently high data throughput, even constantly traversing all directories may not be sufficient to keep them cached.

This should not be occurring with arc_meta_min set higher than the peak observed metadata_size / dnode_size. A possible workaround is to cron find . the zpool every minute, but that shouldn't be necessary given that the current parameters should be sufficient to keep this dnode metadata in the arc.

edit in the time it took for me to write those last two paragraphs, with the copy operation resumed (1GB/s from one zpool to another), the find operation once again returned to uncached performance. dnode_size remains at 5.04GB and metadata_size at 7.16GB.

devZer0 · 2020-07-28T23:31:29Z

it's even worse - i have found that metadata even evicts if there is no memory pressure or other data troughput at all, i.e. just a simple

rsync -av --dry-run /dir/subdirs/with/1mio+/files/altogether /tmp (which is effectively lstat()'ing all files recursively)

on a freshly booted system will make the arc go crazy.

On my VM with 8gb ram, i see arc collapsing during the initial and subsequent run - and altough all metadata fits in ram, a second rsync run will never perform sufficiently and being completely served from ram (which i would expect). we can see slab and SUnreclaim grow in /proc/meminfo

i discussed that on irc and was recommended to drop dnode cache via echo 2 >/proc/sys/vm/drop_caches but this does not really work for me.

there is definetely something stupid going on here and zfs caching apparently puts a spoke in it's own wheel...

chrisrd · 2020-07-29T00:21:48Z

Possibly related:
#10331 - fix dnode eviction typo in arc_evict_state()
#10563 - dbuf cache size is 1/32nd what was intended
#10600 - Revise ARC shrinker algorithm
#10610 - Limit dbuf cache sizes based only on ARC target size by default
#10618 - Restore ARC MFU/MRU pressure

malventano · 2021-05-18T21:16:46Z

Circling back here, I can confirm this issue remains on 2.0.4-pve1. So long as there is relatively high data throughput to/from any attached zpool, no amount of *_meta_min and *_limit tweaking will keep metadata in the arc, resulting in painfully slow directory operations across all datasets. Watching arcstats, any newly cached metadata quickly gets pushed off to evictable and is purged within minutes. With sufficient throughput, even a continuous loop of find .'s on a dataset is unable to keep directory metadata cached.

This ultimately results in directory operations to HDD arrays taking several orders of magnitude longer to complete than they would had they remained in the arc. The delta in my case is a few seconds vs. >10 minutes to complete a find or related (rsync) operation that needs to walk all paths of the dataset. It should not be necessary to add l2arc (which doesn't work anyway: #10957 ) or special vdevs as a workaround for a system where the total directory metadata consumes <10% of the arc yet continues to be prioritized lower than read-once data (bulk file copy) passing through it.

devZer0 · 2021-05-18T23:30:36Z

#12028

nim-odoo · 2021-10-12T06:46:04Z

Hello,

We are facing the same issue for a similar use case (rsync over 1M+ files). We can mitigate the behavior with primarycache|secondarycache=metadata, but a zfs send still fills the ARC with data! It doesn't make any sense to me: why would we find data in the ARC while we specifically set it to store metadata only? Moreover, setting secondarycache=metadata implies a waste of resources since our metadata doesn't need the 512 GB available in the L2ARC.

As said by other users, no matter how much we try to tune arc_meta_min it doesn't seem to have any effect.

Below is the example of what happens at 0:00 when the rsync process starts.

With primarycache|secondarycache=all, metadata (yellow) is evicted although zfs_arc_meta_min is set to 19993325568 (~19 GB). It happens around 0:30, when a large file (46 GB) is transferred by rsync.

With primarycache|secondarycache=metadata, data (green) is filled in the ARC at 2:00 AM when zfs send starts:

nim-odoo · 2021-10-12T14:26:39Z

After checking the code it seems expected, or at least arc_meta_min is never used as a 'reserved' size. It is simply used to know if metadata should be evicted before data.

zfs/module/zfs/arc.c

Lines 4569 to 4717 in 2a49ebb

    
           arc_evict(void) 
        
           { 
        
           	uint64_t total_evicted = 0; 
        
           	uint64_t bytes; 
        
           	int64_t target; 
        
           	uint64_t asize = aggsum_value(&arc_sums.arcstat_size); 
        
           	uint64_t ameta = aggsum_value(&arc_sums.arcstat_meta_used); 
        
           	/* 
        
           	 * If we're over arc_meta_limit, we want to correct that before 
        
           	 * potentially evicting data buffers below. 
        
           	 */ 
        
           	total_evicted += arc_evict_meta(ameta); 
        
           	/* 
        
           	 * Adjust MRU size 
        
           	 * 
        
           	 * If we're over the target cache size, we want to evict enough 
        
           	 * from the list to get back to our target size. We don't want 
        
           	 * to evict too much from the MRU, such that it drops below 
        
           	 * arc_p. So, if we're over our target cache size more than 
        
           	 * the MRU is over arc_p, we'll evict enough to get back to 
        
           	 * arc_p here, and then evict more from the MFU below. 
        
           	 */ 
        
           	target = MIN((int64_t)(asize - arc_c), 
        
           	    (int64_t)(zfs_refcount_count(&arc_anon->arcs_size) + 
        
           	    zfs_refcount_count(&arc_mru->arcs_size) + ameta - arc_p)); 
        
           	/* 
        
           	 * If we're below arc_meta_min, always prefer to evict data. 
        
           	 * Otherwise, try to satisfy the requested number of bytes to 
        
           	 * evict from the type which contains older buffers; in an 
        
           	 * effort to keep newer buffers in the cache regardless of their 
        
           	 * type. If we cannot satisfy the number of bytes from this 
        
           	 * type, spill over into the next type. 
        
           	 */ 
        
           	if (arc_evict_type(arc_mru) == ARC_BUFC_METADATA && 
        
           	    ameta > arc_meta_min) { 
        
           		bytes = arc_evict_impl(arc_mru, 0, target, ARC_BUFC_METADATA); 
        
           		total_evicted += bytes; 
        
           		/* 
        
           		 * If we couldn't evict our target number of bytes from 
        
           		 * metadata, we try to get the rest from data. 
        
           		 */ 
        
           		target -= bytes; 
        
           		total_evicted += 
        
           		    arc_evict_impl(arc_mru, 0, target, ARC_BUFC_DATA); 
        
           	} else { 
        
           		bytes = arc_evict_impl(arc_mru, 0, target, ARC_BUFC_DATA); 
        
           		total_evicted += bytes; 
        
           		/* 
        
           		 * If we couldn't evict our target number of bytes from 
        
           		 * data, we try to get the rest from metadata. 
        
           		 */ 
        
           		target -= bytes; 
        
           		total_evicted += 
        
           		    arc_evict_impl(arc_mru, 0, target, ARC_BUFC_METADATA); 
        
           	} 
        
           	/* 
        
           	 * Re-sum ARC stats after the first round of evictions. 
        
           	 */ 
        
           	asize = aggsum_value(&arc_sums.arcstat_size); 
        
           	ameta = aggsum_value(&arc_sums.arcstat_meta_used); 
        
           	/* 
        
           	 * Adjust MFU size 
        
           	 * 
        
           	 * Now that we've tried to evict enough from the MRU to get its 
        
           	 * size back to arc_p, if we're still above the target cache 
        
           	 * size, we evict the rest from the MFU. 
        
           	 */ 
        
           	target = asize - arc_c; 
        
           	if (arc_evict_type(arc_mfu) == ARC_BUFC_METADATA && 
        
           	    ameta > arc_meta_min) { 
        
           		bytes = arc_evict_impl(arc_mfu, 0, target, ARC_BUFC_METADATA); 
        
           		total_evicted += bytes; 
        
           		/* 
        
           		 * If we couldn't evict our target number of bytes from 
        
           		 * metadata, we try to get the rest from data. 
        
           		 */ 
        
           		target -= bytes; 
        
           		total_evicted += 
        
           		    arc_evict_impl(arc_mfu, 0, target, ARC_BUFC_DATA); 
        
           	} else { 
        
           		bytes = arc_evict_impl(arc_mfu, 0, target, ARC_BUFC_DATA); 
        
           		total_evicted += bytes; 
        
           		/* 
        
           		 * If we couldn't evict our target number of bytes from 
        
           		 * data, we try to get the rest from data. 
        
           		 */ 
        
           		target -= bytes; 
        
           		total_evicted += 
        
           		    arc_evict_impl(arc_mfu, 0, target, ARC_BUFC_METADATA); 
        
           	} 
        
           	/* 
        
           	 * Adjust ghost lists 
        
           	 * 
        
           	 * In addition to the above, the ARC also defines target values 
        
           	 * for the ghost lists. The sum of the mru list and mru ghost 
        
           	 * list should never exceed the target size of the cache, and 
        
           	 * the sum of the mru list, mfu list, mru ghost list, and mfu 
        
           	 * ghost list should never exceed twice the target size of the 
        
           	 * cache. The following logic enforces these limits on the ghost 
        
           	 * caches, and evicts from them as needed. 
        
           	 */ 
        
           	target = zfs_refcount_count(&arc_mru->arcs_size) + 
        
           	    zfs_refcount_count(&arc_mru_ghost->arcs_size) - arc_c; 
        
           	bytes = arc_evict_impl(arc_mru_ghost, 0, target, ARC_BUFC_DATA); 
        
           	total_evicted += bytes; 
        
           	target -= bytes; 
        
           	total_evicted += 
        
           	    arc_evict_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA); 
        
           	/* 
        
           	 * We assume the sum of the mru list and mfu list is less than 
        
           	 * or equal to arc_c (we enforced this above), which means we 
        
           	 * can use the simpler of the two equations below: 
        
           	 * 
        
           	 *	mru + mfu + mru ghost + mfu ghost <= 2 * arc_c 
        
           	 *		    mru ghost + mfu ghost <= arc_c 
        
           	 */ 
        
           	target = zfs_refcount_count(&arc_mru_ghost->arcs_size) + 
        
           	    zfs_refcount_count(&arc_mfu_ghost->arcs_size) - arc_c; 
        
           	bytes = arc_evict_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DATA); 
        
           	total_evicted += bytes; 
        
           	target -= bytes; 
        
           	total_evicted += 
        
           	    arc_evict_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA); 
        
           	return (total_evicted); 
        
           }

As far as I understand, target is the amount of data to be evicted. When calling arc_evict_impl on ARC_BUFC_METADATA, the arc_meta_min is not subtracted from the target. Therefore, nothing prevents keeping less metadata than arc_meta_min. Note that this is not straightforward to change: part of the eviction is done in the MRU, and part of it is in the MFU. Something that could work (not tested) is something like https://github.com/nim-odoo/zfs/commit/48b97ae0c4b7f83d091b70aeb74c416d7fef3a8b which uses the arc_meta_min as a reserved amount of the MFU.

malventano · 2021-10-12T15:04:30Z

part of the eviction is done in the MRU, and part of it is in the MFU

Given that, it is odd that I've seen some attempted workarounds still fail in the face of background read-once throughput, where I had a find running across all arrays every few minutes, so metadata should most certainly have been MFU, but it was still evicted during a large file transfer. As for the performance penalty delta of this eviction, this is what it looks like on my setup:

# time find /z/* -type f|wc -l
2861207

real    11m21.428s
user    0m2.338s
sys     0m17.663s
# time find /z/* -type f|wc -l
2861207

real    0m12.341s
user    0m1.768s
sys     0m9.871s

...that's a 60x file indexing performance hit triggered only by transferring a single large file. Either metadata needs a significantly higher priority in the arc or arc_meta_min needs to behave as a reserved amount.

nim-odoo · 2021-10-13T11:24:30Z

It's hard to say without knowing in which queue (MRU/MFU) the data and metadata are during eviction... But I get your point: if the metadata is in the MFU, one would expect to keep it and evict the data in the MRU in a first place.

I'm also wondering how the Target memory size is computed (e.g. in the graphs I posted in #10508 (comment)). Looking at the first graph, I understand the following:

at 0:00, the rsync starts
between 0:00 and 0:30, the Target memory decreases because data is evicted from the MFU (makes sense since the data in cache is not the data needed by rsync). At the same time, the MRU increases (new data is added)
at 0:30, the MRU goes above the MFU and things get messy: the Target memory increases progressively, but the MRU still decreases because of metadata eviction.

So my question is: why does the Target memory decreases in a first place? If ZFS can take up to 80% of my RAM, why not use it? In my case, ZFS is the only greedy process running on the machine, so I don't mind if it uses all the RAM allocated to it. There is no need to evict data from the RAM unless I am hitting the limit.

Edit: ah, there is zfs_arc_min. I'll give it a try.

nim-odoo · 2021-10-14T05:00:57Z

Looking better now. Here is the base configuration:

# ARC can use up to 80% of the total memory
options zfs zfs_arc_max={{ (0.8 * ansible_memtotal_mb * 1024 ** 2) | int }}

# Metadata can use up to 100% of the ARC
options zfs zfs_arc_meta_limit_percent=100

# When metadata takes less than 75% of the ARC, evict data before metadata
options zfs zfs_arc_meta_min={{ (0.75 * 0.8 * ansible_memtotal_mb * 1024 ** 2) | int }}

ansible_memtotal_mb is to total available memory, in MB.

After a warm-up run with a find, the ARC remains choppy during rsync:

If I add the following, it is much more stable:

# ARC will really use 80% of the total memory :-)
options zfs zfs_arc_min={{ (0.8 * ansible_memtotal_mb * 1024 ** 2) | int }}

nim-odoo · 2021-10-20T14:28:50Z

Last remark : if all your metadata do not fit in the ARC (given you don't have L2ARC), the ARC becomes pretty much useless with rsync.

Let's say we have 4 directories to synchronize (A, B, C and D), and a daily backup is run. If we have only a memory availability for 3 out of the 4 directories, here is what happens:

Day 1: cache metadata for A, B and C. When D is reached, evict the cache of A. So at the end of the day, metadata of B, C and D are in the cache
Day 2: start with A => since there is no metadata of A in the cache, evict the metadata of B. Then comes B... with no metadata in cache! Then C, then D...

In the end, the metadata in cache is never used.

Therefore, on our side, we tried the following:

Day 1: A, B, C, D
Day 2: D, C, B, A
Day 3: A, B, C, D
And so on...

And it works! We get a significant speed improvement of rsync. When starting a new rsync cycle, the metadata of the end of the previous cycle is still in the cache, which speeds up the process.

In order to maximize the use of the ARC, we configure the following:

zfs_arc_max=80% of RAM available
zfs_arc_min=80% of RAM available
zfs_arc_meta_limit_percent=100
zfs_arc_meta_min=80% of RAM available
primarycache=all

The most important option is zfs_arc_min=zfs_arc_max: you don't want the choppy behavior detailed above since it will lead to metadata eviction. Whatever the value chosen, having both value equal seems the best option for a rsync-only server. About zfs_arc_meta_min, I'm not convinced it is useful in practice but for my peace of mind I assume it is set up to always evict data before metadata.

Regarding primarycache=all, well, setting it to metadata lead to very weird ARC purges:

So in the end we ended up keeping primarycache=all. Even if data take some space in the ARC, at least we do not lose all the ARC suddenly.

To anyone working on fine-tuning their configuration, I'd strongly suggest to gather and plot the metrics as we did. It really helps understanding the ARC behavior and compare the various settings.

PS: we did not analyze the source code to verify our hypothesis. I may be wrong in my understanding of the ARC management, but hopefully right enough to improve our use case.

devZer0 · 2021-12-22T17:16:11Z

it's frustrating to see that this problem exists for so long and has no priority for getting fixed.

storing millions of files on zfs and using rsync or other tools (which repeatedly walk down on the whole file tree) is not too exotic use case

nim-odoo · 2021-12-23T07:20:44Z

it's frustrating to see that this problem exists for so long and has no priority for getting fixed.

storing millions of files on zfs and using rsync or other tools (which repeatedly walk down on the whole file tree) is not too exotic use case

I agree, and moreover the rsync alternative that would be zfs send/receive also faces severe issues (see #11679 for example). In the end, we could make our way around it but I'm sure we could reach a much better result without these problems. Just being able to set primarycache=metadata without the weird ARC purges would really help to use the ARC more efficiently with metadata.

Last but not least: rsync is hungry for RAM. Like really really hungry when there are millions of files to sync. This makes finding the sweet spot between the ZFS and the system RAM allocation quite tricky. We recently switched to servers with 128GB of RAM (instead of 32), and it makes things much easier. To give you an idea, the metadata takes ~40GB in the RAM. It's not surprising that we were struggling with our previous configuration.

Other solutions implies other tools, such as restic instead of rsync. But the load is then transferred to your production nodes, which might not be great. restic has also a concept of snapshots which overlaps with the ZFS snapshots.

devZer0 · 2021-12-23T11:56:45Z

Last but not least: rsync is hungry for RAM. Like really really hungry when there are millions of files to sync.

yes. btw, did you notice this one ? https://www.mail-archive.com/[email protected]/msg33362.html

stevecs · 2022-09-16T13:50:41Z

Ran into this myself and was pulling out hair until I saw this issue. Already tried setting zfs_arc_max; zfs_arc_min; zfs_arc_meta_min; zfs_arc_meta_limit_percent (as well as zfs_arc_meta_limit itself to match zfs_arc_min/max). To no real avail. metadata arc sizes used max hover around 40GiB (setting arc_max/min/limt to 80GiB) but even then after the metadata cache is warm in arc it quickly gets purged do to less than 15GiB and I have to re-warm. It's stupid that I have to have a cron job to run every hour or more frequently to keep the data in arc.

Has there been any progress here at all on changing the pressure defaults to keep metadata as a priority and NOT evict it? Or to actually fix zfs_arc_meta_min to work properly as a hard floor and to absolutely never touch metadata if it's less than the floor? this gets really annoying when you have a large rotating rust array and 95% of the hits are metadata that can't stay cached.

Just thinking, another possible option would be to have another setting in zfs_arc_meta_strategy where 0 = metadata only evict; 1 = balanced evict; and 2 = data only evict (or data heavily preferred evict)

adamdmoss · 2022-09-16T18:13:15Z

Anyone who can easily repro this, would you be able to test a potential patch or two? IIRC I found and 'fixed' one or two suspicious ARC eviction behaviors a while back when I was chasing another issue, though I never ended up investigating them further since they didn't end up being relevant to my issue at hand. Could be worth a try though.

devZer0 · 2022-09-16T18:17:40Z

yes, i could do that

malventano · 2022-09-16T19:25:03Z

It's a fairly easy thing to repro. In my own dabbling, I've found that the metadata eviction seems to be coaxed by the page cache pressure from copying lots of data, but it's not like the ARC is shrinking so small when this happens. I've had some success by using nocache -n2 rsync for those operations, but it really should not be necessary. For my case, the system has 384GB of RAM, and I've even toyed with configs of 2TB (augmented by PMEM) and still ran into metadata being purged after a couple dozen TB's were copied.

Expanding on (and possibly contradicting) the above, I have done some digging on my own arrays lately and noted that while traversing all folders, arc_meta_used might only climb to 15GB, but the actual metadata total of the pool was 109GB (measured using this method). So while the directory tables in isolation are a fraction of the total metadata, if you start copying large files, that's more of that remaining 95GB of metadata that's being accessed, and perhaps it's displacing the directory data. Even if that's the case, I'd still consider that misbehaving. I can have a du running on my pools every 10 minutes for weeks, with arc_meta_min set to 32GB, but the first (and only) time that other metadata is accessed for copying a large chunk of files, the directory data gets punted. Perhaps ARC MFU logic is not applying to metadata? But why is the whole thing still not following arc_meta_min?

Metadata associated with walking the file tree should take priority over most of the other types of metadata (and even data) given the steep penalty associated with that data being evicted.

adamdmoss · 2022-09-16T19:32:50Z

Perhaps try this to start with, if you'd be so kind. This is supposed to relieve what I believe is an accounting error which could lead to over-eviction of both data and metadata, but particularly metadata. I'm a bit dubious that it can manifest to such an extent to cause the extreme eviction seen in this issue, but who knows. 😁
zfs-evict.patch.txt
I've just done a quick re-test here to ensure it at least applies ~cleanly to zfs master and doesn't explode when used. YMMV.

devZer0 · 2022-09-20T22:09:24Z

it does not help for my test case.

even worse - i see metadata getting evicted and arc collapsing with a metadata-only workload.

on a VM (4gb ram) i created a dataset with 1 million empty dirs inside

when i do subsequent

while true;do rsync -av --dry-run /dirpool/dirs /tmp/ |wc -l; done

on that dataset, i see arc repeatedly grow/shrink before arc growing to it's maxsize or metadata getting filled to it's hard limit.

i think you are on the right path @adamdmoss , i could guess things getting counted wrong, which leads to over-eviction

with @nim-odoo 's hint at #10508 (comment) , on my system arc goes completely nuts. collapsing from 2.3gb down to 500mb at regular intervals when rsync is run in a loop (caching nothing but empty dirs). why those need so much space in arc is another story.... ( #13925 )

nim-odoo · 2022-09-21T09:54:28Z

@devZer0 sorry to see that it doesn't help in your case. We ended up with this config, but we have also switched to servers with 128 GB of RAM:

# ARC can use up to 96045MB
options zfs zfs_arc_max=100710481920

# ARC will use at least 96045MB
options zfs zfs_arc_min=100710481920

# Metadata can use up to 100% of the ARC
options zfs zfs_arc_meta_limit_percent=100

# When metadata takes less than 80% of the ARC, evict data before metadata
options zfs zfs_arc_meta_min=80568385536

# Activate prefetching writes on the L2ARC
options zfs l2arc_noprefetch=0

ARC and L2ARC store both data and metadata. The bottom line of all our investigations is that ZFS requires a LOT of RAM to perform efficiently with the rsync use case. The trick of reversing the list of dirs is a good starting point, but in the end it is not enough.

stevecs · 2022-09-21T12:22:32Z

@nim-odoo The problem (or a major portion of it) is that zfs_arc_meta_min is NOT honored. i.e. metadata is still evicted even when metadata size is a fraction of that value.

Really wondering if there is just a long-standing 'bug' or if there is another process interfering (aka arc_shrinker[count|scan] as what George Wilson found in the 2020 zfs dev summit).

But basically there seems to be a programmatic desire to evict metadata with a huge preference when we actually want the opposite, to evict all data with a huge preference.

adamdmoss · 2022-09-24T16:57:59Z

I seem the recall the post-split illumos engineers saying* that eliminating the ARC's distinction between data and metadata was one of the best improvements they ever made. Seems like a blunt instrument just for the purposes of addressing this specific issue, but seems worth investigating.

*[citation needed]

devZer0 · 2022-10-14T07:48:52Z

can somebody keep an eye on this and/or confirm/deny my observation at #12028 (comment) ? thanks!

malventano · 2023-01-31T22:08:15Z

I believe some if not all of this may be improved with #14359
It retains the distinction between data and metadata, but cleans up a bunch of the limits (including arc_meta_min, which wasn't working anyway). Will have to test and see (and I encourage others in this thread to do the same if possible).

devZer0 · 2024-02-10T12:48:10Z

I believe some if not all of this may be improved with #14359

no, unfortunately not. metadata / dnode information still getting evicted to early and you have no influence on making sure it stays in arc.

#12028 (comment)

behlendorf added the Type: Performance Performance improvement or performance problem label Jun 29, 2020

chrisrd mentioned this issue Sep 3, 2020

100% CPU load from arc_prune #9966

Open

malventano mentioned this issue May 18, 2021

L2ARC shall not lose valid pool metadata #10957

Open

behlendorf added the Component: Memory Management kernel memory management label Sep 16, 2022

devZer0 mentioned this issue Sep 20, 2022

metadata caching does not work as expected - repeatedly getting lost in arc and l2arc with primary|secondarycache=metadata #12028

Open

adamdmoss mentioned this issue Sep 24, 2022

Eliminate ARC's distinction between data and metadata for eviction #13950

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data throughput causing apparent (directory) metadata eviction with metadata_size << arc_meta_min #10508

Data throughput causing apparent (directory) metadata eviction with metadata_size << arc_meta_min #10508

malventano commented Jun 28, 2020 •

edited

Loading

chrisrd commented Jul 1, 2020

malventano commented Jul 1, 2020 •

edited

Loading

devZer0 commented Jul 28, 2020

chrisrd commented Jul 29, 2020

malventano commented May 18, 2021 •

edited

Loading

devZer0 commented May 18, 2021

nim-odoo commented Oct 12, 2021

nim-odoo commented Oct 12, 2021 •

edited

Loading

malventano commented Oct 12, 2021 •

edited

Loading

nim-odoo commented Oct 13, 2021 •

edited

Loading

nim-odoo commented Oct 14, 2021

nim-odoo commented Oct 20, 2021 •

edited

Loading

devZer0 commented Dec 22, 2021

nim-odoo commented Dec 23, 2021

devZer0 commented Dec 23, 2021

stevecs commented Sep 16, 2022 •

edited

Loading

adamdmoss commented Sep 16, 2022

devZer0 commented Sep 16, 2022

malventano commented Sep 16, 2022 •

edited

Loading

adamdmoss commented Sep 16, 2022 •

edited

Loading

devZer0 commented Sep 20, 2022 •

edited

Loading

nim-odoo commented Sep 21, 2022

stevecs commented Sep 21, 2022

adamdmoss commented Sep 24, 2022 •

edited

Loading

devZer0 commented Oct 14, 2022

malventano commented Jan 31, 2023

devZer0 commented Feb 10, 2024

Data throughput causing apparent (directory) metadata eviction with metadata_size << arc_meta_min #10508

Data throughput causing apparent (directory) metadata eviction with metadata_size << arc_meta_min #10508

Comments

malventano commented Jun 28, 2020 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

chrisrd commented Jul 1, 2020

malventano commented Jul 1, 2020 • edited Loading

devZer0 commented Jul 28, 2020

chrisrd commented Jul 29, 2020

malventano commented May 18, 2021 • edited Loading

devZer0 commented May 18, 2021

nim-odoo commented Oct 12, 2021

nim-odoo commented Oct 12, 2021 • edited Loading

malventano commented Oct 12, 2021 • edited Loading

nim-odoo commented Oct 13, 2021 • edited Loading

nim-odoo commented Oct 14, 2021

nim-odoo commented Oct 20, 2021 • edited Loading

devZer0 commented Dec 22, 2021

nim-odoo commented Dec 23, 2021

devZer0 commented Dec 23, 2021

stevecs commented Sep 16, 2022 • edited Loading

adamdmoss commented Sep 16, 2022

devZer0 commented Sep 16, 2022

malventano commented Sep 16, 2022 • edited Loading

adamdmoss commented Sep 16, 2022 • edited Loading

devZer0 commented Sep 20, 2022 • edited Loading

nim-odoo commented Sep 21, 2022

stevecs commented Sep 21, 2022

adamdmoss commented Sep 24, 2022 • edited Loading

devZer0 commented Oct 14, 2022

malventano commented Jan 31, 2023

devZer0 commented Feb 10, 2024

malventano commented Jun 28, 2020 •

edited

Loading

malventano commented Jul 1, 2020 •

edited

Loading

malventano commented May 18, 2021 •

edited

Loading

nim-odoo commented Oct 12, 2021 •

edited

Loading

malventano commented Oct 12, 2021 •

edited

Loading

nim-odoo commented Oct 13, 2021 •

edited

Loading

nim-odoo commented Oct 20, 2021 •

edited

Loading

stevecs commented Sep 16, 2022 •

edited

Loading

malventano commented Sep 16, 2022 •

edited

Loading

adamdmoss commented Sep 16, 2022 •

edited

Loading

devZer0 commented Sep 20, 2022 •

edited

Loading

adamdmoss commented Sep 24, 2022 •

edited

Loading