-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data throughput causing apparent (directory) metadata eviction with metadata_size << arc_meta_min #10508
Comments
Have you tried increasing (See also |
zfs_arc_dnode_limit automatically rises to match zfs_arc_meta_min, which is set to 32GB (and overrides zfs_arc_dnode_limit_percent). As an additional data point, in performing repeated find operations on the zpool in order to confirm dnode_size remained constant for this reply, and with the copy operation stopped, I noted that with just a single background task reading data from the zpool (in this case a media scanner) several back-to-back find operations in a row appeared to run at uncached speed. It wasn't until the 3rd or 4th repeat that the directory metadata appeared to 'stick' in the cache. Then once the directory metadata was cached, with the background sequential read task continuing, I can watch mfu_evictable_metadata slowly climb again. Repeating the find knocks it down again. This system has seen a lot of sequential throughput over the past two days, without me repeating the find until just now. It's as if dnode/metadata is fighting with data mru/mfu somehow and is failing to follow the respective parameters. The only way I can get the directory metadata to remain cached is to repeat a find across the zpool at a rate sufficient to prevent eviction. The higher the data throughput, the more frequently I need to repeat the directory traversal to keep it in the arc. If I repeat the find a bunch of times in a row, that appears to 'buy me some time', but if data transfer throughput is increased, then I must increase the frequency of the find operation to compensate or else it will revert to uncached performance. With sufficiently high data throughput, even constantly traversing all directories may not be sufficient to keep them cached. This should not be occurring with arc_meta_min set higher than the peak observed metadata_size / dnode_size. A possible workaround is to cron find . the zpool every minute, but that shouldn't be necessary given that the current parameters should be sufficient to keep this dnode metadata in the arc. edit in the time it took for me to write those last two paragraphs, with the copy operation resumed (1GB/s from one zpool to another), the find operation once again returned to uncached performance. dnode_size remains at 5.04GB and metadata_size at 7.16GB. |
it's even worse - i have found that metadata even evicts if there is no memory pressure or other data troughput at all, i.e. just a simple rsync -av --dry-run /dir/subdirs/with/1mio+/files/altogether /tmp (which is effectively lstat()'ing all files recursively) on a freshly booted system will make the arc go crazy. On my VM with 8gb ram, i see arc collapsing during the initial and subsequent run - and altough all metadata fits in ram, a second rsync run will never perform sufficiently and being completely served from ram (which i would expect). we can see slab and SUnreclaim grow in /proc/meminfo i discussed that on irc and was recommended to drop dnode cache via echo 2 >/proc/sys/vm/drop_caches but this does not really work for me. there is definetely something stupid going on here and zfs caching apparently puts a spoke in it's own wheel... |
Circling back here, I can confirm this issue remains on 2.0.4-pve1. So long as there is relatively high data throughput to/from any attached zpool, no amount of *_meta_min and *_limit tweaking will keep metadata in the arc, resulting in painfully slow directory operations across all datasets. Watching arcstats, any newly cached metadata quickly gets pushed off to evictable and is purged within minutes. With sufficient throughput, even a continuous loop of find .'s on a dataset is unable to keep directory metadata cached. This ultimately results in directory operations to HDD arrays taking several orders of magnitude longer to complete than they would had they remained in the arc. The delta in my case is a few seconds vs. >10 minutes to complete a find or related (rsync) operation that needs to walk all paths of the dataset. It should not be necessary to add l2arc (which doesn't work anyway: #10957 ) or special vdevs as a workaround for a system where the total directory metadata consumes <10% of the arc yet continues to be prioritized lower than read-once data (bulk file copy) passing through it. |
Hello, We are facing the same issue for a similar use case (rsync over 1M+ files). We can mitigate the behavior with As said by other users, no matter how much we try to tune Below is the example of what happens at 0:00 when the rsync process starts. With With |
After checking the code it seems expected, or at least Lines 4569 to 4717 in 2a49ebb
As far as I understand, |
Given that, it is odd that I've seen some attempted workarounds still fail in the face of background read-once throughput, where I had a find running across all arrays every few minutes, so metadata should most certainly have been MFU, but it was still evicted during a large file transfer. As for the performance penalty delta of this eviction, this is what it looks like on my setup:
...that's a 60x file indexing performance hit triggered only by transferring a single large file. Either metadata needs a significantly higher priority in the arc or |
It's hard to say without knowing in which queue (MRU/MFU) the data and metadata are during eviction... But I get your point: if the metadata is in the MFU, one would expect to keep it and evict the data in the MRU in a first place. I'm also wondering how the Target memory size is computed (e.g. in the graphs I posted in #10508 (comment)). Looking at the first graph, I understand the following:
So my question is: why does the Target memory decreases in a first place? If ZFS can take up to 80% of my RAM, why not use it? In my case, ZFS is the only greedy process running on the machine, so I don't mind if it uses all the RAM allocated to it. There is no need to evict data from the RAM unless I am hitting the limit. Edit: ah, there is |
it's frustrating to see that this problem exists for so long and has no priority for getting fixed. storing millions of files on zfs and using rsync or other tools (which repeatedly walk down on the whole file tree) is not too exotic use case |
I agree, and moreover the rsync alternative that would be zfs send/receive also faces severe issues (see #11679 for example). In the end, we could make our way around it but I'm sure we could reach a much better result without these problems. Just being able to set Last but not least: rsync is hungry for RAM. Like really really hungry when there are millions of files to sync. This makes finding the sweet spot between the ZFS and the system RAM allocation quite tricky. We recently switched to servers with 128GB of RAM (instead of 32), and it makes things much easier. To give you an idea, the metadata takes ~40GB in the RAM. It's not surprising that we were struggling with our previous configuration. Other solutions implies other tools, such as restic instead of rsync. But the load is then transferred to your production nodes, which might not be great. restic has also a concept of snapshots which overlaps with the ZFS snapshots. |
yes. btw, did you notice this one ? https://www.mail-archive.com/[email protected]/msg33362.html |
Ran into this myself and was pulling out hair until I saw this issue. Already tried setting zfs_arc_max; zfs_arc_min; zfs_arc_meta_min; zfs_arc_meta_limit_percent (as well as zfs_arc_meta_limit itself to match zfs_arc_min/max). To no real avail. metadata arc sizes used max hover around 40GiB (setting arc_max/min/limt to 80GiB) but even then after the metadata cache is warm in arc it quickly gets purged do to less than 15GiB and I have to re-warm. It's stupid that I have to have a cron job to run every hour or more frequently to keep the data in arc. Has there been any progress here at all on changing the pressure defaults to keep metadata as a priority and NOT evict it? Or to actually fix zfs_arc_meta_min to work properly as a hard floor and to absolutely never touch metadata if it's less than the floor? this gets really annoying when you have a large rotating rust array and 95% of the hits are metadata that can't stay cached. Just thinking, another possible option would be to have another setting in zfs_arc_meta_strategy where 0 = metadata only evict; 1 = balanced evict; and 2 = data only evict (or data heavily preferred evict) |
Anyone who can easily repro this, would you be able to test a potential patch or two? IIRC I found and 'fixed' one or two suspicious ARC eviction behaviors a while back when I was chasing another issue, though I never ended up investigating them further since they didn't end up being relevant to my issue at hand. Could be worth a try though. |
yes, i could do that |
It's a fairly easy thing to repro. In my own dabbling, I've found that the metadata eviction seems to be coaxed by the page cache pressure from copying lots of data, but it's not like the ARC is shrinking so small when this happens. I've had some success by using Expanding on (and possibly contradicting) the above, I have done some digging on my own arrays lately and noted that while traversing all folders, arc_meta_used might only climb to 15GB, but the actual metadata total of the pool was 109GB (measured using this method). So while the directory tables in isolation are a fraction of the total metadata, if you start copying large files, that's more of that remaining 95GB of metadata that's being accessed, and perhaps it's displacing the directory data. Even if that's the case, I'd still consider that misbehaving. I can have a Metadata associated with walking the file tree should take priority over most of the other types of metadata (and even data) given the steep penalty associated with that data being evicted. |
Perhaps try this to start with, if you'd be so kind. This is supposed to relieve what I believe is an accounting error which could lead to over-eviction of both data and metadata, but particularly metadata. I'm a bit dubious that it can manifest to such an extent to cause the extreme eviction seen in this issue, but who knows. 😁 |
it does not help for my test case. even worse - i see metadata getting evicted and arc collapsing with a metadata-only workload. on a VM (4gb ram) i created a dataset with 1 million empty dirs inside when i do subsequent while true;do rsync -av --dry-run /dirpool/dirs /tmp/ |wc -l; done on that dataset, i see arc repeatedly grow/shrink before arc growing to it's maxsize or metadata getting filled to it's hard limit. i think you are on the right path @adamdmoss , i could guess things getting counted wrong, which leads to over-eviction with @nim-odoo 's hint at #10508 (comment) , on my system arc goes completely nuts. collapsing from 2.3gb down to 500mb at regular intervals when rsync is run in a loop (caching nothing but empty dirs). why those need so much space in arc is another story.... ( #13925 ) |
@devZer0 sorry to see that it doesn't help in your case. We ended up with this config, but we have also switched to servers with 128 GB of RAM:
ARC and L2ARC store both data and metadata. The bottom line of all our investigations is that ZFS requires a LOT of RAM to perform efficiently with the rsync use case. The trick of reversing the list of dirs is a good starting point, but in the end it is not enough. |
@nim-odoo The problem (or a major portion of it) is that zfs_arc_meta_min is NOT honored. i.e. metadata is still evicted even when metadata size is a fraction of that value. Really wondering if there is just a long-standing 'bug' or if there is another process interfering (aka arc_shrinker[count|scan] as what George Wilson found in the 2020 zfs dev summit). But basically there seems to be a programmatic desire to evict metadata with a huge preference when we actually want the opposite, to evict all data with a huge preference. |
I seem the recall the post-split illumos engineers saying* that eliminating the ARC's distinction between data and metadata was one of the best improvements they ever made. Seems like a blunt instrument just for the purposes of addressing this specific issue, but seems worth investigating. *[citation needed] |
can somebody keep an eye on this and/or confirm/deny my observation at #12028 (comment) ? thanks! |
I believe some if not all of this may be improved with #14359 |
no, unfortunately not. metadata / dnode information still getting evicted to early and you have no influence on making sure it stays in arc. |
System information
After a clean boot, system memory climbs to a steady 118-120GB of 188GB as ARC populates (as reported by HTOP). No other memory heavy operations taking place on the system. It is basically idle save for this testing.
Describe the problem you're observing
In testing usage scenarios sensitive to metadata eviction (repeated indexing/walking of a large directory structure, hdd's remaining spun down while refreshing directory contents, etc), I've found that beyond a certain file transfer throughput to (and possibly from) a zpool, the directory metadata appears to have a hair-trigger to be purged. If the throughput remains relatively low (low hundreds of MB/s), the transfer can continue for days with all directory contents remaining cached, but with higher throughputs (600+ MB/s), I've found that it takes just a few dozen GB of transfer for the directory traversal to revert to its uncached / cold-boot speed. For a raidz zpool with thousands of directories, this means an operation that took <10 seconds to traverse all directories will now take tens of minutes to complete (as long as it took when the cache was cold after boot).
I've tested this with zpool configurations of varying numbers of raidz/z2/z3 vdevs. In all cases, the vdevs were wide enough as to support 2GB/s bursts (observable via iostat 0.1), and the 'high' throughputs that trigger the cache misses are still sufficiently low that they don't appear to be running up against the write throttle (observed bursts of 2GB/s for a fraction of the zfs_txg_timeout, with 1+ seconds of 0 writes between timeouts).
For my setup:
After a cold boot and a find run against the zpool (~1.2 million files / ~100k dirs):
The metadata_size and arc_meta_used will fluctuate depending on the type of activity, but even a few seconds of sufficiently high data throughput can cause the cache to drop directory data.
The specific vdev/zpool configuration(s) in use do not appear to have any impact. I've observed the cache misses triggered with the following widely varying scenarios:
Some other observations:
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
Nothing abnormal to report.
The text was updated successfully, but these errors were encountered: