-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefetch enabled causes anon_size
to increase to abnormal sizes, which causes arc_evict
to peg a core, which renders ZFS slow as a snail
#15214
Comments
So, my speculation at the time, and the reason we tried the fetching with a speed limit, was that if the files were getting prefetched into memory, they might be getting pinned until the requests triggering the prefetch finished, and that was why we couldn't reproduce it even under much higher nominal connection load if each connection didn't last long. IIRC turning down the minimum "prefetch stays in memory before being eligible" time didn't do anything either, it was just completely turning off prefetch that night and day killed the load. So I'm wildly speculating the behavior is something like a prefetch triggers loading a huge thing into memory since it's very visibly sequential, the thing that triggered the prefetch "locks" it there until it's done, even if the nominal expiration is over, and then if this happens enough times in parallel, you blow your ARC limits. My other speculation was an interaction with mmap since doing this with Either way, that's all the state from memory I can decant atm. edit: I lied. I also have a hysteresis patch to make arc_evict sleep for a bit instead of constantly running until it frees in cases like this, because I don't think it's literally ever helpful to constantly peg things on that, but I haven't tested it much. |
I am surprised to see that anon state can be related to prefetch. Prefetched buffers should be in MRU or MFU state. They should not be special and should be evictable on common grounds with other buffers, may be after minimal time of few seconds. Anonymous buffers are ones that have no physical address, for example, buffers that were read, but then dirtied while not yet written to obtain new physical address. Since the sendfile() is mentioned, I wonder how it is implemented on Linux and whether it can produce the anonymous buffers somehow. It does not explain relations with prefetch though. I've cleaned a lot of ARC state transitions in ZFS 2.2 branch. It would be good to know whether the issue is reproducible there, since the ARC code has diverged substantially and I would not like to look for something no longer existing. |
Side note - @rincebrain mentioned sendfile() - on zfs 0.7-0.8 (that was in 2019 iirc) sendfile=on on nginx was slower than off for me, and for nearly the same case (sequential videos cdn) I've played with nginx's buffer sizes. Here are some of my old notes:
And yes, with recordsize=1m and many slow clients you may want to disable prefetch (if you have some spare iops and don't use hdds). |
Just to be clear, the use of Originally our app is PHP-based, and it does the required stuff for authorization of the user/client, and then slowly reads the file in memory, writes to the user etc. Each client connection is very long lived (think 30+ minutes each). We have made some changes to the app to rule out any php-related performance issue, hence why I mentioned that |
xargs is pretty easy to use to parallelize shell commands.
-P32 being the flag that sets it to 32 parallel processes, ran on my 16core/32thread desktop pc with a background load of ~4. |
For a moment I though it may be similar to what we hit few times before implementing #14516, but I don't think we saw anon_size growing there, only allocated zio buffers and zios directly, which are not counted towards ARC. |
System information
Describe the problem you're observing
We have several systems that are simply serving multiple video files via HTTP(s).
The applications downloading the files are doing it rather slowly (as they consume the content), so at peak hours, we have around anywhere from 10.000 to 15.000 users (apps) connected. At these hours, we'd usually expect a few thousand different files open and being read from. Most users are usually accessing a different file, with some outliers.
Further context - these are all MP4/MKV files ranging from ~500MB to ~2GB each.
Our hardware is usually:
Our datasets usually have:
compression=off
atime=off
recordsize=128K
(default, on older systems), while newer systems arerecordsize=1M
arc_size_max
is usually left to the default 50% of the RAM (so 128GB,137438953472
), or 256GB on the newer systems.These are systems with very few writes, close to 0 at peak hours, writes are usually done off-peak.
So what happens:
At peak hours,
anon_size
will quickly start increasing, at about 2-3GB every ~5 seconds, to finally reaching sizes abovezfs_arc_max
.When this happens,
arc_evict
will start pegging a CPU core (100% usage), load average will climb up at around 8000+, ZFS performance will drop to under 1GB/s reads.The system is still responsive (we don't use ZFS for the root fs), but the apps (in this case, just
nginx
orCaddy
) are pretty much dead in the water.With the help of @rincebrain (PMT) on IRC, after countless nights and all sort of monitoring and all sort of looking trough the variables, we have concluded that disabling prefetch (
echo 1 > zfs_prefetch_disable
) makes the issue instantly go away.Other things that were tried:
arc_size_max
arc_size_max
zfetch_array_rd_sz
init_on_alloc=0 init_on_free=0
As stated above, this was reproduceable on Ubuntu 18.04, Ubuntu 22.04, with different "major" kernel version, from 5.2 to 6.2 (latest HWE in 22.04).
All ZFS versions tried were from 2.1.4 (we used the late Jonathon Fernyhough's ppa, may he rest in peace), to 2.1.12.
We have observed this even on our newer systems with 512GB RAM (256GB ARC), although understandably much rarer.
Describe how to reproduce the problem
Step 1: Create around 10.000 - 15.000 of varying sizes files, for example:
if you have any other ideas of a faster way to do this, please, by all means, step forward
Step 2: setup a web server that supports
sendfile()
to serve the newly created files. A simple example configuration file for Caddy:Run with
./caddy run --config Caddyfile
.Step 3: grab a couple more servers with a decently fast link to the server running zfs (preferably in the same vlan, or whatever). this can probably be done on the same server, now that I think of, but in my tests I was trying to replicate a real-world scenario as much as I could
I've also generated a
files.txt
list with all the filenames.Step 4: simulate multiple slow downloads:
The sleep and the
--limit-rate
turned out to be very important in reproducing the issue. Being on a 25Gbit/s link, without the rate-limit our files were downloading way too fast so. Without the sleep it was also much harder to reproduce, and we run into some other issues on the systems where we were running the commands.We have found that running the above snippet from 2 other different servers (with
{1..5000}
), at the same time, works out best and reproduces the issue 100% of the time.Step 4: Watch
anon_size
increase until ZFS becomes unusableFor this purpose, before we were able to figure out the root cause of the issue, we wrote a simple script to collect some stats that we could look over to figure out what's happening:
We left this running while we did the test. The result is down below.
Include any warning/errors/backtraces from the system logs
Also attached the contents of the
simple_stats
directory for the duration of the test: simple_stats.tar.gzPlease let me know if I have omitted anything relevant. This was an issue we have tried to figure out for about 2 months, and when we finally figured out what's causing it and how to "fix" it, I took a longer vacation so data is about 2-3 weeks old at this point.
I'm happy to report that the systems have been running fine since then, and we have seen upwards of 40Gbit/s upload speeds without issues. Previously, with prefetch enabled, our systems were starting to die at around ~26-27Gbit/s. Obviously, the speeds weren't exactly relevant to the issue at cause, but just to put this into perspective.
The text was updated successfully, but these errors were encountered: