Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefetch enabled causes anon_size to increase to abnormal sizes, which causes arc_evict to peg a core, which renders ZFS slow as a snail #15214

Open
Znuff opened this issue Aug 28, 2023 · 6 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@Znuff
Copy link

Znuff commented Aug 28, 2023

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 18.04, 22.04
Kernel Version 5.2.0 -> 6.2.0
Architecture x86
OpenZFS Version versions 2.1.4 trough 2.1.12-1

Describe the problem you're observing

We have several systems that are simply serving multiple video files via HTTP(s).

The applications downloading the files are doing it rather slowly (as they consume the content), so at peak hours, we have around anywhere from 10.000 to 15.000 users (apps) connected. At these hours, we'd usually expect a few thousand different files open and being read from. Most users are usually accessing a different file, with some outliers.

Further context - these are all MP4/MKV files ranging from ~500MB to ~2GB each.

Our hardware is usually:

  • 2 x Xeon Gold 6xxx (80 or 88 threads)
  • 256GB RAM, newer systems are 512GB
  • multiple SATA or NVMe drives, older systems are RAIDz1, newer ones are mirrors (ie: "raid10")

Our datasets usually have:

  • compression=off
  • atime=off
  • recordsize=128K (default, on older systems), while newer systems are recordsize=1M

arc_size_max is usually left to the default 50% of the RAM (so 128GB, 137438953472), or 256GB on the newer systems.

These are systems with very few writes, close to 0 at peak hours, writes are usually done off-peak.

So what happens:

At peak hours, anon_size will quickly start increasing, at about 2-3GB every ~5 seconds, to finally reaching sizes above zfs_arc_max.

When this happens, arc_evict will start pegging a CPU core (100% usage), load average will climb up at around 8000+, ZFS performance will drop to under 1GB/s reads.

The system is still responsive (we don't use ZFS for the root fs), but the apps (in this case, just nginx or Caddy) are pretty much dead in the water.

With the help of @rincebrain (PMT) on IRC, after countless nights and all sort of monitoring and all sort of looking trough the variables, we have concluded that disabling prefetch (echo 1 > zfs_prefetch_disable) makes the issue instantly go away.

Other things that were tried:

  • reducing the size of arc_size_max
    • no immediate effect when core is pegged
    • big no-no otherwise, made the problem appear much sooner
  • increasing the size of arc_size_max
    • slight relief when core is pegged, 5-10 minutes, but not feasible as the rest of stuff running on the server requires the memory
    • obviously it made the problem appear a bit later
  • reducing tunable zfetch_array_rd_sz
    • no immediate effect when core was pegged
    • made the issue somewhat appear later, but I can't vouch 100% for this claim
  • booted the kernel with init_on_alloc=0 init_on_free=0
    • no discernable effect

As stated above, this was reproduceable on Ubuntu 18.04, Ubuntu 22.04, with different "major" kernel version, from 5.2 to 6.2 (latest HWE in 22.04).

All ZFS versions tried were from 2.1.4 (we used the late Jonathon Fernyhough's ppa, may he rest in peace), to 2.1.12.

We have observed this even on our newer systems with 512GB RAM (256GB ARC), although understandably much rarer.

Describe how to reproduce the problem

Step 1: Create around 10.000 - 15.000 of varying sizes files, for example:

for file in {00001..12000}; do
    count=$(shuf -i 500-2000 -n 1)
    dd if=<(openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero) of=/storage/test/"$file".mkv bs=1M count="$count" status=progress iflag=fullblock
done

if you have any other ideas of a faster way to do this, please, by all means, step forward

Step 2: setup a web server that supports sendfile() to serve the newly created files. A simple example configuration file for Caddy:

:80 {
        root * /storage/test/
        file_server
}

Run with ./caddy run --config Caddyfile.

Step 3: grab a couple more servers with a decently fast link to the server running zfs (preferably in the same vlan, or whatever). this can probably be done on the same server, now that I think of, but in my tests I was trying to replicate a real-world scenario as much as I could

I've also generated a files.txt list with all the filenames.

Step 4: simulate multiple slow downloads:

for i in {1..10000}; do
    random=$(basename $(shuf -n 1 files.txt));
    curl "http://ip-or-hostname-of-zfs-server/$random" --limit-rate 512k -o /dev/null -s & sleep 0.005;
done

The sleep and the --limit-rate turned out to be very important in reproducing the issue. Being on a 25Gbit/s link, without the rate-limit our files were downloading way too fast so. Without the sleep it was also much harder to reproduce, and we run into some other issues on the systems where we were running the commands.

We have found that running the above snippet from 2 other different servers (with {1..5000}), at the same time, works out best and reproduces the issue 100% of the time.

Step 4: Watch anon_size increase until ZFS becomes unusable

For this purpose, before we were able to figure out the root cause of the issue, we wrote a simple script to collect some stats that we could look over to figure out what's happening:

#!/bin/bash

while true; do
  ts=$(date +%s)
  cat /proc/spl/kstat/zfs/{arcstats,abdstats} > /root/simple_stats/$ts
  cat /proc/spl/kstat/zfs/storage/txgs > /root/simple_stats/$ts.txgs
  cat /proc/spl/kstat/zfs/dmu_tx > /root/simple_stats/$ts.dmu_tx
  sleep 5;
done

We left this running while we did the test. The result is down below.

Include any warning/errors/backtraces from the system logs

# start test
1690848729:anon_size                       4    0
1690848734:anon_size                       4    0
1690848739:anon_size                       4    0
1690848744:anon_size                       4    4980736
1690848749:anon_size                       4    262144
1690848754:anon_size                       4    8650752
1690848759:anon_size                       4    262144
1690848764:anon_size                       4    5373952
1690848769:anon_size                       4    12582912
1690848774:anon_size                       4    2914783232
1690848779:anon_size                       4    2261430272
1690848784:anon_size                       4    6743191552
1690848789:anon_size                       4    23561175040
1690848794:anon_size                       4    33879909888
1690848799:anon_size                       4    35658895360
1690848804:anon_size                       4    37011476480
1690848810:anon_size                       4    38404288512
1690848815:anon_size                       4    39710883840
1690848820:anon_size                       4    41137524736
1690848825:anon_size                       4    42138038272
1690848830:anon_size                       4    43572531200
1690848835:anon_size                       4    44700925952
1690848840:anon_size                       4    46099718144
1690848845:anon_size                       4    47678537728
1690848850:anon_size                       4    48854700032
1690848855:anon_size                       4    50707677184
1690848860:anon_size                       4    52155572224
1690848865:anon_size                       4    53704949760
1690848870:anon_size                       4    55225733120
1690848875:anon_size                       4    57224851456
1690848880:anon_size                       4    59076370432
1690848885:anon_size                       4    60825862144
1690848890:anon_size                       4    63073931264
1690848895:anon_size                       4    65474318336
1690848900:anon_size                       4    67701624832
1690848905:anon_size                       4    69834297344
1690848910:anon_size                       4    72069754880
1690848915:anon_size                       4    74059677696
1690848920:anon_size                       4    76245630976
1690848925:anon_size                       4    78294396928
1690848930:anon_size                       4    80374292480
1690848935:anon_size                       4    82643341312
1690848940:anon_size                       4    84612644864
1690848945:anon_size                       4    86627631104
1690848950:anon_size                       4    88472702976
1690848955:anon_size                       4    90138722304
1690848960:anon_size                       4    92011212800
1690848965:anon_size                       4    93586767872
1690848970:anon_size                       4    95339476480
1690848975:anon_size                       4    97005137920
1690848980:anon_size                       4    98858409984
1690848985:anon_size                       4    100303286272
1690848990:anon_size                       4    101620416512
1690848995:anon_size                       4    103295643648
1690849000:anon_size                       4    105143603200
1690849005:anon_size                       4    106730119168
1690849010:anon_size                       4    108533432320
1690849015:anon_size                       4    110196682752
1690849020:anon_size                       4    112266703360
1690849025:anon_size                       4    113870814720
1690849030:anon_size                       4    115303026688
1690849035:anon_size                       4    116841009152
1690849040:anon_size                       4    118218358784
1690849045:anon_size                       4    119878270976
1690849050:anon_size                       4    121337614336
1690849055:anon_size                       4    122584694784
1690849060:anon_size                       4    124311633920
1690849065:anon_size                       4    125827112960
1690849070:anon_size                       4    127280828416
1690849075:anon_size                       4    129083691008
1690849080:anon_size                       4    130502201344
1690849085:anon_size                       4    131607691264
1690849090:anon_size                       4    133178183680
1690849095:anon_size                       4    134726721536
1690849100:anon_size                       4    136033177600
# has hit arc_max_size, pool becomes slow/unresponsive, load avg. climbs high
1690849105:anon_size                       4    137770708992
1690849110:anon_size                       4    139562803200
1690849115:anon_size                       4    140936005120
1690849120:anon_size                       4    142426574848
1690849125:anon_size                       4    144179482624
1690849130:anon_size                       4    145780846592
1690849135:anon_size                       4    147235528704
1690849140:anon_size                       4    148994531328
1690849145:anon_size                       4    150346932224
1690849150:anon_size                       4    151910645760
1690849155:anon_size                       4    153118806016
1690849160:anon_size                       4    154534887424
1690849165:anon_size                       4    155853598720
1690849170:anon_size                       4    157367533568
1690849175:anon_size                       4    158677409792
1690849180:anon_size                       4    159813010944
1690849185:anon_size                       4    160900177920
1690849191:anon_size                       4    161802354688
1690849196:anon_size                       4    162771849216
1690849201:anon_size                       4    164056711168
1690849206:anon_size                       4    164928671744
1690849211:anon_size                       4    166061805568
1690849216:anon_size                       4    166857121792
1690849221:anon_size                       4    167983366144
1690849226:anon_size                       4    168952799232
1690849231:anon_size                       4    169564459008
1690849236:anon_size                       4    170411442176
1690849241:anon_size                       4    171341307904
1690849246:anon_size                       4    172124778496
1690849251:anon_size                       4    172857307136
1690849256:anon_size                       4    173464506368
1690849261:anon_size                       4    174517215232
1690849266:anon_size                       4    175086850048
1690849271:anon_size                       4    175600046080
1690849276:anon_size                       4    176314594816
1690849281:anon_size                       4    177413185536
1690849286:anon_size                       4    178128381440
1690849291:anon_size                       4    178910011392
1690849296:anon_size                       4    180020367360
1690849301:anon_size                       4    180916506624
1690849306:anon_size                       4    181533945856
1690849311:anon_size                       4    182605934592
1690849316:anon_size                       4    183530790912
1690849321:anon_size                       4    184614068224
1690849326:anon_size                       4    185434472448
1690849331:anon_size                       4    186759741440
1690849336:anon_size                       4    187462516736
1690849341:anon_size                       4    188359688192
1690849346:anon_size                       4    189318422528
1690849351:anon_size                       4    190125891584
1690849356:anon_size                       4    190902509568
1690849361:anon_size                       4    191575146496
1690849366:anon_size                       4    192679675392
1690849371:anon_size                       4    193427783680
1690849376:anon_size                       4    193839865856
1690849381:anon_size                       4    194836582400
1690849386:anon_size                       4    195533766656
1690849391:anon_size                       4    196094119936
1690849396:anon_size                       4    196292068864
1690849401:anon_size                       4    196729696256
1690849406:anon_size                       4    197170937856
1690849411:anon_size                       4    197425065984
1690849416:anon_size                       4    197795713024
1690849421:anon_size                       4    198022037504
1690849426:anon_size                       4    197978865664
1690849431:anon_size                       4    198221242368
1690849436:anon_size                       4    198396350464
1690849441:anon_size                       4    198587383808
1690849446:anon_size                       4    198736879616
1690849451:anon_size                       4    198756241408
1690849456:anon_size                       4    198970556416
1690849461:anon_size                       4    199120691200
1690849466:anon_size                       4    199172441088
1690849471:anon_size                       4    199359029248
1690849476:anon_size                       4    199365402624
1690849481:anon_size                       4    199434219520
1690849486:anon_size                       4    199446880256
1690849491:anon_size                       4    199417573376
1690849496:anon_size                       4    199321243648
1690849501:anon_size                       4    199163088896
1690849506:anon_size                       4    199017205760
1690849511:anon_size                       4    198945280000
1690849516:anon_size                       4    199081399808
1690849521:anon_size                       4    199076683776
1690849526:anon_size                       4    199013564416
1690849531:anon_size                       4    199089274880
1690849536:anon_size                       4    198928637952
1690849541:anon_size                       4    198655340544
1690849546:anon_size                       4    198423941120
1690849551:anon_size                       4    197795741696
1690849556:anon_size                       4    197421801472
1690849561:anon_size                       4    196781711360
1690849566:anon_size                       4    196646260736
1690849571:anon_size                       4    196438134784
1690849576:anon_size                       4    196089447424
1690849581:anon_size                       4    195753013248
1690849586:anon_size                       4    195551764480
1690849592:anon_size                       4    195470082048
1690849597:anon_size                       4    195803373568
1690849602:anon_size                       4    195965222912
1690849607:anon_size                       4    196199047168
1690849612:anon_size                       4    196475678720
1690849617:anon_size                       4    196960100352
1690849622:anon_size                       4    197188771840
1690849627:anon_size                       4    197438861312
1690849632:anon_size                       4    197223096320
1690849637:anon_size                       4    196862431232
1690849642:anon_size                       4    196761243648
1690849647:anon_size                       4    196701368320
1690849652:anon_size                       4    196514336768
1690849657:anon_size                       4    196083372032
1690849662:anon_size                       4    195927662592
1690849667:anon_size                       4    195565723648
1690849672:anon_size                       4    195426848768
1690849677:anon_size                       4    195056410624
1690849682:anon_size                       4    194748383232
1690849687:anon_size                       4    194505285632
1690849692:anon_size                       4    193986691072
1690849697:anon_size                       4    193591644160
1690849702:anon_size                       4    193269075968
# we disabled prefetch by echo 1 > zfs_prefetch_disable
1690849707:anon_size                       4    192941641728
1690849712:anon_size                       4    192510234624
1690849717:anon_size                       4    190752641024
1690849722:anon_size                       4    187990310912
1690849727:anon_size                       4    178243203072
1690849732:anon_size                       4    158239703040
1690849737:anon_size                       4    135640748032
1690849742:anon_size                       4    109866831872
1690849747:anon_size                       4    83298562048
1690849752:anon_size                       4    55567937536
1690849757:anon_size                       4    34140905472
1690849762:anon_size                       4    10660339712
1690849767:anon_size                       4    0
1690849772:anon_size                       4    0
1690849777:anon_size                       4    0
1690849782:anon_size                       4    0
1690849787:anon_size                       4    131072
1690849792:anon_size                       4    0
1690849797:anon_size                       4    0

Also attached the contents of the simple_stats directory for the duration of the test: simple_stats.tar.gz

Please let me know if I have omitted anything relevant. This was an issue we have tried to figure out for about 2 months, and when we finally figured out what's causing it and how to "fix" it, I took a longer vacation so data is about 2-3 weeks old at this point.

I'm happy to report that the systems have been running fine since then, and we have seen upwards of 40Gbit/s upload speeds without issues. Previously, with prefetch enabled, our systems were starting to die at around ~26-27Gbit/s. Obviously, the speeds weren't exactly relevant to the issue at cause, but just to put this into perspective.

@Znuff Znuff added the Type: Defect Incorrect behavior (e.g. crash, hang) label Aug 28, 2023
@rincebrain
Copy link
Contributor

rincebrain commented Aug 28, 2023

So, my speculation at the time, and the reason we tried the fetching with a speed limit, was that if the files were getting prefetched into memory, they might be getting pinned until the requests triggering the prefetch finished, and that was why we couldn't reproduce it even under much higher nominal connection load if each connection didn't last long.

IIRC turning down the minimum "prefetch stays in memory before being eligible" time didn't do anything either, it was just completely turning off prefetch that night and day killed the load.

So I'm wildly speculating the behavior is something like a prefetch triggers loading a huge thing into memory since it's very visibly sequential, the thing that triggered the prefetch "locks" it there until it's done, even if the nominal expiration is over, and then if this happens enough times in parallel, you blow your ARC limits.

My other speculation was an interaction with mmap since doing this with sendfile() basically means splice() on Linux, which I would expect to mean mmaping the file in, I think?

Either way, that's all the state from memory I can decant atm.

edit: I lied. I also have a hysteresis patch to make arc_evict sleep for a bit instead of constantly running until it frees in cases like this, because I don't think it's literally ever helpful to constantly peg things on that, but I haven't tested it much.

@amotin
Copy link
Member

amotin commented Aug 28, 2023

I am surprised to see that anon state can be related to prefetch. Prefetched buffers should be in MRU or MFU state. They should not be special and should be evictable on common grounds with other buffers, may be after minimal time of few seconds. Anonymous buffers are ones that have no physical address, for example, buffers that were read, but then dirtied while not yet written to obtain new physical address. Since the sendfile() is mentioned, I wonder how it is implemented on Linux and whether it can produce the anonymous buffers somehow. It does not explain relations with prefetch though.

I've cleaned a lot of ARC state transitions in ZFS 2.2 branch. It would be good to know whether the issue is reproducible there, since the ARC code has diverged substantially and I would not like to look for something no longer existing.

@gmelikov
Copy link
Member

gmelikov commented Aug 28, 2023

Side note - @rincebrain mentioned sendfile() - on zfs 0.7-0.8 (that was in 2019 iirc) sendfile=on on nginx was slower than off for me, and for nearly the same case (sequential videos cdn) I've played with nginx's buffer sizes. Here are some of my old notes:

thread_pool pool_c threads=32;

http {

output_buffers 1 1m;

sendfile off;

}

server {

aio threads=pool_c;

}

And yes, with recordsize=1m and many slow clients you may want to disable prefetch (if you have some spare iops and don't use hdds).

@Znuff
Copy link
Author

Znuff commented Aug 28, 2023

Just to be clear, the use of sendfile() is completely irrelevant to the issue at cause.

Originally our app is PHP-based, and it does the required stuff for authorization of the user/client, and then slowly reads the file in memory, writes to the user etc. Each client connection is very long lived (think 30+ minutes each).

We have made some changes to the app to rule out any php-related performance issue, hence why I mentioned that sendfile(). But it's use did not affect the problem at all.

@thulle
Copy link

thulle commented Aug 31, 2023

if you have any other ideas of a faster way to do this, please, by all means, step forward

xargs is pretty easy to use to parallelize shell commands.

for file in {00001..04000}; do
    count=$(shuf -i 500-2000 -n 1)
    dd if=<(openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero) of="$file".mkv bs=1k count="$count" iflag=fullblock
done

real	3m2,163s
user	0m15,087s
sys	3m8,480s
$ du -ms .
4930	.

seq -w 1 4000|xargs -I% -P32 bash -c 'dd if=<(openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero) of=%.mkv bs=1k count="$(shuf -i 500-2000 -n 1)" iflag=fullblock'

real	0m17,553s
user	0m8,506s
sys	4m51,734s
$ du -ms .
4922	.

-P32 being the flag that sets it to 32 parallel processes, ran on my 16core/32thread desktop pc with a background load of ~4.

@shodanshok shodanshok mentioned this issue Sep 4, 2023
13 tasks
@amotin
Copy link
Member

amotin commented Sep 4, 2023

For a moment I though it may be similar to what we hit few times before implementing #14516, but I don't think we saw anon_size growing there, only allocated zio buffers and zios directly, which are not counted towards ARC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

5 participants