-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[arm64] scrub uses all machine memory and locks up the machine #12150
Comments
How large is the pool (used/total)?
|
Hmmm... so I tried this with 2.0.4, and a similar thing happens:
|
Hey @rincebrain, I was able to just zpool import tank && zpool scrub -s tank and then I recreated the tank. All good. We're just running some tests on aarch64 so it's fine if the VM dies or locks up. So, the pool looks like:
It has an aggregate read bandwidth of 3.6Gb/s. The machine has 96Gb of RAM. I haven't done any tuning or modified any settings at all, just loaded the module. I reran scrub on the newly created pool and it also ran out of memory, so I'll see what I can tune. |
[ deleted a post because the issue wasn't solved, it just looked like it due to a smaller allocated size on the new pool ] |
I'm pretty sure you shouldn't actually close this - IMO "scrub triggers the OOM killer even immediately at boot unless you tweak zfs_arc_max" sounds like a bug even if it's got a mitigation, as it's supposed to default to 50% of total system RAM on Linux, which should be well clear of that threshold on a system with 96 GiB. |
@rincebrain yeah, I mean that's what I personally think. Maybe it's just the speed of the newer devices or maybe something isn't optimal on aarch64 or 🤷♂️. I'll leave it open then. |
I'm running this on Linux/aarch64 on my RPi4 with 8GiB of RAM and it's been up for a month, including a scrub (admittedly only one spinning disk, though, so if it's a race between pruning the ARC and filling it, I wouldn't be hitting it). Is your VM being run on an aarch64 machine in turn, or some x86_64 or other arch? (I'm wondering about the feasibility of triggering this without having some beefy aarch64 hardware available, though I suppose at least one cloud provider will sell you aarch64 VMs...) |
@rincebrain it's running it on a Ampere Alta host machine, which you can test out for free: https://www.servethehome.com/oracle-cloud-giving-away-ampere-arm-a1-instances-always-free/ The way to get the better quota (16 cores and 96GiB of RAM) is to sign up for the Arm Accelerator: https://go.oracle.com/armaccelerator Which is made for OSS developers etc. Note that I'm running it with RHEL but Oracle Linux is equivalent and you can install the same kernel on there. I just had it die with the zfs_arc_max set to 24GiB so let me paste that in a new reply. |
Alright, so, it seems like this only happens if the allocated size on the zpool > available RAM, even when zfs_arc_max is set low and nothing else is running on the machine.
It looked like it was attempting to keep the memory usage under control at least but I think it's just too fast, or something. What I did was use fio to create a file big enough to trigger the issue:
And the zpool/zfs parameters:
|
I tried setting zfs_no_scrub_prefetch to 1 but it just slowed down the scrub to 2.28Gb/s with the same issue. The thing is... the 'used' output of 'free -m' matches the progress output of zpool status. So just before it dies:
And the last 3 calls to free on a while 1 / free -m / sleep 1 loop:
|
Could you check the contents of |
@behlendorf yup. Okay, so... zpool import tank && zpool status 1
And I wrote a little script that does a diff -u on the output on kmem/slab after a second:
|
I'm running the same test on an AWS r6g.4xlarge instance with 12x500Gb gp2 EBS volumes just to make sure it's not some weird Ampere Alta thing. I'm pretty sure they're both Neoverse N1 based:
Then I'll try on x86_64. |
Well, that seems about right. The |
Before I move to x86_64, testing on the AWS Graviton 2 shows the same issue:
This instance type has 128Gb of RAM instead of the 96 on OCI, but it runs out of memory the same way. All I did on this new instance was boot it up, install zfs, create the pool and fs, run the fio script, then run scrub, using the official RH8.4 AMI: RHEL-8.4.0_HVM-20210504-arm64-2-Hourly2-GP2 Happy to provide you guys with an Oracle or AWS arm64 instance to play around with if you'd like. You can create a new ssh key pair and send me the pub key and I can set it up. |
Alright, so on a r5.4xlarge instance with 16 of these:
And 128Gb of RAM... the scrub completes successfully and zfs never uses more than 3.5Gb of RAM. I even imported the pool from the arm64 instance just to test with the exact same on disk data.
So... that's fun. :) |
I'll test with rc6 on arm64 just in case of magic. |
The other potential culprit is building from the ./configure script with no CFLAGS vs rebuilding the RPMs with the normal system optimisations. I'll play around with that as well. |
Okay, so... this is with 2.1.0-rc6. On a new r6g.4xlarge instance with 16 Graviton 2 cores and 128Gb of RAM. It died. I installed all the deps:
Ran configure with no flags:
Ran make, and in another terminal ran ps to check which flags were getting passed to gcc:
It's just using the same flags the kernel was built with by RH:
This time, instead of watching the slab info, I watched /proc/meminfo. I ran zpool scrub tank && zpool status 1:
And watching meminfo:
Is there a "debugging ZFS on weird architectures" document somewhere? :) |
I don't think AArch64 qualifies as that odd, personally. FYI, you can use make V=1 [other args] to convince make to tell you what it's doing. (This will, necessarily, be a lot of text.) I think OpenZFS specifies almost no per-arch flags, I believe it gets (nearly?) all of them from the compile flags in the kernel Makefile. So if you want to experiment with the flags the modules get built with, I think there's only so much you can do without nontrivial work. (If distros vary the flags used to build the kernels significantly, which I don't know, not having ever had occasion to look, you could try another distro and see if the behavior varies.) |
Yeah, I mean, I just wanted to show I wasn't doing anything weird. I trust RH knows what they're doing since the rest of the system works. I feel like most people wouldn't run anything apart from the official RH kernel on production workloads so... I'm not sure what the next step is here. I haven't done kernel development since like 2005 so I need to reactivate that part of my brain. lol. I can try the Oracle 5.4.x kernel since that's pretty easy to test on RHEL. Might as well. |
I didn't mean to suggest RH was doing something wrong, just that since I didn't see anything obviously special-casing arm64 handling, I was wondering about flag-induced behavior. @behlendorf above wondered about the system page size - I have never had to look at this before, so I just looked something up, but it looks like As an experiment, I'll try booting up an AArch64 VM and see if I can easily repro this... |
No worries. :) I just tested it on:
Which is the latest Oracle-for-RHEL kernel. It died the same way.
Oracle uses slightly different CFLAGS to build their kernels, but it doesn't seem to matter:
The pagesize on OCI A1:
The pagesize on the AWS Graviton 2:
The Neoverse N1 tech manual (https://documentation-service.arm.com/static/5f561d50235b3560a01e03b5?token=) says:
https://www.kernel.org/doc/html/latest/arm64/memory.html Has a good rundown of the page sizes on AArch64/Linux. It seems like maybe Debian uses 4K page size on AArch64 and RHEL uses 64K page size. |
Found it. Issue #11574 describes this same issue with |
Thanks @behlendorf. Now I know what to focus on I can take a look at the code. For whatever reason when it locks up on AWS the instance becomes completely unresponsive and is unsalvageable. The only option is to terminate the entire instance. On OCI it's at least rebootable. And with the RH 4.18.0-305 kernel it even reboots itself, which is nice. |
We may want to find tune this a bit, but here's what I'm currently thinking would be a reasonable fix. Basically, if the page size was anything other than 4k we'd always fallback to using the SPLs kmem implementation which requires page alignment and was causing the memory inflation. We were effectively, wasting the majority of every page we allocated. If you can verify this resolves the issue I'll open a PR and we can go from there. diff --git a/module/os/linux/spl/spl-kmem-cache.c b/module/os/linux/spl/spl-kme>
index 6b3d559ff..4b7867b7e 100644
--- a/module/os/linux/spl/spl-kmem-cache.c
+++ b/module/os/linux/spl/spl-kmem-cache.c
@@ -100,12 +100,13 @@ MODULE_PARM_DESC(spl_kmem_cache_max_size, "Maximum size o>
* For small objects the Linux slab allocator should be used to make the most
* efficient use of the memory. However, large objects are not supported by
* the Linux slab and therefore the SPL implementation is preferred. A cutoff
- * of 16K was determined to be optimal for architectures using 4K pages.
+ * of 16K was determined to be optimal for architectures using 4K pages. For
+ * larger page sizes set the cutoff at a single page.
*/
-#if PAGE_SIZE == 4096
+#if PAGE_SIZE <= 16384
unsigned int spl_kmem_cache_slab_limit = 16384;
#else
-unsigned int spl_kmem_cache_slab_limit = 0;
+unsigned int spl_kmem_cache_slab_limit = PAGE_SIZE;
#endif
module_param(spl_kmem_cache_slab_limit, uint, 0644);
MODULE_PARM_DESC(spl_kmem_cache_slab_limit, |
For small objects the kernel's slab implemention is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to PAGE_SIZE on systems using larger pages. Since 16,384 bytes was experimentally determined to yield the best performance on 4K page systems this is used as the cutoff. This means on 4K page systems there is no functional change. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150
I've opened PR #12152 with the patch above and an explanation of the issue. I haven't actually tested it however, so it'd be great if you could confirm it really does resolve the problem. |
@behlendorf I'm testing it now. Okay, I did this by modifying the modprobe.d file for spl:
Just echo'ing into that sysfs file didn't work at first. I had to rmmod all the zfs modules and modprobe again. With that change it doesn't run out of RAM. I'll try a couple of other values just to make sure that's the best one. But at the very least... it no longer crashes. Awesome. :) |
That's right. You just need to make sure it's set before importing the pool. |
@behlendorf Cool. So, I did some testing of various values of spl_kmem_cache_slab_limit.
Every value finished in the same time / at the same speed:
I ran 'vmstat 1' alongside each scrub and stopped it as soon as the scrub was complete. I wrote a little thing to aggregate the values across the run time for each limit I tested. I've put the output here: https://gist.github.com/omarkilani/346fb6ac8406fc0a51d0c267c3a31fa3 On the whole I don't think it makes any difference which value is chosen. 16k seems to have a lower system time but it's within a margin of error so I wouldn't put any stock in it. I think the PR is good to go. |
I ran some Postgres benchmarks at the various limit levels, with 64k on the 64k page size kernel providing the best performance:
|
One final test, fio run with the following config:
At 16k/64k/128k/256k. Outputs here: https://gist.github.com/omarkilani/dc8f6d167493e9b94fae7402de841ec4 64k and 16k look alright on the 64k page size kernel. Thanks for all your help @rincebrain and @behlendorf . Glad there was a solution in the end. :) |
Ran a pgbench stress test on zfs with
|
For small objects the kernel's slab implemention is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to PAGE_SIZE on systems using larger pages. Since 16,384 bytes was experimentally determined to yield the best performance on 4K page systems this is used as the cutoff. This means on 4K page systems there is no functional change. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150
For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#12152 Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150
For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#12152 Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150
For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#12152 Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150
For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#12152 Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150
For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #12152 Closes #11429 Closes #11574 Closes #12150
System information
Describe the problem you're observing
I was doing a stress test on a zfs pool running Postgres. I left it running overnight and came back to a locked up VM. Nothing on the console from the lock up that I could see, but I suspect zfs was behind the lock up.
When I rebooted the VM, I ran scrub on the pool. The machine ran out of memory in about 5 seconds and the OOM kicked in, and eventually the machine rebooted.
If I import the pool again the scrub kicks off again automatically and the machine runs out of memory again.
Will try 2.0.4 soon.
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: