Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux: Set spl_kmem_cache_slab_limit when page size !4K #12152

Merged
merged 1 commit into from
Jun 3, 2021

Conversation

behlendorf
Copy link
Contributor

Motivation and Context

It was observed on systems using a 64k page size that running
zpool scrub could exhaust memory and trigger the OOM killer.
This should never be possible.

Issues #11429 #11574 #12150

Description

For small objects the kernel's slab implemention is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to PAGE_SIZE on systems using larger pages. Since 16,384 bytes
was experimentally determined to yield the best performance on
4K page systems this is used as the cutoff. This means on 4K
page systems there is no functional change.

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

How Has This Been Tested?

Locally compiled but it has not yet been tested on a system
using a large page size.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added Type: Architecture Indicates an issue is specific to a single processor architecture Status: Code Review Needed Ready for review and testing labels May 29, 2021
@AttilaFueloep
Copy link
Contributor

AttilaFueloep commented May 29, 2021

Wouldn't it be better to make this a runtime module load time selection rather than a compile time one? AFAIK AArch64 Linux supports 4k and 16K pages as well.

@behlendorf
Copy link
Contributor Author

@AttilaFueloep 4k, 16k, and 64k pages are all supported for aarch64, but a specific page size needs to be selected when compiling the kernel. So there shouldn't be an issue setting this at compile time.

    CONFIG_<ARCH>_4K_PAGES
    CONFIG_<ARCH>_16K_PAGES
    CONFIG_<ARCH>_64K_PAGES

@AttilaFueloep
Copy link
Contributor

Yes, but what happens if you compile the module on a box with a 4K page sized kernel and then load it on a box with a 64k kernel? Then there is a discrepancy between the compile time and the running on page sizes. Or am I missing something here?

@AttilaFueloep
Copy link
Contributor

I think we should at least detect page size mismatches and refuse to load the module if so.

@behlendorf
Copy link
Contributor Author

I think we should at least detect page size mismatches and refuse to load the module if so.

I believe this should already be the case even without additional changes. Though I'm not having much luck finding any documentation to confirm this.

@behlendorf
Copy link
Contributor Author

@omarkilani thanks for the testing, I opted for setting the cutoff at the page size because the kernel's slab implementation(s) should allocate a single page per slab. As long as the object size is less than the page size I'd expect the kernel's slab to be both faster and more space efficient since it's highly optimized. Your testing seems to back that up. Where that story changes is a bit when allocations start requiring multiple pages.

@AttilaFueloep
Copy link
Contributor

I believe this should already be the case even without additional changes. Though I'm not having much luck finding any documentation to confirm this.

Yeah, I couldn't find any documentation either and even grepping the Linux and kmod sources didn't brought up something revealing. Unfortunately I don't have a box supporting multiple page sizes to test this on. So I'll take this for granted, it seems to be quite a corner case anyway.

As a side note, not directly related to your change, the mixed usages of PAGESIZE and PAGE_SIZE I see in the sources puzzles me a bit. Skimming the sources my understanding is that PAGESIZE is defined as PAGE_SIZE if in kernel space and as sysconf(_SC_PAGESIZE) if in user space (ztest). If that's true, wouldn't using PAGE_SIZE outside of #ifdef _KERNEL blocks defeat that approach and shouldn't PAGESIZE always be preferred over PAGE_SIZE? Further I don't see the same handling for PAGE_SHIFT in user space outside of abd_os.c. I'm not sure if all of this could lead to some obscure ztest failures in the page size mismatch case.

@behlendorf
Copy link
Contributor Author

the mixed usages of PAGESIZE and PAGE_SIZE I see in the sources puzzles me a bit.

The mixed usages comes from the fact that PAGESIZE is what was used on illumos, and so you're right that's the portable interface which should be used in all the common code. We could add the same wrappers for PAGESHIFT but it turns out there were never any existing consumers in the common code so they never got added.

In the platform specific code (module/os/<platform>/) there really isn't any issue with using the native interfaces in the right contexts. In general, that's what we've tried to do but we don't have a hard and fast rule about it. In this case I opted for PAGE_SIZE since this is platform specific code which is only used by the kernel.

@AttilaFueloep
Copy link
Contributor

I see, thanks for the thorough explanation. No reason to change anything then.

@ahrens
Copy link
Member

ahrens commented Jun 1, 2021

@behlendorf

I opted for setting the cutoff at the page size because the kernel's slab implementation(s) should allocate a single page per slab. As long as the object size is less than the page size I'd expect the kernel's slab to be both faster and more space efficient since it's highly optimized.

FYI that isn't true for caches with entries >=256 bytes, at least on PAGESIZE=4K systems:

sdb> slabs -o name,entry_size,entries_per_slab
name                        entry_size entries_per_slab
--------------------------- ---------- ----------------
kmalloc-96                          96               42 (1 page per slab)
kmalloc-128                        128               32 (1 page per slab)
kmalloc-192                        192               21 (1 page per slab)
kmalloc-256                        256               32 (2 pages per slab)
kmalloc-512                        512               32 (4 pages per slab)
kmalloc-1k                        1024               32 (8 pages per slab)
kmalloc-2k                        2048               16 (8 pages per slab)
kmalloc-4k                        4096                8 (8 pages per slab)
kmalloc-8k                        8192                4 (8 pages per slab)

Given the possibility of fragmentation, this aspect of the kernel's slab implementation seems like a bad design decision to me.

@behlendorf
Copy link
Contributor Author

@ahrens yes, upon re-reading what I wrote I clearly left out more than a few important qualifiers. Looking at this again, I wonder if it wouldn't be preferable to simply set the cutoff at 16k for all architectures. Based on @omarkilani's testing any performance impact seems negligible. Thoughts?

@omarkilani
Copy link

This is probably a dumb question but how does one run the slabs -o name,entry_size,entries_per_slab command?

Googling sdb is about as helpful as you'd expect.

@ahrens
Copy link
Member

ahrens commented Jun 2, 2021

@omarkilani
Copy link

I spent most of the day rebuilding kernels and dependencies to get drgn working and then I realised... I could just cat /proc/slabinfo. lol.

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>
kmalloc-128k          28     28 131072    4    8
kmalloc-64k          144    144  65536    8    8
kmalloc-32k          272    272  32768   16    8
kmalloc-16k         2944   2944  16384   32    8
kmalloc-8k           608    608   8192   32    4
kmalloc-4k          3072   3072   4096   32    2
kmalloc-2k          3072   3072   2048   32    1
kmalloc-1k         25490  25536   1024   64    1
kmalloc-512        19741  23424    512  128    1
kmalloc-256        20647  21248    256  256    1
kmalloc-128       613203 613376    128  512    1

I did some more tests with pgbench and zpool scrub while monitoring /proc/slabinfo at 16k and 64k.

In both cases nothing above kmalloc-4k gets used. And that only gets used during a zpool scrub. During pgbench nothing above kmalloc-1k gets used.

pgbench averages ~15,500tps at 64k and 15,000tps at 16k.

I ran pgbench 10 times under each limit setting and the better performance using 64k is small but reproducible.

IMHO, 16k is probably fine for a default setting if you're worried about fragmentation. I'm not sure it's the "optimal" setting for all workloads, but nothing bad's going to happen and people can just up the limit for Postgres or whatever.

Anything's better than 0 and certain OOM/crash.

Copy link
Contributor

@tonynguien tonynguien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good. Thanks for fixing the issue.

For small objects the kernel's slab implemention is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to PAGE_SIZE on systems using larger pages. Since 16,384 bytes
was experimentally determined to yield the best performance on
4K page systems this is used as the cutoff. This means on 4K
page systems there is no functional change.

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
@behlendorf
Copy link
Contributor Author

I've gone ahead and updated the PR to always use a 16K cutoff.. This keeps the behavior the same across platforms add should help mitigate any issues we might otherwise end up seeing with fragmentation.

@tonynguien tonynguien added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jun 3, 2021
@tonynguien tonynguien merged commit 7837845 into openzfs:master Jun 3, 2021
behlendorf added a commit to behlendorf/zfs that referenced this pull request Jun 3, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#12152
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Jun 4, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#12152
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
behlendorf added a commit to behlendorf/zfs that referenced this pull request Jun 8, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#12152
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
behlendorf added a commit to behlendorf/zfs that referenced this pull request Jun 9, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#12152
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
tonyhutter pushed a commit that referenced this pull request Jun 23, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #12152
Closes #11429
Closes #11574
Closes #12150
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Architecture Indicates an issue is specific to a single processor architecture
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants