Reduce need for contiguous memory for ioctls #14474

prakashsurya · 2023-02-08T17:40:06Z

We've had cases where we trigger an OOM despite having memory freely available on the system. For example, here, we had about 21GB free:

kernel: Node 0 Normal: 2418758*4kB (UME) 1549533*8kB (UE) 0*16kB
0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
22071296kB

The problem being, all the memory is in 4K and 8K contiguous regions, but the allocation request was for a 16K contiguous region:

kernel: SafeExecutors-4 invoked oom-killer:
gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO),
order=2, oom_score_adj=0

The offending allocation came from this call trace:

kernel: Call Trace:
kernel:  dump_stack+0x57/0x7a
kernel:  dump_header+0x4f/0x1e1
kernel:  oom_kill_process.cold.33+0xb/0x10
kernel:  out_of_memory+0x1ad/0x490
kernel:  __alloc_pages_slowpath+0xd55/0xe40
kernel:  __alloc_pages_nodemask+0x2df/0x330
kernel:  kmalloc_large_node+0x42/0x90
kernel:  __kmalloc_node+0x25a/0x320
kernel:  ? spl_kmem_free_impl+0x21/0x30 [spl]
kernel:  spl_kmem_alloc_impl+0xa5/0x100 [spl]
kernel:  spl_kmem_zalloc+0x19/0x20 [spl]
kernel:  zfsdev_ioctl+0x2b/0xe0 [zfs]
kernel:  do_vfs_ioctl+0xa9/0x640
kernel:  ? __audit_syscall_entry+0xdd/0x130
kernel:  ksys_ioctl+0x67/0x90
kernel:  __x64_sys_ioctl+0x1a/0x20
kernel:  do_syscall_64+0x5e/0x200
kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
kernel: RIP: 0033:0x7fdca3674317

The problem is, for each ioctl that ZFS makes, it has to allocate a zfs_cmd_t structure, which is 13744 bytes in size (on my system):

sdb> sizeof zfs_cmd
(size_t)13744

This size, coupled with the fact that we currently allocate it with kmem_zalloc, means we need a 16K contiguous region of memory to satisfy the request.

The solution taken by this change, is to use "vmem" instead of "kmem" to do the allocation, such that we don't necessarily need a contiguous 16K memory region to satisfy the allocation.

Arguably, a better solution would be not to require such a large allocation to begin with (e.g. reduce the size of the zfs_cmd_t structure), but that'd be a much larger change than this "one liner". Thus, I've opted for this approach for now; we can always circle back and attempt to reduce the size of the structure in the future.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

module/os/linux/zfs/zfs_ioctl_os.c

We've had cases where we trigger an OOM despite having memory freely available on the system. For example, here, we had about 21GB free: kernel: Node 0 Normal: 2418758*4kB (UME) 1549533*8kB (UE) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22071296kB The problem being, all the memory is in 4K and 8K contiguous regions, but the allocation request was for a 16K contiguous region: kernel: SafeExecutors-4 invoked oom-killer: gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO), order=2, oom_score_adj=0 The offending allocation came from this call trace: kernel: Call Trace: kernel: dump_stack+0x57/0x7a kernel: dump_header+0x4f/0x1e1 kernel: oom_kill_process.cold.33+0xb/0x10 kernel: out_of_memory+0x1ad/0x490 kernel: __alloc_pages_slowpath+0xd55/0xe40 kernel: __alloc_pages_nodemask+0x2df/0x330 kernel: kmalloc_large_node+0x42/0x90 kernel: __kmalloc_node+0x25a/0x320 kernel: ? spl_kmem_free_impl+0x21/0x30 [spl] kernel: spl_kmem_alloc_impl+0xa5/0x100 [spl] kernel: spl_kmem_zalloc+0x19/0x20 [spl] kernel: zfsdev_ioctl+0x2b/0xe0 [zfs] kernel: do_vfs_ioctl+0xa9/0x640 kernel: ? __audit_syscall_entry+0xdd/0x130 kernel: ksys_ioctl+0x67/0x90 kernel: __x64_sys_ioctl+0x1a/0x20 kernel: do_syscall_64+0x5e/0x200 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 kernel: RIP: 0033:0x7fdca3674317 The problem is, for each ioctl that ZFS makes, it has to allocate a zfs_cmd_t structure, which is 13744 bytes in size (on my system): sdb> sizeof zfs_cmd (size_t)13744 This size, coupled with the fact that we currently allocate it with kmem_zalloc, means we need a 16K contiguous region of memory to satisfy the request. The solution taken by this change, is to use "vmem" instead of "kmem" to do the allocation, such that we don't necessarily need a contiguous 16K memory region to satisfy the allocation. Arguably, a better solution would be not to require such a large allocation to begin with (e.g. reduce the size of the zfs_cmd_t structure), but that'd be a much larger change than this "one liner". Thus, I've opted for this approach for now; we can always circle back and attempt to reduce the size of the structure in the future. Signed-off-by: Prakash Surya <[email protected]>

solbjorn · 2023-02-10T17:34:45Z

Can't we use kvzalloc() (+ kvfree()) to let the kernel decide where to allocate from?

prakashsurya · 2023-02-10T17:36:31Z

Do we have any existing use of kvzalloc in the code base? I'm grep-ing for it, but nothing is coming up..

prakashsurya · 2023-02-10T17:41:50Z

If you are not sure whether the allocation size is too large for
`kmalloc`, it is possible to use :c:func:`kvmalloc` and its
derivatives. It will try to allocate memory with `kmalloc` and if the
allocation fails it will be retried with `vmalloc`. There are
restrictions on which GFP flags can be used with `kvmalloc`; please
see :c:func:`kvmalloc_node` reference documentation. Note that
`kvmalloc` may return memory that is not physically contiguous.

that's neat.. didn't realize the kernel had support for this.. with that said, I'm still not sure if it makes sense to even try to use kmalloc at all.. is there any good reason to use "kmem" when we can? vs. always using "vmem"?

solbjorn · 2023-02-10T18:36:02Z

vmalloced area can't be DMA mapped and you can't allocate from it inside atomic context. Speaking of "regular" code, it's pointless to allocate small stuff like 100-200 bytes from the vmalloc zone as you will only waste more memory (vmallocs need to store more metadata per each allocation). I usually switch from kmalloc() to kvmalloc() when the size I want to allocate is 4 Kb or more, i.e. doesn't fit into one x86 page (4096).
When there is enough memory, the absolute limit for kmalloc() is 4 Mb, but the situation you described may occur at any time, so using kvmalloc() here is the most optimal solution to me.

prakashsurya · 2023-02-10T21:33:17Z

When there is enough memory, the absolute limit for kmalloc() is 4 Mb

so, is it safe to say, your opinion would be to always use kvmalloc unless you know the allocation will be more that 4Mb?

I don't necessarily disagree with the notion that we should use kvmalloc here, I'm just trying to understand when to use kvmalloc vs. vmalloc in general.. e.g. in this case, it'll always be a 16K allocation, so it seems like a potentially reasonable place to always use vmalloc..? or since it's still less than 4Mb, we should use kvmalloc, since it's always better to use kmalloc when it's possible to do so..?

practically, I think my only hesitation from using kvmalloc for this case, is the fact that I haven't seen it being used elsewhere in the codebase.. so I'm not sure if it's "allowed" (e.g. perhaps due to portability, or kernel version considerations).. maybe @behlendorf, or somebody more familiar could chime in.. personally, I'm OK either way..

solbjorn · 2023-02-10T22:17:39Z

My pref is (general-purpose data, no DMA needed):

n < 4 Kb: kmalloc
4 Kb <= n < 4 Mb: kvmalloc
n >= 4 Mb: vmalloc

Here we have 16 Kb. Most of times it can be allocated from the contiguous zone, so there'll be no need for involving VM area. ATST, if it fails, we can rely on vmalloc() then instead of just getting nothing. So I'd go for kvmalloc() in this particular case.

ryao · 2023-02-12T22:21:14Z

Can't we use kvzalloc() (+ kvfree()) to let the kernel decide where to allocate from?

In the Linux SPL, vmem_zalloc() will go to spl_kvmalloc(), which will try kmalloc_node() and fallback to spl_vmalloc() if it fails. There is no need to use kvzalloc(). The code already is doing what kvzalloc() does.

If I recall correctly, Linux invented kvzalloc() after the current vmem_zalloc() implementation was written. Not all of the kernels we support have kvzalloc().

If you are not sure whether the allocation size is too large for
`kmalloc`, it is possible to use :c:func:`kvmalloc` and its
derivatives. It will try to allocate memory with `kmalloc` and if the
allocation fails it will be retried with `vmalloc`. There are
restrictions on which GFP flags can be used with `kvmalloc`; please
see :c:func:`kvmalloc_node` reference documentation. Note that
`kvmalloc` may return memory that is not physically contiguous.
that's neat.. didn't realize the kernel had support for this.. with that said, I'm still not sure if it makes sense to even try to use kmalloc at all.. is there any good reason to use "kmem" when we can? vs. always using "vmem"?

There is no need to do that. The Linux SPL implements vmem_zalloc() in a way that is equivalent to kvzalloc(). It was necessary to do it this way to avoid deficiencies in Linux's kernel virtual memory.

prakashsurya · 2023-02-13T18:53:49Z

The Linux SPL implements vmem_zalloc() in a way that is equivalent to kvzalloc()

Thanks @ryao .. We should be good to go on this one, then. Just need approval.

behlendorf

It's surprising a 16K kmalloc() would fail even with heavily fragmented memory. Since this is only a GFP_KERNEL allocation the allocator should try progressively more aggressive steps to get the memory it needs (i.e. dropping page cache, ARC reclaim, memory compaction, etc). But if that all does fail resorting to vmalloc() is totally reasonable, and since that's what the vmem_* calls do on Linux this LGTM. On FreeBSD vmem_alloc and kmem_alloc are identical so the changes to freebsd/zfs/kmod_core.c don't do anything, but I like the idea of keeping the code as similar as possible.

We've had cases where we trigger an OOM despite having memory freely available on the system. For example, here, we had about 21GB free: kernel: Node 0 Normal: 2418758*4kB (UME) 1549533*8kB (UE) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22071296kB The problem being, all the memory is in 4K and 8K contiguous regions, but the allocation request was for a 16K contiguous region: kernel: SafeExecutors-4 invoked oom-killer: gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO), order=2, oom_score_adj=0 The offending allocation came from this call trace: kernel: Call Trace: kernel: dump_stack+0x57/0x7a kernel: dump_header+0x4f/0x1e1 kernel: oom_kill_process.cold.33+0xb/0x10 kernel: out_of_memory+0x1ad/0x490 kernel: __alloc_pages_slowpath+0xd55/0xe40 kernel: __alloc_pages_nodemask+0x2df/0x330 kernel: kmalloc_large_node+0x42/0x90 kernel: __kmalloc_node+0x25a/0x320 kernel: ? spl_kmem_free_impl+0x21/0x30 [spl] kernel: spl_kmem_alloc_impl+0xa5/0x100 [spl] kernel: spl_kmem_zalloc+0x19/0x20 [spl] kernel: zfsdev_ioctl+0x2b/0xe0 [zfs] kernel: do_vfs_ioctl+0xa9/0x640 kernel: ? __audit_syscall_entry+0xdd/0x130 kernel: ksys_ioctl+0x67/0x90 kernel: __x64_sys_ioctl+0x1a/0x20 kernel: do_syscall_64+0x5e/0x200 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 kernel: RIP: 0033:0x7fdca3674317 The problem is, for each ioctl that ZFS makes, it has to allocate a zfs_cmd_t structure, which is 13744 bytes in size (on my system): sdb> sizeof zfs_cmd (size_t)13744 This size, coupled with the fact that we currently allocate it with kmem_zalloc, means we need a 16K contiguous region of memory to satisfy the request. The solution taken by this change, is to use "vmem" instead of "kmem" to do the allocation, such that we don't necessarily need a contiguous 16K memory region to satisfy the allocation. Arguably, a better solution would be not to require such a large allocation to begin with (e.g. reduce the size of the zfs_cmd_t structure), but that'd be a much larger change than this "one liner". Thus, I've opted for this approach for now; we can always circle back and attempt to reduce the size of the structure in the future. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes openzfs#14474

prakashsurya force-pushed the ioctl-oom branch from 3a9d076 to 1e32d57 Compare February 8, 2023 17:40

prakashsurya requested a review from ahrens February 8, 2023 17:41

ryao approved these changes Feb 8, 2023

View reviewed changes

ahrens approved these changes Feb 9, 2023

View reviewed changes

mmaybee approved these changes Feb 9, 2023

View reviewed changes

don-brady reviewed Feb 9, 2023

View reviewed changes

module/os/linux/zfs/zfs_ioctl_os.c Show resolved Hide resolved

prakashsurya force-pushed the ioctl-oom branch from 1e32d57 to c62eaae Compare February 10, 2023 17:22

ahrens requested a review from behlendorf February 13, 2023 18:50

ahrens assigned mmaybee Feb 13, 2023

behlendorf approved these changes Feb 13, 2023

View reviewed changes

behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Feb 13, 2023

behlendorf merged commit 13312e2 into openzfs:master Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce need for contiguous memory for ioctls #14474

Reduce need for contiguous memory for ioctls #14474

prakashsurya commented Feb 8, 2023 •

edited

Loading

solbjorn commented Feb 10, 2023

prakashsurya commented Feb 10, 2023

prakashsurya commented Feb 10, 2023 •

edited

Loading

solbjorn commented Feb 10, 2023

prakashsurya commented Feb 10, 2023

solbjorn commented Feb 10, 2023 •

edited

Loading

ryao commented Feb 12, 2023 •

edited

Loading

prakashsurya commented Feb 13, 2023

behlendorf left a comment

Reduce need for contiguous memory for ioctls #14474

Reduce need for contiguous memory for ioctls #14474

Conversation

prakashsurya commented Feb 8, 2023 • edited Loading

Types of changes

Checklist:

solbjorn commented Feb 10, 2023

prakashsurya commented Feb 10, 2023

prakashsurya commented Feb 10, 2023 • edited Loading

solbjorn commented Feb 10, 2023

prakashsurya commented Feb 10, 2023

solbjorn commented Feb 10, 2023 • edited Loading

ryao commented Feb 12, 2023 • edited Loading

prakashsurya commented Feb 13, 2023

behlendorf left a comment

Choose a reason for hiding this comment

prakashsurya commented Feb 8, 2023 •

edited

Loading

prakashsurya commented Feb 10, 2023 •

edited

Loading

solbjorn commented Feb 10, 2023 •

edited

Loading

ryao commented Feb 12, 2023 •

edited

Loading