Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce need for contiguous memory for ioctls #14474

Merged
merged 1 commit into from
Feb 14, 2023

Conversation

prakashsurya
Copy link
Member

@prakashsurya prakashsurya commented Feb 8, 2023

We've had cases where we trigger an OOM despite having memory freely available on the system. For example, here, we had about 21GB free:

kernel: Node 0 Normal: 2418758*4kB (UME) 1549533*8kB (UE) 0*16kB
0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
22071296kB

The problem being, all the memory is in 4K and 8K contiguous regions, but the allocation request was for a 16K contiguous region:

kernel: SafeExecutors-4 invoked oom-killer:
gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO),
order=2, oom_score_adj=0

The offending allocation came from this call trace:

kernel: Call Trace:
kernel:  dump_stack+0x57/0x7a
kernel:  dump_header+0x4f/0x1e1
kernel:  oom_kill_process.cold.33+0xb/0x10
kernel:  out_of_memory+0x1ad/0x490
kernel:  __alloc_pages_slowpath+0xd55/0xe40
kernel:  __alloc_pages_nodemask+0x2df/0x330
kernel:  kmalloc_large_node+0x42/0x90
kernel:  __kmalloc_node+0x25a/0x320
kernel:  ? spl_kmem_free_impl+0x21/0x30 [spl]
kernel:  spl_kmem_alloc_impl+0xa5/0x100 [spl]
kernel:  spl_kmem_zalloc+0x19/0x20 [spl]
kernel:  zfsdev_ioctl+0x2b/0xe0 [zfs]
kernel:  do_vfs_ioctl+0xa9/0x640
kernel:  ? __audit_syscall_entry+0xdd/0x130
kernel:  ksys_ioctl+0x67/0x90
kernel:  __x64_sys_ioctl+0x1a/0x20
kernel:  do_syscall_64+0x5e/0x200
kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
kernel: RIP: 0033:0x7fdca3674317

The problem is, for each ioctl that ZFS makes, it has to allocate a zfs_cmd_t structure, which is 13744 bytes in size (on my system):

sdb> sizeof zfs_cmd
(size_t)13744

This size, coupled with the fact that we currently allocate it with kmem_zalloc, means we need a 16K contiguous region of memory to satisfy the request.

The solution taken by this change, is to use "vmem" instead of "kmem" to do the allocation, such that we don't necessarily need a contiguous 16K memory region to satisfy the allocation.

Arguably, a better solution would be not to require such a large allocation to begin with (e.g. reduce the size of the zfs_cmd_t structure), but that'd be a much larger change than this "one liner". Thus, I've opted for this approach for now; we can always circle back and attempt to reduce the size of the structure in the future.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

We've had cases where we trigger an OOM despite having memory freely
available on the system. For example, here, we had about 21GB free:

    kernel: Node 0 Normal: 2418758*4kB (UME) 1549533*8kB (UE) 0*16kB
    0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
    22071296kB

The problem being, all the memory is in 4K and 8K contiguous regions,
but the allocation request was for a 16K contiguous region:

    kernel: SafeExecutors-4 invoked oom-killer:
    gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO),
    order=2, oom_score_adj=0

The offending allocation came from this call trace:

    kernel: Call Trace:
    kernel:  dump_stack+0x57/0x7a
    kernel:  dump_header+0x4f/0x1e1
    kernel:  oom_kill_process.cold.33+0xb/0x10
    kernel:  out_of_memory+0x1ad/0x490
    kernel:  __alloc_pages_slowpath+0xd55/0xe40
    kernel:  __alloc_pages_nodemask+0x2df/0x330
    kernel:  kmalloc_large_node+0x42/0x90
    kernel:  __kmalloc_node+0x25a/0x320
    kernel:  ? spl_kmem_free_impl+0x21/0x30 [spl]
    kernel:  spl_kmem_alloc_impl+0xa5/0x100 [spl]
    kernel:  spl_kmem_zalloc+0x19/0x20 [spl]
    kernel:  zfsdev_ioctl+0x2b/0xe0 [zfs]
    kernel:  do_vfs_ioctl+0xa9/0x640
    kernel:  ? __audit_syscall_entry+0xdd/0x130
    kernel:  ksys_ioctl+0x67/0x90
    kernel:  __x64_sys_ioctl+0x1a/0x20
    kernel:  do_syscall_64+0x5e/0x200
    kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
    kernel: RIP: 0033:0x7fdca3674317

The problem is, for each ioctl that ZFS makes, it has to allocate a
zfs_cmd_t structure, which is 13744 bytes in size (on my system):

    sdb> sizeof zfs_cmd
    (size_t)13744

This size, coupled with the fact that we currently allocate it with
kmem_zalloc, means we need a 16K contiguous region of memory to satisfy
the request.

The solution taken by this change, is to use "vmem" instead of "kmem" to
do the allocation, such that we don't necessarily need a contiguous 16K
memory region to satisfy the allocation.

Arguably, a better solution would be not to require such a large
allocation to begin with (e.g. reduce the size of the zfs_cmd_t
structure), but that'd be a much larger change than this "one liner".
Thus, I've opted for this approach for now; we can always circle back
and attempt to reduce the size of the structure in the future.

Signed-off-by: Prakash Surya <[email protected]>
@solbjorn
Copy link
Contributor

Can't we use kvzalloc() (+ kvfree()) to let the kernel decide where to allocate from?

@prakashsurya
Copy link
Member Author

Do we have any existing use of kvzalloc in the code base? I'm grep-ing for it, but nothing is coming up..

@prakashsurya
Copy link
Member Author

prakashsurya commented Feb 10, 2023

If you are not sure whether the allocation size is too large for
`kmalloc`, it is possible to use :c:func:`kvmalloc` and its
derivatives. It will try to allocate memory with `kmalloc` and if the
allocation fails it will be retried with `vmalloc`. There are
restrictions on which GFP flags can be used with `kvmalloc`; please
see :c:func:`kvmalloc_node` reference documentation. Note that
`kvmalloc` may return memory that is not physically contiguous.

that's neat.. didn't realize the kernel had support for this.. with that said, I'm still not sure if it makes sense to even try to use kmalloc at all.. is there any good reason to use "kmem" when we can? vs. always using "vmem"?

@solbjorn
Copy link
Contributor

vmalloced area can't be DMA mapped and you can't allocate from it inside atomic context. Speaking of "regular" code, it's pointless to allocate small stuff like 100-200 bytes from the vmalloc zone as you will only waste more memory (vmallocs need to store more metadata per each allocation). I usually switch from kmalloc() to kvmalloc() when the size I want to allocate is 4 Kb or more, i.e. doesn't fit into one x86 page (4096).
When there is enough memory, the absolute limit for kmalloc() is 4 Mb, but the situation you described may occur at any time, so using kvmalloc() here is the most optimal solution to me.

@prakashsurya
Copy link
Member Author

When there is enough memory, the absolute limit for kmalloc() is 4 Mb

so, is it safe to say, your opinion would be to always use kvmalloc unless you know the allocation will be more that 4Mb?

I don't necessarily disagree with the notion that we should use kvmalloc here, I'm just trying to understand when to use kvmalloc vs. vmalloc in general.. e.g. in this case, it'll always be a 16K allocation, so it seems like a potentially reasonable place to always use vmalloc..? or since it's still less than 4Mb, we should use kvmalloc, since it's always better to use kmalloc when it's possible to do so..?

practically, I think my only hesitation from using kvmalloc for this case, is the fact that I haven't seen it being used elsewhere in the codebase.. so I'm not sure if it's "allowed" (e.g. perhaps due to portability, or kernel version considerations).. maybe @behlendorf, or somebody more familiar could chime in.. personally, I'm OK either way..

@solbjorn
Copy link
Contributor

solbjorn commented Feb 10, 2023

My pref is (general-purpose data, no DMA needed):

n < 4 Kb: kmalloc
4 Kb <= n < 4 Mb: kvmalloc
n >= 4 Mb: vmalloc

Here we have 16 Kb. Most of times it can be allocated from the contiguous zone, so there'll be no need for involving VM area. ATST, if it fails, we can rely on vmalloc() then instead of just getting nothing. So I'd go for kvmalloc() in this particular case.

@ryao
Copy link
Contributor

ryao commented Feb 12, 2023

Can't we use kvzalloc() (+ kvfree()) to let the kernel decide where to allocate from?

In the Linux SPL, vmem_zalloc() will go to spl_kvmalloc(), which will try kmalloc_node() and fallback to spl_vmalloc() if it fails. There is no need to use kvzalloc(). The code already is doing what kvzalloc() does.

If I recall correctly, Linux invented kvzalloc() after the current vmem_zalloc() implementation was written. Not all of the kernels we support have kvzalloc().

If you are not sure whether the allocation size is too large for
`kmalloc`, it is possible to use :c:func:`kvmalloc` and its
derivatives. It will try to allocate memory with `kmalloc` and if the
allocation fails it will be retried with `vmalloc`. There are
restrictions on which GFP flags can be used with `kvmalloc`; please
see :c:func:`kvmalloc_node` reference documentation. Note that
`kvmalloc` may return memory that is not physically contiguous.

that's neat.. didn't realize the kernel had support for this.. with that said, I'm still not sure if it makes sense to even try to use kmalloc at all.. is there any good reason to use "kmem" when we can? vs. always using "vmem"?

There is no need to do that. The Linux SPL implements vmem_zalloc() in a way that is equivalent to kvzalloc(). It was necessary to do it this way to avoid deficiencies in Linux's kernel virtual memory.

@prakashsurya
Copy link
Member Author

The Linux SPL implements vmem_zalloc() in a way that is equivalent to kvzalloc()

Thanks @ryao .. We should be good to go on this one, then. Just need approval.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's surprising a 16K kmalloc() would fail even with heavily fragmented memory. Since this is only a GFP_KERNEL allocation the allocator should try progressively more aggressive steps to get the memory it needs (i.e. dropping page cache, ARC reclaim, memory compaction, etc). But if that all does fail resorting to vmalloc() is totally reasonable, and since that's what the vmem_* calls do on Linux this LGTM. On FreeBSD vmem_alloc and kmem_alloc are identical so the changes to freebsd/zfs/kmod_core.c don't do anything, but I like the idea of keeping the code as similar as possible.

@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Feb 13, 2023
@behlendorf behlendorf merged commit 13312e2 into openzfs:master Feb 14, 2023
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 3, 2023
We've had cases where we trigger an OOM despite having memory freely
available on the system. For example, here, we had about 21GB free:

    kernel: Node 0 Normal: 2418758*4kB (UME) 1549533*8kB (UE) 0*16kB
    0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
    22071296kB

The problem being, all the memory is in 4K and 8K contiguous regions,
but the allocation request was for a 16K contiguous region:

    kernel: SafeExecutors-4 invoked oom-killer:
    gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO),
    order=2, oom_score_adj=0

The offending allocation came from this call trace:

    kernel: Call Trace:
    kernel:  dump_stack+0x57/0x7a
    kernel:  dump_header+0x4f/0x1e1
    kernel:  oom_kill_process.cold.33+0xb/0x10
    kernel:  out_of_memory+0x1ad/0x490
    kernel:  __alloc_pages_slowpath+0xd55/0xe40
    kernel:  __alloc_pages_nodemask+0x2df/0x330
    kernel:  kmalloc_large_node+0x42/0x90
    kernel:  __kmalloc_node+0x25a/0x320
    kernel:  ? spl_kmem_free_impl+0x21/0x30 [spl]
    kernel:  spl_kmem_alloc_impl+0xa5/0x100 [spl]
    kernel:  spl_kmem_zalloc+0x19/0x20 [spl]
    kernel:  zfsdev_ioctl+0x2b/0xe0 [zfs]
    kernel:  do_vfs_ioctl+0xa9/0x640
    kernel:  ? __audit_syscall_entry+0xdd/0x130
    kernel:  ksys_ioctl+0x67/0x90
    kernel:  __x64_sys_ioctl+0x1a/0x20
    kernel:  do_syscall_64+0x5e/0x200
    kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
    kernel: RIP: 0033:0x7fdca3674317

The problem is, for each ioctl that ZFS makes, it has to allocate a
zfs_cmd_t structure, which is 13744 bytes in size (on my system):

    sdb> sizeof zfs_cmd
    (size_t)13744

This size, coupled with the fact that we currently allocate it with
kmem_zalloc, means we need a 16K contiguous region of memory to satisfy
the request.

The solution taken by this change, is to use "vmem" instead of "kmem" to
do the allocation, such that we don't necessarily need a contiguous 16K
memory region to satisfy the allocation.

Arguably, a better solution would be not to require such a large
allocation to begin with (e.g. reduce the size of the zfs_cmd_t
structure), but that'd be a much larger change than this "one liner".
Thus, I've opted for this approach for now; we can always circle back
and attempt to reduce the size of the structure in the future.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Richard Yao <[email protected]>
Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Don Brady <[email protected]>
Signed-off-by: Prakash Surya <[email protected]>
Closes openzfs#14474
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants