-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce need for contiguous memory for ioctls #14474
Conversation
3a9d076
to
1e32d57
Compare
We've had cases where we trigger an OOM despite having memory freely available on the system. For example, here, we had about 21GB free: kernel: Node 0 Normal: 2418758*4kB (UME) 1549533*8kB (UE) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22071296kB The problem being, all the memory is in 4K and 8K contiguous regions, but the allocation request was for a 16K contiguous region: kernel: SafeExecutors-4 invoked oom-killer: gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO), order=2, oom_score_adj=0 The offending allocation came from this call trace: kernel: Call Trace: kernel: dump_stack+0x57/0x7a kernel: dump_header+0x4f/0x1e1 kernel: oom_kill_process.cold.33+0xb/0x10 kernel: out_of_memory+0x1ad/0x490 kernel: __alloc_pages_slowpath+0xd55/0xe40 kernel: __alloc_pages_nodemask+0x2df/0x330 kernel: kmalloc_large_node+0x42/0x90 kernel: __kmalloc_node+0x25a/0x320 kernel: ? spl_kmem_free_impl+0x21/0x30 [spl] kernel: spl_kmem_alloc_impl+0xa5/0x100 [spl] kernel: spl_kmem_zalloc+0x19/0x20 [spl] kernel: zfsdev_ioctl+0x2b/0xe0 [zfs] kernel: do_vfs_ioctl+0xa9/0x640 kernel: ? __audit_syscall_entry+0xdd/0x130 kernel: ksys_ioctl+0x67/0x90 kernel: __x64_sys_ioctl+0x1a/0x20 kernel: do_syscall_64+0x5e/0x200 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 kernel: RIP: 0033:0x7fdca3674317 The problem is, for each ioctl that ZFS makes, it has to allocate a zfs_cmd_t structure, which is 13744 bytes in size (on my system): sdb> sizeof zfs_cmd (size_t)13744 This size, coupled with the fact that we currently allocate it with kmem_zalloc, means we need a 16K contiguous region of memory to satisfy the request. The solution taken by this change, is to use "vmem" instead of "kmem" to do the allocation, such that we don't necessarily need a contiguous 16K memory region to satisfy the allocation. Arguably, a better solution would be not to require such a large allocation to begin with (e.g. reduce the size of the zfs_cmd_t structure), but that'd be a much larger change than this "one liner". Thus, I've opted for this approach for now; we can always circle back and attempt to reduce the size of the structure in the future. Signed-off-by: Prakash Surya <[email protected]>
1e32d57
to
c62eaae
Compare
Can't we use kvzalloc() (+ kvfree()) to let the kernel decide where to allocate from? |
Do we have any existing use of |
that's neat.. didn't realize the kernel had support for this.. with that said, I'm still not sure if it makes sense to even try to use |
vmalloced area can't be DMA mapped and you can't allocate from it inside atomic context. Speaking of "regular" code, it's pointless to allocate small stuff like 100-200 bytes from the vmalloc zone as you will only waste more memory (vmallocs need to store more metadata per each allocation). I usually switch from kmalloc() to kvmalloc() when the size I want to allocate is 4 Kb or more, i.e. doesn't fit into one x86 page (4096). |
so, is it safe to say, your opinion would be to always use I don't necessarily disagree with the notion that we should use practically, I think my only hesitation from using |
My pref is (general-purpose data, no DMA needed): n < 4 Kb: kmalloc Here we have 16 Kb. Most of times it can be allocated from the contiguous zone, so there'll be no need for involving VM area. ATST, if it fails, we can rely on vmalloc() then instead of just getting nothing. So I'd go for kvmalloc() in this particular case. |
In the Linux SPL, If I recall correctly, Linux invented
There is no need to do that. The Linux SPL implements |
Thanks @ryao .. We should be good to go on this one, then. Just need approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's surprising a 16K kmalloc()
would fail even with heavily fragmented memory. Since this is only a GFP_KERNEL
allocation the allocator should try progressively more aggressive steps to get the memory it needs (i.e. dropping page cache, ARC reclaim, memory compaction, etc). But if that all does fail resorting to vmalloc()
is totally reasonable, and since that's what the vmem_*
calls do on Linux this LGTM. On FreeBSD vmem_alloc
and kmem_alloc
are identical so the changes to freebsd/zfs/kmod_core.c
don't do anything, but I like the idea of keeping the code as similar as possible.
We've had cases where we trigger an OOM despite having memory freely available on the system. For example, here, we had about 21GB free: kernel: Node 0 Normal: 2418758*4kB (UME) 1549533*8kB (UE) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22071296kB The problem being, all the memory is in 4K and 8K contiguous regions, but the allocation request was for a 16K contiguous region: kernel: SafeExecutors-4 invoked oom-killer: gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO), order=2, oom_score_adj=0 The offending allocation came from this call trace: kernel: Call Trace: kernel: dump_stack+0x57/0x7a kernel: dump_header+0x4f/0x1e1 kernel: oom_kill_process.cold.33+0xb/0x10 kernel: out_of_memory+0x1ad/0x490 kernel: __alloc_pages_slowpath+0xd55/0xe40 kernel: __alloc_pages_nodemask+0x2df/0x330 kernel: kmalloc_large_node+0x42/0x90 kernel: __kmalloc_node+0x25a/0x320 kernel: ? spl_kmem_free_impl+0x21/0x30 [spl] kernel: spl_kmem_alloc_impl+0xa5/0x100 [spl] kernel: spl_kmem_zalloc+0x19/0x20 [spl] kernel: zfsdev_ioctl+0x2b/0xe0 [zfs] kernel: do_vfs_ioctl+0xa9/0x640 kernel: ? __audit_syscall_entry+0xdd/0x130 kernel: ksys_ioctl+0x67/0x90 kernel: __x64_sys_ioctl+0x1a/0x20 kernel: do_syscall_64+0x5e/0x200 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 kernel: RIP: 0033:0x7fdca3674317 The problem is, for each ioctl that ZFS makes, it has to allocate a zfs_cmd_t structure, which is 13744 bytes in size (on my system): sdb> sizeof zfs_cmd (size_t)13744 This size, coupled with the fact that we currently allocate it with kmem_zalloc, means we need a 16K contiguous region of memory to satisfy the request. The solution taken by this change, is to use "vmem" instead of "kmem" to do the allocation, such that we don't necessarily need a contiguous 16K memory region to satisfy the allocation. Arguably, a better solution would be not to require such a large allocation to begin with (e.g. reduce the size of the zfs_cmd_t structure), but that'd be a much larger change than this "one liner". Thus, I've opted for this approach for now; we can always circle back and attempt to reduce the size of the structure in the future. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Don Brady <[email protected]> Signed-off-by: Prakash Surya <[email protected]> Closes openzfs#14474
We've had cases where we trigger an OOM despite having memory freely available on the system. For example, here, we had about 21GB free:
The problem being, all the memory is in 4K and 8K contiguous regions, but the allocation request was for a 16K contiguous region:
The offending allocation came from this call trace:
The problem is, for each ioctl that ZFS makes, it has to allocate a zfs_cmd_t structure, which is 13744 bytes in size (on my system):
This size, coupled with the fact that we currently allocate it with kmem_zalloc, means we need a 16K contiguous region of memory to satisfy the request.
The solution taken by this change, is to use "vmem" instead of "kmem" to do the allocation, such that we don't necessarily need a contiguous 16K memory region to satisfy the allocation.
Arguably, a better solution would be not to require such a large allocation to begin with (e.g. reduce the size of the zfs_cmd_t structure), but that'd be a much larger change than this "one liner". Thus, I've opted for this approach for now; we can always circle back and attempt to reduce the size of the structure in the future.
Types of changes
Checklist:
Signed-off-by
.