-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Severe contention in get_vmalloc_info() #2705
Comments
FYI: https://lkml.org/lkml/2014/6/10/817 so it is in Linus' tree starting with 3.17 and should be soon (or already is) in 3.12 if the implementation in ZFS takes rather long - a way to improve IOPS would be to request inclusion of the change in the relevant & used stable kernels |
@kernelOfTruth Thanks for looking into the status of that patch. Hearing that it already is in 3.17 is good news. |
This seems to be a regression caused by a previous iteration of openzfs/spl#369 and justifies @behlendorf's request that I change the patches to move away from vmalloc() for small allocations in instances where KM_SLEEP is used. I just had this problem again. This time, I took advantage of vmallocinfo to get a dump of the allocations:
The majority of allocations are from the SPL, of which the overwhelming majority look like this:
It looks like we allocated 8192 bytes for precisely 8192 bytes. These 8192 byte allocation should be going to kmalloc(), but are not because of this:
This error on my part has been very educational. I am closing this because it is clearly not a bug in ZoL's code. However, it is useful information for me as I refresh openzfs/spl#369. |
This patch achieves the following goals: 1. It replaces the preprocessor kmem flag to gfp flag mapping with proper translation logic. This eliminates the potential for surprises that were previously possible where kmem flags were mapped to gfp flags. 2. It maps vmem_alloc() allocations to kmem_alloc() for allocations sized less than or equal to spl_kmem_alloc_max. This ensures that small allocations will not contend on a single global lock, large allocations can still be handled, and potentially limited virtual address space will not be squandered. This behavior is entirely different than under Illumos due to different memory management stratagies employed by the respective kernels. However, this functionally provides the sematics we require. 3. The --disable-debug-kmem, --enable-debug-kmem (default), and --enable-debug-kmem-tracking allocators have been unified in to a single spl_kmem_alloc_impl() allocation function. This was done to simplify the code and make it more maintainable. 4. Improve portability by exposing an implementation of the memory allocations functions that can be safely used in the same way they are used on Illumos. Specifically, callers may safely using KM_SLEEP in contexts which perform filesystem IO. This allows us to eliminate an entire class of Linux specific changes which were previously required to avoid deadlocking the system. This change will be largely transparent to existing callers by there are a few caveats: 1. Because the headers were refactored it extraneous includes removed callers may find they need to explicitly add additional #includes. In particular, kmem_cache.h must now be explicitly includes to access the SPL's kmem cache implementation. This behavior is different from Illumos but it was done to avoid always masking the Linux slab functions when kmem.h is included. 2. Callers, like Lustre, which made assumptions about the definitions of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated. Other callers such as ZFS which did not will not require changes. 3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO. It retains its original meaning of allowing allocations to access reserved memory. KM_PUSHPAGE callers can be converted back to KM_SLEEP. 4. The KM_NODEBUG flags has been retired and the default warning threshold increased to 32k. 5. The kmem_virt() functions has been removed. For callers which need to distinguish between a physical and virtual address use is_vmalloc_addr().
Certain workloads have horrible contention in
get_vmalloc_info()
on vmap_area_lock, which is taken for both kernel virtual memory allocations and/proc/meminfo
accounting. This was previously a reader-writer lock on older kernels such as RHEL6's kernel. It is a spinlock on modern kernels. In the case of allocations, we want to add an item to the list while in the case of accounting, we want to traverse the list. When ZoL is in use, this linked list contains hundreds of thousands of items, which causes us to serialize while we spend excessive time traversing a linked list and in spinlocks waiting for that linked list.On my workstation, I see more than half of CPU time is spent in the kernel, with at least half of the time spent in the kernel spent in
get_vmalloc_info()
. The contention on memory allocations limits our IOPS potential and could explain rare random IO lags in the 10s of milliseconds when doing IO operations. SLAB allocations reduce the need to modify the list somewhat, but I suspect we could do better by doing something about this contention. I sent a patch tot he LKML earlier this year to switch the list to RCU (which avoided the contention). However, I did not elaborate that it helped ZoL lest it be rejected for not having a clear in-kernel consumer. Someone else reimplemented it in a way that was more agreeable to merging, but so far, that has not been merged.This could explain why setting
spl_kmem_cache_slab_limit
on the SPL helps some workloads. This might be an argument for merging #2129, but I am hopeful that we can eliminate the need forvmalloc
entirely by reworking the zio buffers to use pages.The text was updated successfully, but these errors were encountered: