Severe contention in get_vmalloc_info() #2705

ryao · 2014-09-16T16:21:46Z

Certain workloads have horrible contention in get_vmalloc_info() on vmap_area_lock, which is taken for both kernel virtual memory allocations and /proc/meminfo accounting. This was previously a reader-writer lock on older kernels such as RHEL6's kernel. It is a spinlock on modern kernels. In the case of allocations, we want to add an item to the list while in the case of accounting, we want to traverse the list. When ZoL is in use, this linked list contains hundreds of thousands of items, which causes us to serialize while we spend excessive time traversing a linked list and in spinlocks waiting for that linked list.

On my workstation, I see more than half of CPU time is spent in the kernel, with at least half of the time spent in the kernel spent in get_vmalloc_info(). The contention on memory allocations limits our IOPS potential and could explain rare random IO lags in the 10s of milliseconds when doing IO operations. SLAB allocations reduce the need to modify the list somewhat, but I suspect we could do better by doing something about this contention. I sent a patch tot he LKML earlier this year to switch the list to RCU (which avoided the contention). However, I did not elaborate that it helped ZoL lest it be rejected for not having a clear in-kernel consumer. Someone else reimplemented it in a way that was more agreeable to merging, but so far, that has not been merged.

This could explain why setting spl_kmem_cache_slab_limit on the SPL helps some workloads. This might be an argument for merging #2129, but I am hopeful that we can eliminate the need for vmalloc entirely by reworking the zio buffers to use pages.

The text was updated successfully, but these errors were encountered:

kernelOfTruth · 2014-09-16T18:19:23Z

FYI:

https://lkml.org/lkml/2014/6/10/817
http://www.spinics.net/lists/stable/msg59229.html
http://www.codestuff.ch/codestuff/kernel/code/id_select_git_log_files_mm_linux.html

so it is in Linus' tree starting with 3.17 and should be soon (or already is) in 3.12

if the implementation in ZFS takes rather long - a way to improve IOPS would be to request inclusion of the change in the relevant & used stable kernels

ryao · 2014-09-17T14:07:16Z

@kernelOfTruth Thanks for looking into the status of that patch. Hearing that it already is in 3.17 is good news.

ryao · 2014-11-02T16:32:51Z

This seems to be a regression caused by a previous iteration of openzfs/spl#369 and justifies @behlendorf's request that I change the patches to move away from vmalloc() for small allocations in instances where KM_SLEEP is used. I just had this problem again. This time, I took advantage of vmallocinfo to get a dump of the allocations:

desktop ~ # cat /proc/vmallocinfo > /tmp/vmallocinfo-dump
desktop ~ # grep spl /tmp/vmallocinfo-dump | wc
 240391 1443241 23563753
desktop ~ # grep -v spl /tmp/vmallocinfo-dump | wc
    618    3114   60759
desktop ~ # grep spl /tmp/vmallocinfo-dump | awk ' NF > 0{ counts[$2] = counts[$2] + 1; } END { for (word in counts) print word, counts[word]; }' | sort --numeric                                                                                                             
8192 237797
12288 869
135168 1
528384 173
1052672 306
2101248 350
4198400 894
67112960 1

The majority of allocations are from the SPL, of which the overwhelming majority look like this:

0xffffc90000209000-0xffffc9000020b000    8192 spl_kmem_alloc_node+0xad/0xc0 [spl] pages=1 vmalloc

It looks like we allocated 8192 bytes for precisely 8192 bytes. These 8192 byte allocation should be going to kmalloc(), but are not because of this:

        /*
         * XXX: node is taken as a hint for large allocations because
         * __vmalloc_node() is not exported for our use.
         */
        if (likely(size <= PAGE_SIZE*2)) {
                if ((flags & (KM_PUSHPAGE | KM_NOSLEEP)))
                        ptr = kmalloc_node(size, lflags, node);
                else
                        ptr = spl_vmalloc(size, lflags, PAGE_KERNEL);
        } else {
                SDEBUG(SD_CONSOLE | SD_WARNING | SD_ERROR,
                        "kmem_alloc() called for %lu bytes (mode:0x%x)",
                        size, flags);
                spl_debug_dumpstack(NULL);
                ptr = spl_vmalloc(size, lflags, PAGE_KERNEL);
        }

This error on my part has been very educational. I am closing this because it is clearly not a bug in ZoL's code. However, it is useful information for me as I refresh openzfs/spl#369.

This patch achieves the following goals: 1. It replaces the preprocessor kmem flag to gfp flag mapping with proper translation logic. This eliminates the potential for surprises that were previously possible where kmem flags were mapped to gfp flags. 2. It maps vmem_alloc() allocations to kmem_alloc() for allocations sized less than or equal to spl_kmem_alloc_max. This ensures that small allocations will not contend on a single global lock, large allocations can still be handled, and potentially limited virtual address space will not be squandered. This behavior is entirely different than under Illumos due to different memory management stratagies employed by the respective kernels. However, this functionally provides the sematics we require. 3. The --disable-debug-kmem, --enable-debug-kmem (default), and --enable-debug-kmem-tracking allocators have been unified in to a single spl_kmem_alloc_impl() allocation function. This was done to simplify the code and make it more maintainable. 4. Improve portability by exposing an implementation of the memory allocations functions that can be safely used in the same way they are used on Illumos. Specifically, callers may safely using KM_SLEEP in contexts which perform filesystem IO. This allows us to eliminate an entire class of Linux specific changes which were previously required to avoid deadlocking the system. This change will be largely transparent to existing callers by there are a few caveats: 1. Because the headers were refactored it extraneous includes removed callers may find they need to explicitly add additional #includes. In particular, kmem_cache.h must now be explicitly includes to access the SPL's kmem cache implementation. This behavior is different from Illumos but it was done to avoid always masking the Linux slab functions when kmem.h is included. 2. Callers, like Lustre, which made assumptions about the definitions of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated. Other callers such as ZFS which did not will not require changes. 3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO. It retains its original meaning of allowing allocations to access reserved memory. KM_PUSHPAGE callers can be converted back to KM_SLEEP. 4. The KM_NODEBUG flags has been retired and the default warning threshold increased to 32k. 5. The kmem_virt() functions has been removed. For callers which need to distinguish between a physical and virtual address use is_vmalloc_addr().

behlendorf closed this as completed Nov 4, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Severe contention in get_vmalloc_info() #2705

Severe contention in get_vmalloc_info() #2705

ryao commented Sep 16, 2014

kernelOfTruth commented Sep 16, 2014

ryao commented Sep 17, 2014

ryao commented Nov 2, 2014

Severe contention in get_vmalloc_info() #2705

Severe contention in get_vmalloc_info() #2705

Comments

ryao commented Sep 16, 2014

kernelOfTruth commented Sep 16, 2014

ryao commented Sep 17, 2014

ryao commented Nov 2, 2014