Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe contention in get_vmalloc_info() #2705

Closed
ryao opened this issue Sep 16, 2014 · 3 comments
Closed

Severe contention in get_vmalloc_info() #2705

ryao opened this issue Sep 16, 2014 · 3 comments

Comments

@ryao
Copy link
Contributor

ryao commented Sep 16, 2014

Certain workloads have horrible contention in get_vmalloc_info() on vmap_area_lock, which is taken for both kernel virtual memory allocations and /proc/meminfo accounting. This was previously a reader-writer lock on older kernels such as RHEL6's kernel. It is a spinlock on modern kernels. In the case of allocations, we want to add an item to the list while in the case of accounting, we want to traverse the list. When ZoL is in use, this linked list contains hundreds of thousands of items, which causes us to serialize while we spend excessive time traversing a linked list and in spinlocks waiting for that linked list.

On my workstation, I see more than half of CPU time is spent in the kernel, with at least half of the time spent in the kernel spent in get_vmalloc_info(). The contention on memory allocations limits our IOPS potential and could explain rare random IO lags in the 10s of milliseconds when doing IO operations. SLAB allocations reduce the need to modify the list somewhat, but I suspect we could do better by doing something about this contention. I sent a patch tot he LKML earlier this year to switch the list to RCU (which avoided the contention). However, I did not elaborate that it helped ZoL lest it be rejected for not having a clear in-kernel consumer. Someone else reimplemented it in a way that was more agreeable to merging, but so far, that has not been merged.

This could explain why setting spl_kmem_cache_slab_limit on the SPL helps some workloads. This might be an argument for merging #2129, but I am hopeful that we can eliminate the need for vmalloc entirely by reworking the zio buffers to use pages.

@kernelOfTruth
Copy link
Contributor

FYI:

https://lkml.org/lkml/2014/6/10/817
http://www.spinics.net/lists/stable/msg59229.html
http://www.codestuff.ch/codestuff/kernel/code/id_select_git_log_files_mm_linux.html

so it is in Linus' tree starting with 3.17 and should be soon (or already is) in 3.12

if the implementation in ZFS takes rather long - a way to improve IOPS would be to request inclusion of the change in the relevant & used stable kernels

@ryao
Copy link
Contributor Author

ryao commented Sep 17, 2014

@kernelOfTruth Thanks for looking into the status of that patch. Hearing that it already is in 3.17 is good news.

@ryao
Copy link
Contributor Author

ryao commented Nov 2, 2014

This seems to be a regression caused by a previous iteration of openzfs/spl#369 and justifies @behlendorf's request that I change the patches to move away from vmalloc() for small allocations in instances where KM_SLEEP is used. I just had this problem again. This time, I took advantage of vmallocinfo to get a dump of the allocations:

desktop ~ # cat /proc/vmallocinfo > /tmp/vmallocinfo-dump
desktop ~ # grep spl /tmp/vmallocinfo-dump | wc
 240391 1443241 23563753
desktop ~ # grep -v spl /tmp/vmallocinfo-dump | wc
    618    3114   60759
desktop ~ # grep spl /tmp/vmallocinfo-dump | awk ' NF > 0{ counts[$2] = counts[$2] + 1; } END { for (word in counts) print word, counts[word]; }' | sort --numeric                                                                                                             
8192 237797
12288 869
135168 1
528384 173
1052672 306
2101248 350
4198400 894
67112960 1

The majority of allocations are from the SPL, of which the overwhelming majority look like this:

0xffffc90000209000-0xffffc9000020b000    8192 spl_kmem_alloc_node+0xad/0xc0 [spl] pages=1 vmalloc

It looks like we allocated 8192 bytes for precisely 8192 bytes. These 8192 byte allocation should be going to kmalloc(), but are not because of this:

        /*
         * XXX: node is taken as a hint for large allocations because
         * __vmalloc_node() is not exported for our use.
         */
        if (likely(size <= PAGE_SIZE*2)) {
                if ((flags & (KM_PUSHPAGE | KM_NOSLEEP)))
                        ptr = kmalloc_node(size, lflags, node);
                else
                        ptr = spl_vmalloc(size, lflags, PAGE_KERNEL);
        } else {
                SDEBUG(SD_CONSOLE | SD_WARNING | SD_ERROR,
                        "kmem_alloc() called for %lu bytes (mode:0x%x)",
                        size, flags);
                spl_debug_dumpstack(NULL);
                ptr = spl_vmalloc(size, lflags, PAGE_KERNEL);
        }

This error on my part has been very educational. I am closing this because it is clearly not a bug in ZoL's code. However, it is useful information for me as I refresh openzfs/spl#369.

ryao referenced this issue in behlendorf/spl Dec 26, 2014
This patch achieves the following goals:

1. It replaces the preprocessor kmem flag to gfp flag mapping with
   proper translation logic. This eliminates the potential for
   surprises that were previously possible where kmem flags were
   mapped to gfp flags.

2. It maps vmem_alloc() allocations to kmem_alloc() for allocations
   sized less than or equal to spl_kmem_alloc_max.  This ensures that
   small allocations will not contend on a single global lock, large
   allocations can still be handled, and potentially limited virtual
   address space will not be squandered.  This behavior is entirely
   different than under Illumos due to different memory management
   stratagies employed by the respective kernels.  However, this
   functionally provides the sematics we require.

3. The --disable-debug-kmem, --enable-debug-kmem (default), and
   --enable-debug-kmem-tracking allocators have been unified in to
   a single spl_kmem_alloc_impl() allocation function.  This was
   done to simplify the code and make it more maintainable.

4. Improve portability by exposing an implementation of the memory
   allocations functions that can be safely used in the same way
   they are used on Illumos.   Specifically, callers may safely
   using KM_SLEEP in contexts which perform filesystem IO.  This
   allows us to eliminate an entire class of Linux specific changes
   which were previously required to avoid deadlocking the system.

This change will be largely transparent to existing callers by there
are a few caveats:

1. Because the headers were refactored it extraneous includes removed
   callers may find they need to explicitly add additional #includes.
   In particular, kmem_cache.h must now be explicitly includes to
   access the SPL's kmem cache implementation.  This behavior is
   different from Illumos but it was done to avoid always masking
   the Linux slab functions when kmem.h is included.

2. Callers, like Lustre, which made assumptions about the definitions
   of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated.
   Other callers such as ZFS which did not will not require changes.

3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO.  It retains
   its original meaning of allowing allocations to access reserved
   memory.  KM_PUSHPAGE callers can be converted back to KM_SLEEP.

4. The KM_NODEBUG flags has been retired and the default warning
   threshold increased to 32k.

5. The kmem_virt() functions has been removed.  For callers which
   need to distinguish between a physical and virtual address use
   is_vmalloc_addr().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants