-
Notifications
You must be signed in to change notification settings - Fork 178
Conversation
@behlendorf I have revised this pull request to place "kmem_cache: Call constructor/destructor on each alloc/free" at the top of the stack in response to #371. The rest of this pull request and its accessory pull request to the zfs repository should be safe to merge after review. |
I'm not 100% sure this is the cause, but it's the most likely issue. # zpool export $activepool # rmmod zfs SysRq : Show Blocked State task PC stack pid father rmmod D ffff880025165e18 0 25402 24110 0x00000000 ffff880025165de8 0000000000000082 ffff880025165d78 ffff8801b69768d0 ffff8801af7ca840 0000000100200009 ffff880025164010 0000000000011780 ffff8801b6758810 0000000000011780 0000000000011780 ffff8801b6758810 Call Trace: [<ffffffff8e3d4c6d>] schedule+0x6e/0x70 [<ffffffffc00be16c>] spl_kmem_cache_destroy+0x14e/0x2f9 [spl] [<ffffffffc00bc9f1>] ? spl_kmem_free+0x2d/0x2f [spl] [<ffffffff8e062ab3>] ? bit_waitqueue+0xb0/0xb0 [<ffffffffc047ae48>] zil_fini+0x10/0x2e [zfs] [<ffffffffc043f81d>] spa_fini+0x1d/0x112 [zfs] [<ffffffffc0460eb6>] _fini+0xa3/0xf0 [zfs] [<ffffffffc0460f11>] spl__fini+0xe/0x2d [zfs] [<ffffffff8e08004a>] SyS_delete_module+0x122/0x19a [<ffffffff8e02a347>] ? do_page_fault+0x3d/0x54 [<ffffffff8e3d7b22>] system_call_fastpath+0x16/0x1b My tree: https://github.com/DeHackEd/spl/commits/dehacked |
@DeHackEd Yes, the stack you reported was caused by this patch stack. It's easily reproducible with the kmem splat tests. In particular I hit it with the |
@ryao I'm still working my way through all the patches so expect more comments but I wanted to quickly jot down some of my initial thoughts and concerns. In general I think this is a nice bit of cleanup. In particular, I like the KM_* conversion function. It's clear to me that's what should have been done from day one. Updating the code to depend on the existing PF_FSTRAN flags is a nice clean solution as well. It's a shame we can't share more of the infrastructure xfs built up. That said, my major concern thus far is that with the new interfaces it's possible for a The converse case of a |
@behlendorf Only Linux-specific aspects of the code base need As for @DeHackEd's issue, this patch stack includes a patch to convert the slab cache function to use kzalloc(), which is provided by the Linux kernel itself. |
We should avoid this, even if all the needed changes end up in Linux specific code. Doing this goes against one of the design goals of this patch. We should be minimizing the delta between OpenZFS implementations. Simply ensuring that |
@behlendorf In that case, I will try mapping |
One other general comment. Since eventually we'd like to move the spl code in to the zfs tree we should be careful to follow same style guidelines used for zfs. Currently the spl code is riddled with violations but we should be careful to avoid adding more when writing new code. |
@behlendorf I think there is some misunderstanding about the distinction between The idea that mapping the Solaris analog of
|
@behlendorf Here is a description of this pull request that I wrote in an email recently.
This predated openzfs/zfs@cd3939c, which SoftNAS told me further improved performance in that benchmark. |
I can confirm the above benchmarks and findings. In fact, we have observed remarkable ZoL performance improvements that now make ZoL much more usable on VMware ESXi, especially VM cloning and migrations. Great job! |
@ryao from reading through this pull request I got the impression that only pools using atime would benefit from this - is this true ? thanks |
@kernelOfTruth All pools benefit from it. However, pools using atime have the additional benefit of avoiding the deadlock that I mentioned in the following blog post: |
Seems SPL doesnt want to build for me - DKMS package builds, actual kernel module fails - http://dpaste.com/067615E. |
@ryao thank you =) @sempervictus confirmed - got the same error message |
@sempervictus @kernelOfTruth ryao/spl@634047f had a mistake. I had changed an ASSERT to a VERIFY as per @behlendorf's request, but neglected to remove the
I have pushed a revised commit. |
I have refreshed the patches to the zfs repository via openzfs/zfs#2796. Until today, this patch had anenabling patch for supporting |
@ryao I'll try and give the updated patch a careful review tomorrow. I would like to get this merged once we're sure there aren't any subtle issues lurking. However, I was able to hit another issue in the splat kmem:vmem_size test which caused a hang so there are still some issues lurking. |
I had not run splat to test the more recent iterations of these changes and unfortunately, that lead to bugs going uncaught. These are my present mistakes:
The first is quite obvious in that I am sending a request for a large amount of memory to kmalloc() that it cannot satisfy. The stack trace for it looks like this:
The second is less obvious, but it has a similar backtrace in |
@behlendorf and/or @ryao: with the patch above applied to the PR, where do we stand on suspicions about corrupted pools? On a related but separate note, is there a writeup i can reference for getting a comprehensive testing environment configured? We have datacenter environments with metal and virtual resources... i'm sure they wouldn't mind spinning a few more cycles for "the greater good." |
@sempervictus @ryao's fixes addressed the issue I caught about potentially not zeroing memory. This could have resulted in problems for all callers which depended on that behavior. I'm still working through a careful review of the rest of the patch. Any additional testing you can offer in a test environment would be welcome. |
@ryao can you rebase this against master. I've merged the pre-2.6.32 compatibility cleanup which touched/removed some of the same code. |
a46707e
to
adf1e3c
Compare
@behlendorf It is rebased and should be ready to merge. I had mentioned in IRC that I had identified a problem with the linux:shrinker SPLAT test on Linux 3.14.14-gentoo, but it turns out that this is a pre-existing problem. I have included the patch from #403 in the pull request. |
@ryao Thanks, I'll work through another round and feedback and testing today. |
@ryao This is shaping up very nicely! I know there's a lot of feedback here but this is definitely getting close to something we can merge. |
This has a few benefits. First, it fixes a regression that "Rework generic memory allocation interfaces" appears to have triggered in splat's slab_reap and slab_age tests. Second, it makes porting code from Illumos to ZFSOnLinux easier. Third, it has the side effect of making reclaim from slab caches that specify reclaim functions an order of magnitude faster. The splat slab_reap test usually took 30 to 40 seconds. With this change, it takes 3 to 4. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #369
The following patches have been merged. The remaining patches in the stack we still need to reach agreement on. ad9863e kmem_cache: Call constructor/destructor on each alloc/free |
wake_up_bit() is called on a word inside kmem_cache_t objects. This calls virt_to_page() on the address of the memory. That is incompatible with virtual memory, so we must switch to Linux's memory allocator. Signed-off-by: Richard Yao <[email protected]>
This reverts commit eb0f407.
This patch achieves the following goals: 1. It replaces the preprocessor kmem flag to gfp flag mapping with proper translation logic. This eliminates the potential for surprises that were previously possible where kmem flags were mapped to gfp flags. 2. It maps `kmem_{,z}alloc()` KM_SLEEP allocations that were previously mapped to `kmalloc()` to `vmalloc()` to reduce internal memory fragmentation. 3. It discards the distinction between vmem_* and kmem_* that was previously made by mapping them to vmalloc() and kmalloc() respectively. This achieve better compatibility because kmem_* allocates are done from slabs allocated from vmem_*. Both are therefore virtual memory allocators and it makes no sense to implement them differently than one another. The detailed reasons for each are as follows: 1. Solaris derivatives have different allocation flag semantics than Linux. This was originally handled by trying to map Solaris flags to Linux flags, but the semantics are different enough that this approach does not correctly handle all cases. For example, 0 is KM_SLEEP on Solaris derivatives while 0 is illegal on Linux. This means that things like assertions that make assumptions about the flag semantics are no longer portable because reasonable assertions such as `ASSERT0(flags)` on Solaris derivatives are illegal on Linux. In addition, a trivial mapping allows us to freely mix and match flags. This is bad for portability and it can lead to unexpected consequences when clashes between semantics means that one expects one system's semantics, and receives another. 3. The SPL originally mapped kmem_alloc() to kmalloc() and vmem_alloc() to vmalloc(). One would be inclined to think this is correct by applying the reasonable expectation that things with similar names on each platform are similar things. However, this is not the case here. On Solaris, vmem_* is a general purpose arena allocator that does kernel virtual memory allocations. The Solaris SLAB allocator `kmem_cache_alloc()` operates by allocating slabs from vmem and returns objects. Allocations from kmem_alloc() work by performing HOARD-style allocations on pre-existing power of 2 SLAB caches. When mapping uses of these allocators to Linux equivalents, we must consider 4 allocators on Linux and how they interact: 1. The buddy allocator 2. The slab allocator 3. The vmap allocator 4. The kernel virtual memory allocator The buddy allocator is used for allocating both pages and slabs for Linux's kmem_slab_alloc. These are then used to provide generic power of 2 caches to which kmalloc() is mapped. Allocations that are larger than the largest power of 2 are sent directly to the buddy allocator. This is analogous to kmem_cache_alloc() and kmem_alloc() on Solaris. The third allocator is the vmap allocator, which concerns itself with allocating address space. The four allocator is the kernel virtual memory allocator and is invoked via `vmalloc()`. This uses pages from the buddy allocator and address space from the vmap allocator to perform virtual memory allocations. 3. Switching the KM_SLEEP allocations to `vmalloc()` provides some protection from deadlocks caused by internal memory fragmentation. It would have been ideal to make all allocations virtual like they are on Illumos. However, virtual memory allocations allocations that require KM_PUSHPAGE or KM_NOSLEEP semantics will receive KM_SLEEP semantics on Linux whenever a page directory table entry must be allocated, which is unsafe. We are therefore forced to use physical memory for KM_PUSHPAGE and KM_NOSLEEP allocations. That is suboptimal from the perspective of reducing internal memory fragmentation, but we still partially benefit by mapping KM_SLEEP allocations to `vmalloc()`. A caveat that aries from replacing `kmalloc()` with `vmalloc()` is that code using Linux's wake_up_bit should use the native Linux allocators. This has no equivalent on Solaris. While it might seem fragile at first glance, that is not the case for three reasons: 1. Linux's locking structures use atomic instructions and churn in them is rare. When churn does occur, there is an incredible amount of pressure to maintain both the size of the structure and remain backward compatible precisely because changes to locking structures can cause unanticipated issues that are hard to debug. 2. The incompatibility arises because `wait_on_bit()` does a hashtable lookup that assumes that `wait_on_bit()` is never called on virtual memory. This hashtable lookup should be more expensive than conventional locking because it involves more memory accesses than an atomic instruction for taking a mutex, so it will never be used inside one of the locking primitives that we map to the Solaris ones. 3. The kernel will print a backtrace to dmesg whenever wait_on_bit() is used inside memory from vmalloc(), so any problems that arise would appear very quickly in buildbots. Consequently, it is reasonable to expect allocations intended for structures that use `wake_up_bit()` to be done using the Linux allocator. At present, the only allocations for such structures are done inside the SPL SLAB allocator and for super blocks. No other code uses it or is likely to use it. These changes appear to create the most semantically equivalent mapping possible on Linux. The result is the elimination of concerns regarding proper use of generic interfaces when writing portable code, which posed problems for the development of things like sgbuf. A couple of additional changes worth noting are: 1. The kmem_alloc_node interface has been removed(). It has no external consumers and does not exist on Solaris. 2. sys/vmem.h has been introduced as an alias of sys/kmem.h for Illumos compatibility. Signed-off-by: Richard Yao <[email protected]>
If a SLAB cache is full and two allocations occur from the same SLAB cache nearly simultaneously where one is KM_SLEEP and another is either KM_PUSHPAGE or KM_NOSLEEP, the one that occurs first will dictate the KM_FLAGS used for SLAB growth for both of them. This is a race condition that at best hurts performance and at worse, causes deadlocks. We address this by modifying `spl_cache_grow()` to only provide the emergency allocation semantics to KM_PUSHPAGE allocations, with KM_SLEEP allocations being coalesced and KM_NOSLEEP allocations failing immediately. Signed-off-by: Richard Yao <[email protected]>
The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit that is used to mark transactions to that the translation of the IRIX kmem flags into Linux gfp flags for allocations inside of transactions will dip into kernel memory reserves to avoid deadlocks during writeback. Linux 3.9 provided the additional PF_MEMALLOC_NOIO for disabling __GFP_IO in page allocations, which XFS began using in 3.15. This patch implements hooks for marking transactions via PF_FSTRANS. When an allocation is performed with in the context of PF_FSTRANS, any KM_SLEEP allocation is transparently converted into a KM_PUSHPAGE allocation. It will also set PF_MEMALLOC_NOIO to prevent direct reclaim from entering `pageout()` on any KM_PUSHPAGE or KM_NOSLEEP allocation on Linux 3.9 or later. Signed-off-by: Richard Yao <[email protected]>
This reverts commit e302072.
The comment above the Linux 3.17 kernel's clear_bit() states: /** * clear_bit - Clears a bit in memory * @nr: Bit to clear * @addr: Address to start counting from * * clear_bit() is atomic and may not be reordered. However, it does * not contain a memory barrier, so if it is used for locking purposes, * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic() * in order to ensure changes are visible on other processors. */ This comment does not make sense in the context of x86 because x86 maps these operations to barrier(), which is a compiler barrier. However, it does make sense to me when I consider architectures that reorder around atomic instructions. In such situations, a processor is allowed to execute the wake_up_bit() before clear_bit() and we have a race. There are a few architectures that suffer from this issue: http://lxr.free-electrons.com/source/arch/arm/include/asm/barrier.h?v=3.16#L83 http://lxr.free-electrons.com/source/arch/arm64/include/asm/barrier.h?v=3.16#L102 http://lxr.free-electrons.com/source/arch/mips/include/asm/barrier.h?v=3.16#L199 http://lxr.free-electrons.com/source/arch/powerpc/include/asm/barrier.h?v=3.16#L88 http://lxr.free-electrons.com/source/arch/s390/include/asm/barrier.h?v=3.16#L32 http://lxr.free-electrons.com/source/arch/tile/include/asm/barrier.h?v=3.16#L83 https://en.wikipedia.org/wiki/Memory_ordering In such situations, the other processor would wake-up, see the bit is still taken and go to sleep, while the one responsible for waking it up will assume that it did its job and continue. https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h#L100 It is important to note that smp_mb__{before,after}_{atomic,clear}_{dec,inc,bit}(), were replaced by smp_mb__{before,after}_atomic() in recent kernels: torvalds/linux@febdbfe Some compatibility code was added to replace it in the time being, although it does not interact well with -Werror: http://www.spinics.net/lists/backports/msg02669.html http://lxr.free-electrons.com/source/include/linux/bitops.h?v=3.16#L48 In addition, the kernel's code paths are using clear_bit_unlock() in situations where clear_bit is used for unlocking. This adds smp_mb__before_atomic(), which I assume is for Alpha. This patch implements a wrapper that maps smp_mb__{before,after}_atomic() to smp_mb__{before,after}_clear_bit() on older kernels and changes our code to leverage it in a manner consistent with the mainine kernel. Signed-off-by: Richard Yao <[email protected]>
The initial port of ZFS to Linux required a way to identify virtual memory to make IO to virtual memory backed slabs work, so kmem_virt() was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is logically equivalent to kmem_virt(). Support for kernels before 2.6.26 was later dropped and more recently, support for kernels before Linux 2.6.32 has been dropped. We retire kmem_virt() in favor of is_vmalloc_addr() to cleanup the code. Signed-off-by: Richard Yao <[email protected]>
I have pushed a refresh. There are two known issues at this time:
I suspect that the kmem chnages exposed some subtle issues that were hidden by the earlier kmem code. I am going to work some more on these issues more today, but tomorrow onward, I need to switch to working on the libzfs_core extensions for work. |
@ryao regarding the OOM killer we should avoid disabling it in such a common code path. We want it to be able to always kill runaway user processes and disabling it should not be necessary. |
This has a few benefits. First, it fixes a regression that "Rework generic memory allocation interfaces" appears to have triggered in splat's slab_reap and slab_age tests. Second, it makes porting code from Illumos to ZFSOnLinux easier. Third, it has the side effect of making reclaim from slab caches that specify reclaim functions an order of magnitude faster. The splat slab_reap test usually took 30 to 40 seconds. With this change, it takes 3 to 4. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#369
Closing. The kmem rework was merged some time ago. |
This refactoring is intended to simplify the code while making it easier to share code between ZoL and other Open ZFS platforms.