-
Notifications
You must be signed in to change notification settings - Fork 178
Use Linux SLAB allocator for SPL SLAB allocations #147
Conversation
This has the side effect of causing the kmem:slab_large splat test to fail. The kernel does not maintain a large enough reserve of memory to permit such large allocations to succeed under memory pressure. Requesting them under such conditions will either deadlock or severely harm performance. It probably would be best to stop supporting large slabs. |
@ryao Would you like to comment on your approach of using the native linux slab in light of point 2 in the comments from
In particular you've already seen the Also, there now seems there may be a good amount of redundant work with the kmem slab on top of the linux slab, e.g. both look to be keeping track of per-cpu locality etc. |
These are good points, but I believe that these changes are an improvement. The virtual memory address backed SLAB was nice in principle, but in practice, it did not work very well. First, the code path for small allocations is not used in practice, so nothing benefited from it. Second, the previous approach relied on a global lock that serialized SLAB allocations. The elimination of this lock should compensate for the overhead of the redundant code. Third, the Linux kernel's virtual memory allocator is intentionally crippled to prevent developers from doing what we currently do. The effect is that unless end users patch their kernels, they will suffer from stability problems. People have found the kernel patch to do wonders for system stability, but it has a negligible chance of being accepted upstream. The loss of large slab allocations is in some sense an improvement. It is extremely difficult, if not impossible, for the current Linux kernel to allocate large slabs under extreme memory pressure without tweaks. Much critical code depends on SLAB allocations and that code is not designed to handle allocation failures. I have not examined the kernel's code in detail to know exactly what should happen in these circumstance, but one of three scenarios are theoretically possible from what I know.
The kernel is not intended to provide such large allocations and they cause stability problems when paging to a ZFS swap zvol under load. Even when swap on ZFS is not in use, allocating large slabs as the code currently does under memory pressure poses performance issues. Lastly, the definition of a large allocation is relative. The Linux SLAB allocator has a maximum object size of 32MB. The SPL SLAB allocator should have no problems performing efficiently as long as SLAB objects are no greater than a few megabytes in size. I am not aware of any situations in which they approach 1MB, so I do not think that this limitation is a problem. |
I'm also particularly concerned about point 2) in my comment above. This is in effect 90% of the reason I was forced to write the spl slab in the first place. I would have loved to been able to avoid this work. But at the time this was the cleanest short/medium term solution. The problem is that fundamentally the Linux kernel is very bad at handling large contiguous memory allocations, it was never designed to do this well. They should be avoided. My primary concern is that your now going to be regularly attempting to allocate multi-megabyte chunks of contiguous physical address space. Now on a freshly booted system that's probably not going to be a huge issue, but as the system memory fragments over time it becomes increasingly expensive to perform these allocations. It can even degrade to the point where these large allocations are impossible. Additionally, I'd expect changes like #145 to only further compound the issue by reducing the number of locations pages can be easily dropped to free up the required address space. These issues are subtle and will not likely crop up until memory gets heavily fragmented. And even then I doubt they will result in an outright failure, but you will see a performance impact. For a tiny pool that performance impact may be small. But for larger pools which are expected to sustain GB/s of throughput this will be a problem. So while these changes may make sense for a desktop, they will cause problems for the enterprise systems. Additionally, @chrisrd is correct about these changes introducing a lot of new redundancy with the kernel slab in regard to per-cpu locality and cache aging. Now rather that dissuade you from pursuing this further. I want to suggest that you take what you started several steps farther. Create an completely alternate spl slab implementation which is thinly layered on top of the kernel slab. This should be relatively straight forward with the exception of implementing destructors which may be difficult. Then add a new configure option to select which slab implementation should be used. This resolves a couple of issues:
|
@behlendorf What are your thoughts about reducing the maximum SLAB size ahead of this? Large slabs should pose problems under low memory conditions. Also, what do you think about commits ryao/spl@5517b73 and ryao/spl@353d751? They should be harmless. |
@ryao For which implementation? The current assumption is that large slabs are vmalloc() based so making them a few megabytes was reasonable considering the overhead of acquiring the global spin lock. If your talking about backing them by the Linux slab then yes they would need to get smaller. Although I'd prefer to simply abandon the existing implementation entirely if we shift to using the Linux slab. We just leave all those slab management details to the Linux slab implementation and live with a maximum object size of 128K. I was thinking we could even trivially handle the constructor/destructors by just calling them for each alloc/free respectively. That's not 100% optimal but it is semantically correct and simple. This would allow the spl slab layer over the linux slab to be very thin. |
Oh yeah. I was going to comment on them but it slipped my mind while responding to the rest of it. Good catch, I don't have any objection to those two improvements/fixes. Let me take a second careful look at them and I'll get them tested and merged. |
This would be for the current implementation. The kernel only maintains so much reserve memory (controlled via vm.min_free_kbytes). If the SPL slab size exceeds it, the system is under memory pressure and an allocation is done from cache with full slabs, the allocation could make things go from bad to worse. The use of virtual memory won't prevent problems in this scenario. With that said, there is an alternative to this that we can consider. We can write our own primitive virtual memory allocator and some book keeping. Whenever an allocation is made for memory, we call alloc_vmap_area() to get virtual address space. It then can allocate pages by examining the kernel's free space and patching together the smallest fragments by mapping them into our virtual address space. It would do some book keeping much like how vm_area_structs work in processes. It will then return that to the caller. Then a free function would take that, look through our records to identify what fragments composed it, unmap them and release the area. |
Concerning reducing the current max slab size we could certainly try reducing it if you think it is aggravating the low memory issue. That initial values was only an educated guess I made at the time trying to balance the vmalloc() spin lock contention and how much memory could be easily allocated. If testing shows reducing it is helpful and it doesn't come at the expense of performance we can do it. I'm much more leery about writing our own VMA. Better to first trying simply layering the spl's slab on the Linux slab as see how well that works in practice. At least that's development effort in the right direction IMHO. Plus, I've actually never really seriously tried it. It might work far better than I expect. |
Ignore the 'ryao added some commits' message. Those are the same commits. I accidentally removed them from this branch when I meant to remove them from another branch. I haven't decided what to do about this branch yet. |
I'd be very interested to take this one step further with a very thin compatibility layer on top of the Linux slab. As I mentioned above when I originally write this spl's virtual memory backed slab it was because I was concerned about performance. But at the time there was no Posix layer in place so it wasn't easy to do any real performance comparisons between the spl's slab and the Linux slab. I may have over estimated that impact, particularly since I was primarily concerned about sustaining several GB/s through IB or 10Gig attached servers. For a desktop workload this may be a price worth paying for now since it should improve stability. |
@behlendorf These patches appear to be triggering the memory fragmentation issue on my desktop. My desktop applications are fairly memory intensive and on top of that, I compile software in parallel quite often, so the kernel memory is stressed much more than it would be on an ordinary desktop. With that said, I think I have thought of a solution that would work well on both Sequoia and desktop computers. The idea is to modify the SPL SLAB implementation to create a worker thread for each cache created and always maintain an extra slab. When a new slab is needed, the extra SLAB is consumed and the worker thread is notified to allocate a new one via GFP_KERNEL. This will enable SLAB allocations to be done using GFP_KERNEL in a manner that is safe by blocking asynchronously. Since each cache has its own worker thread, there will be no additional allocations needed to do this, so we will not risk making things worse under lower memory conditions. An issue with this is what happens when the worker thread does not successfully allocate a new SLAB in time, which isn't handled. In principle, each SLAB should be able to contain a sufficiently large number of objects that this will never happen, but it is something that should be handled somehow. This could be handled by canabalizing other SLABs, although that is a fairly elaborate thing to implement. Panicking would probably be acceptable in an initial implementation until a more elaborate mechanism for handling this state can be written. Lastly, this will make the SPL slabs even more memory hungry, although that is probably worth the system stability that this would provide. |
On second thought, it would be better to use a single thread and just allocate additional space in the cache's structure to be used when the request is made to that thread. That will reduce spinlock contention and possibly have a few other benefits that I am still considering. Also, I neglected to mention in my previous comment that I believe that the initial results were promising because the kernel's buddy allocator (from which the slabs are allocated) was able to satisfy the smaller allocations under situations of memory stress immediately after booting when the kernel memory was not too badly fragmented. After several days of fragmentation, this ceased to be the case. This explanation is consistent with the idea that smaller slab sizes are better under low memory conditions and this should hold even when virtual memory is used. The idea in my previous comment should also eliminate virtual memory allocation overhead from slab allocations, which should eliminate contention concerns. |
@ryao Concerning the fragmentation issues you observed. I'm curious, how did they manifest themselves on your system? Were they outright allocation failures or did the system just become painfully slow to use? Also, in your original patch IIRC you were allocating the SPL slab from a Linux slab which would result in huge Linux slab objects (10's of MBs). Have you tried simply abandoning the SPL slab entirely? You could just kmalloc() the slab object outright which would limit things to a maximum order-5 allocation size for 128KB objects. Once again this would be slow, but it might be small enough to be fairly stable. Concerning the idea of using a separate thread to allocate these slabs. I had a similar thought but I discarded it after further reflection because it didn't completely solve the existing PF_MEMALLOC related issues. Even if the allocation occurs as GFP_KERNEL in a separate thread there's no guarantee it will be able to complete. For example, it could still deadlock in the case where the txg_sync thread triggers the allocation and then asynchronously polls for the new slab. The new allocation might still block attempting to writeback pages to zfs since it's GFP_KERNEL. And since the txg_sync can't progress until it gets it's memory we're back to our original deadlock. Although perhaps it would be more unlikely. I suppose we could kick off additional new slab allocations hoping that eventually one of them would succeed. Presumably, we'd eventually be able to discard a set of pages without triggering writeback to zfs and then be able to proceed. But that feels like a serious hack. In this case, I'd suggest we use the generic linux work queues rather than a dedicated thread. Anyway, I never tested the idea of doing the allocations in a separate thread. I just through it through an abandoned it. But perhaps you can convince me I missed something and there's an easy way to prevent the above issue. |
@behlendorf The system hung while waiting for allocations. Doing Magic Sysreq + E enabled me to execute I have thought about abandoning the SPL slabs entirely, but my desire for a solution that is universally applicable lead me to consider the separate thread approach. If that does not work, then I suppose this would be the only way, although I would prefer to explore that possibility first. I have observed the txg_sync thread problem in the current code when when doing swap on ZFS under high memory stress. The approach of asynchronously allocating slabs in a separate thread is still susceptible to this, but it is much less likely. When low memory conditions occur, the n+1th slab would already be available by virtue of having been allocated before the low memory conditions, so things would keep going until the n+2th slab is needed. If each slab stores a sufficiently large number of objects to permit things to continue operating until the low memory condition has been alleviated, then there will be no problems. If not, things will deadlock, as they do now. That is a separate case from the situation where swap is not on ZFS, in which case these SPL slab allocations are not necessary for swap to occur and the kernel should be able to solve the low memory condition on its own. If that happens, then the only change we will see from this is a performance increase from the amortized O(1) slab allocations that result from allocating the next slab in advance. I think a dedicated thread would be better than generic linux work queues for two reasons. The first is that the work queues would attempt to do allocations in parallel, which will cause unnecessary spin lock contention. The second is that the use of a single thread provides us with the opportunity to implement prioritization, which enables more important caches to have the next slab allocated (such as any used in txg_sync) before others. I am still thinking about what the right way to implement prioritization would be. I am currently thinking of applying bayes theorem to based on the frequency of allocations from the SLAB that pass GFP_HIGHMEM, but I want to implement the asynchronous allocations before I consider this too heavily. I had neglected to mention this in my previous comment because I was thinking of implementing the worker thread first and then implement this prioritization at a later date should the worker thread approach alone prove to be enough. Let me know what you think. |
@ryao Yes, I understand. I'd prefer a clean generic solution for this issue as well. I'm certainly not opposed to pursuing the dedicated thread approach. Hopefully, it will work well enough that we can rely on it for a proper 0.6.0 stable release. That would buy us the latitude and time to do things right in 0.7.0 and remove this virtual memory dependency from the zfs code. This is something I've wanted to do for quite some time since a lot of good will come from it. But I think we need a stable 0.6.0 release first before we potentially go and destabilize things. So anything we can do to the spl slab to achieve stability I'm interested in... including a lower performing but entirely stable implementation if needed.
Right, but is a single slab enough. I'm not totally convinced it will be. But once again, I don't have any real data on this just a gut feeling which I'd love to be wrong about. So I'm all for trying it.
Not entirely (see http://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L239 ). The same issue can potentially occur for mmap(2) writes. The mapped region in the page cache can be dirtied and then forced out to disk via .writepage() when there is memory pressure. An unlikely, but possible case. But yes, in general I agree. If your using a non zfs swap device then we should be able to make forward progress and get the memory we need. Although, I suspect we won't see much of a performance improvement. We're already typically hiding most of the allocation time by using slabs in the first place and keeping per-cpu magazines.
Both good points. My primarily concern about using a single thread is that if it gets deadlocked for any reason down in vmalloc() we're out of luck. If we were to use work queues would could easily dispatch another vmalloc() work item which could very well succeed because it will reclaim an entirely different set of pages. This would be more resilient and should help ensure forward progress. It should also cause no more contention on the virtual memory address space spinlock that we see today. And that appears to be at a tolerable level. This would make prioritizing the allocations harder but I'm not sure that's really critical. Particularly, because it's not at all clear to me that at this level the spl is capable of making a good decision about which allocations to prioritize.
I think you should give this a try and we'll see how it holds up to real usage. If it works well, and the implementation is clean, then I don't have any objections to merging it. |
I tried the delayed allocation approach, but it did not work as well as I had hoped. Having an additional slab available was insufficient to ensure system stability. The worker thread also triggered the hung task timer during idle time. It seems that writing a wrapper for the Linux SLAB allocator is the only alternative. |
I am going to close this until another solution is ready. |
This replaces kmem_zalloc() allocations of spl_kmem_cache objects that are approximately 32KB in size with a Linux SLAB allocations. This eliminates the need to supress debug warnings and makes the changes necessary to replace the __vmalloc() in kv_alloc() with SLAB allocations. We also take advantage of the SLAB allocator's SLAB_HWCACHE_ALIGN flag to reduce false sharing, which could increase performance. Signed-off-by: Richard Yao <[email protected]>
I took time to implement a wrapper for the Linux SLAB allocator. It still needs a configure script option to toggle it, but it is at the point where it can be tested and reviewed. This should guarantee that gfp flags are honored without a kernel patch. It should also improve things on 32-bit systems. Unfortunately, it is not perfect. Running 4 instances of At present, I have SPL patched to use the FreeBSD solution of having KM_SLEEP and KM_PUSHPAGE be aliases of one another. Modifying that so that only KM_PUSHPAGE uses GFP_HIGHMEM would help here, although I do not expect that to solve this problem. |
This should eliminate the need for a kernel patch and improve things on 32-bit systems. Signed-off-by: Richard Yao <[email protected]>
Sorry, I don't have the expertise to do a fine grained review, but I definitely like the approach of better integration with the rest of linux, and the reduction in complexity! I'll give this a try. "the Linux SLAB allocator allocates only one object per slab for objects that span more than 8 pages" If you're using the default SLUB allocator ( What is producing the large objects, and how prevalent are they? |
FYI I've had ryao/spl@4e2562a + cherry-pick ryao/spl@b0bc39c running on linux-3.4 (w/ CONFIG_SLUB) for nearly 5 hrs, loaded as an rsync target and doing a resilver. Some stats may be of interest:
|
@chrisrd I appreciate your enthusiasm, but unfortunately, I do not see any way to make this work. I am going to close this pull request and open a new one for ryao/spl@855c08a, which is salvageable. |
No worries. I understand you're focused on swap over ZoL, but for what it's worth, I think the approach of SPL's cache being a thin layer on top of the native linux slab is the right one to address the more general memory issues that have been cropping up all too frequently. |
Memory issues and swap issues are related. Anyway, the native Linux slab cannot be used due to memory fragmentation issues that occur when using real memory. The real solution would be to fix the kernel virtual memory allocator to stop ignoring gfp flags, but I do not expect that to happen at upstream. |
Most (all?) of the major outstanding stability issues tie back to these memory management issues. Unfortunately, we can't simply naively layer on top of the Linux slab due to fragmentation and performance concerns. What needs to happen in the 0.7.0 time frame is to rework ZFS and remove its dependence on large allocations. This is behavior which is OK on Solaris, BSD, MacOS, etc but is just not well supported (by design) in the Linux kernel. Until that happens the best we can due to attempt to make the SPL's slab more robust. Inspired by @ryao various attempts I gave the issue some more thought and put together another patch which I think will work, #155, It passes all my regression tests without any additional performance impact, and it is designed to specifically address the write back deadlock case. If either of you have the time I'd appreciate a code review and some testing on a real system. |
Is there any chance to use ideas of SLUB ? http://lwn.net/Articles/229984/ .... |
Is the fragmentation issue really any worse than non-ZFS? As it happens, I have 2 boxes on linux-3.4, both on a similar uptime and running a similar workload, one with ext4 and one with ZoL on native linux slab (SLUB + ryao/spl@4e2562a + cherry-pick ryao/spl@b0bc39c). Comparing the two, ext4:
versus ZoL :
According to The only other difference between the boxes is the ext4 box has 96GB vs ZoL with 48 GB. Perhaps the buddyinfo difference is because the ext4 box has chewed up all the memory in buffers? Ext4:
vs ZoL:
(...which prompts the question, should ZoL be tying into the buffer subsystem rather than the slab subsystem?) I don't have any figures addressing your performance concerns, however (waves hands around...) it seems to me that the presumably highly-tuned native linux slab subsystem has a good chance of being as fast as the bespoke ZoL slab. All that said, I understand ZoL also has issues with low memory conditions and swap (the two being of course highly correlated), and to my educated lay person's mind, #155 (Emergency slab objects) is likely the right approach to address that side of things. Perhaps #155 in combination with the native linux slab? |
It's not that ZFS fragments memory that much more than other filesystems. In fact, according to your data it does the opposite. It's that ZFS depends on the ability to allocate large contiguous chunks of memory for basic operation. Other filesystems such as ext4/xfs do not, they are able to operate just fine with heavily fragmented memory. ZFS can (and will) be updated to function in this fragmented memory environment, But it wasn't originally written to do so since Solaris provides a fairly rich virtual memory subsystem. Linux does not, it instead forces many of these low level memory management details down in to the filesystem. This was a conscious decision made by the Linux kernel developers which is very good for performance, and bad for portability.
Yes, yes it should. The reason it doesn't is because it was originally written for the slab on Solaris. To shift it over to the page cache is going to require the changes mentioned above. In particular, the ZFS stack must be changed to pass a scatter-gather list of pages to functions rather than a address+length. It can then be updated to only briefly map these pages in to a per-cpu address space for various transforms, and additionally to cleanly link these pages in to the page cache.
So #155 should be viewed as a short/medium term solution for the official 0.6.0 release. These are low risk changes designed to stabilize the code. When we get to the point where we can use a zvol as a swap device we'll know we've reached a point of good stability in regards to the memory management. @ryao's test case above is a very good stress case to make sure we got it right. One things are stable we can go about making the needed long term changes in the 0.7.0 development branch since they may be destabilizing. |
Thanks for taking the time to explain! |
I may be way off base, but in reading up on the page cache I noticed in
Have mempools been considered to address the low memory / swap issues? |
They are only really suited for small allocations. |
For those interested in this work could you please test the following spl+zfs pull requests. They allow me to safely use a zvol based swap device, but I'd like to see this code get more testing. It should significantly improve things under low memory conditions. |
The SPL SLAB code relied on __vmalloc() for its allocations. The Linux kernel ignores flags passed to __vmalloc(), which required a kernel patch to avoid deadlock. We replace this allocation with allocations from the Linux SLAB to avoid this.
The maximum allocation size of __vmalloc() far exceeded the maximum slab size of the Linux kernel. The kernel is not prepared to satisfy allocations of such size under low memory conditions, which can also caused deadlocks. We change the maximum SLAB size to the maximum supported by Linux to avoid that.
Allocations of spl_kmem_slab_t structures were particularly large, which required suppressing a warning. It could potentially cause failures under the right conditions when loading the module. We replace these allocations with allocations from the SLAB allocator, which should eliminate such problems. It also enables us to do allocations in a way that minimizes the potential for false sharing, which should improve the performance of the SPL SLAB code.
These changes make the SPL SLAB implementation a wrapper for the Linux SLAB allocator. The Linux SLAB objects are the SPL slabs from which SPL SLAB objects are created. This is the partial rewrite I promised in #145.