Skip to content
This repository was archived by the owner on Feb 26, 2020. It is now read-only.

Use Linux SLAB allocator for SPL SLAB allocations #147

Closed
wants to merge 2 commits into from

Conversation

ryao
Copy link
Contributor

@ryao ryao commented Aug 1, 2012

The SPL SLAB code relied on __vmalloc() for its allocations. The Linux kernel ignores flags passed to __vmalloc(), which required a kernel patch to avoid deadlock. We replace this allocation with allocations from the Linux SLAB to avoid this.

The maximum allocation size of __vmalloc() far exceeded the maximum slab size of the Linux kernel. The kernel is not prepared to satisfy allocations of such size under low memory conditions, which can also caused deadlocks. We change the maximum SLAB size to the maximum supported by Linux to avoid that.

Allocations of spl_kmem_slab_t structures were particularly large, which required suppressing a warning. It could potentially cause failures under the right conditions when loading the module. We replace these allocations with allocations from the SLAB allocator, which should eliminate such problems. It also enables us to do allocations in a way that minimizes the potential for false sharing, which should improve the performance of the SPL SLAB code.

These changes make the SPL SLAB implementation a wrapper for the Linux SLAB allocator. The Linux SLAB objects are the SPL slabs from which SPL SLAB objects are created. This is the partial rewrite I promised in #145.

@ryao
Copy link
Contributor Author

ryao commented Aug 1, 2012

This has the side effect of causing the kmem:slab_large splat test to fail. The kernel does not maintain a large enough reserve of memory to permit such large allocations to succeed under memory pressure. Requesting them under such conditions will either deadlock or severely harm performance. It probably would be best to stop supporting large slabs.

@chrisrd
Copy link
Contributor

chrisrd commented Aug 1, 2012

@ryao Would you like to comment on your approach of using the native linux slab in light of point 2 in the comments from spl/spl-kmem.c:

/*
 * Slab allocation interfaces
 *
 * While the Linux slab implementation was inspired by the Solaris
 * implementation I cannot use it to emulate the Solaris APIs.  I
 * require two features which are not provided by the Linux slab.
 *
 * 1) Constructors AND destructors.  Recent versions of the Linux
 *    kernel have removed support for destructors.  This is a deal
 *    breaker for the SPL which contains particularly expensive
 *    initializers for mutex's, condition variables, etc.  We also
 *    require a minimal level of cleanup for these data types unlike
 *    many Linux data type which do need to be explicitly destroyed.
 *
 * 2) Virtual address space backed slab.  Callers of the Solaris slab
 *    expect it to work well for both small are very large allocations.
 *    Because of memory fragmentation the Linux slab which is backed
 *    by kmalloc'ed memory performs very badly when confronted with
 *    large numbers of large allocations.  Basing the slab on the
 *    virtual address space removes the need for contiguous pages
 *    and greatly improve performance for large allocations.
 *
 * For these reasons, the SPL has its own slab implementation with
 * the needed features.  It is not as highly optimized as either the
 * Solaris or Linux slabs, but it should get me most of what is
 * needed until it can be optimized or obsoleted by another approach.
 *

In particular you've already seen the kmem:slab_large splat test failing, which is a direct, known consequence of using the native linux slab.

Also, there now seems there may be a good amount of redundant work with the kmem slab on top of the linux slab, e.g. both look to be keeping track of per-cpu locality etc.

@ryao
Copy link
Contributor Author

ryao commented Aug 1, 2012

These are good points, but I believe that these changes are an improvement.

The virtual memory address backed SLAB was nice in principle, but in practice, it did not work very well. First, the code path for small allocations is not used in practice, so nothing benefited from it. Second, the previous approach relied on a global lock that serialized SLAB allocations. The elimination of this lock should compensate for the overhead of the redundant code. Third, the Linux kernel's virtual memory allocator is intentionally crippled to prevent developers from doing what we currently do. The effect is that unless end users patch their kernels, they will suffer from stability problems. People have found the kernel patch to do wonders for system stability, but it has a negligible chance of being accepted upstream.

The loss of large slab allocations is in some sense an improvement. It is extremely difficult, if not impossible, for the current Linux kernel to allocate large slabs under extreme memory pressure without tweaks. Much critical code depends on SLAB allocations and that code is not designed to handle allocation failures. I have not examined the kernel's code in detail to know exactly what should happen in these circumstance, but one of three scenarios are theoretically possible from what I know.

  1. The kernel tries to free memory from caches and it barely succeeds, taking an inordinate amount of time in the process and causing the system to lag.
  2. The kernel tries to free memory from its caches, but that is impossible, so it returns a failure, causing a kernel panic because our code cannot handle a failure. I should note that there could be cases where we can handle a failure, but those aren't important here.
  3. The kernel tries to free memory from its caches, but that is impossible, so it loops infinitely, in a deadlock-like state.

The kernel is not intended to provide such large allocations and they cause stability problems when paging to a ZFS swap zvol under load. Even when swap on ZFS is not in use, allocating large slabs as the code currently does under memory pressure poses performance issues.

Lastly, the definition of a large allocation is relative. The Linux SLAB allocator has a maximum object size of 32MB. The SPL SLAB allocator should have no problems performing efficiently as long as SLAB objects are no greater than a few megabytes in size. I am not aware of any situations in which they approach 1MB, so I do not think that this limitation is a problem.

@behlendorf
Copy link
Contributor

I'm also particularly concerned about point 2) in my comment above. This is in effect 90% of the reason I was forced to write the spl slab in the first place. I would have loved to been able to avoid this work. But at the time this was the cleanest short/medium term solution. The problem is that fundamentally the Linux kernel is very bad at handling large contiguous memory allocations, it was never designed to do this well. They should be avoided.

My primary concern is that your now going to be regularly attempting to allocate multi-megabyte chunks of contiguous physical address space. Now on a freshly booted system that's probably not going to be a huge issue, but as the system memory fragments over time it becomes increasingly expensive to perform these allocations. It can even degrade to the point where these large allocations are impossible.

Additionally, I'd expect changes like #145 to only further compound the issue by reducing the number of locations pages can be easily dropped to free up the required address space.

These issues are subtle and will not likely crop up until memory gets heavily fragmented. And even then I doubt they will result in an outright failure, but you will see a performance impact. For a tiny pool that performance impact may be small. But for larger pools which are expected to sustain GB/s of throughput this will be a problem.

So while these changes may make sense for a desktop, they will cause problems for the enterprise systems. Additionally, @chrisrd is correct about these changes introducing a lot of new redundancy with the kernel slab in regard to per-cpu locality and cache aging.

Now rather that dissuade you from pursuing this further. I want to suggest that you take what you started several steps farther. Create an completely alternate spl slab implementation which is thinly layered on top of the kernel slab. This should be relatively straight forward with the exception of implementing destructors which may be difficult. Then add a new configure option to select which slab implementation should be used. This resolves a couple of issues:

  • Eliminates the use of vmalloc().
  • Eliminates redundancy between the spl and kernel slab implementations.
  • Resolves the outstanding slab related preempt issues.
  • Allows for easier wide scale testing
  • Is work we'd need to do anyway after removing ZFS's dependency on large allocations.
  • Can be easily enabled in downstream distribution like Gentoo.

@ryao
Copy link
Contributor Author

ryao commented Aug 1, 2012

@behlendorf What are your thoughts about reducing the maximum SLAB size ahead of this? Large slabs should pose problems under low memory conditions.

Also, what do you think about commits ryao/spl@5517b73 and ryao/spl@353d751? They should be harmless.

@behlendorf
Copy link
Contributor

@ryao For which implementation? The current assumption is that large slabs are vmalloc() based so making them a few megabytes was reasonable considering the overhead of acquiring the global spin lock. If your talking about backing them by the Linux slab then yes they would need to get smaller.

Although I'd prefer to simply abandon the existing implementation entirely if we shift to using the Linux slab. We just leave all those slab management details to the Linux slab implementation and live with a maximum object size of 128K. I was thinking we could even trivially handle the constructor/destructors by just calling them for each alloc/free respectively. That's not 100% optimal but it is semantically correct and simple. This would allow the spl slab layer over the linux slab to be very thin.

@behlendorf
Copy link
Contributor

Also, what do you think about commits ryao/spl@5517b73 and ryao/spl@353d751? They should be harmless.

Oh yeah. I was going to comment on them but it slipped my mind while responding to the rest of it. Good catch, I don't have any objection to those two improvements/fixes. Let me take a second careful look at them and I'll get them tested and merged.

@ryao
Copy link
Contributor Author

ryao commented Aug 1, 2012

This would be for the current implementation. The kernel only maintains so much reserve memory (controlled via vm.min_free_kbytes). If the SPL slab size exceeds it, the system is under memory pressure and an allocation is done from cache with full slabs, the allocation could make things go from bad to worse. The use of virtual memory won't prevent problems in this scenario.

With that said, there is an alternative to this that we can consider. We can write our own primitive virtual memory allocator and some book keeping. Whenever an allocation is made for memory, we call alloc_vmap_area() to get virtual address space. It then can allocate pages by examining the kernel's free space and patching together the smallest fragments by mapping them into our virtual address space. It would do some book keeping much like how vm_area_structs work in processes. It will then return that to the caller. Then a free function would take that, look through our records to identify what fragments composed it, unmap them and release the area.

@behlendorf
Copy link
Contributor

Concerning reducing the current max slab size we could certainly try reducing it if you think it is aggravating the low memory issue. That initial values was only an educated guess I made at the time trying to balance the vmalloc() spin lock contention and how much memory could be easily allocated. If testing shows reducing it is helpful and it doesn't come at the expense of performance we can do it.

I'm much more leery about writing our own VMA. Better to first trying simply layering the spl's slab on the Linux slab as see how well that works in practice. At least that's development effort in the right direction IMHO. Plus, I've actually never really seriously tried it. It might work far better than I expect.

@ryao
Copy link
Contributor Author

ryao commented Aug 2, 2012

Ignore the 'ryao added some commits' message. Those are the same commits. I accidentally removed them from this branch when I meant to remove them from another branch. I haven't decided what to do about this branch yet.

@behlendorf
Copy link
Contributor

I'd be very interested to take this one step further with a very thin compatibility layer on top of the Linux slab.

As I mentioned above when I originally write this spl's virtual memory backed slab it was because I was concerned about performance. But at the time there was no Posix layer in place so it wasn't easy to do any real performance comparisons between the spl's slab and the Linux slab. I may have over estimated that impact, particularly since I was primarily concerned about sustaining several GB/s through IB or 10Gig attached servers. For a desktop workload this may be a price worth paying for now since it should improve stability.

@ryao
Copy link
Contributor Author

ryao commented Aug 6, 2012

@behlendorf These patches appear to be triggering the memory fragmentation issue on my desktop. My desktop applications are fairly memory intensive and on top of that, I compile software in parallel quite often, so the kernel memory is stressed much more than it would be on an ordinary desktop.

With that said, I think I have thought of a solution that would work well on both Sequoia and desktop computers. The idea is to modify the SPL SLAB implementation to create a worker thread for each cache created and always maintain an extra slab. When a new slab is needed, the extra SLAB is consumed and the worker thread is notified to allocate a new one via GFP_KERNEL.

This will enable SLAB allocations to be done using GFP_KERNEL in a manner that is safe by blocking asynchronously. Since each cache has its own worker thread, there will be no additional allocations needed to do this, so we will not risk making things worse under lower memory conditions.

An issue with this is what happens when the worker thread does not successfully allocate a new SLAB in time, which isn't handled. In principle, each SLAB should be able to contain a sufficiently large number of objects that this will never happen, but it is something that should be handled somehow. This could be handled by canabalizing other SLABs, although that is a fairly elaborate thing to implement. Panicking would probably be acceptable in an initial implementation until a more elaborate mechanism for handling this state can be written.

Lastly, this will make the SPL slabs even more memory hungry, although that is probably worth the system stability that this would provide.

@ryao
Copy link
Contributor Author

ryao commented Aug 6, 2012

On second thought, it would be better to use a single thread and just allocate additional space in the cache's structure to be used when the request is made to that thread. That will reduce spinlock contention and possibly have a few other benefits that I am still considering.

Also, I neglected to mention in my previous comment that I believe that the initial results were promising because the kernel's buddy allocator (from which the slabs are allocated) was able to satisfy the smaller allocations under situations of memory stress immediately after booting when the kernel memory was not too badly fragmented. After several days of fragmentation, this ceased to be the case.

This explanation is consistent with the idea that smaller slab sizes are better under low memory conditions and this should hold even when virtual memory is used. The idea in my previous comment should also eliminate virtual memory allocation overhead from slab allocations, which should eliminate contention concerns.

@behlendorf
Copy link
Contributor

@ryao Concerning the fragmentation issues you observed. I'm curious, how did they manifest themselves on your system? Were they outright allocation failures or did the system just become painfully slow to use?

Also, in your original patch IIRC you were allocating the SPL slab from a Linux slab which would result in huge Linux slab objects (10's of MBs). Have you tried simply abandoning the SPL slab entirely? You could just kmalloc() the slab object outright which would limit things to a maximum order-5 allocation size for 128KB objects. Once again this would be slow, but it might be small enough to be fairly stable.

Concerning the idea of using a separate thread to allocate these slabs. I had a similar thought but I discarded it after further reflection because it didn't completely solve the existing PF_MEMALLOC related issues. Even if the allocation occurs as GFP_KERNEL in a separate thread there's no guarantee it will be able to complete. For example, it could still deadlock in the case where the txg_sync thread triggers the allocation and then asynchronously polls for the new slab. The new allocation might still block attempting to writeback pages to zfs since it's GFP_KERNEL. And since the txg_sync can't progress until it gets it's memory we're back to our original deadlock. Although perhaps it would be more unlikely.

I suppose we could kick off additional new slab allocations hoping that eventually one of them would succeed. Presumably, we'd eventually be able to discard a set of pages without triggering writeback to zfs and then be able to proceed. But that feels like a serious hack. In this case, I'd suggest we use the generic linux work queues rather than a dedicated thread.

Anyway, I never tested the idea of doing the allocations in a separate thread. I just through it through an abandoned it. But perhaps you can convince me I missed something and there's an easy way to prevent the above issue.

@ryao
Copy link
Contributor Author

ryao commented Aug 7, 2012

@behlendorf The system hung while waiting for allocations. Doing Magic Sysreq + E enabled me to execute init 3 to bring the system back to a usable state until the next hang. That was until this morning when even that would not revive it.

I have thought about abandoning the SPL slabs entirely, but my desire for a solution that is universally applicable lead me to consider the separate thread approach. If that does not work, then I suppose this would be the only way, although I would prefer to explore that possibility first.

I have observed the txg_sync thread problem in the current code when when doing swap on ZFS under high memory stress. The approach of asynchronously allocating slabs in a separate thread is still susceptible to this, but it is much less likely. When low memory conditions occur, the n+1th slab would already be available by virtue of having been allocated before the low memory conditions, so things would keep going until the n+2th slab is needed. If each slab stores a sufficiently large number of objects to permit things to continue operating until the low memory condition has been alleviated, then there will be no problems. If not, things will deadlock, as they do now.

That is a separate case from the situation where swap is not on ZFS, in which case these SPL slab allocations are not necessary for swap to occur and the kernel should be able to solve the low memory condition on its own. If that happens, then the only change we will see from this is a performance increase from the amortized O(1) slab allocations that result from allocating the next slab in advance.

I think a dedicated thread would be better than generic linux work queues for two reasons. The first is that the work queues would attempt to do allocations in parallel, which will cause unnecessary spin lock contention. The second is that the use of a single thread provides us with the opportunity to implement prioritization, which enables more important caches to have the next slab allocated (such as any used in txg_sync) before others.

I am still thinking about what the right way to implement prioritization would be. I am currently thinking of applying bayes theorem to based on the frequency of allocations from the SLAB that pass GFP_HIGHMEM, but I want to implement the asynchronous allocations before I consider this too heavily. I had neglected to mention this in my previous comment because I was thinking of implementing the worker thread first and then implement this prioritization at a later date should the worker thread approach alone prove to be enough.

Let me know what you think.

@behlendorf
Copy link
Contributor

I have thought about abandoning the SPL slabs entirely, but my desire for a solution that is universally applicable lead me to consider the separate thread approach. If that does not work, then I suppose this would be the only way, although I would prefer to explore that possibility first.

@ryao Yes, I understand. I'd prefer a clean generic solution for this issue as well. I'm certainly not opposed to pursuing the dedicated thread approach. Hopefully, it will work well enough that we can rely on it for a proper 0.6.0 stable release.

That would buy us the latitude and time to do things right in 0.7.0 and remove this virtual memory dependency from the zfs code. This is something I've wanted to do for quite some time since a lot of good will come from it. But I think we need a stable 0.6.0 release first before we potentially go and destabilize things. So anything we can do to the spl slab to achieve stability I'm interested in... including a lower performing but entirely stable implementation if needed.

If each slab stores a sufficiently large number of objects to permit things to continue operating until the low memory condition has been alleviated, then there will be no problems. If not, things will deadlock, as they do now.

Right, but is a single slab enough. I'm not totally convinced it will be. But once again, I don't have any real data on this just a gut feeling which I'd love to be wrong about. So I'm all for trying it.

That is a separate case from the situation where swap is not on ZFS

Not entirely (see http://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L239 ). The same issue can potentially occur for mmap(2) writes. The mapped region in the page cache can be dirtied and then forced out to disk via .writepage() when there is memory pressure. An unlikely, but possible case.

But yes, in general I agree. If your using a non zfs swap device then we should be able to make forward progress and get the memory we need. Although, I suspect we won't see much of a performance improvement. We're already typically hiding most of the allocation time by using slabs in the first place and keeping per-cpu magazines.

I think a dedicated thread would be better than generic linux work queues for two reasons. The first is that the work queues would attempt to do allocations in parallel, which will cause unnecessary spin lock contention. The second is that the use of a single thread provides us with the opportunity to implement prioritization, which enables more
important caches to have the next slab allocated (such as any used in txg_sync) before others.

Both good points. My primarily concern about using a single thread is that if it gets deadlocked for any reason down in vmalloc() we're out of luck. If we were to use work queues would could easily dispatch another vmalloc() work item which could very well succeed because it will reclaim an entirely different set of pages. This would be more resilient and should help ensure forward progress.

It should also cause no more contention on the virtual memory address space spinlock that we see today. And that appears to be at a tolerable level. This would make prioritizing the allocations harder but I'm not sure that's really critical. Particularly, because it's not at all clear to me that at this level the spl is capable of making a good decision about which allocations to prioritize.

Let me know what you think.

I think you should give this a try and we'll see how it holds up to real usage. If it works well, and the implementation is clean, then I don't have any objections to merging it.

@ryao
Copy link
Contributor Author

ryao commented Aug 15, 2012

I tried the delayed allocation approach, but it did not work as well as I had hoped. Having an additional slab available was insufficient to ensure system stability. The worker thread also triggered the hung task timer during idle time.

It seems that writing a wrapper for the Linux SLAB allocator is the only alternative.

@ryao
Copy link
Contributor Author

ryao commented Aug 16, 2012

I am going to close this until another solution is ready.

@ryao ryao closed this Aug 16, 2012
This replaces kmem_zalloc() allocations of spl_kmem_cache objects that
are approximately 32KB in size with a Linux SLAB allocations. This
eliminates the need to supress debug warnings and makes the changes
necessary to replace the __vmalloc() in kv_alloc() with SLAB
allocations.

We also take advantage of the SLAB allocator's SLAB_HWCACHE_ALIGN flag
to reduce false sharing, which could increase performance.

Signed-off-by: Richard Yao <[email protected]>
@ryao
Copy link
Contributor Author

ryao commented Aug 16, 2012

I took time to implement a wrapper for the Linux SLAB allocator. It still needs a configure script option to toggle it, but it is at the point where it can be tested and reviewed.

This should guarantee that gfp flags are honored without a kernel patch. It should also improve things on 32-bit systems. Unfortunately, it is not perfect. Running 4 instances of python -c "print 2**10**10" on my quad core processor lead to a deadlock when doing swap on a zvol after about 200MB of data had been written. The magic system request key was unable to make the system recover while it had no problems recovering from this with the virtual memory backed SLAB implementation. I suspect that memory fragmentation is the cause of this regression. In addition, the Linux SLAB allocator allocates only one object per slab for objects that span more than 8 pages, which eliminates the main benefit of using the SLAB allocator.

At present, I have SPL patched to use the FreeBSD solution of having KM_SLEEP and KM_PUSHPAGE be aliases of one another. Modifying that so that only KM_PUSHPAGE uses GFP_HIGHMEM would help here, although I do not expect that to solve this problem.

@ryao ryao reopened this Aug 16, 2012
This should eliminate the need for a kernel patch and improve things on
32-bit systems.

Signed-off-by: Richard Yao <[email protected]>
@chrisrd
Copy link
Contributor

chrisrd commented Aug 16, 2012

Sorry, I don't have the expertise to do a fine grained review, but I definitely like the approach of better integration with the rest of linux, and the reduction in complexity! I'll give this a try.

"the Linux SLAB allocator allocates only one object per slab for objects that span more than 8 pages"

If you're using the default SLUB allocator (CONFIG_SLUB=y) rather than SLAB, it looks like the 8 pages is configurable using slub_max_order, per Documentation/vm/slub.txt.

What is producing the large objects, and how prevalent are they?

@chrisrd
Copy link
Contributor

chrisrd commented Aug 16, 2012

FYI I've had ryao/spl@4e2562a + cherry-pick ryao/spl@b0bc39c running on linux-3.4 (w/ CONFIG_SLUB) for nearly 5 hrs, loaded as an rsync target and doing a resilver. Some stats may be of interest:

# egrep '# name|arc|dnode|zfs|zio' /proc/slabinfo
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
zfs_znode_cache   8018670 8018670    968   33    8 : tunables    0    0    0 : slabdata 242990 242990      0
arc_buf_hdr_t     2303314 2605112    376   43    4 : tunables    0    0    0 : slabdata  60584  60584      0
dnode_t           8028018 8028076   1136   28    8 : tunables    0    0    0 : slabdata 286717 286717      0
zio_buf_131072      2486   2486 131072    1   32 : tunables    0    0    0 : slabdata   2486   2486      0
zio_buf_126976        37     37 126976    1   32 : tunables    0    0    0 : slabdata     37     37      0
zio_buf_122880        37     37 122880    1   32 : tunables    0    0    0 : slabdata     37     37      0
zio_buf_118784        44     44 118784    1   32 : tunables    0    0    0 : slabdata     44     44      0
zio_buf_114688        40     40 114688    1   32 : tunables    0    0    0 : slabdata     40     40      0
zio_buf_110592        38     38 110592    1   32 : tunables    0    0    0 : slabdata     38     38      0
zio_buf_106496        34     34 106496    1   32 : tunables    0    0    0 : slabdata     34     34      0
zio_buf_102400        39     39 102400    1   32 : tunables    0    0    0 : slabdata     39     39      0
zio_buf_98304         39     39  98304    1   32 : tunables    0    0    0 : slabdata     39     39      0
zio_buf_94208         39     39  94208    1   32 : tunables    0    0    0 : slabdata     39     39      0
zio_buf_90112         32     32  90112    1   32 : tunables    0    0    0 : slabdata     32     32      0
zio_buf_86016         40     40  86016    1   32 : tunables    0    0    0 : slabdata     40     40      0
zio_buf_81920         39     39  81920    1   32 : tunables    0    0    0 : slabdata     39     39      0
zio_buf_77824         44     44  77824    1   32 : tunables    0    0    0 : slabdata     44     44      0
zio_buf_73728         37     37  73728    1   32 : tunables    0    0    0 : slabdata     37     37      0
zio_buf_69632         32     32  69632    1   32 : tunables    0    0    0 : slabdata     32     32      0
zio_buf_65536         32     32  65536    1   16 : tunables    0    0    0 : slabdata     32     32      0
zio_buf_61440         32     32  61440    1   16 : tunables    0    0    0 : slabdata     32     32      0
zio_buf_57344         40     40  57344    1   16 : tunables    0    0    0 : slabdata     40     40      0
zio_buf_53248         38     38  53248    1   16 : tunables    0    0    0 : slabdata     38     38      0
zio_buf_49152         34     34  49152    1   16 : tunables    0    0    0 : slabdata     34     34      0
zio_buf_45056         36     36  45056    1   16 : tunables    0    0    0 : slabdata     36     36      0
zio_buf_40960         28     28  40960    1   16 : tunables    0    0    0 : slabdata     28     28      0
zio_buf_36864         31     31  36864    1   16 : tunables    0    0    0 : slabdata     31     31      0
zio_buf_32768         34     34  32768    1    8 : tunables    0    0    0 : slabdata     34     34      0
zio_buf_28672         34     34  28672    1    8 : tunables    0    0    0 : slabdata     34     34      0
zio_buf_24576         35     35  24576    1    8 : tunables    0    0    0 : slabdata     35     35      0
zio_buf_20480         43     43  20480    1    8 : tunables    0    0    0 : slabdata     43     43      0
zio_buf_16384     327632 362700  16384    2    8 : tunables    0    0    0 : slabdata 181350 181350      0
zio_buf_14336         72     72  14336    2    8 : tunables    0    0    0 : slabdata     36     36      0
zio_buf_12288         80     80  12288    2    8 : tunables    0    0    0 : slabdata     40     40      0
zio_buf_10240        108    108  10240    3    8 : tunables    0    0    0 : slabdata     36     36      0
zio_buf_7168         136    136   7168    4    8 : tunables    0    0    0 : slabdata     34     34      0
zio_buf_6144         170    200   6144    5    8 : tunables    0    0    0 : slabdata     40     40      0
zio_buf_5120         228    228   5120    6    8 : tunables    0    0    0 : slabdata     38     38      0
zio_buf_3584         585    585   3584    9    8 : tunables    0    0    0 : slabdata     65     65      0
zio_buf_3072         660    660   3072   10    8 : tunables    0    0    0 : slabdata     66     66      0
zio_buf_2560         612    648   2560   12    8 : tunables    0    0    0 : slabdata     54     54      0
zio_buf_1536        1491   1575   1536   21    8 : tunables    0    0    0 : slabdata     75     75      0
zio_cache           4729   5066    952   34    8 : tunables    0    0    0 : slabdata    149    149      0

# cat /proc/spl/kstat/zfs/arcstats
4 1 0x01 77 3696 960656445594438 982313097492247
name                            type data
hits                            4    49394050
misses                          4    9990995
demand_data_hits                4    4227699
demand_data_misses              4    90960
demand_metadata_hits            4    30689773
demand_metadata_misses          4    1571106
prefetch_data_hits              4    2010479
prefetch_data_misses            4    2444305
prefetch_metadata_hits          4    12466099
prefetch_metadata_misses        4    5884624
mru_hits                        4    15386917
mru_ghost_hits                  4    407709
mfu_hits                        4    19566431
mfu_ghost_hits                  4    850222
deleted                         4    11946503
recycle_miss                    4    4567164
mutex_miss                      4    33076
evict_skip                      4    781364838
evict_l2_cached                 4    0
evict_l2_eligible               4    579960024064
evict_l2_ineligible             4    111211535360
hash_elements                   4    1880506
hash_elements_max               4    2889638
hash_collisions                 4    13255840
hash_chains                     4    560761
hash_chain_max                  4    14
p                               4    4253757440
c                               4    21474836480
c_min                           4    10737418240
c_max                           4    21474836480
size                            4    21474906792
hdr_size                        4    744804408
data_size                       4    5838839296
other_size                      4    14891263088
anon_size                       4    83968
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    4142153216
mru_evict_data                  4    332791808
mru_evict_metadata              4    113319936
mru_ghost_size                  4    17334272512
mru_ghost_evict_data            4    24379392
mru_ghost_evict_metadata        4    17309893120
mfu_size                        4    1696602112
mfu_evict_data                  4    38666240
mfu_evict_metadata              4    139264
mfu_ghost_size                  4    4140443648
mfu_ghost_evict_data            4    422969344
mfu_ghost_evict_metadata        4    3717474304
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    21103448744
arc_meta_limit                  4    21474836480
arc_meta_max                    4    21487001072

# perf top
Events: 115K cycles                                                                                                                              
 24.41%  rsync                          [.] md5_process
  9.87%  [aesni_intel]                  [k] _aesni_dec4
  7.21%  rsync                          [.] get_checksum1
  5.50%  [kernel]                       [k] native_write_cr0
  5.10%  [zcommon]                      [k] fletcher_4_native
  3.95%  [unknown]                      [.] 0x00007f957dd5d027
  3.61%  [aesni_intel]                  [k] _aesni_enc1
  2.62%  [kernel]                       [k] memmove
  2.00%  [aesni_intel]                  [k] aesni_cbc_dec
  1.83%  [kernel]                       [k] copy_user_generic_string
  1.76%  [kernel]                       [k] __ticket_spin_lock
  1.35%  [dm_crypt]                     [k] crypt_convert
  1.08%  [kernel]                       [k] kernel_fpu_begin
  0.97%  [kernel]                       [k] memset
  0.88%  [kernel]                       [k] find_busiest_group
  0.76%  [zfs]                          [k] vdev_raidz_generate_parity_pq
  0.67%  [kernel]                       [k] mutex_lock
  0.62%  [zfs]                          [k] lzjb_decompress
  0.61%  [kernel]                       [k] mutex_unlock
  0.56%  [kernel]                       [k] blkcipher_walk_next
  0.38%  [kernel]                       [k] scatterwalk_done
  0.35%  [aesni_intel]                  [k] cbc_decrypt
  0.34%  [kernel]                       [k] kmem_cache_free
  0.34%  [kernel]                       [k] load_balance
  0.33%  [zfs]                          [k] zio_done
  0.33%  [kernel]                       [k] mutex_spin_on_owner
  0.33%  [kernel]                       [k] __slab_free
  0.31%  [kernel]                       [k] _raw_spin_unlock_irqrestore
  0.30%  [kernel]                       [k] __schedule
  0.28%  [kernel]                       [k] clockevents_program_event
  0.28%  [kernel]                       [k] blkcipher_walk_done
  0.27%  [kernel]                       [k] __hrtimer_start_range_ns
  0.26%  [kernel]                       [k] idle_cpu
  0.26%  [aesni_intel]                  [k] aes_encrypt
  0.26%  [zfs]                          [k] zio_cons
  0.25%  [spl]                          [k] spl_kmem_cache_alloc
  0.25%  [kernel]                       [k] __remove_hrtimer
  0.25%  [kernel]                       [k] blkcipher_walk_first
  0.24%  [kernel]                       [k] sg_init_table
  0.24%  [kernel]                       [k] kmem_cache_alloc
  0.23%  [dm_crypt]                     [k] crypt_iv_essiv_gen
  0.22%  [kernel]                       [k] irq_fpu_usable
  0.20%  [kernel]                       [k] default_send_IPI_mask_sequence_phys
  0.19%  [zavl]                         [k] avl_find
  0.19%  [aesni_intel]                  [k] ablk_decrypt
  0.18%  [kernel]                       [k] clflush_cache_range
  0.18%  [zfs]                          [k] arc_evict
  0.17%  [kernel]                       [k] cmpxchg_double_slab.isra.24
  0.17%  [zfs]                          [k] zio_walk_parents
  0.17%  [kernel]                       [k] irq_entries_start
  0.16%  [spl]                          [k] spl_kmem_cache_free
  0.16%  [zfs]                          [k] vdev_queue_io_to_issue
  0.16%  [aesni_intel]                  [k] aesni_cbc_enc
  0.15%  [kernel]                       [k] tick_program_event
  0.15%  [kernel]                       [k] cpumask_next_and
  0.15%  [kernel]                       [k] find_next_bit
  0.15%  [spl]                          [k] taskq_thread
  0.14%  [kernel]                       [k] __slab_alloc
  0.14%  [kernel]                       [k] __kmalloc
  0.14%  [spl]                          [k] kmem_alloc_debug
  0.14%  [kernel]                       [k] mix_pool_bytes_extract
  0.13%  [dm_crypt]                     [k] iv_of_dmreq
  0.13%  [zfs]                          [k] zio_execute
  0.13%  [kernel]                       [k] blkcipher_walk_virt
  0.13%  rsync                          [.] md5_update
  0.12%  [kernel]                       [k] __ticket_spin_unlock
  0.12%  [zfs]                          [k] buf_hash
  0.12%  [kernel]                       [k] _raw_spin_lock_irqsave
  0.12%  [zavl]                         [k] avl_remove
  0.12%  [mpt2sas]                      [k] _base_interrupt
  0.12%  [kernel]                       [k] scatterwalk_map

# for ((i=0; i<30; i+=1)); do cat /proc/loadavg; sleep 10; done
4.55 5.32 4.99 3/748 5620
8.47 6.11 5.25 6/748 5622
7.48 5.97 5.21 9/748 5624
6.78 5.87 5.19 1/748 5626
11.65 6.91 5.53 48/748 5628
10.49 6.81 5.52 8/748 5630
10.99 7.04 5.60 4/756 5640
9.99 6.95 5.59 1/755 5643
8.84 6.81 5.56 15/755 5645
7.71 6.63 5.51 1/755 5647
6.75 6.46 5.47 6/755 5653
6.33 6.38 5.45 2/755 5655
6.19 6.35 5.45 2/755 5657
5.86 6.27 5.44 46/745 5659
5.50 6.18 5.42 1/745 5662
5.43 6.14 5.41 2/744 5664
6.40 6.34 5.48 2/744 5674
6.47 6.36 5.50 2/744 5677
5.93 6.25 5.47 2/747 5693
5.41 6.12 5.44 1/747 5695
4.81 5.97 5.40 1/745 5697
4.22 5.81 5.35 6/744 5699
4.25 5.76 5.34 2/744 5701
3.91 5.64 5.31 1/746 5705
4.33 5.67 5.32 1/747 5718
9.81 6.76 5.68 1/748 5748
9.28 6.75 5.69 1/745 5909
8.54 6.68 5.67 1/748 5914
7.54 6.52 5.63 1/745 5916
7.09 6.46 5.62 3/744 5918

# grep ^processor /proc/cpuinfo | wc -l
16

# cat /proc/meminfo
MemTotal:       49526720 kB
MemFree:         9310140 kB
Buffers:           69004 kB
Cached:           170124 kB
SwapCached:          360 kB
Active:           135324 kB
Inactive:         179884 kB
Active(anon):      39652 kB
Inactive(anon):    37812 kB
Active(file):      95672 kB
Inactive(file):   142072 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       5242876 kB
SwapFree:        5241808 kB
Dirty:                 4 kB
Writeback:             0 kB
AnonPages:         76008 kB
Mapped:             8248 kB
Shmem:              1376 kB
Slab:           39445024 kB
SReclaimable:    3078136 kB
SUnreclaim:     36366888 kB
KernelStack:        5784 kB
PageTables:         4804 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    30006236 kB
Committed_AS:     253680 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      523100 kB
VmallocChunk:   34333564064 kB
HardwareCorrupted:     0 kB
AnonHugePages:     12288 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       11776 kB
DirectMap2M:     2076672 kB
DirectMap1G:    48234496 kB

@ryao
Copy link
Contributor Author

ryao commented Aug 16, 2012

@chrisrd I appreciate your enthusiasm, but unfortunately, I do not see any way to make this work. I am going to close this pull request and open a new one for ryao/spl@855c08a, which is salvageable.

@ryao ryao closed this Aug 16, 2012
@chrisrd
Copy link
Contributor

chrisrd commented Aug 17, 2012

No worries.

I understand you're focused on swap over ZoL, but for what it's worth, I think the approach of SPL's cache being a thin layer on top of the native linux slab is the right one to address the more general memory issues that have been cropping up all too frequently.

@ryao
Copy link
Contributor Author

ryao commented Aug 17, 2012

Memory issues and swap issues are related. Anyway, the native Linux slab cannot be used due to memory fragmentation issues that occur when using real memory.

The real solution would be to fix the kernel virtual memory allocator to stop ignoring gfp flags, but I do not expect that to happen at upstream.

@behlendorf
Copy link
Contributor

Most (all?) of the major outstanding stability issues tie back to these memory management issues. Unfortunately, we can't simply naively layer on top of the Linux slab due to fragmentation and performance concerns. What needs to happen in the 0.7.0 time frame is to rework ZFS and remove its dependence on large allocations. This is behavior which is OK on Solaris, BSD, MacOS, etc but is just not well supported (by design) in the Linux kernel.

Until that happens the best we can due to attempt to make the SPL's slab more robust. Inspired by @ryao various attempts I gave the issue some more thought and put together another patch which I think will work, #155, It passes all my regression tests without any additional performance impact, and it is designed to specifically address the write back deadlock case.

If either of you have the time I'd appreciate a code review and some testing on a real system.

@pyavdr
Copy link

pyavdr commented Aug 19, 2012

Is there any chance to use ideas of SLUB ? http://lwn.net/Articles/229984/ ....

@chrisrd
Copy link
Contributor

chrisrd commented Aug 20, 2012

Is the fragmentation issue really any worse than non-ZFS?

As it happens, I have 2 boxes on linux-3.4, both on a similar uptime and running a similar workload, one with ext4 and one with ZoL on native linux slab (SLUB + ryao/spl@4e2562a + cherry-pick ryao/spl@b0bc39c). Comparing the two, ext4:

# uptime; cat /proc/buddyinfo 
 08:49:44 up 15 days, 13:00,  7 users,  load average: 1.37, 1.66, 1.94
Node 0, zone      DMA      1      0      1      0      2      1      1      0      1      1      3 
Node 0, zone    DMA32    642    226    667   1473   1166    391     70      1      1      0      0 
Node 0, zone   Normal  79354   2466    169     13      4      3      0      0      0      0      0 
Node 1, zone   Normal  67247   5131   1061   1674    413    205     16      0      0      1      0 

versus ZoL :

# uptime; cat /proc/buddyinfo 
 08:49:49 up 15 days, 17:04,  5 users,  load average: 3.60, 4.82, 5.24
Node 0, zone      DMA      1      0      1      0      2      1      1      0      1      1      3 
Node 0, zone    DMA32    810    924    941   1018    967    916    831    721    544    340    252 
Node 0, zone   Normal  73644  21825  10063   8061   6541   4202   1469    382    263      1      0 
Node 1, zone   Normal  59463  17451   6454   2932   2456   2056   1333    833   1143   1265      6 

According to linux/Documentation/filesystems/proc.txt the numbers are the available contiguous pages at "page order" (2^x) sizes from 0 to 10, left to right. So, if my understanding is correct, the ZoL box has far more memory available at higher page orders than the ext4 box, i.e. ZoL is causing significantly less fragmentation than ext4.

The only other difference between the boxes is the ext4 box has 96GB vs ZoL with 48 GB. Perhaps the buddyinfo difference is because the ext4 box has chewed up all the memory in buffers? Ext4:

# cat /proc/meminfo
MemTotal:       99071920 kB
MemFree:         1286584 kB
Buffers:        26228356 kB
Cached:         57775612 kB
SwapCached:            0 kB
Active:         29355556 kB
Inactive:       55034584 kB
Active(anon):     332552 kB
Inactive(anon):    54904 kB
Active(file):   29023004 kB
Inactive(file): 54979680 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      25165820 kB
SwapFree:       25165820 kB
Dirty:              2592 kB
Writeback:             0 kB
AnonPages:        388316 kB
Mapped:            12064 kB
Shmem:              1188 kB
Slab:           12776852 kB
SReclaimable:   12559388 kB
SUnreclaim:       217464 kB
KernelStack:        2832 kB
PageTables:         7504 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    74701780 kB
Committed_AS:    1241432 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      767740 kB
VmallocChunk:   34307741016 kB
HardwareCorrupted:     0 kB
AnonHugePages:    284672 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       11776 kB
DirectMap2M:     2076672 kB
DirectMap1G:    98566144 kB

vs ZoL:

# cat /proc/meminfo
MemTotal:       49526720 kB
MemFree:        11769640 kB
Buffers:          137184 kB
Cached:           280148 kB
SwapCached:          360 kB
Active:           311764 kB
Inactive:         175540 kB
Active(anon):      35692 kB
Inactive(anon):    35668 kB
Active(file):     276072 kB
Inactive(file):   139872 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       5242876 kB
SwapFree:        5241808 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         69836 kB
Mapped:             8264 kB
Shmem:              1384 kB
Slab:           36815900 kB
SReclaimable:     971468 kB
SUnreclaim:     35844432 kB
KernelStack:        5592 kB
PageTables:         4396 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    30006236 kB
Committed_AS:     229596 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      523100 kB
VmallocChunk:   34333564064 kB
HardwareCorrupted:     0 kB
AnonHugePages:      8192 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       11776 kB
DirectMap2M:     2076672 kB
DirectMap1G:    48234496 kB

(...which prompts the question, should ZoL be tying into the buffer subsystem rather than the slab subsystem?)

I don't have any figures addressing your performance concerns, however (waves hands around...) it seems to me that the presumably highly-tuned native linux slab subsystem has a good chance of being as fast as the bespoke ZoL slab.

All that said, I understand ZoL also has issues with low memory conditions and swap (the two being of course highly correlated), and to my educated lay person's mind, #155 (Emergency slab objects) is likely the right approach to address that side of things. Perhaps #155 in combination with the native linux slab?

@behlendorf
Copy link
Contributor

Is the fragmentation issue really any worse than non-ZFS?

It's not that ZFS fragments memory that much more than other filesystems. In fact, according to your data it does the opposite. It's that ZFS depends on the ability to allocate large contiguous chunks of memory for basic operation. Other filesystems such as ext4/xfs do not, they are able to operate just fine with heavily fragmented memory.

ZFS can (and will) be updated to function in this fragmented memory environment, But it wasn't originally written to do so since Solaris provides a fairly rich virtual memory subsystem. Linux does not, it instead forces many of these low level memory management details down in to the filesystem. This was a conscious decision made by the Linux kernel developers which is very good for performance, and bad for portability.

(...which prompts the question, should ZoL be tying into the buffer subsystem rather than the slab subsystem?)

Yes, yes it should. The reason it doesn't is because it was originally written for the slab on Solaris. To shift it over to the page cache is going to require the changes mentioned above. In particular, the ZFS stack must be changed to pass a scatter-gather list of pages to functions rather than a address+length. It can then be updated to only briefly map these pages in to a per-cpu address space for various transforms, and additionally to cleanly link these pages in to the page cache.

Perhaps #155 in combination with the native linux slab?

So #155 should be viewed as a short/medium term solution for the official 0.6.0 release. These are low risk changes designed to stabilize the code. When we get to the point where we can use a zvol as a swap device we'll know we've reached a point of good stability in regards to the memory management. @ryao's test case above is a very good stress case to make sure we got it right.

One things are stable we can go about making the needed long term changes in the 0.7.0 development branch since they may be destabilizing.

@chrisrd
Copy link
Contributor

chrisrd commented Aug 22, 2012

Thanks for taking the time to explain!

@chrisrd
Copy link
Contributor

chrisrd commented Aug 22, 2012

I may be way off base, but in reading up on the page cache I noticed in linux/Documentation/block/biodoc.txt:

...Ingo Molnar's mempool implementation, which enables
subsystems like bio to maintain their own reserve memory pools for guaranteed
deadlock-free allocations during extreme VM load

Have mempools been considered to address the low memory / swap issues?

@behlendorf
Copy link
Contributor

They are only really suited for small allocations.

@behlendorf
Copy link
Contributor

For those interested in this work could you please test the following spl+zfs pull requests. They allow me to safely use a zvol based swap device, but I'd like to see this code get more testing. It should significantly improve things under low memory conditions.

#161
openzfs/zfs#883

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants