Use scatter-gather lists for ARC buffers #75

behlendorf · 2010-11-05T19:48:35Z

This is a big change but we really need to consider updating the ZFS code to use scatter-gather lists for the ARC buffers instead of vmalloc'ed memory. Using a vmalloc'ed buffer is the way it's done on OpenSolaris but it's less problematic there because they have a more full featured virtual memory management system in the kernel. By design the Linux kernel's VM is primitive for performance reasons. The only reason things are working reasonable well today is that I've implemented a fairly decent virtual slab in the SPL. This is good but it goes against the grain of what should be done and it does cause some problems, such as:

Deadlocks. Because of the way the zio pipeline in designed in ZFS we must be careful to avoid triggering the synchronous memory reclaim path. If one of the zio threads does enter reclaim then it may deadlock on itself by trying to flush dirty pages from say a zvol. This is avoided in most instances by clearing GFP_FS but we can't clear this flag for vmalloc() calls. Unfortunately, we may be forced to vmalloc() a new slab in the zio pipeline for certain workloads such as compression and this we risk deadlocking. Moving to scatter-gather lists would allow us to eliminate this __vmalloc() and potential deadlock.
Avoid serializing on the single Linux VM lock. Because the Linux VM is designed to be lightly used all changes to the virtual address space are serialized through a single lock. The SPL slab does go through some effort to minimizing this impact by allocating slabs of objects but clearly there are scaling concerns here.
VM overhead. In addition to the lock contention there is overhead involved in locating suitable virtual addresses and setting up the mapping from virtual to physical pages. For a CPU hungry filesystem and overhead we can eliminate is worthwhile.
32-bit arch support. This biggest issue with supporting 32-bit arches is they have a very small virtual address range, usually only 100's of MB. By moving all ARC data buffers to scatter gather lists we avoid having to use this limited address range. Instead all data pages can simply reside is the standard address range just like with all other Linux filesystems.

In the upstream OpenSolaris ZFS code the maximum ARC usage is limited to 3/4 of memory or all but 1GB, whichever is larger. Because of how Linux's VM subsystem is organized these defaults have proven to be too large which can lead to stability issues. To avoid making everyone manually tune the ARC the defaults are being changed to 1/2 of memory or all but 4GB. The rational for this is as follows: * Desktop Systems (less than 8GB of memory) Limiting the ARC to 1/2 of memory is desirable for desktop systems which have highly dynamic memory requirements. For example, launching your web browser can suddenly result in a demand for several gigabytes of memory. This memory must be reclaimed from the ARC cache which can take some time. The user will experience this reclaim time as a sluggish system with poor interactive performance. Thus in this case it is preferable to leave the memory as free and available for immediate use. * Server Systems (more than 8GB of memory) Using all but 4GB of memory for the ARC is preferable for server systems. These systems often run with minimal user interaction and have long running daemons with relatively stable memory demands. These systems will benefit most by having as much data cached in memory as possible. These values should work well for most configurations. However, if you have a desktop system with more than 8GB of memory you may wish to further restrict the ARC. This can still be accomplished by setting the 'zfs_arc_max' module option. Additionally, keep in mind these aren't currently hard limits. The ARC is based on a slab implementation which can suffer from memory fragmentation. Because this fragmentation is not visible from the ARC it may believe it is within the specified limits while actually consuming slightly more memory. How much more memory get's consumed will be determined by how badly fragmented the slabs are. In the long term this can be mitigated by slab defragmentation code which was OpenSolaris solution. Or preferably, using the page cache to back the ARC under Linux would be even better. See issue openzfs#75 for the benefits of more tightly integrating with the page cache. This change also fixes a issue where the default ARC max was being set incorrectly for machines with less than 2GB of memory. The constant in the arc_c_max comparison must be explicitly cast to a uint64_t type to prevent overflow and the wrong conditional branch being taken. This failure was typically observed in VMs which are commonly created with less than 2GB of memory. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#75

kernelOfTruth · 2014-10-05T01:58:48Z

related to #2129

kernelOfTruth · 2016-08-31T13:30:29Z

and #3441

behlendorf · 2016-11-30T22:53:27Z

Merged as:

7657def Introduce ARC Buffer Data (ABD)

Signed-off-by: Paul Dagnelie <[email protected]>

Avoid duplicated Actions in TrueNAS ZFS CI Signed-off-by: Umer Saleem <[email protected]>

behlendorf mentioned this issue Sep 4, 2012

page allocation failure #465

Closed

dajhorn mentioned this issue Dec 9, 2012

Swap zfsonlinux/pkg-zfs#48

Closed

This was referenced Sep 26, 2013

bad performance/OOM killer under memory pressure #1745

Closed

Mainline 3.11 (2013-09-02) compile issues (get_vmalloc_info_fn) openzfs/spl#291

Closed

behlendorf modified the milestones: 0.7.0, 0.9.0 Oct 3, 2014

behlendorf added Bug - Major and removed Type: Feature Feature request or new feature labels Oct 3, 2014

behlendorf added the Component: Memory Management kernel memory management label Mar 25, 2016

behlendorf removed Bug - Major labels Sep 30, 2016

behlendorf closed this as completed Nov 30, 2016

ahrens pushed a commit to ahrens/zfs that referenced this issue Sep 17, 2019

Reduce loaded range tree memory usage (openzfs#75)

3836e28

Signed-off-by: Paul Dagnelie <[email protected]>

sdimitro pushed a commit to sdimitro/zfs that referenced this issue Feb 14, 2022

DOSE-783 zfs load-key command is hanging (openzfs#75)

116adaa

Signed-off-by: Paul Dagnelie <[email protected]>

rkojedzinszky pushed a commit to rkojedzinszky/zfs that referenced this issue Mar 7, 2023

Merge pull request openzfs#75 from truenas/NAS-114972

7552264

Avoid duplicated Actions in TrueNAS ZFS CI Signed-off-by: Umer Saleem <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use scatter-gather lists for ARC buffers #75

Use scatter-gather lists for ARC buffers #75

behlendorf commented Nov 5, 2010

kernelOfTruth commented Oct 5, 2014

kernelOfTruth commented Aug 31, 2016

behlendorf commented Nov 30, 2016

Use scatter-gather lists for ARC buffers #75

Use scatter-gather lists for ARC buffers #75

Comments

behlendorf commented Nov 5, 2010

kernelOfTruth commented Oct 5, 2014

kernelOfTruth commented Aug 31, 2016

behlendorf commented Nov 30, 2016