ARC metadata not freed aggressively enough when < arc_meta_limit #3283

dweeezil · 2015-04-13T04:58:00Z

I've encountered some pathologically bad behavior with my stress test suite when a user program tries to dirty a lot of memory and the ARC is filled mostly with metadata. I'm running this on a 3.13.3 kernel. The test conditions are as follows:

System has 32GiB of RAM; ARC limit is auto-tuned to 16GiB and the meta limit is auto-tuned to about 12GiB.
Fill ARC with mostly metadata by populating a fresh filesystem with about 5 million small files. The stress test suite uses 10 concurrent processes for this job and populates them in a 3-level directory hierarchy.
After fill, ARC actually is at about 13GiB and arc_c is 15GiB.

Once the ARC is warmed up in this manner, a user program tries to allocate and dirty enough memory that some of the ARC would need to be freed in order to accommodate its requests. In my current test, the program attempts to allocate and dirty 15000 MiB of memory (via sbrk()) in 10MiB hunks. Unfortunately, since the amount of metadata is under the metadata limit, nothing much is done to alleviate it. The SPL shrinker (for count) is repeatedly called and mostly returns 0 or a very low number. The effect is that the system spins for a very very long time (maybe indefinitely in some cases) and only very slowly frees enough to accommodate the request.

I've found the following patch to be helpful:

diff --git a/module/zfs/zfs_znode.c b/module/zfs/zfs_znode.c
index c931a72..992ff09 100644
--- a/module/zfs/zfs_znode.c
+++ b/module/zfs/zfs_znode.c
@@ -152,7 +152,7 @@ zfs_znode_init(void)
        ASSERT(znode_cache == NULL);
        znode_cache = kmem_cache_create("zfs_znode_cache",
            sizeof (znode_t), 0, zfs_znode_cache_constructor,
-           zfs_znode_cache_destructor, NULL, NULL, NULL, KMC_KMEM);
+           zfs_znode_cache_destructor, NULL, NULL, NULL, 0);
 }

 void

There's still a bit of spinning as described above, but the user program allocations do end up succeeding in a reasonable amount of time. The znode_cache tends to be very large in these cases (object size of about 1100 bytes). @behlendorf Is there any reason this needs to be KMC_KMEM (changed in 3558fd7)?

The text was updated successfully, but these errors were encountered:

behlendorf · 2015-04-13T20:58:49Z

@dweeezil now I wish I'd made a better comment about the znode_cache. However, IIRC correctly passing KMC_KMEM was added because we must be certain that the znode/inode is backed by kmalloc'd memory. If the znode were vmalloc'd then the wait_on_bit operations performed on the embedding inode wouldn't function correctly. That would be very bad.

I suspect this is exactly what happened in an much older version of the kmem_cache code so this was made explicit. Today the post kmem-rework makes that far less likely (but maybe still possible). Based on your testing it looks like these days it would really be best to locate these objects on the Linux slab. Presumably this is best because the Linux slabs are smaller the SPL's slab so memory in reclaimed more quickly due to there not being any fragmentation issues.

The right fix here is probably to change this to KMC_SLAB. Then ensures it's on the Linux slab which will be kmalloc backed which is critical.

dweeezil · 2015-04-13T22:00:43Z

@behlendorf I was aware of the wait_on_bit couldn't remember the context in which it applied and didn't have a chance to go over the issue lists to remind myself where it was. I'll be running more tests tonight and will use KMC_SLAB. In any case, it seems to be a very Good Thing to get the znode_cache on to the Linux slab. Of course this depends on spl_kmem_cache_slab_limit being set high enough so if someone is forcing it to zero or another value < about 1100, it won't help.

I think your theory about there being less fragmentation on the Linux slab is likely correct.

I'm a bit torn as to how things should work in this type of a case from the perspective of a typical user program. The question is how much of the ARC do we want a single memory hungry process to be able to take (possibly at the expense of overall filesystem performance). I realize my test case is totally contrived but it's the kind of thing benchmarkers will run into and this type of test works just dandily under ext4. Presumably this is a case where ABD won't help much, either, because a lot of the memory in use is sitting on the slabs.

From my perspective, if a memory allocation for a user program is going to fail, it would be nice it it would fail right away rather than causing the system to spiral out of control trying over and over again to reclaim memory.

I'll post further results when I get a chance to run the tests again.

dweeezil · 2015-04-14T05:27:45Z

This looks good. I've posted the change as pull request #3289. I've got some perf data to look at to try to characterize the bad behavior but I've observed that this patch dramatically helps the case where a user program tries to allocate and dirty memory which would require a lot of metadata to be shed from the ARC.

Specifically, in my test case on a 32GiB system, the ARC has about 13GiB of mostly metadata and free reports about 7GiB available. Without this patch, if a user program tries to grab 14.6GiB (15000 MiB) of memory, the system spins for a very very long time (possibly indefinietly). With this patch, system performs normally.

behlendorf · 2015-04-14T17:51:25Z

@dweeezil nice jobs getting to the root cause here, I'll get it merged.

The Linux slab, in general, performs better than the SPl slab in cases where a lot of objects are allocated and fragmentation is likely present. This patch fixes pathologically bad behavior in cases where the ARC is filled with mostly metadata and a user program needs to allocate and dirty enough memory which would require an insignificant amount of the ARC to be reclaimed. If zfs_znode_cache is on the SPL slab, the system may spin for a very long time trying to reclaim sufficient memory. If it is on the Linux slab, the behavior has been observed to be much more predictible; the memory is reclaimed more efficiently. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #3283

dweeezil · 2015-04-14T20:05:05Z

Certainly not an end-all but it sure had a dramatic impact on my test case. I'm hoping it will similarly help more real-world cases (rsync server).

The Linux slab, in general, performs better than the SPl slab in cases where a lot of objects are allocated and fragmentation is likely present. This patch fixes pathologically bad behavior in cases where the ARC is filled with mostly metadata and a user program needs to allocate and dirty enough memory which would require an insignificant amount of the ARC to be reclaimed. If zfs_znode_cache is on the SPL slab, the system may spin for a very long time trying to reclaim sufficient memory. If it is on the Linux slab, the behavior has been observed to be much more predictible; the memory is reclaimed more efficiently. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #3283

behlendorf · 2015-09-21T21:27:43Z

Closing this issue has been resolved.

kernelOfTruth mentioned this issue Apr 13, 2015

l2arc_feed stuck at 100% CPU #3259

Closed

dweeezil mentioned this issue Apr 14, 2015

Allocate zfs_znode_cache on the Linux slab #3289

Closed

behlendorf added Component: Memory Management kernel memory management Bug - Minor labels Apr 14, 2015

kernelOfTruth mentioned this issue Apr 19, 2015

ZoL 0.6.4 on Fedora 21 claims memory until system freezes #3320

Closed

behlendorf closed this as completed Sep 21, 2015

behlendorf removed the Bug - Point Release label Sep 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARC metadata not freed aggressively enough when < arc_meta_limit #3283

ARC metadata not freed aggressively enough when < arc_meta_limit #3283

dweeezil commented Apr 13, 2015

behlendorf commented Apr 13, 2015

dweeezil commented Apr 13, 2015

dweeezil commented Apr 14, 2015

behlendorf commented Apr 14, 2015

dweeezil commented Apr 14, 2015

behlendorf commented Sep 21, 2015

ARC metadata not freed aggressively enough when < arc_meta_limit #3283

ARC metadata not freed aggressively enough when < arc_meta_limit #3283

Comments

dweeezil commented Apr 13, 2015

behlendorf commented Apr 13, 2015

dweeezil commented Apr 13, 2015

dweeezil commented Apr 14, 2015

behlendorf commented Apr 14, 2015

dweeezil commented Apr 14, 2015

behlendorf commented Sep 21, 2015