Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARC metadata not freed aggressively enough when < arc_meta_limit #3283

Closed
dweeezil opened this issue Apr 13, 2015 · 7 comments
Closed

ARC metadata not freed aggressively enough when < arc_meta_limit #3283

dweeezil opened this issue Apr 13, 2015 · 7 comments
Labels
Component: Memory Management kernel memory management

Comments

@dweeezil
Copy link
Contributor

I've encountered some pathologically bad behavior with my stress test suite when a user program tries to dirty a lot of memory and the ARC is filled mostly with metadata. I'm running this on a 3.13.3 kernel. The test conditions are as follows:

  • System has 32GiB of RAM; ARC limit is auto-tuned to 16GiB and the meta limit is auto-tuned to about 12GiB.
  • Fill ARC with mostly metadata by populating a fresh filesystem with about 5 million small files. The stress test suite uses 10 concurrent processes for this job and populates them in a 3-level directory hierarchy.
  • After fill, ARC actually is at about 13GiB and arc_c is 15GiB.

Once the ARC is warmed up in this manner, a user program tries to allocate and dirty enough memory that some of the ARC would need to be freed in order to accommodate its requests. In my current test, the program attempts to allocate and dirty 15000 MiB of memory (via sbrk()) in 10MiB hunks. Unfortunately, since the amount of metadata is under the metadata limit, nothing much is done to alleviate it. The SPL shrinker (for count) is repeatedly called and mostly returns 0 or a very low number. The effect is that the system spins for a very very long time (maybe indefinitely in some cases) and only very slowly frees enough to accommodate the request.

I've found the following patch to be helpful:

diff --git a/module/zfs/zfs_znode.c b/module/zfs/zfs_znode.c
index c931a72..992ff09 100644
--- a/module/zfs/zfs_znode.c
+++ b/module/zfs/zfs_znode.c
@@ -152,7 +152,7 @@ zfs_znode_init(void)
        ASSERT(znode_cache == NULL);
        znode_cache = kmem_cache_create("zfs_znode_cache",
            sizeof (znode_t), 0, zfs_znode_cache_constructor,
-           zfs_znode_cache_destructor, NULL, NULL, NULL, KMC_KMEM);
+           zfs_znode_cache_destructor, NULL, NULL, NULL, 0);
 }

 void

There's still a bit of spinning as described above, but the user program allocations do end up succeeding in a reasonable amount of time. The znode_cache tends to be very large in these cases (object size of about 1100 bytes). @behlendorf Is there any reason this needs to be KMC_KMEM (changed in 3558fd7)?

@behlendorf
Copy link
Contributor

@dweeezil now I wish I'd made a better comment about the znode_cache. However, IIRC correctly passing KMC_KMEM was added because we must be certain that the znode/inode is backed by kmalloc'd memory. If the znode were vmalloc'd then the wait_on_bit operations performed on the embedding inode wouldn't function correctly. That would be very bad.

I suspect this is exactly what happened in an much older version of the kmem_cache code so this was made explicit. Today the post kmem-rework makes that far less likely (but maybe still possible). Based on your testing it looks like these days it would really be best to locate these objects on the Linux slab. Presumably this is best because the Linux slabs are smaller the SPL's slab so memory in reclaimed more quickly due to there not being any fragmentation issues.

The right fix here is probably to change this to KMC_SLAB. Then ensures it's on the Linux slab which will be kmalloc backed which is critical.

@dweeezil
Copy link
Contributor Author

@behlendorf I was aware of the wait_on_bit couldn't remember the context in which it applied and didn't have a chance to go over the issue lists to remind myself where it was. I'll be running more tests tonight and will use KMC_SLAB. In any case, it seems to be a very Good Thing to get the znode_cache on to the Linux slab. Of course this depends on spl_kmem_cache_slab_limit being set high enough so if someone is forcing it to zero or another value < about 1100, it won't help.

I think your theory about there being less fragmentation on the Linux slab is likely correct.

I'm a bit torn as to how things should work in this type of a case from the perspective of a typical user program. The question is how much of the ARC do we want a single memory hungry process to be able to take (possibly at the expense of overall filesystem performance). I realize my test case is totally contrived but it's the kind of thing benchmarkers will run into and this type of test works just dandily under ext4. Presumably this is a case where ABD won't help much, either, because a lot of the memory in use is sitting on the slabs.

From my perspective, if a memory allocation for a user program is going to fail, it would be nice it it would fail right away rather than causing the system to spiral out of control trying over and over again to reclaim memory.

I'll post further results when I get a chance to run the tests again.

@dweeezil
Copy link
Contributor Author

This looks good. I've posted the change as pull request #3289. I've got some perf data to look at to try to characterize the bad behavior but I've observed that this patch dramatically helps the case where a user program tries to allocate and dirty memory which would require a lot of metadata to be shed from the ARC.

Specifically, in my test case on a 32GiB system, the ARC has about 13GiB of mostly metadata and free reports about 7GiB available. Without this patch, if a user program tries to grab 14.6GiB (15000 MiB) of memory, the system spins for a very very long time (possibly indefinietly). With this patch, system performs normally.

@behlendorf
Copy link
Contributor

@dweeezil nice jobs getting to the root cause here, I'll get it merged.

behlendorf pushed a commit that referenced this issue Apr 14, 2015
The Linux slab, in general, performs better than the SPl slab in cases
where a lot of objects are allocated and fragmentation is likely present.

This patch fixes pathologically bad behavior in cases where the ARC is
filled with mostly metadata and a user program needs to allocate and
dirty enough memory which would require an insignificant amount of the
ARC to be reclaimed.

If zfs_znode_cache is on the SPL slab, the system may spin for a very
long time trying to reclaim sufficient memory.  If it is on the Linux
slab, the behavior has been observed to be much more predictible; the
memory is reclaimed more efficiently.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #3283
@dweeezil
Copy link
Contributor Author

Certainly not an end-all but it sure had a dramatic impact on my test case. I'm hoping it will similarly help more real-world cases (rsync server).

behlendorf pushed a commit that referenced this issue Apr 17, 2015
The Linux slab, in general, performs better than the SPl slab in cases
where a lot of objects are allocated and fragmentation is likely present.

This patch fixes pathologically bad behavior in cases where the ARC is
filled with mostly metadata and a user program needs to allocate and
dirty enough memory which would require an insignificant amount of the
ARC to be reclaimed.

If zfs_znode_cache is on the SPL slab, the system may spin for a very
long time trying to reclaim sufficient memory.  If it is on the Linux
slab, the behavior has been observed to be much more predictible; the
memory is reclaimed more efficiently.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #3283
@behlendorf
Copy link
Contributor

Closing this issue has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management
Projects
None yet
Development

No branches or pull requests

3 participants
@behlendorf @dweeezil and others