-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARC metadata not freed aggressively enough when < arc_meta_limit #3283
Comments
@dweeezil now I wish I'd made a better comment about the znode_cache. However, IIRC correctly passing KMC_KMEM was added because we must be certain that the znode/inode is backed by kmalloc'd memory. If the znode were vmalloc'd then the I suspect this is exactly what happened in an much older version of the kmem_cache code so this was made explicit. Today the post kmem-rework makes that far less likely (but maybe still possible). Based on your testing it looks like these days it would really be best to locate these objects on the Linux slab. Presumably this is best because the Linux slabs are smaller the SPL's slab so memory in reclaimed more quickly due to there not being any fragmentation issues. The right fix here is probably to change this to KMC_SLAB. Then ensures it's on the Linux slab which will be kmalloc backed which is critical. |
@behlendorf I was aware of the I think your theory about there being less fragmentation on the Linux slab is likely correct. I'm a bit torn as to how things should work in this type of a case from the perspective of a typical user program. The question is how much of the ARC do we want a single memory hungry process to be able to take (possibly at the expense of overall filesystem performance). I realize my test case is totally contrived but it's the kind of thing benchmarkers will run into and this type of test works just dandily under ext4. Presumably this is a case where ABD won't help much, either, because a lot of the memory in use is sitting on the slabs. From my perspective, if a memory allocation for a user program is going to fail, it would be nice it it would fail right away rather than causing the system to spiral out of control trying over and over again to reclaim memory. I'll post further results when I get a chance to run the tests again. |
This looks good. I've posted the change as pull request #3289. I've got some perf data to look at to try to characterize the bad behavior but I've observed that this patch dramatically helps the case where a user program tries to allocate and dirty memory which would require a lot of metadata to be shed from the ARC. Specifically, in my test case on a 32GiB system, the ARC has about 13GiB of mostly metadata and |
@dweeezil nice jobs getting to the root cause here, I'll get it merged. |
The Linux slab, in general, performs better than the SPl slab in cases where a lot of objects are allocated and fragmentation is likely present. This patch fixes pathologically bad behavior in cases where the ARC is filled with mostly metadata and a user program needs to allocate and dirty enough memory which would require an insignificant amount of the ARC to be reclaimed. If zfs_znode_cache is on the SPL slab, the system may spin for a very long time trying to reclaim sufficient memory. If it is on the Linux slab, the behavior has been observed to be much more predictible; the memory is reclaimed more efficiently. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #3283
Certainly not an end-all but it sure had a dramatic impact on my test case. I'm hoping it will similarly help more real-world cases (rsync server). |
The Linux slab, in general, performs better than the SPl slab in cases where a lot of objects are allocated and fragmentation is likely present. This patch fixes pathologically bad behavior in cases where the ARC is filled with mostly metadata and a user program needs to allocate and dirty enough memory which would require an insignificant amount of the ARC to be reclaimed. If zfs_znode_cache is on the SPL slab, the system may spin for a very long time trying to reclaim sufficient memory. If it is on the Linux slab, the behavior has been observed to be much more predictible; the memory is reclaimed more efficiently. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #3283
Closing this issue has been resolved. |
I've encountered some pathologically bad behavior with my stress test suite when a user program tries to dirty a lot of memory and the ARC is filled mostly with metadata. I'm running this on a 3.13.3 kernel. The test conditions are as follows:
Once the ARC is warmed up in this manner, a user program tries to allocate and dirty enough memory that some of the ARC would need to be freed in order to accommodate its requests. In my current test, the program attempts to allocate and dirty 15000 MiB of memory (via
sbrk()
) in 10MiB hunks. Unfortunately, since the amount of metadata is under the metadata limit, nothing much is done to alleviate it. The SPL shrinker (for count) is repeatedly called and mostly returns 0 or a very low number. The effect is that the system spins for a very very long time (maybe indefinitely in some cases) and only very slowly frees enough to accommodate the request.I've found the following patch to be helpful:
There's still a bit of spinning as described above, but the user program allocations do end up succeeding in a reasonable amount of time. The
znode_cache
tends to be very large in these cases (object size of about 1100 bytes). @behlendorf Is there any reason this needs to beKMC_KMEM
(changed in 3558fd7)?The text was updated successfully, but these errors were encountered: