Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cap metaslab memory usage #9128

Merged
merged 2 commits into from
Aug 16, 2019
Merged

Cap metaslab memory usage #9128

merged 2 commits into from
Aug 16, 2019

Conversation

pcd1193182
Copy link
Contributor

Reviewed-by: Matt Ahrens [email protected]
Reviewed-by: George Wilson [email protected]
Reviewed-by: Sebastien Roy [email protected]
Reviewed-by: Serapheim Dimitropoulos [email protected]

Motivation and Context

On systems with large amounts of storage and high fragmentation, a huge amount of space can be used by storing metaslab range trees. Since metaslabs are only unloaded during a txg sync, and only if they have been inactive for 8 txgs, it is possible to get into a state where all of the system's memory is consumed by range trees and metaslabs, and txgs cannot sync. While ZFS knows how to evict ARC data when needed, it has no such mechanism for range tree data. This can result in boot hangs for some system configurations.

Description

First, we add the ability to unload metaslabs outside of syncing context. Second, we store a multilist of all loaded metaslabs, sorted by their selection txg, so we can quickly identify the oldest metaslabs. We use a multilist to reduce lock contention during heavy write workloads. Finally, we add logic that will unload a metaslab when we're loading a new metaslab, if we're using more than a certain fraction of the available memory on range trees.

How Has This Been Tested?

Passes zfs-test and zloop on Linux, as well as extensive testing and usage in prod on Illumos. Performance testing was performed on Illumos: these changes resulted in a .5% performance decrease in the worst case scenario (heavy sequential write workload).

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

Signed-off-by: Paul Dagnelie <[email protected]>
@c0d3z3r0 c0d3z3r0 mentioned this pull request Aug 6, 2019
12 tasks
@pcd1193182 pcd1193182 added the Status: Code Review Needed Ready for review and testing label Aug 7, 2019
@codecov
Copy link

codecov bot commented Aug 12, 2019

Codecov Report

Merging #9128 into master will decrease coverage by 0.22%.
The diff coverage is 79.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9128      +/-   ##
==========================================
- Coverage   79.33%   79.11%   -0.23%     
==========================================
  Files         400      400              
  Lines      121692   121793     +101     
==========================================
- Hits        96542    96352     -190     
- Misses      25150    25441     +291
Flag Coverage Δ
#kernel 79.68% <77.77%> (-0.13%) ⬇️
#user 67% <96.03%> (-0.31%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c81f179...7031025. Read the comment docs.

Copy link
Contributor

@sdimitro sdimitro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the 2 nits , the code LTGM.

Signed-off-by: Paul Dagnelie <[email protected]>
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. My only concern is that this change directly accesses skc_obj_total and skc_obj_size which isn't going to be portable. Though I don't see any better way to do this, since I couldn't find any existing illumos interface to get the cache size. As a follow up we should consider adding kmem_cache_size() function for this purpose.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 15, 2019
@pcd1193182
Copy link
Contributor Author

@behlendorf on Illumos we used kmem_cache_stat(btree_leaf_cache, "buf_inuse"); to get skc_obj_total, and kmem_cache_stat(btree_leaf_cache, "buf_size"); to get skc_obj_size. We could definitely add the kmem_cache_stat functions to match Illumos's approach, or make some simpler functions that both platforms could implement.

@behlendorf
Copy link
Contributor

@pcd1193182 ahh, I see. Why don't we go ahead with it as-is and defer what to do about this until we determine what interfaces are available on FreeBSD.

@behlendorf behlendorf merged commit f09fda5 into openzfs:master Aug 16, 2019
allanjude pushed a commit to KlaraSystems/zfs that referenced this pull request Apr 28, 2020
On systems with large amounts of storage and high fragmentation, a huge
amount of space can be used by storing metaslab range trees. Since
metaslabs are only unloaded during a txg sync, and only if they have
been inactive for 8 txgs, it is possible to get into a state where all
of the system's memory is consumed by range trees and metaslabs, and
txgs cannot sync. While ZFS knows how to evict ARC data when needed,
it has no such mechanism for range tree data. This can result in boot
hangs for some system configurations.

First, we add the ability to unload metaslabs outside of syncing
context. Second, we store a multilist of all loaded metaslabs, sorted
by their selection txg, so we can quickly identify the oldest
metaslabs.  We use a multilist to reduce lock contention during heavy
write workloads. Finally, we add logic that will unload a metaslab
when we're loading a new metaslab, if we're using more than a certain
fraction of the available memory on range trees.

Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Sebastien Roy <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes openzfs#9128

Signed-off-by: Bryant G. Ly <[email protected]>

Conflicts:
	include/sys/metaslab.h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants