-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cap metaslab memory usage #9128
Conversation
Signed-off-by: Paul Dagnelie <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #9128 +/- ##
==========================================
- Coverage 79.33% 79.11% -0.23%
==========================================
Files 400 400
Lines 121692 121793 +101
==========================================
- Hits 96542 96352 -190
- Misses 25150 25441 +291
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the 2 nits , the code LTGM.
Signed-off-by: Paul Dagnelie <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. My only concern is that this change directly accesses skc_obj_total
and skc_obj_size
which isn't going to be portable. Though I don't see any better way to do this, since I couldn't find any existing illumos interface to get the cache size. As a follow up we should consider adding kmem_cache_size()
function for this purpose.
@behlendorf on Illumos we used |
@pcd1193182 ahh, I see. Why don't we go ahead with it as-is and defer what to do about this until we determine what interfaces are available on FreeBSD. |
On systems with large amounts of storage and high fragmentation, a huge amount of space can be used by storing metaslab range trees. Since metaslabs are only unloaded during a txg sync, and only if they have been inactive for 8 txgs, it is possible to get into a state where all of the system's memory is consumed by range trees and metaslabs, and txgs cannot sync. While ZFS knows how to evict ARC data when needed, it has no such mechanism for range tree data. This can result in boot hangs for some system configurations. First, we add the ability to unload metaslabs outside of syncing context. Second, we store a multilist of all loaded metaslabs, sorted by their selection txg, so we can quickly identify the oldest metaslabs. We use a multilist to reduce lock contention during heavy write workloads. Finally, we add logic that will unload a metaslab when we're loading a new metaslab, if we're using more than a certain fraction of the available memory on range trees. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Sebastien Roy <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#9128 Signed-off-by: Bryant G. Ly <[email protected]> Conflicts: include/sys/metaslab.h
Reviewed-by: Matt Ahrens [email protected]
Reviewed-by: George Wilson [email protected]
Reviewed-by: Sebastien Roy [email protected]
Reviewed-by: Serapheim Dimitropoulos [email protected]
Motivation and Context
On systems with large amounts of storage and high fragmentation, a huge amount of space can be used by storing metaslab range trees. Since metaslabs are only unloaded during a txg sync, and only if they have been inactive for 8 txgs, it is possible to get into a state where all of the system's memory is consumed by range trees and metaslabs, and txgs cannot sync. While ZFS knows how to evict ARC data when needed, it has no such mechanism for range tree data. This can result in boot hangs for some system configurations.
Description
First, we add the ability to unload metaslabs outside of syncing context. Second, we store a multilist of all loaded metaslabs, sorted by their selection txg, so we can quickly identify the oldest metaslabs. We use a multilist to reduce lock contention during heavy write workloads. Finally, we add logic that will unload a metaslab when we're loading a new metaslab, if we're using more than a certain fraction of the available memory on range trees.
How Has This Been Tested?
Passes zfs-test and zloop on Linux, as well as extensive testing and usage in prod on Illumos. Performance testing was performed on Illumos: these changes resulted in a .5% performance decrease in the worst case scenario (heavy sequential write workload).
Types of changes
Checklist:
Signed-off-by
.