-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backfilling metadnode degrades object create rates #4460
Conversation
Object creation rates may be degraded when dmu_object_alloc() tries to backfill the metadnode array by restarting its search at offset 0. The method of searching the dnode space for holes is inefficient and unreliable, leading to many failed attempts to obtain a dnode hold. These failed attempts are expensive and limit overall system throughput. This patch changes the default behavior to disable backfilling, and it adds a zfs_metadnode_backfill module parameter to allow the old behavior to be enabled. The search offset restart happens at most once per call to dmu_object_alloc() when the previously allocated object number is a multiple of 4096. If the hold on the requested object fails because the object is allocated, dmu_object_next() is called to find the next hole. That function should theoretically identify the next free object that the next loop iteration can successfully obtain a hold on. In practice, however, dmu_object_next() may falsely identify a recently allocated dnode as free because the in-memory copy of the dnode_phys_t is not up to date. The next hold attempt then fails, and this process repeats for up to 4096 loop iterations before the search skips ahead to a sparse region of the metadnode. A similar pathology occurs if dmu_object_next() returns ESRCH when it fails to find a hole in the current dnode block. In this case dmu_object_alloc() simply increments the object number and retries, resulting again in up to 4096 failed dnode hold attempts. We can avoid these pathologies by not attempting to backfill the metadnode array. This may result in sparse dnode blocks, potentially costing disk space, memory overhead, and increased disk I/O. These penalties appear to be outweighed by the performance cost of the current approach. Future work could implement a more efficient means to search for holes and allow us to reenable backfilling by default. === Benchmark Results === We measured a 46% increase in average file creation rate by setting zfs_metadnode_backfill=0. The createmany benchmark used is available at http://github.com/nedbass/createmany. It used 32 threads to create 16 million files over 16 iterations. The pool was freshly created for each of the two tests. The test system was a d2.xlarge Amazon AWS virtual machine with 3 2TB disks in a raidz pool. zfs_metadnode_backfill Average creates/second ---------------------- ---------------------- 0 43879 1 30040 $ zpool create tank raidz /dev/xvd{b,c,d} $ echo 0 > /sys/module/zfs/parameters/zfs_metadnode_backfill $ for ((i=0;i<16;i++)) ; do ./createmany -o -t 32 -D $(mktemp -d /tank/XXXXX) 1000000 ; done total: 1000000 creates in 21.142829 seconds: 47297.359852 creates/second total: 1000000 creates in 21.421943 seconds: 46681.108566 creates/second total: 1000000 creates in 21.996960 seconds: 45460.826977 creates/second total: 1000000 creates in 22.031947 seconds: 45388.637143 creates/second total: 1000000 creates in 21.597262 seconds: 46302.165727 creates/second total: 1000000 creates in 21.194397 seconds: 47182.281302 creates/second total: 1000000 creates in 23.844561 seconds: 41938.285457 creates/second total: 1000000 creates in 25.678497 seconds: 38943.089478 creates/second total: 1000000 creates in 22.400553 seconds: 44641.757449 creates/second total: 1000000 creates in 22.011262 seconds: 45431.290857 creates/second total: 1000000 creates in 21.848749 seconds: 45769.211022 creates/second total: 1000000 creates in 26.574808 seconds: 37629.622928 creates/second total: 1000000 creates in 22.326124 seconds: 44790.580077 creates/second total: 1000000 creates in 23.562593 seconds: 42440.152541 creates/second total: 1000000 creates in 26.825597 seconds: 37277.828270 creates/second total: 1000000 creates in 22.277026 seconds: 44889.297413 creates/second $ zpool destroy tank $ zpool create tank raidz /dev/xvd{b,c,d} $ echo 1 > /sys/module/zfs/parameters/zfs_metadnode_backfill $ for ((i=0;i<16;i++)) ; do ./createmany -o -t 32 -D $(mktemp -d /tank/XXXXX) 1000000 ; done total: 1000000 creates in 31.947285 seconds: 31301.564265 creates/second total: 1000000 creates in 31.511260 seconds: 31734.687822 creates/second total: 1000000 creates in 31.984121 seconds: 31265.515618 creates/second total: 1000000 creates in 31.960720 seconds: 31288.406458 creates/second total: 1000000 creates in 32.651408 seconds: 30626.550663 creates/second total: 1000000 creates in 32.579218 seconds: 30694.414826 creates/second total: 1000000 creates in 36.163562 seconds: 27652.143474 creates/second total: 1000000 creates in 33.621352 seconds: 29743.003829 creates/second total: 1000000 creates in 33.097268 seconds: 30213.974061 creates/second total: 1000000 creates in 34.419482 seconds: 29053.313476 creates/second total: 1000000 creates in 34.014244 seconds: 29399.448204 creates/second total: 1000000 creates in 32.972573 seconds: 30328.236705 creates/second total: 1000000 creates in 34.757156 seconds: 28771.054526 creates/second total: 1000000 creates in 32.194859 seconds: 31060.859951 creates/second total: 1000000 creates in 32.464407 seconds: 30802.966165 creates/second total: 1000000 creates in 37.443681 seconds: 26706.776650 creates/second Signed-off-by: Ned Bass <[email protected]>
@ahrens this might be of interest to your metadata performance work. It highlights a problem I noticed while working on large dnodes #3542. When Addressing that doesn't fix this performance issue, however, because Disabling backfill is a band aid. In the long term we need a better way to find holes, i.e. spacemaps. |
Thanks @nedbass. I recently discovered this as well. I agree we need to take into account allocated dnodes that have not been written to disk yet. I'll be working on a design for that. I'll try to avoid changing the on disk structure though (i.e. Not add space maps. Range trees could be useful though, for tracking what's allocated but not yet synced. ) |
@nedbass nice find. This issue of not being able to cheaply determine if a dnode has just been dirtied but not yet written has come up a few times recently. It clearly has a significant impact on create performance. I think space maps, range trees, or even bitmaps might all be reasonable approaches depending on exactly what use case is being optimized. None of these solutions necessarily require us to change the on disk format (which I agree would be a good thing). |
Related to openzfs/openzfs#82 |
@@ -31,6 +31,8 @@ | |||
#include <sys/zap.h> | |||
#include <sys/zfeature.h> | |||
|
|||
int zfs_metadnode_backfill = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably makes sense to add a pool parameter for this instead of just a module option, so that it can be set persistently if users care more about dense allocations than create performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean to make it a zpool property? I think that would be a bad idea, since this is essentially a workaround for a performance bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we can reliably detect holes using openzfs/openzfs#82, and add appropriate handling of ESRCH
from dmu_object_next()
, then this patch shouldn't be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that openzfs/openzfs#82 is sufficient to address the performance problem that this patch is working around. The problem is that dnode_next_offset (and dmu_object_next) don’t take into account objects allocated in memory but not yet synced to disk. Therefore if we allocate more than a L1 (the comment about L2 is inaccurate) worth of dnodes in one txg, we will end up calling dnode_hold_impl / dmu_object_next on every allocated-in-memory object when cross the L1 boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I misunderstood what that patch fixes. The symptoms are similar (dnode_next_offset() can detect fictional holes) but happen under different conditions.
If this patch were to be applied in a stack, used on a pool, and then removed for a subsequent iteration, will the holes missed due to this workaround be back-filled? |
@sempervictus Yes, the holes would be backfilled. In fact the backfilling behavior would be immediately restored if you dynamically set This patch will no effect on ZVOL performance since a ZVOL is effectively one giant object and doesn't allocate new objects internally. |
While this isn't the ideal long term solution for this problem it is a small safe change which does significantly improve meta-data performance today. @nedbass let me know if you're happy with this as a final version so it can be merged. |
@behlendorf I'm fine with it in its current form. We'll probably want to revert this once the underlying issues are fixed. |
AFAIK, the problem with using this patch in production is that it will result in monotonously increasing dnode numbers, and for a workload that is creating and deleting files continuously the metadnode would become very large and sparse, until the filesystem is remounted? For long-running servers this might be unacceptable. Is there some way to track the number of freed dnodes (in-memory per pupil counter) and reset the scanning once it hits some threshold (e.g. 4x the average number of dnodes allocated in the past few TXGs)? That makes it worthwhile to go back and re-scan, while ensuring the new dnodes are unlikely to hit recently allocated dnodes. It may have a noticeable performance cost to change from allocating all new dnodes in a block to filling in holes. |
@adilger That's a good point. I wonder if metadata compression makes it less of an issue though. I think zeroed-out blocks do not actually consume space on disk. And mostly-zero blocks should compress well. There is still a memory overhead problem with a very sparse metadnode if the working set is spread across the entire dnode space. |
I suspect that a relatively simple heuristic as I outlined could be implemented, based on the number of freed dnodes, so that a create-mostly workload will never try backfilling, while a create/delete workload will do it occasionally when it is worthwhile to do so.
In addition to the count of freed dnodes, it might be worthwhile to track the minimum dnode number freed so that the allocator doesn't scan the whole metadnode each time.
Since the allocator will rescan the metadnode from the start on each boot, these values can be lazy or racy updates in memory.
|
Yes
That's a good idea. |
Yes, but is having a sparse dnode object actually a real problem? Sure, it's not the ideal long term fix but aside from possibly slightly worse memory utilization this seems like it wouldn't cause any issues. That said, I'm happy to hold off on doing anything here. If we get a little time I agree it would be interesting to take a crack at @adilger suggested which is nice and simple. |
I think there are a few potential drawbacks of never trying to backfill in a workload that does both creates and unlinks:
|
|
is it correct to say that the major issue is dnode_next_offset_level() incapable to detect just allocated dnodes? |
@bzzz77 yes that's correct. A secondary issue is that |
then what if we have an in-memory structure tracking allocations in current TXG and consult with that? then TXG sync would release that structure |
Even if there is an efficient structure for tracking in-progress allocations, I think there is still a benefit from not doing any scanning of metadnode blocks if there aren't any files being unlinked. For HPC at least, there may be a few 100k's of file creates in one group, and then a similar number of deletes in a later group, so rescanning the metadnode for holes repeatedly during creation is wasteful unless there is some reason to expect that there are new holes (i.e. some reasonably number of dnodes have been deleted since the last time the metadnode was scanned). |
yes, obviously it makes sense to track deletes someway as well. |
Closing in favor of #4711 |
Object creation rates may be degraded when dmu_object_alloc() tries
to backfill the metadnode array by restarting its search at offset 0.
The method of searching the dnode space for holes is inefficient and
unreliable, leading to many failed attempts to obtain a dnode hold.
These failed attempts are expensive and limit overall system
throughput. This patch changes the default behavior to disable
backfilling, and it adds a zfs_metadnode_backfill module parameter to
allow the old behavior to be enabled.
The search offset restart happens at most once per call to
dmu_object_alloc() when the previously allocated object number is a
multiple of 4096. If the hold on the requested object fails because
the object is allocated, dmu_object_next() is called to find the next
hole. That function should theoretically identify the next free
object that the next loop iteration can successfully obtain a hold
on. In practice, however, dmu_object_next() may falsely identify a
recently allocated dnode as free because the in-memory copy of the
dnode_phys_t is not up to date. The next hold attempt then fails, and
this process repeats for up to 4096 loop iterations before the search
skips ahead to a sparse region of the metadnode. A similar pathology
occurs if dmu_object_next() returns ESRCH when it fails to find a
hole in the current dnode block. In this case dmu_object_alloc()
simply increments the object number and retries, resulting again in
up to 4096 failed dnode hold attempts.
We can avoid these pathologies by not attempting to backfill the
metadnode array. This may result in sparse dnode blocks, potentially
costing disk space, memory overhead, and increased disk I/O. These
penalties appear to be outweighed by the performance cost of the
current approach. Future work could implement a more efficient means
to search for holes and allow us to reenable backfilling by default.
=== Benchmark Results ===
We measured a 46% increase in average file creation rate by
setting zfs_metadnode_backfill=0.
The createmany benchmark used is available at
http://github.com/nedbass/createmany. It used 32 threads to create 16
million files over 16 iterations. The pool was freshly created for each
of the two tests. The test system was a d2.xlarge Amazon AWS virtual
machine with 3 2TB disks in a raidz pool.
zfs_metadnode_backfill Average creates/second
$ zpool create tank raidz /dev/xvd{b,c,d}
$ echo 0 > /sys/module/zfs/parameters/zfs_metadnode_backfill
$ for ((i=0;i<16;i++)) ; do ./createmany -o -t 32 -D $(mktemp
-d /tank/XXXXX) 1000000 ; done
total: 1000000 creates in 21.142829 seconds: 47297.359852 creates/second
total: 1000000 creates in 21.421943 seconds: 46681.108566 creates/second
total: 1000000 creates in 21.996960 seconds: 45460.826977 creates/second
total: 1000000 creates in 22.031947 seconds: 45388.637143 creates/second
total: 1000000 creates in 21.597262 seconds: 46302.165727 creates/second
total: 1000000 creates in 21.194397 seconds: 47182.281302 creates/second
total: 1000000 creates in 23.844561 seconds: 41938.285457 creates/second
total: 1000000 creates in 25.678497 seconds: 38943.089478 creates/second
total: 1000000 creates in 22.400553 seconds: 44641.757449 creates/second
total: 1000000 creates in 22.011262 seconds: 45431.290857 creates/second
total: 1000000 creates in 21.848749 seconds: 45769.211022 creates/second
total: 1000000 creates in 26.574808 seconds: 37629.622928 creates/second
total: 1000000 creates in 22.326124 seconds: 44790.580077 creates/second
total: 1000000 creates in 23.562593 seconds: 42440.152541 creates/second
total: 1000000 creates in 26.825597 seconds: 37277.828270 creates/second
total: 1000000 creates in 22.277026 seconds: 44889.297413 creates/second
$ zpool destroy tank
$ zpool create tank raidz /dev/xvd{b,c,d}
$ echo 1 > /sys/module/zfs/parameters/zfs_metadnode_backfill
$ for ((i=0;i<16;i++)) ; do ./createmany -o -t 32 -D $(mktemp
-d /tank/XXXXX) 1000000 ; done
total: 1000000 creates in 31.947285 seconds: 31301.564265 creates/second
total: 1000000 creates in 31.511260 seconds: 31734.687822 creates/second
total: 1000000 creates in 31.984121 seconds: 31265.515618 creates/second
total: 1000000 creates in 31.960720 seconds: 31288.406458 creates/second
total: 1000000 creates in 32.651408 seconds: 30626.550663 creates/second
total: 1000000 creates in 32.579218 seconds: 30694.414826 creates/second
total: 1000000 creates in 36.163562 seconds: 27652.143474 creates/second
total: 1000000 creates in 33.621352 seconds: 29743.003829 creates/second
total: 1000000 creates in 33.097268 seconds: 30213.974061 creates/second
total: 1000000 creates in 34.419482 seconds: 29053.313476 creates/second
total: 1000000 creates in 34.014244 seconds: 29399.448204 creates/second
total: 1000000 creates in 32.972573 seconds: 30328.236705 creates/second
total: 1000000 creates in 34.757156 seconds: 28771.054526 creates/second
total: 1000000 creates in 32.194859 seconds: 31060.859951 creates/second
total: 1000000 creates in 32.464407 seconds: 30802.966165 creates/second
total: 1000000 creates in 37.443681 seconds: 26706.776650 creates/second
Signed-off-by: Ned Bass [email protected]