-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs create/destroy/mount bad scalability with respect to number of mounted datasets #845
Comments
I was thinking that maybe commit b740d60 is to blame, which fixes Issue #329, by disabling mnttab caching in zfs command, but it looks not to be the main performance problem. (reverting this commit and reenabling cache, doesnt make it much faster in my case - yes, mtab file is read once and parsed, but still command took essentially same amount of time, mostly due lots of ioctls). I discovered that actually any zfs/zpool command starts to take lots of time. Even I tracked this to the function Other possibility to fix it, is to actually disable Other is to invent better transport than ioctls, because this can be a problem here. Differences with update_zfs_shares disabled in sa_init:
|
I belive best option would be to wrap update_zfs_shares in sa_init() with condition on some environment variable. Like ZFS_AUTO_UPDATE_SHARES=yes I can prepare a patch for this. |
(I'm not a ZFS developer -- just a big fan and user.) Perhaps this ticket should not be closed. The problem is still in released 0.6.2, and trivial to replicate -- just make a pool on a sparse image file (say), then create 1000 filesystems (all empty), and do "time zfs -?" and it takes several seconds. There's another ticket at #821 and several discussions all over the web about how one can't use ZFS with a large number of filesystems, all because of this. If you comment out the line "update_zfs_shares(impl_handle, NULL);" in lib/libshare/libshare.c, then re-install zfs from source, "time zfs -?" is nearly instant again, as is everything else, including things like listing all snapshots, making snapshots, etc. They all become "1000 times" faster for me. For my project, which absolutely requires having 10000+ filesystems in a single pool, this speedup is absolutely critical. I'm not using NFS at all, so disabling the update_zfs_shares is fine for me. |
Even though this is closed (which it shouldn't) it is related to #1484. |
wat the fuk, why would update_zfs_shares() be called for listings? this takes AGES to execute on spinning rust. Please fix! :-( |
libshare is initialized globaly, for all commands. This means, that even something like I've tried to find where and when it (libshare) is initialized/started, but it's even initialized before zpool:main(), so I have no idea how to fix this. |
@williamstein @FransUrbo At a minimum for 0.6.3 we should be able to update the code so |
@behlendorf That was my idea as well. I started with this, but when I noticed that libshare was initialized before main() and couldn't figure out why, I didn't know how to proceed. Have any hints for me? |
@FransUrbo Sure, that's caused by the It's worth taking a look at |
For anyone suffering from this issue could you please try the following patch. It does two things:
I wasn't able to consistently reproduce the slow behavior in my VM so I'd be interested to see how these two small changes help your systems. diff --git a/cmd/zfs/zfs_main.c b/cmd/zfs/zfs_main.c
index 3f54985..9fac5b2 100644
--- a/cmd/zfs/zfs_main.c
+++ b/cmd/zfs/zfs_main.c
@@ -6467,7 +6467,7 @@ main(int argc, char **argv)
/*
* Run the appropriate command.
*/
- libzfs_mnttab_cache(g_zfs, B_FALSE);
+ libzfs_mnttab_cache(g_zfs, B_TRUE);
if (find_command_idx(cmdname, &i) == 0) {
current_command = &command_table[i];
ret = command_table[i].func(argc - 1, argv + 1);
diff --git a/lib/libshare/libshare.c b/lib/libshare/libshare.c
index 6625a1b..ea59dcd 100644
--- a/lib/libshare/libshare.c
+++ b/lib/libshare/libshare.c
@@ -105,14 +105,6 @@ libshare_init(void)
{
libshare_nfs_init();
libshare_smb_init();
-
- /*
- * This bit causes /etc/dfs/sharetab to be updated before libzfs gets a
- * chance to read that file; this is necessary because the sharetab file
- * might be out of sync with the NFS kernel exports (e.g. due to reboots
- * or users manually removing shares)
- */
- sa_fini(sa_init(0));
}
static void |
@behlendorf In my tests for #1484 I found that enabling the mtab cache DO help. Not by a huge amount, but anyway. But is it a good idea? The script I used to test this looks like this:
This took several hours initially, but some of the improvements I've set pull requests for and mtab cache enable (including a much, much newer ZoL) have cut this substantially. I'll rerun the test again, with and without libshare and see some numbers as soon as I'm sure that my ZVOLs works as the're supposed to. But if you change the sub levels from But it's weird that you can't reproduce the problem. Did you test with libshare completely disabled/enabled, not just the libshare_init() fix above? On my live machine, it currently takes about an hour and a half to mount 613 filesystems. |
For what it's worth... When I had approximately 1000 filesystems in my pool, I had a look at In the end, I currently have 2100 filesystems in my pool and using the script and zfs-mount program below it takes less than 4 minutes to mount everything.
Source for zfs-mount:
|
Allowing the mtab file to be cached means there's a window where the version cached by the @FransUrbo I'll try your test script, I did something similiar for my testing but I could never take it more than about 1 minute in my VM to mount and share 1000 filesystems. |
@behlendorf I'll do some testing with the mtab caching enabled. However I'm wondering, for the situations where the mtab is managed by the kernel, is there a way of detecting this and turning off the mtab manipulation altogether? It's also interesting that your haven't been able produce more than a minute delay in mounting 1000 filesystems, whereas I was seeing over an hour. I'll see what strace can tell us about where the time is being spent, and perhaps dive deeper into the kernel if any particular system calls stand out. |
To some degree it already done. These days the mount helper will detect if If someone could write a little script and verify it reproduces the bad behavior in a VM that would be helpful to me. I'm not having much luck causing a probably as serious as that described here. What happens cleanly isn't optimal, but in my testing you could certainly live with it. |
@behlendorf With the mtab cache enabled the time to |
I've now been running without the 'sa_fini(sa_init(0))' part for about a week and everything seems to be working. I have not done any speed tests to see if it actually did any difference, but it doesn't seems like it do... @behlendorf have you been successful in reproducing the problem? PS. I've just recently (yesterday!) been able to boot my ZFS root installation with Debian GNU/Linux Wheezy (all 64bit) which DO have mtab as a symlink, but I don't notice any difference in mount speed. Note though, that I was extremely paranoid when I created the dataset, so I used Oh, I just tripple checked. I missed the mtab cache enable. I'll enabled that as well and see what I find. |
Do note that the mtab cache part is incomplete. I found in #1484 that there where a lot of places where the code opens/reads/seeks/closes mtab directly, without going through libzfs... Finding and fixing those might speed things up even more. |
It sounds like re-enabling the cache has been helpful for @chrisrd so I'll apply that patch. I'm also going to make the @FransUrbo I can easily believe that not everything uses the mtab cache. We should probably address those cases one by one as we discover them, but in the meantime I don't think that needs to prevent us from enabling the cache. Unfortunately, I still haven't been able to reproduce the issue on any of my test systems. |
Re-enable the /etc/mtab cache to prevent the zfs command from having to repeatedly open and read from the /etc/mtab file. Instead an AVL tree of the mounted filesystems is created and used to vastly speed up lookups. This means that if non-zfs filesystems are mounted concurrently the 'zfs mount' will not immediately detect them. In practice that will rarely happen and even if it does the absolute worst case would be a failed mount. This was originally disabled out of an abundance of paranoia. NOTE: There may still be some parts of the code which do not consult the mtab cache. They should be updated to check the mtab cache as they as discovered to be a problem. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Turbo Fredriksson <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Issue #845
Removes the unconditional sharetab update when running any zfs command. This means the sharetab might become out of date if users are manually adding/removing shares with exportfs. But we shouldn't punish all callers to zfs in order to handle that unlikely case. In the unlikely event we observe issues because of this it can always be added back to just the share/unshare call paths where we need an up to date sharetab. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Turbo Fredriksson <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Issue #845
Bumping to 0.6.4. Things have improved here somewhat but there's clearly more to do. |
I've been seriously hit by this bug tonight after upgrading my zfsonlinux host from squeeze to wheezy (that bug prevented the system to boot, at least I had no patience enough). Now, the zpool import remains quite long (maybe a minute or 2), and the "zfs mount -a" take less than 2 minutes also. Back to business (and having "zfs -h" that respond immediately is greatly appreciated). David |
@douardda @chrisrd @baryluk @williamstein @Rudd-O in your option are we at a point with the 0.6.3 tag where the performance when managing 1000's of filesystems is generally acceptable? |
The large part of the problem I believe was fixed in 0bc7a7a and abbfdca (Issue #1498). = |
On Mon, Oct 6, 2014 at 11:45 PM, Turbo Fredriksson <[email protected]
Is it linear in the number of filesystems, so if there were 6710, then it For what it's worth, I re-architect my site (cloud.sagemath.com) to use one
William Stein |
Yes, it should be roughly linear. OK, thanks for the feedback. I'm glad you you found a way to make ZFS work in your environment. |
It doesn't look like it's linear:
I honestly don't know why that list takes less than two, and the one with only 670 filesystems took almost ten... Creating those 3k8 filesystems took "for ever" though! |
OK, then I'm closing out this issue. If there are still specific use cases which need to be improved lets open new bugs for them. |
This may be worthy of another bug, @mailinglists35. |
@behlendorf is it normal/expected on latest release (which already has mtab cache enabled) to still observe
however |
…ndex run (openzfs#845) The zettacache index cache is updated as part of merging the PendingChanges into the on-disk index. The merge task sends the updates to the checkpoint task, as part of a `MergeProgress` message. The index cache updates are then made from a spawned blocking (CPU-bound) task. The updates are completed (waited for) before the next checkpoint completes. During the merge, it's expected that lookups can see IndexEntry's from the old index, either from reading the old index itself, or from the index entry cache. These stale entries are "corrected" by either `PendingChanges::update()`'s call to `Remap::remap()`, or `MergeState::entry_disposition()`'s check of `PendingChanges::freeing()`. When the `MergeMessage::Complete` is received it calls `Locked::rotate_index()` which deletes the old on-disk index, and calls `PendingChanges::set_remap(None)` and `Locked::merge.take()`. This ends the stale entry "corrections" mentioned above, which are no longer necessary because we can no longer see stale entries from the old on-disk index. The problem occurs when the `MergeMessage::Complete` is received and processed before the spawned blocking task completes. In this case, we end the stale entry "corrections", but we can still see stale entries from the index cache. This PR addresses the problem by waiting for the index cache updates to complete before processing the `MergeMessage::Complete`. The problem was introduced by openzfs#808.
With about 1000 datasets on my zfs pool, it takes about 20 seconds to create new one, or to mount one more.
zfs import or zfs mount -a can take about hour.
I belive bug is in mount.zfs, which checks /etc/mtab and /proc/mounts multiple times
and updates this files. This extremally unacassary, and makes
zfs mount/create/destroy/import extermally slow when system or zfs have multiple mounted datasets.
Solution would be to have a something better than plain linear
structure in mtab, or better just ignore mtab at all, and try to mount without checking
if it is mounted, and do checking only in case of error.
Precise measurments visualised on this plot http://i.imgur.com/oZGlb.png
strace zfs create / mount, shows that big amount of time is spent on reading/writing to /etc/mtab.
aftear each lseek, a whole /etc/mtab file is read.
/etc/mtab should be read only once, and stored in memory.
If possible it should be not used at all. Just try to mount, if it fails, check why,
if it successed, just append proper line to /etc/mtab (with proper file locking).
Thanks,
Witek
The text was updated successfully, but these errors were encountered: