Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS takes 5-6 seconds to import and mount a pool #2190

Closed
turl opened this issue Mar 14, 2014 · 19 comments
Closed

ZFS takes 5-6 seconds to import and mount a pool #2190

turl opened this issue Mar 14, 2014 · 19 comments
Labels
Type: Performance Performance improvement or performance problem
Milestone

Comments

@turl
Copy link

turl commented Mar 14, 2014

I'm filling this issue by @ryao's request

Turl> I'm playing with systemd-analyze
Turl> and saw zfs takes 5-6s to start
Turl> is that a normal time for it?
Turl> I have a single disk pool
ryao> Turl: One of the OSv developers had this problem, but I did not think it affected us.
ryao> Turl: What could be happening is that zil replay is being done on import because there was no flush on export.
ryao> I think raphaelsc in #osv was the person who caught it. I have a fast SSD, so I had not noticed. File an issue.
ryao> Turl: What I described is the issue OSv had.

plot

The ZFS unit contains

ExecStart=/usr/bin/modprobe zfs
ExecStart=/usr/bin/zpool import -c /etc/zfs/zpool.cache -aN
ExecStart=/usr/bin/zfs mount -a
ExecStart=/usr/bin/zfs share -a
ExecStop=/usr/bin/zfs umount -a

I'm using Arch Linux, ZoL 0.6.2, Linux 3.13.6

@dweeezil
Copy link
Contributor

In my experience, the big slowdown is all the partition scanning and udev interaction with zvols. It would be interesting to see what dmesg | grep zd shows on your system because it does look like you've got some zvols (swap on zvol, at least).

@turl
Copy link
Author

turl commented Mar 14, 2014

This is from a different boot that the one pictured above, but you can get an idea

$ dmesg|grep -Ei '(zfs|spl|zd)'
[    2.034510] SPL: Loaded module v0.6.2-1
[    2.055611] ZFS: Loaded module v0.6.2-1, ZFS pool version 5000, ZFS filesystem version 5
[    4.438508] SPL: using hostid 0xa8c0017a
[    4.475161]  zd0: unknown partition table

Note that I'm not using that zvol (or any other for that matter) for swap or any other purposes at the moment.

@ryao
Copy link
Contributor

ryao commented Mar 14, 2014

@turl My hunch that other things known to affect import times are not at work here. Would you include the output of zfs list -t all to confirm that?

@ryao
Copy link
Contributor

ryao commented Mar 14, 2014

@dweeezil Having given this some thought, I believe that there is potential for unflushed transactions in ZIL to force ZIL replay during normal boot. The numbers that the OSv developers shared with me show a 5 to 6 second penalty on Amazon EC2 that can be alleviated by flushing ZIL at shutdown to avoid ZIL replay. The addition of a flush made them rather happy with things and they did not report any noticeable increase is shutdown time. This suggests to me that the time needed to flush them at shutdown is much less than the time needed to perform replay at boot. Consequently, it would be useful to extend the code to address that by allowing the shutdown process to forcibly flush these things to avoid the ZIL replay penalty at boot.

@ahrens If you have some time, would you provide a second opinion what I have said?

@turl
Copy link
Author

turl commented Mar 14, 2014

@ryao here it is

    $ sudo zfs list -t all
    NAME                                        USED  AVAIL  REFER  MOUNTPOINT
    home                                        411G   503G   184K  /home
    home/android-aw-jb-out                      136K   503G   136K  /home/emilio/android/jb/out/
    home/android-aw-jb-out@empty                   0      -   136K  -
    home/android-sony-jb-out                    136K   503G   136K  /home/emilio/android/jb-sony/out/
    home/android-sony-jb-out@empty                 0      -   136K  -
    home/android-sony-kk-out                   15,4G   503G  15,4G  /home/emilio/android/kk/out/
    home/android-sony-kk-out@empty               96K      -   136K  -
    home/apps                                  2,95G   503G  2,95G  /home/apps
    home/borrar                                 152K   503G   152K  /home/emilio/Descargas/borrar
    home/buildroot-out                          381M   503G   381M  /home/emilio/buildroot/output
    home/ccache                                 483M   503G   482M  /home/emilio/.ccache
    home/ccache@empty                           112K      -   152K  -
    home/emilio                                 341G   503G   308G  /home/emilio
    home/emilio@auto_monthly-2013-10-25-1303   4,30G      -   251G  -
    home/emilio@auto_monthly-2013-11-24-1317   1020M      -   289G  -
    home/emilio@auto_monthly-2013-12-26-0406    990M      -   300G  -
    home/emilio@auto_weekly-2014-01-23-0306    1,01G      -   304G  -
    home/emilio@auto_monthly-2014-01-26-0303    723M      -   305G  -
    home/emilio@auto_weekly-2014-01-30-0339    1,03G      -   307G  -
    home/emilio@auto_weekly-2014-02-06-1253     492M      -   307G  -
    home/emilio@auto_daily-2014-02-11-1518      414M      -   308G  -
    home/emilio@auto_daily-2014-02-12-1106     36,0M      -   308G  -
    home/emilio@auto_daily-2014-02-13-0945        8K      -   308G  -
    home/emilio@auto_weekly-2014-02-13-1005       8K      -   308G  -
    home/emilio@auto_daily-2014-02-15-1136     19,8M      -   308G  -
    home/emilio@auto_daily-2014-02-16-0341      314M      -   308G  -
    home/emilio@auto_daily-2014-02-17-1008      414M      -   308G  -
    home/emilio@auto_daily-2014-02-18-0927      308M      -   308G  -
    home/emilio@auto_daily-2014-02-19-1112      402M      -   308G  -
    home/emilio@auto_daily-2014-02-20-1242     67,4M      -   308G  -
    home/emilio@auto_weekly-2014-02-20-1302    63,2M      -   308G  -
    home/emilio@auto_daily-2014-02-21-1015      420M      -   308G  -
    home/emilio@auto_daily-2014-02-22-1311      400M      -   308G  -
    home/emilio@auto_monthly-2014-02-23-1152   21,1M      -   309G  -
    home/emilio@auto_daily-2014-02-23-1212     28,3M      -   309G  -
    home/emilio@auto_daily-2014-02-24-0818      268M      -   308G  -
    home/emilio@auto_daily-2014-02-25-1315      194M      -   309G  -
    home/emilio@auto_monthly-2014-02-26-0156   31,7M      -   309G  -
    home/emilio@auto_daily-2014-02-26-0315     29,8M      -   309G  -
    home/emilio@auto_daily-2014-02-27-1335     13,0M      -   308G  -
    home/emilio@auto_weekly-2014-02-27-1355    15,7M      -   308G  -
    home/emilio@auto_daily-2014-02-28-0310      390M      -   309G  -
    home/emilio@auto_daily-2014-03-01-1340      288M      -   310G  -
    home/emilio@auto_daily-2014-03-02-0323      278M      -   310G  -
    home/emilio@auto_daily-2014-03-03-1249      112M      -   309G  -
    home/emilio@auto_daily-2014-03-04-1044      269M      -   309G  -
    home/emilio@auto_daily-2014-03-05-0310      213M      -   309G  -
    home/emilio@auto_daily-2014-03-06-1108     33,5M      -   310G  -
    home/emilio@auto_weekly-2014-03-06-1128    35,6M      -   309G  -
    home/emilio@auto_daily-2014-03-07-0317      300M      -   310G  -
    home/emilio@auto_daily-2014-03-08-1109      364M      -   310G  -
    home/emilio@auto_daily-2014-03-09-1329      301M      -   310G  -
    home/emilio@auto_daily-2014-03-10-1030      214M      -   310G  -
    home/emilio@auto_daily-2014-03-11-0332      274M      -   310G  -
    home/emilio@auto_daily-2014-03-12-0313      316M      -   310G  -
    home/emilio@auto_hourly-2014-03-12-2001    25,5M      -   310G  -
    home/emilio@auto_hourly-2014-03-12-2101    19,7M      -   310G  -
    home/emilio@auto_hourly-2014-03-12-2201    24,0M      -   310G  -
    home/emilio@auto_hourly-2014-03-12-2301    23,9M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-0001    22,7M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-0101    35,7M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-0201    45,3M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-1201    14,1M      -   310G  -
    home/emilio@auto_daily-2014-03-13-1213     15,8M      -   310G  -
    home/emilio@auto_weekly-2014-03-13-1233    15,3M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-1301    34,0M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-1401    17,5M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-1501    16,6M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-1601    11,2M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-1701    10,5M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-1801    12,8M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-1901    29,8M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-2001    33,9M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-2101    31,5M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-2201    58,7M      -   310G  -
    home/emilio@auto_hourly-2014-03-13-2301    21,6M      -   309G  -
    home/emilio@auto_hourly-2014-03-14-0001    33,7M      -   309G  -
    home/emilio@auto_hourly-2014-03-14-0101    23,2M      -   308G  -
    home/emilio@auto_hourly-2014-03-14-0201    26,5M      -   308G  -
    home/emilio@auto_hourly-2014-03-14-1101    13,2M      -   308G  -
    home/emilio@auto_frequent-2014-03-14-1115  9,09M      -   308G  -
    home/emilio@auto_daily-2014-03-14-1123     6,32M      -   308G  -
    home/emilio@auto_frequent-2014-03-14-1130  8,63M      -   308G  -
    home/emilio@auto_frequent-2014-03-14-1145  9,02M      -   308G  -
    home/emilio@auto_frequent-2014-03-14-1200  2,54M      -   308G  -
    home/emilio@auto_hourly-2014-03-14-1201    2,46M      -   308G  -
    home/libvirt-storage                       10,7G   503G  10,7G  /var/lib/libvirt/images
    home/lxc                                   1,86G   503G  1,86G  /home/lxc
    home/supervisor                            32,5G   503G  32,5G  /home/supervisor
    home/swap                                  5,32G   508G    76K  -

@ryao
Copy link
Contributor

ryao commented Mar 14, 2014

@turl What @dweeezil suggested could be in effect here, but I suspect that ZIL replay is partly responsible as well.

@dweeezil
Copy link
Contributor

I mainly mentioned the issue because I've experienced this delay myself and have been thinking of ways to mitigate it. On most of my machines, all the vols are for virtual guests and I really don't care about accessing their partitions (at least not right away after a reboot) from the host system. The system I'm on right now took about 14 seconds to scan its 32 zvols. Things can get a lot worse if the snapshots are exposed.

@ryao Shouldn't the ZIL be empty after a "clean" shutdown in which the filesystems are unmounted? Of course, I'm not totally sure whether a normal shutdown actually causes the zfs_umount->zfs_sb_teardown->zil_close chain to be called (for each filesystem).

@ryao
Copy link
Contributor

ryao commented Mar 16, 2014

@dweeezil You would think that, but I see no reason that it is and there is sufficient evidence that it is not.

@behlendorf
Copy link
Contributor

There's a lot going on in your zfs systemd unit so it's hard to say for certain what is exactly responsible for the delay. It could certainly be any or all of the things mentioned above, it would depend on your exact situation.

We should be able to get a much better idea of where the time is being spent in the 0.6.3 code. @Lalufu did some really nice work (881f45c) adding support for systemd upstream and splitting the bring up in to multiple logical units. In addition, certain parts of the module initialization are now done asynchronously which I'd expect to speed things up.

I'd want to see the systemd-analyze output for your systems using the new systemd units and the latest ZoL code. That should provide us more insight in to what's taking the time.

@behlendorf behlendorf added this to the 0.6.4 milestone Mar 21, 2014
@Lalufu
Copy link
Contributor

Lalufu commented Mar 22, 2014

The last boot on my small storage took ~30 seconds in zfs-mount.service, mounting 24 file systems. If a pool was not shut down cleanly (the system was hard reset), is the work needed to get the pool into a clean state done on import, or on mount?

@behlendorf
Copy link
Contributor

Zil replay in particular will be done per-filesystem during mount. However, pool import can potentially also result in significant IO.

@imp
Copy link
Contributor

imp commented Mar 31, 2014

I thought I'd add a couple of our observations - we have a really pathological case where pool bring up can last for 15 to 20 minutes. The system that we are using separates the import and the mount phases (canmount=noauto). So we have pretty good idea of how much time spent in each phase.

  1. import - this depends on amount of devices in the pool and whether or not cache file is in use.
    Import itself can be split into user-land and kernel-land phases. Both can be sped up quite a bit.
  2. mount - this completely driven by the ZIL replay. We have a corner case that generates over 50,000 ZIL records. Replaying ZIL in such situation easily takes over 15 minutes.

BTW, when looking at ZIL replay we've added timer kstat support and ZIL replay kstat that collects timing info on various ZIL records. If there is enough interest I can pull-request it.

While there are some low hanging fruits with pool import, there is no simple solution for optimizing ZIL replay during mount - it is done sequentially. And it is not trivial to take advantage of normal file operations - the filesystem does not exist until mount is finished.

@ryao
Copy link
Contributor

ryao commented Mar 31, 2014

Issue #1526 and this are related. The fix will likely be the same.

@Lalufu
Copy link
Contributor

Lalufu commented Mar 31, 2014

This is probably a very stupid question, but here goes:

why does the ZIL have to be replayed completely on startup? Can't we just reconstruct the ARC from the ZIL and write out the data and age out the ZIL normally? If the system crashes again we still have the ZIL on disk.

@imp
Copy link
Contributor

imp commented Mar 31, 2014

@Lalufu The question is by no means stupid. The deal is that ZIL doesn't keep raw ARC data. It rather gets very short records on what high level operation is going to be executed at the file system level (well, with one small exception, but bear with me). For example, you are creating file named FILE1. The record itself is very simple you just write down the object id of the directory and the name of the file (together with its permissions, timestamps, etc - i.e. metadata). Overall the amount of data to be recorded is small. When it comes to execute this operation - it translates into quite a complex work of creating ZPL objects, frequently of very complex nature. Ultimately it ends up in updating a significant amount of datablocks on disk.
When the pool is up and running ZIL gets only a record of intention (it is an intent log) and all the real work is done in the context of the current TXG, where is has chances of being aggregated with many other similar updates. In addition kernel objects (such as inodes, dentries, etc) are being created to facilitate the effective processing of next operations. When ZIL is replayed all these are not available yet - the file system in question is in process of being mounted, but didn't get there yet.

There are only few records that are may keep the actual raw data to be written, but even then you will need to replay them one-by-one, otherwise you risk losing your data due to unintentional re-ordering.

Hm, I am not sure I explained it well, but I'll be glad to elaborate if more details needed.

@turl
Copy link
Author

turl commented Apr 20, 2014

With Arch updates today, I got a ZoL git version with the new systemd units. I see a ~2.5s time span on zfs-import-cache.service ("/usr/bin/zpool import -c /etc/zfs/zpool.cache -aN") and a ~1.5s one on zfs-mount.service ("/usr/bin/zfs mount -a")

@ryao
Copy link
Contributor

ryao commented Jul 11, 2014

@dweeezil If you are right, then @turl might see an improvement from #2487.

@behlendorf behlendorf removed this from the 0.6.4 milestone Oct 31, 2014
@behlendorf behlendorf added this to the 0.7.0 milestone Jul 16, 2016
@behlendorf
Copy link
Contributor

Import times may be further reduced with #4794.

@behlendorf
Copy link
Contributor

behlendorf commented Oct 11, 2016

Closing. Import times have been reduced considerably by various commits. There's definitely still room for improvement in some areas but lets file new issues as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

6 participants