Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resilvering extremely slow #1110

Closed
mattlqx opened this issue Nov 24, 2012 · 9 comments
Closed

Resilvering extremely slow #1110

mattlqx opened this issue Nov 24, 2012 · 9 comments
Labels
Type: Documentation Indicates a requested change to the documentation
Milestone

Comments

@mattlqx
Copy link

mattlqx commented Nov 24, 2012

I had a drive that was operational but exhibited clicking in my array so I swapped it out with a new drive and started a resilver on it (the array is a 4-drive raidz1). The resilver was slow going at a rate of about 900KB/s. Overnight the host froze (as it is does when there is load on the zpool, but that's a separate issue). The host doesn't boot with grub2 since the pool is in a degraded state.

I've booted into a USB thumb drive (Gentoo 12.1 Live) and am continuing with the resilver there but rates are even slower, around 225KB/s. When the array was in a good state, this setup was pulling at least 20MB/s from each drive so I'm pretty confident this is not drive or controller related. There are no errors in dmesg.

The software versions of the thumb drive boot are: kernel 3.3.0, zfs 0.6.0-rc8
For the rpool root when booted, it runs kernel 3.5.7, zfs 0.6.0-rc11

Here's an example iostat:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   20.00    0.00    19.50     0.00     1.95     0.20   10.10   10.10    0.00   9.30  18.60
sdb               0.00     0.00   20.00    0.00    15.50     0.00     1.55     0.11    5.55    5.55    0.00   5.00  10.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00   20.00    0.00    19.00     0.00     1.90     0.19    9.65    9.65    0.00   8.40  16.80
sde               0.00     0.00    0.00   59.00     0.00    55.50     1.88    10.00  167.27    0.00  167.27  16.95 100.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Status:

  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Sat Nov 24 05:21:23 2012
    15.8G scanned out of 2.70T at 223K/s, (scan is slow, no estimated time)
    3.86G resilvered, 0.57% done
config:

    NAME             STATE     READ WRITE CKSUM
    rpool            DEGRADED     0     0     0
      raidz1-0       DEGRADED     0     0     0
        sda2         ONLINE       0     0     0
        sdb2         ONLINE       0     0     0
        sdd2         ONLINE       0     0     0
        replacing-3  DEGRADED     0     0     0
          old        OFFLINE      0     0     0
          sde2       ONLINE       0     0     0  (resilvering)
    cache
      sdf2           ONLINE       0     0     0

The host is pretty much idle:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0      0 7634428  73956 194156    0    0    43    42  136  173  0  0 99  0
 0  0      0 7634428  73956 194152    0    0    28    47  208  239  0  0 100  0
 0  0      0 7634428  73956 194152    0    0     0    40  138  167  0  0 100  0
 0  0      0 7633932  73956 194152    0    0   115   388  694  928  0  0 100  0
 0  0      0 7633932  73956 194152    0    0    36    42  196  235  0  0 100  0

For brevity, the zfs/zpool process list can be found here http://pastebin.com/99j2F4HM

@mattlqx
Copy link
Author

mattlqx commented Nov 25, 2012

Trying to see if it had anything to do with hardware, I put the cache device offline and also traded the fresh drive with a drive on the other controller to try and narrow that down. Initial behavior is the same... slow ~250KB/s rate. Then after crunching for a while, the rates started to burst up and they're all over the place, from a couple MB/s per device all the way up to 15MB/s per device.

Then it'll hit a spot where it goes back down to a ~300KB/s and linger there for a while and then start bursting up again.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda              93.00     0.00  428.00    0.00 12154.00     0.00    56.79     0.36    0.85    0.85    0.00   0.33  14.00
sdb               0.00   116.00    0.00  249.00     0.00 11430.50    91.81     3.44   13.15    0.00   13.15   3.67  91.30
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00  570.00    0.00 12270.50     0.00    43.05     0.70    1.23    1.23    0.00   0.33  18.70
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00  549.00    0.00 12099.00     0.00    44.08     0.78    1.41    1.41    0.00   0.36  19.70

@behlendorf
Copy link
Contributor

@mattlqx Based on the iostat output you provided I don't think there's anything actually wrong. Your just bumping up against the per-device IOP limit. See how sde in the output is 100% utilized and is pushing 59 small writes per second.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sde               0.00     0.00    0.00   59.00     0.00    55.50     1.88    10.00  167.27    0.00  167.27  16.95 100.00

Unlike traditional RAID which rebuilds the entire device sequentially. ZFS needs to walk the entire file system through the name space which means increased IOPs and often small read/writes.

@mattlqx
Copy link
Author

mattlqx commented Nov 29, 2012

That's a fair explanation I suppose. I didn't think the bursty nature of the resilver was the norm. But again, this is only a 4 drive SATA pool.

@behlendorf
Copy link
Contributor

@mattlqx Unfortunately, if you have a lot of small files in your pool it's the norm. It's also not really acceptable for the enterprise so there is a design for a fast resilver feature floating around which just needs to be implemented.

@aruntomar
Copy link

Hi!

it's been more than a month now, and resilvering is still not complete. Please help. every day the it's incomplete, we are in a risk of losing data in case another hard disk fails.

[root@storage ~]# zpool status
pool: school
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Jul 27 02:29:56 2014
3.75T scanned out of 4.19T at 18.8K/s, (scan is slow, no estimated time)
960G resilvered, 89.66% done
config:

    NAME             STATE     READ WRITE CKSUM
    school           DEGRADED     0     0     0
      raidz1-0       DEGRADED     0     0     0
        sda          ONLINE       0     0     0
        replacing-1  DEGRADED     0     0     0
          old        UNAVAIL      0     0     0
          sdb        ONLINE       0     0     0  (resilvering)
        sdc          ONLINE       0     0     0
        sdd          ONLINE       0     0     0

errors: No known data errors
[root@storage ~]# iostat -x
Linux 2.6.32-358.2.1.el6.x86_64 (storage) 08/21/2014 x86_64 (6 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
0.01 0.00 0.62 0.03 0.00 99.34

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdd 0.21 0.33 68.16 8.70 318.10 51.67 4.81 1.48 19.32 10.31 79.25
sda 0.24 0.34 71.43 9.12 331.19 53.96 4.78 1.44 17.90 9.33 75.15
sde 0.01 0.48 0.05 0.23 2.21 5.34 26.95 0.00 8.09 6.28 0.18
sdb 0.00 0.76 10.73 12.78 14.95 60.29 3.20 0.13 5.34 5.04 11.85
sdc 0.24 0.34 69.61 9.14 330.79 53.96 4.89 1.43 18.17 9.24 72.78
sdf 0.00 0.00 0.00 0.00 0.00 0.00 8.03 0.00 1.37 1.37 0.00
dm-0 0.00 0.00 0.05 0.67 2.18 5.34 10.42 0.01 7.55 2.44 0.18
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 4.94 2.50 0.00
dm-2 0.00 0.00 0.00 0.00 0.01 0.00 7.97 0.00 2.82 1.13 0.00

system details:
[root@storage proc]# cat /etc/redhat-release
CentOS release 6.3 (Final)

[root@storage proc]# uname -a
Linux storage 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

[root@storage proc]# free -m
total used free shared buffers cached
Mem: 7618 3594 4023 0 221 265
-/+ buffers/cache: 3107 4510
Swap: 7759 0 7759

@behlendorf
Copy link
Contributor

@aruntomar According to the iostat data you've posted the rebuild is taking a long time because the average block size in your pool is very small (a little over a sector). The drives are being forced to perform a lot of small random IO in order to complete the rebuild.

@aruntomar
Copy link

@behlendorf , thanks for the information. When i started the resilver process it was going @ rate of more than 6 Mbps. after reaching i believe 80%, it started crawling. anyways, just wanted to provide this info in case it was relevant.

How do i check & increase the block size? and what would should be the optimal parameter of the block size?

Thanks,
Arun

@behlendorf
Copy link
Contributor

How do i check & increase the block size? and what would should be the optimal parameter of the block size?

The default value is 128k for file systems and 8k for ZVOLs. However, you may end up with smaller blocks if individual files are smaller than this, or the pull in near capacity and gang blocks are in use.

To check the block size.

zfs get recordsize

@aruntomar
Copy link

[root@storage ~]# zfs get recordsize
NAME                         PROPERTY    VALUE    SOURCE
school                       recordsize  128K     default
school@2014Jul3              recordsize  -        -
school@2014Jul10             recordsize  -        -
school/backup                recordsize  128K     default
school/backup@2014Jul3       recordsize  -        -
school/backup@2014Jul10      recordsize  -        -
school/testing321            recordsize  128K     default
school/testing321@2014Jul3   recordsize  -        -
school/testing321@2014Jul10  recordsize  -        -

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Documentation Indicates a requested change to the documentation
Projects
None yet
Development

No branches or pull requests

3 participants