-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resilvering extremely slow #1110
Comments
Trying to see if it had anything to do with hardware, I put the cache device offline and also traded the fresh drive with a drive on the other controller to try and narrow that down. Initial behavior is the same... slow ~250KB/s rate. Then after crunching for a while, the rates started to burst up and they're all over the place, from a couple MB/s per device all the way up to 15MB/s per device. Then it'll hit a spot where it goes back down to a ~300KB/s and linger there for a while and then start bursting up again.
|
@mattlqx Based on the iostat output you provided I don't think there's anything actually wrong. Your just bumping up against the per-device IOP limit. See how sde in the output is 100% utilized and is pushing 59 small writes per second.
Unlike traditional RAID which rebuilds the entire device sequentially. ZFS needs to walk the entire file system through the name space which means increased IOPs and often small read/writes. |
That's a fair explanation I suppose. I didn't think the bursty nature of the resilver was the norm. But again, this is only a 4 drive SATA pool. |
@mattlqx Unfortunately, if you have a lot of small files in your pool it's the norm. It's also not really acceptable for the enterprise so there is a design for a fast resilver feature floating around which just needs to be implemented. |
Hi! it's been more than a month now, and resilvering is still not complete. Please help. every day the it's incomplete, we are in a risk of losing data in case another hard disk fails. [root@storage ~]# zpool status
errors: No known data errors avg-cpu: %user %nice %system %iowait %steal %idle Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util system details: [root@storage proc]# uname -a [root@storage proc]# free -m |
@aruntomar According to the |
@behlendorf , thanks for the information. When i started the resilver process it was going @ rate of more than 6 Mbps. after reaching i believe 80%, it started crawling. anyways, just wanted to provide this info in case it was relevant. How do i check & increase the block size? and what would should be the optimal parameter of the block size? Thanks, |
The default value is 128k for file systems and 8k for ZVOLs. However, you may end up with smaller blocks if individual files are smaller than this, or the pull in near capacity and gang blocks are in use. To check the block size.
|
|
I had a drive that was operational but exhibited clicking in my array so I swapped it out with a new drive and started a resilver on it (the array is a 4-drive raidz1). The resilver was slow going at a rate of about 900KB/s. Overnight the host froze (as it is does when there is load on the zpool, but that's a separate issue). The host doesn't boot with grub2 since the pool is in a degraded state.
I've booted into a USB thumb drive (Gentoo 12.1 Live) and am continuing with the resilver there but rates are even slower, around 225KB/s. When the array was in a good state, this setup was pulling at least 20MB/s from each drive so I'm pretty confident this is not drive or controller related. There are no errors in dmesg.
The software versions of the thumb drive boot are: kernel 3.3.0, zfs 0.6.0-rc8
For the rpool root when booted, it runs kernel 3.5.7, zfs 0.6.0-rc11
Here's an example iostat:
Status:
The host is pretty much idle:
For brevity, the zfs/zpool process list can be found here http://pastebin.com/99j2F4HM
The text was updated successfully, but these errors were encountered: