-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resilver restarting? #840
Comments
I actually had this happen on vanilla linux mdraid once. Basically linux kernel tries to read the sector and HDD fails to do it since the sector has corrupted/become unreadable for some reason. This in turn caused the raid sync to fail. There is a dangerous "fix" available. In order to force a reallocation of the broken HDD sector, you can for example do hdparm --write-sector . Note that this is really very dangerous as it effectively zeroes out a whole HDD sector and I don't really know how ZFS will deal with the situation. Your pool is already degraded which potentially worsens the situation, for example data in those sectors might not be redundant anymore (you have several read errors across the raidz2-2). I've seen something similar happening on another storage server with WD disks, a kind of "WD bomb" if I may. I'd consider swapping them out for something else ASAP. I'm not really sure how ZFS should act in case of unreadable sectors since Linux will never reallocate the sector by itself so potentially this could kill a pool if unreadable sectors accumulate. In essence this is a form of silent data corruption that we aren't protected from and requires a periodic active monitoring of the pool health from the sysadmin. Perhaps @behlendorf can comment on this. edit: Or could this be a bug that ZFS On Linux does not try to get the broken sector's data from another device in case the primary method fails, or could the redundancy already be exhausted here? |
Fast way of course to check this on a single device is to run badblocks -s -c256 /dev/device (this is read only so it should be safe, check manpage for options) |
Thanks for your response. I'm not actually concerned about the individual disk errors, I see that quite a bit here and it's one of the things the redundancy in "RAID" is supposed to save you from. I've definitely seen ZFS recover from these other times - it's designed to fill in the missing info from the other disks, and re-writes the info back onto the errored disk. I think it rewrites the data back to the exact same sector, relying on the disk remapping a dead sector rather than writing the data to a different sector and updating it's own block pointers. The issue is the resilver restarting when it hits an error on that specific disk. FYI, since first posting I've had another 5 errors on
|
Firstly, I hope you have backups.
ZFS deals with the destroyed sector just fine, provided that there's redundancy available on other disk(s).
This is what worries me also. That many errors across the same vdev, and already 2 disks out of operation. A dangerous situation for sure. Also it's interesting to see only the WD disks are erroring like that, and having that many errors too in the scrub. For future reference, that's why one shouldn't put all similar disks in one redundant vdev if at all possible, but spread them around.
ZFS is designed to protect your data specifically from errors like this. It isn't "silent" since ZFS reports it, now is it? Linux mdraid and normal hardware raid solutions however are affected by silent corruption.
Not a bug, ZFS works perfectly fine. Might be that redundancy is lost already.
That's correct. See implementation details specific to raidz here: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c - line 2109. IIRC, this is the only case copy-on-write isn't used. Now, for what is happening... most likely the resilver restarts because ZFS gets errors from sdbc. If you're lucky, the read/checksum errors on other disks don't correspond to those sectors on sdbc and redundancy is intact. If you have backups, yank out sdbc completely and you should see resilver finish using parity data. Or stop ZFS, and run a This of course only works if parity data corresponding to those sectors you destroy is intact - hence the backups. But it beats yanking the whole disk, since most of sdbc is still readable I'd guess. |
Tbh, I'd be very concerned about all those read errors. Even though ZFS should be able to save your bacon most of the time, that's just spectacularly flaky, garbage hardware right there from WD :-/ In the very least, errors like that will slow down resilvers and pool performance. At worst, they will cause complete pool loss if they overlap just so. Middle ground seems to be issues like this you're having here - however this should still be recoverable from. My 2 cents: toss those WDs as far as you can as soon as possible. No sense in using faulty hardware... |
Oh, yes, those WDs will be replaced as soon as possible! But first I have an operational interest in how well this goes with flakey hardware and all. Note that it's raid-z2 and only one disk has actually disappeared altogether so there's still a parity disk left. Of course, if there are 2 or more disks with errors on the same ZFS block then it's lost data. However even in that case it should only mean a lost file if it's on a data block, which should then be reported at the end with Statistically... thus far a total of 169 sector errors have been reported since the resilver originally started, so 84.5 kB of data. There's nearly 25 TB of used data in the pool, so assuming it's evenly distributed over the 3 vdevs there's 8.3 TB on the failing vdev, and spread over the 10 remaining disks (+ 1 dead) that's around 830 GB per disk. So grossly speaking, 84.5 kB out of 830 GB means there's a pretty good chance the errors aren't in the same ZFS block. At least that's what I'm hoping! ...if the resilver is actually progressing and not simply going over the same ground again and again. Thanks for the suggestion of Cheers! |
Just to clarify, when I said "I'm not actually concerned about the individual disk errors", I should have continued on with "with respect to the immediate issue". Obviously a disk with a high error rate is something to be concerned about, and even more so a bunch of disks! However right now, with the errors in play, I'm wondering if the resilver is actually progressing or not. |
I also noticed you have mixed 512 byte (EADS) and 4k byte (EARX) sector disks there. That could complicate sector overwrite operations, as one faulty 4k internal sector could make 8 emulated sectors unreadable. Is the pool made with ashift=12 option? Hmm, only the EADS disks emit read errors so far? Interesting too, maybe the better ECC in "advanced format" disks helps... |
It's I hadn't really considered the ECC on the 4k disks before, but yes, that would make sense. The other thing about the EADS is they have TLER / SCTERC (set to 7s), as do the Hitachis, whereas the EARX and Seagates don't. So they're more likely to report bad sectors rather than try really, really, really hard (for minutes) to get the data back. But the EADS are also the oldest disks in the box which may have something to do with the errors seen there. |
Thanks for the detailed information guys. I haven't looked in to why the resilver keeps restarting but if I were to venture a somewhat educated guess it would be because the vdev is disappearing/reappearing. The lower level mpt2sas driver may be attempting various recovery techniques to get access to the sector. These could involve resetting the device or the channel which would percolate back up the stack to zfs. But as I said that's just a guess. Still we should leave the bug open so we can determine exactly what's causing it and decide what to do about it. |
TLDR; having moved the disks to a different chassis the resilver is progressing nicely (7 hrs, 10 TB and counting) without any disk errors. So my immediate issue is resolved, and thanks to those who've responded! It's still of interest as to why errors from For those interested... I finally gave up on the resilvering as it really didn't seem to be getting anywhere and decided to do a disk to disk copy of the
This was a variation on the Much to my surprise the The difference was in the case of resilvering I had the 33 disks in the pool all being hit at the same time, versus the The actual physical layout is a SuperMicro 847E26-R1400LPB head box with a LSI SAS 9211-8i HBA, connected to another identical SuperMicro wired as a simple JBOD, then daisy-chained to a Norco RPC-4224 JBOD. All the disks that were erroring i.e. the WD EADSes were in the Norco, along with a bunch of the other disks that weren't erroring. On a hunch, I moved all the disks in the pool I was trying to resilver out of the Norco and into the 2nd Supermicro. It's now been resilvering away for 7 hours with 10 TB done, and not a single disk error from any of the disks. Only 20 TB to go. Augh! Suddenly I'm not looking kindly upon that Norco. Hmmm, I wonder, if I had a big enough blender, "will it blend"? |
Long story short: use nearline SAS disks instead of SATA with SAS expanders. Longer version: your SuperMicro 847E26-R1400LPB has a SAS2 backplane+2 expanders, correct? I'm playing with those at my current job, and that exact same HBA... good hardware. Anyhow, how's the 2nd chassis wired to the Norco's 6 iPASS connectors? Reports say that SAS expanders + SATA disks are a possible recipe for disaster, especially under moderate to heavy load. Even with the newest HBA+expander firmware, there's a possibility of reset storms (or other weird error conditions) that take out a whole backplane's worth of disks either momentarily or until power cycle, depending on how badly a misbehaving SATA disk crashes the SAS subsystem. Had you described that system layout in the 1st post... :-) Most likely the problem isn't your Norco chassis, the "backplanes" in it are simple passthrough devices... nothing fancy in them. Possibly you just moved the failure domain so that the HBA/expander(s) can better deal with flaky SATA disks. |
Mind, direct attaching SATA disks to SAS HBAs should work just fine. It's when you add SAS expanders (and SATA Tunneling Protocol) to the mix that you start encountering weird errors with misbehaving SATA disks. SATA simply isn't designed with a switching fabric in mind, and it looks like firmwares still aren't robust enough to deal with command tunneling reliably in the presence of errors. Too bad that nearline SAS disks are so much more expensive than SATA disks. One can hope prices come down sooner than later... |
Norco is connected using a Chenbro CK23601. I can definitely blend that if I need to! It uses the same LSI SAS2X36 expander as the Supermicro SC847E26-RJBOD1. I'm interested in further discussing the hardware side of things, perhaps move that part over to zfs-discuss for a wider audience? On the immediate issue, 27 hours in and 21 TB of 30 TB completed. 2 disks errors reported, both on EADS disks, neither resulting in a resilver restart:
I guess that means I've just gotten lucky that the problem disk hasn't suffered an error and caused a resilver restart. Fingers crossed for the next 7 hrs! Current pool status:
Note that, with the move to the Supermicro chassis, the |
FYI, the resilvering completed with several more disk errors, but none on |
Just confirming this as more than just a once-off freak: I'm seeing this "resilver restarts on error from replacement disk" problem happening again with different disks. Once again I was replacing a faulted-out disk, using
...which precisely matches the timestamp of a disk error seen on the new replacement disk:
The relevant section of the
|
FYI, "resilver restarts on error from replacement disk" also occurs when the original disk is still present (previously I'd only seen it when the original disk had been faulted out). |
Yep, that looks like it addresses the problem. I don't have a test case to confirm, but I'd say the ticket can be closed. For anyone following the saga above, in the end my problems with multiple EADS disks going bad was traced down to a single dodgy EADS. It would do "something", and other EADS disks would experience errors. Once that one disk was replaced (during a program to replace all the EADS disks) all the problems went away. Go figure. I guess the EADS disks are marginal on their bus specs or something, perhaps only seen in conjunction with the SATA over SAS expanders etc. Since then the replaced EADS disks (minus the dodgy one) have been used in test pools and we've not seen any similar issues even during prolonged stress testing etc. |
I really hate bumping old tickets, but I'm not sure what to do. I have the exact same issue running 0.6.5-304_gf74b821 I had a problem where a failing drive was still present in the machine, and repeatedly caused the pool to restart resilvering due to possibly this problem? I had my resilver restart about 5 times before I finally found this, offlined the bad device and removed it from the chassis. My setup is a supermicro enclosure with SATA disks behind SAS HBA/Expanders. |
Hey, 0.6.5.6-1 Some I/O Errors in log..
`
` |
There's probably a bug to be found with this, but as a mitigation for either of you, do you have TLER configured on the failing drives? (e.g. smartctl -l scterc,RNUM,WNUM [drive] - RNUM/WNUM are tenths of a second to attempt read/write before giving up, while 0 will generally cause it to not give up...which, if it takes too long, can be when the various other parts of the chain decide to reset it forcibly) I'd suggest trying it if you can, because presuming the implementation isn't buggy, most parts of IO stacks respond better to drives replying after X seconds that they've had an error than to drives which stop responding entirely until you reset the device. |
Hey, thx for reply. I can test it, Which Values for scterc would you recommend? Thank you! |
Tobias, I'm also suffering this issue at the moment. I've been talking to the blokes on IRC and was given values of 70,70 to try (smartctl -l scterc,70,70) At this stage i'm on my 5th resilver restart, I'll report back on array type and failure/resolution on this setting (I'm applying it to all disks however) in the near future.. |
Thank you. |
Ok. Again restarted. :( |
Changing the values to different numbers shouldn't affect whether it helps or not, unfortunately. |
So @rincebrain you would suggest value 0 anyway? Thank you. |
No, I still claim having it set to nonzero values in a RAID array is a good idea, but if one set of nonzero values didn't workaround this issue, a different set probably won't either. |
I've just been running into this for the last few days on an array. Killing zed and letting it try again. |
I just ran into a similar/the same? issue. System: This is just a very simple mirrored pool of 2 USB attached HDDs.
Disabling What might be interesting is how I got into the situation in the first place:
How I eventually solved the issue on my machine:
I am not sure if 4. and either 2. or 3. are really necessary. Possibly there is no need for stopping the zfs-zed service. Possibly only zeroing or creating a different partition table/fs on the disk would be sufficient. I am not sure if the My guess to what happened: zfs tries to find out how much has already been resilvered to the new drive, however mysteriously fails because of the manual export and re-import during the first resilvering attempt. Maybe related to USB as well? Possibly related Maybe this warrants a separate issue, as I had NO drive errors and eventually succeeded with the method mentioned above? Possibly at the Illumos bugtracker? There definitely seems to be some bug hidden in one of the components. |
This issue saved my bacon. Got stuck in the resilver loop for days on end, with a constant fear of a failure of the remaining disk in the mirror vdev. Stopped zfs-zed, disk resilvered in minutes. |
For me I needed to do 2 things - kill zfs-zed and also set Ultimately it looks like I'll need to extract all the remaining data and rebuild from backups though. |
Hey all, original poster (from 7 years ago!) here... I was running into this "resilver constantly restarting" after replacing a drive, which naturally kicked off a resilver. One of the symptoms was these messages spewing into my
This is with 0.8.0, and WITHOUT the zfs feature As soon as I enabled For others in this situation the options seem to be:
WARNING enabling any 0.8.0 features, including (This is a completely different cause to the original problem from 7 years ago, but the symptoms are the same and this issue seems to be the place people come when they hit this "resilver constantly restarting".) |
For what it's worth, I also just hit this issue today, after having upgraded to 0.8.1 with a faulted drive, because with 0.7 the resilver was going to take a very long time, so I wanted sequential resilver. Stopping zed didn't work, but upgrading the pool and enabling resilver_defer solved the problem. |
If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: John Poduska <[email protected]> Issue #840 Closes #9155 Closes #9378 Closes #9551 Closes #9588
I'm currently experiencing this on a pool running under 0.8.2, with feature@resilver_defer enabled (the pool was on 0.8.2, with that feature enabled, when the resilver started). The original disk is no longer present, the replacement disk encountered errors during the first resilver run (which I cleared), and finished the second resilver run without any errors. It's currently in the third resilver. |
|
If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: John Poduska <[email protected]> Issue openzfs#840 Closes openzfs#9155 Closes openzfs#9378 Closes openzfs#9551 Closes openzfs#9588
If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Gallagher <[email protected]> Reviewed-by: Kjeld Schouten <[email protected]> Signed-off-by: John Poduska <[email protected]> Issue #840 Closes #9155 Closes #9378 Closes #9551 Closes #9588
This is still going on in 0.8.3 pool: tv
errors: No known data errors I offlined the disks and stopped ZED to see if that will fix the resilver, which apparently has been running non-stop for >15 days. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
I just found this thread. Thank you to everyone who has posted over the years. I'm running into problems with ZFS suffering errors and degrading disks in any zpool I create or run through my CK23601. It doesn't matter if the controller is an LSI 2008 or an LSI 2308, the same problem is occurring. I do happen to have a single WD EADS disk. I will remove it and see what impact it has. I've tried using all SAS or all SATA disks in the same zpool without success. The entire setup is: (2x) Norco 4224 to (2x) CK23601 to a single LSI 2308. The LSI 2308 (LSI 9206-16e) is hosted in a Dell 910. Any insight would be appreciated. Thank you. |
I believe it was the EADS drive. I pulled it out. At the same time I also moved all of the SAS into one enclosure and the SATA into another enclosure. The piece I still haven't been able to resolve - the ZFS filesystem I created on SAS disks connected directly to the controller generates errors when connected via SAS-expander. Thanks again for the above discussion. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
Bumps [pest](https://github.com/pest-parser/pest) from 2.5.7 to 2.6.0. - [Release notes](https://github.com/pest-parser/pest/releases) - [Commits](pest-parser/pest@v2.5.7...v2.6.0) --- updated-dependencies: - dependency-name: pest dependency-type: indirect update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
I used
zpool replace ${pool} ${guid} ${dev}
to replace one dead and one dying disk, which started a resilver. I've been monitoring the progress and have seen it apparently restart, e.g.:In conjunction with the apparent restart there's a disk error reported in the
kern.log
:...where
sdbc
is one of the drives being replaced.This has happened at least 4 or 5 times since the replaces were originally done some 24 hours ago, and the 3 times I've actually watched it happen have all been with errors on
sdbc
(at different sectors).On the other hand, there've been other disk errors, including (I think) on
sdbc
, that haven't caused the resilver to apparently restart.Is this normal, and/or can I expect the resilver to actually finish, or is it stuck in a loop?
Oh, I've just realised I have a few dumps of the complete
zfs status
taken during all this, which show that as far as thein progress since...
message is concerned the resilver really is restarting. E.g. from this morning vs now:The complete current
zpool status
, with the /dev/mapper devices annotated with their matching /dev device:The text was updated successfully, but these errors were encountered: