-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"attempt to access beyond end of device" and devices failing #15932
Comments
The last two times I saw something like this, it was because:
It displaying a disk size of exactly 2^32 suggests something somewhere got confused about the disk size, and that it's below ZFS, since Linux is what's claiming the disk is only that big, at this point, and ZFS doesn't do anything more intimate about disks than "send a discard request" or "write a partition table and wait for the OS to rescan it", really. So I would suggest you investigate how on earth those disks are reporting being 2T actual size, and whether that matches the specs of the device, since if it's entirely made of 8 and 10T disks, you should not be able to attach a 2T disk no matter how confused ZFS got. Also, since 0.7 hasn't been updated since 2019, I would strongly doubt anyone is going to look at any bug you find even if it is in ZFS unless you try it on a version like 2.1 or 2.2 that is still getting fixes, to confirm it's not something that is long-fixed. To be clear, I don't think, at least with the information at hand, this is a bug in ZFS, since the disk themselves appear to be reporting being 2T, and if the disk says it's not big enough, ZFS can't do all that much about it. But separate from that, if you do end up finding a bug in ZFS somewhere along the way, that would be my expectation. |
Indeed, I've seen some discussions about USB disks and 2TB. But there's no USB involved here, and I think no one physically touched any disks since 2021. But the idea that disk entered some "special" mode looks like a very plausible explanation to me. And I just still have a faint hope power-cycling the system would bring it back to the normal mode... I agree that the Any comments about my plan about "how to proceed with this", in particular? |
I'd reboot now, since you're just going to have to resilver again once the disks come back anyway, really, and then make sure you for it to not defer resilvering some of them with a forcible |
OK. thanks for the suggestion! But...
BTW, all ST8000NM0185 got PT51 firmware. Dell provides "Urgent" PT55 firmware, dated 15 Mar 2022. They say |
I would say cascading failure of disk can be caused by the extra load on PSU due to more and more disk / disk controller activity. I would rather stop / upgrade PSU / ddrescue failing disks. Resilver state should persist on reboot, but I don't know that for sure. |
Resilver state persists on reboot, yes, it checkpoints every so often depending on whether it's post or pre sequential scrub. That suggestion makes no sense in this context. That's like saying you should fix your failing disks by doing a dance around them - it has no relation to the problem at hand. It shouldn't cause any problems to upgrade, and if it does, it's a bug. |
@IvanVolosyuk , thanks for your suggesion!
|
Yes, it would restart, but the entire point of my advice was TO FORCIBLY TRIGGER THE RESTART, since that feature otherwise would make you trigger a resilver again after it finished what it was doing, which is usually not desirable, and the authors of that feature really didn't design it well. |
@rincebrain , I think my situation here is just the opposite:
Or is any of this (or all of this) a complete nonsense? |
No, you're not losing progress on resilvering by restarting. Even if the checkpoint were old, it still wrote the new data. Please start a discussion on the mailing list or the Discussions tab if you want to ask questions about how ZFS works, this is a bug report, and does not appear to be a bug in ZFS. |
OK, fair... |
@i3v I'm not sure if you solve Your problem, but please check what you have in smartctl report for this drive.
Maybe newer firmware for your hard disk is available. I have had a kind other issue on ST8000NM0185 with firmware PT51 (https://www.dell.com/support/home/pl-pl/drivers/driversdetails?driverid=6421f)
|
|
So, to wrap this up:
I'm happy zfs proved to be able to survive through this "real-life resiliency test". |
System information
Describe the problem you're observing
I am looking at a system that was not maintained since 2021 or so...
And I'm seeing an endless wall of similar error messages (about ~1GB of them in my
/var/log/messages
).The messages are:
I know #7906 is an old thread and about a very old zfs version, that no one already cares about...
But, at least, I would like to report that I'm getting similar messages on 0.7.9 too.
This is happening during the nightmarish resilvering:
Note, that there's a mix of 8TB and 10TB drives in the pool (maybe this is somehow related to this issue as well).
Just before I started that resilvering, things were scary already, but not that bad:
thus I "happily" started
zpool replace
, then added a few spares, then manually started two additionalzpool replace
(withspare
disks this time).I'm not sure if those messages actually got the same origin as reported in #7906 (for the 0.7.10), because there's something weird with these disks:
Note
sdaa
andasaq
are suddenly 2TB disks now, whilesdy
looks perfectly normal. They were all reporting7.3T
before resilvering started.The only thing I found that resembles these "weird capacity` is this post. Interestingly, it is also related to zfs.
The
sdaa
is a completely different story - it just disappeared from the system at about 30% resilvering progress.I was monitoring the process, and it looked like those
attempt to access beyond end of device
started few minutes beforelsblk
stated to show this weird2TB
. Thus, I'm not actually sure what was the cause and what is just the effect here.All these 3 disks (
sdaa
,sdac
,sdaq
) were struggling to read from about 18% of resilveringzpool iostat
the reading speed frequently dropped to about 1 MB/s (but it mostly was about 50 MB/s)iostat -x
one of them had ~100% utilization, while all other disks were about idleI guess that CRC errors for the 10TB disks could be resolved with this. But even without that, there's a ton of
nonmed
errors for many disks, which, AFAIU, means that there could be something wrong with cabling / firmware / etc. (that is, something that could be fixed).Personal concerns:
zed
and prevented any user data access. There's no rush.sdaa
,sdac
,sdaq
) would come online, which would allow the subsequent zfs resilver to rescue the data.sdaa
would be automatically removed from the pool once the resilvering would be finished, so that zfs would just ignore it, even if it would be perfectly working and containing the data that could have been used for good.... I'm not sure if that would be the case, how to work around that if that would happen and if there's a possibility to not let that happen. Theoretically, AFAIU, the original device (if different from replacement) will be removed from the pool, but I've alsoAny help is greatly appreciated...
Describe how to reproduce the problem
No idea.
Include any warning/errors/backtraces from the system logs
Some fragments are included above. I can provide more if anyone's interested.
The text was updated successfully, but these errors were encountered: