-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I/O errors shortly after pool creation #2663
Comments
Two followup questions I probably should have asked originally:
Thanks! /a |
@alexisrosen Presumably you have |
Yes, I do. |
@alexisrosen Unfortunately, it's not clear to me which file/directory might have a problem. I'm concerned there might be a problem with the SA. Could you please find the inode number of the bad file or directory with |
Thanks, I'll do that as soon as I can (burning in another disk right now, might be a couple days). In the meantime, any idea about the fundamental question of what an I/O error means for ZFS? |
@alexisrosen I was trying to avoid answering that part of your question :) An EIO is supposed to mean the same thing it means for any other filesystem which for ZFS, it would typically mean an I/O error from which normal ZFS recovery mechanisms couldn't recover. Could you please also verify that failing files were created with a version of zfsonlinux >= 0.6.3. I was going out on a limb by suggesting the line of debugging outlined above but it would be nice to eliminate a bad (SA) xattr as the culprit here. |
It's definitely >=0.6.3 - it was installed on Ubuntu from the PPA about a week ago. Metapackage is "8 Beyond your suggestion (which I will definitely do, almost certainly on Tuesday), is there a way to get a raw block list for a file or directory? I should be able to probe those blocks (for example with dd) to provoke a kernel IO error. Hm... What happens with there is an unrecoverable error and the disk remaps the block? Presumably, it returns all zeroes or garbage. Either way, that should fail the ZFS checksum and provoke... what behavior exactly? If there's no recoverable block for a file and you try to read it, what does ZFS do? Are directories stored multiple times or are they not that kind of metadata? |
@alexisrosen The zdb command I mentioned above will try to read the directory in a similar manner as ZFS, itself does. If you add an extra pair of "-d" options (-dddddd), it will display something like this in addition to the rest of its output:
The colon-separated tuples (DVAs) tell where the data is located. This example is from a pool with no redundancy and, as you can see, there are still 2 copies of it because it's considered to be metadata (the first field of the tuple is the top-level vdev on which the data are stored). Zdb can dump the contents given a DVA with the "-R" option. For example, During normal operation, ZFS will pick one of the DVAs to read and if it fails for some reason (checksum, etc.) it will try another one. |
Well, this is not encouraging. :-( I just plugged the zfs disk back in. "zfs mount -a" did nothing, so I did "zpool list" to see if ZFS could see the disk. This hung - I had to open a new window to get a new shell, as the process is stuck in diskwait:
Again, syslog suggests that there isn't anything wrong with the disk itself - instead, I see this (and nothing else!): Unfortunately, I don't know if that message was provoked by the "zfs mount", the "zpool list", or neither. What's up with the empty line after the SPLErr? Anyway, something seems extraordinarily fishy here. Why would the zpool be stuck in diskwait? And what damaged my zpool anyway? It was fine before I removed the disk. I might believe that this could be caused by removing the disk while it was mounted, but it was definitely unmounted before removal. And anyway, with ZFS, it shouldn't matter - I should be able to yank it at any time and state on disk should be consistent. In general, is there anything I should do before removing removable media with zpools on them, beyond the simple unmount of the filesystem(s) they contain? BTW, further commands like "spool status" also hang in diskwait. I am able to dd the first and last 500MB of the disk to /dev/null without error or delay. So... now what? |
@alexisrosen The "I/O failure" message indicates there is very likely something wrong with the disk (or I/O path). ZFS was getting I/O errors while trying to access it and there wasn't sufficient redundancy to continue operation so the pool was suspended. Unfortunately, there's a bug here which can prevent other Are you able to read the entire disk without error using something like |
So after a little thought, I looked for other processes in diskwait:
This made me wonder if it had perhaps gotten confused even before I removed the disk. So I rebooted, and wonder of wonders, the pool came up fine. I wonder what caused the txg_sync thread to hang, but in the meantime, after I put my kid to bed I'll do the debugging Tim asked for. And in the meantime I'll dd the entire disk to /dev/null (though I think that's likely a wild goose chase since there are no block errors logged in syslog). |
@alexisrosen FYI, you can |
Tim, here's what you asked for:
In syslog, I noted this, which happened during a dd of the whole disk, and not during the zdb:
I am not 100% certain how to read that but my naive read on it is a short-term transport failure, not a disk error, which makes sense since it's sitting in an esata disk dock. Net cable length is probably outside the SATA standard so the occasional transport error is no surprise. More in a few minutes. |
Here's the dir itself:
The rest of the block is all zeroes. All 4k was printed out. Looking at the first zdb output... looks like I get inode numbers for each file! So... Using zdb -dddddd and zdb -R, I was able to read each block, which, IIUC, is the inode. No IO errors reading the inodes. The last of them (#60099) starts like this:
This file appears to have no name. Is that because it's the ACL for the directory? Meanwhile, no further IO errors while I'm doing this. Is there a zdb or other command that will let me dump the actual data blocks of the files in question? And if none of these commands are generating errors, what does? |
One more thought. I don't know how likely this is to be coincidence or not, but here are the four problem dirs:
All four problem dirs are ".AppleDouble". (The ones named AD were renamed by me to make it easier to poke at them.) They were all created (on their original ext4 FS) by netatalk, and they're fine there. I was also able to recopy them into the zfs FS, with the result showing no errors, with xattrs/ACLs coming across correctly. The recopy was done with "cp -pr", while the original was done with rsync. However rsync copied over other trees with xattrs and ACLs without errors. So all in all I have no idea what to make of this data point, but I thought it was worth mentioning. |
@alexisrosen The zdb output above doesn't think there's any xattrs on 60077. I find this surprising and it's likely wrong. The key piece of information in that output is SPILL_BLKPTR. Most of our issues regarding xattr SAs seem to have revolved around extension into spill blocks. Could you please strace the ls command on that directory to see which syscall is getting the IO error. Maybe just run "stat" on it to see what happens. It'd be interesting to know the exact series of system calls used when this directory was created. It sounds like it was created by rsync. Maybe you could try rsyncing just this directory from its original source on an ext4 file system (as produced by netatalk) and see if you can reduce this to a simpler test case and then to strace the rsync once you can reproduce it. I've got a number of patches to zdb to help debug this type of problem. If you can't reliably reproduce this, then those might be useful. |
Argh. Just now I tried to remount the FS to do the things you asked, and the mount hung again. Sure enough, just like last time, the thread "txg_sync" was stuck in diskwait. That despite my having unmounted all the filesystems before turning off that drive. What is going on here? Is this a known defect? Rebooting now... |
@alexisrosen If you're having problems importing or mounting, you might try importing in read-only mode with First, the SA registration object. It's typically in object 5 and will look like this:
If it's not object 5, you can fish around low-numbered objects with zdb until you find it. I'm interested in the "ZPL_..." set of lines at the end. This list shows the system attributes in use throughout the specified file system. Next, the SA layouts which are typically in object 6:
Again, if it's not object 6, it should be in one of the low-numbered ones. We want to see the lines at the end of the form "2=..." as shown above. This list shows all of the combinations and ordering of system attributes used throughout the specified file system. Finally, I'd like to see the SA layout used for one of the broken files/directories such as 60077 above. Unfortunately, zdb doesn't print this information. You can compile my zdb branch as of dweeezil/zfs@495b189 (in https://github.com/dweeezil/zfs/tree/zdb) and run its zdb right out of the build directory as
The SA header size and layout are printed right after "dnode maxblkid". By combining those 3 pieces of information, we'll get a better read on the contents of the SA. AFAIK, there are no known bugs in this area at the moment. The debugging I've outlined above will allow is to tell whether there is still a problem lurking or whether the problem is more likely caused by something else on your system. |
First of all - no problem mounting the FS... as long as I reboot to clear the diskwait, which is most likely due to the txg_sync kernel thread being hung in disk wait. I didn't remember to get a stack trace before rebooting, sorry - next time I will. (And there will likely be a next time, this looks reproducible. To repeat, is this a known bug?) Anyway, there aren't any xattrs on 60077's analog on the source disk, so the zdb output saying that 60077 doesn't have any should also be correct. Here's the trace on "ls -l bad":
Wow, how totally unhelpful. That return val is not special to getxattr, and the man page implies that it comes from stat. Stat just says that EFAULT means... "Bad address". WTF? The stat itself returned 0, so what is going on? As for reproducing the error... I already tried rsyncing just that subtree. Works fine, no errors. |
I'm out of time, unfortunately, but will try to get you the other info you asked for tomorrow. |
@alexisrosen It's not clear to me the exact steps you're doing which is causing the hung sync task. Going out on a limb, here's a few possibilities: I got the impression that something regarding the pool and/or device configuration may have changed; if As to the error on |
I wondered what an strace of "cd bad ; ls AD" would show. It was this:
Hm. Totally uninformative. EIO isn't even documented as a possible return code for openat() or open(), and again, there are NO messages in syslog (or dmesg) showing a block error. So where the heck is that IO error coming from? Shouldn't that be the focus? |
About the hung txg_sync: Nothing ever changes for the zpool... except sometimes, the underlying device (/dev/disk/by-id/whatever) isn't always there (since sometimes that dock is used for a different drive). (And of course, this isn't a final configuration, which will have mirrored disks that are never removed.) The hang looks like a bug to me, but may be driven by a lack of understanding by me of expected behavior. This is what I do:
At this point, the zpool command gets stuck in diskwait, I look around with ps, and find txg_sync in diskwait as well. Only a reboot clears this (and one out of two times, the system couldn't reboot, stuck with some daemons already down and some still up. Annoying...). But... If I don't do step 4 (wait a while), and do "zpool status" only a few minutes after turning off the disk, not only doesn't it hang, but it says (in part):
...which is clearly not the case, and must be the result of using cached info since the disk is off. Now, I would argue that hanging in diskwait is a bug even if I'm not following the rules, but... am I following the rules? Am I supposed to export that pool before removing (or turning off) the media? Or do something else so zfs doesn't hang trying to read it? |
Guys - sorry I disappeared on you. Some major real-life events forced me to put this whole project on hold, and I'm not finished yet, so the disk has just been sitting around waiting for me to find time for debugging. I'm glad to see that you think you've got this resolved, and I hope to be able to get back to working on this in a few weeks. I used the Ubuntu PPA to set up ZFS originally; what do I need to do if I want a version that picks up this fix? Thanks! |
If a spill block's dbuf hasn't yet been written when a spill block is freed, the unwritten version will still be written. This patch handles the case in which a spill block's dbuf is freed and undirties it to prevent it from being written. The most common case in which this could happen is when xattr=sa is being used and a long xattr is immediately replaced by a short xattr as in: setfattr -n user.test -v very_very_very..._long_value <file> setfattr -n user.test -v short_value <file> The first value must be sufficiently long that a spill block is generated and the second value must be short enough to not require a spill block. In practice, this would typically happen due to internal xattr operations as a result of setting acltype=posixacl. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#2663 Closes openzfs#2700 Closes openzfs#2701 Closes openzfs#2717 Closes openzfs#2863 Closes openzfs#2884 Conflicts: module/zfs/dbuf.c
If a spill block's dbuf hasn't yet been written when a spill block is freed, the unwritten version will still be written. This patch handles the case in which a spill block's dbuf is freed and undirties it to prevent it from being written. The most common case in which this could happen is when xattr=sa is being used and a long xattr is immediately replaced by a short xattr as in: setfattr -n user.test -v very_very_very..._long_value <file> setfattr -n user.test -v short_value <file> The first value must be sufficiently long that a spill block is generated and the second value must be short enough to not require a spill block. In practice, this would typically happen due to internal xattr operations as a result of setting acltype=posixacl. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #2663 Closes #2700 Closes #2701 Closes #2717 Closes #2863 Closes #2884
[Edited: fixed various markdown-related screwups, sorry.]
I just installed zfsonlinux on Ubuntu 14.04.1 using the PPA. On a new 4TB disk just tested clean with badblocks, I created a new pool and some filesystems. With a combination of "cp -pr" and rsync, I copied over about 1.5TB of data. Running rsync on the entire hierarchy to make sure everything was copied properly, rsync threw a bunch of errors in the one directory tree that I was using with ACLs, that looked like this:
rsync: get_acl: sys_acl_get_file(path/to/dir, ACL_TYPE_ACCESS): Input/output error (5)
rsync: set_acl: sys_acl_set_file(path/to/dir, ACL_TYPE_ACCESS): Bad address (14)
rsync: get_xattr_names: llistxattr(""/path/to/file"",1024) failed: Bad address (14)
rsync: recv_generator: failed to stat "/path/to/file": Input/output error (5)
At first I thought that there was some problem with ACLs on files moved from ext4 to zfs, but that's not the case - re-rsyncing that one tree to a new destination in the zfs FS worked fine, with xattrs and ACLs coming across correctly.
Trying to delete the "bad" tree, however, I discovered that it really was bad. "rm -r" failed to remove everything, and I was unable to rmdir two directories:
# rmdir bad/.AppleDouble
rmdir: failed to remove ‘bad/.AppleDouble’: Directory not empty
# ls bad/.AppleDouble/
ls: cannot open directory bad/.AppleDouble/: Input/output error
# mv bad/.AppleDouble bad/AD
# ls -l bad
ls: bad/AD: Bad address
total 17K
drwxr-sr-x 2 alexis alexis 9 Aug 6 19:22 AD
I was able to "mv bad/.AppleDouble bad/AD" but not move it to a different directory.
I see no errors, or messages of any sort, in /var/log/syslog for the underlying device (/dev/sdc).
The last of the four rsync errors above (and many similar) suggest that this error happened after the rsync started, since rsync had built the file list of files in that dir... if I'm reading its output right.
So... what the heck happened here? Is there any way to tell, or even approximate an answer? How can I prevent this from happening again?
Some other misc. info:
I've started a scrub, to see what happens... it should be done in 3 hours. Edit: Scrub completed, no errors.
Thanks,
/a
The text was updated successfully, but these errors were encountered: