Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I/O errors shortly after pool creation #2663

Closed
alexisrosen opened this issue Sep 5, 2014 · 25 comments
Closed

I/O errors shortly after pool creation #2663

alexisrosen opened this issue Sep 5, 2014 · 25 comments
Milestone

Comments

@alexisrosen
Copy link

[Edited: fixed various markdown-related screwups, sorry.]

I just installed zfsonlinux on Ubuntu 14.04.1 using the PPA. On a new 4TB disk just tested clean with badblocks, I created a new pool and some filesystems. With a combination of "cp -pr" and rsync, I copied over about 1.5TB of data. Running rsync on the entire hierarchy to make sure everything was copied properly, rsync threw a bunch of errors in the one directory tree that I was using with ACLs, that looked like this:
rsync: get_acl: sys_acl_get_file(path/to/dir, ACL_TYPE_ACCESS): Input/output error (5)
rsync: set_acl: sys_acl_set_file(path/to/dir, ACL_TYPE_ACCESS): Bad address (14)
rsync: get_xattr_names: llistxattr(""/path/to/file"",1024) failed: Bad address (14)
rsync: recv_generator: failed to stat "/path/to/file": Input/output error (5)

At first I thought that there was some problem with ACLs on files moved from ext4 to zfs, but that's not the case - re-rsyncing that one tree to a new destination in the zfs FS worked fine, with xattrs and ACLs coming across correctly.

Trying to delete the "bad" tree, however, I discovered that it really was bad. "rm -r" failed to remove everything, and I was unable to rmdir two directories:
# rmdir bad/.AppleDouble
rmdir: failed to remove ‘bad/.AppleDouble’: Directory not empty
# ls bad/.AppleDouble/
ls: cannot open directory bad/.AppleDouble/: Input/output error
# mv bad/.AppleDouble bad/AD
# ls -l bad
ls: bad/AD: Bad address
total 17K
drwxr-sr-x 2 alexis alexis 9 Aug 6 19:22 AD

I was able to "mv bad/.AppleDouble bad/AD" but not move it to a different directory.

I see no errors, or messages of any sort, in /var/log/syslog for the underlying device (/dev/sdc).

The last of the four rsync errors above (and many similar) suggest that this error happened after the rsync started, since rsync had built the file list of files in that dir... if I'm reading its output right.

So... what the heck happened here? Is there any way to tell, or even approximate an answer? How can I prevent this from happening again?

Some other misc. info:

  • I have a second disk to mirror, which is currently in burn-in, which is why the problem pool isn't in a mirror.
  • I can risk destroying the problem disk
  • This server has only 2GB RAM. The ARC cache is limited to 400MB.
  • The two problem dirs are both ".AppleDouble" files but the netatalkd process was idle and has not touched the zfs FS since its creation, so it seems highly unlikely to be involved.

I've started a scrub, to see what happens... it should be done in 3 hours. Edit: Scrub completed, no errors.

Thanks,
/a

@alexisrosen
Copy link
Author

Two followup questions I probably should have asked originally:

  1. If this is a directory-reading error, shouldn't there be multiple copies of that data, even within a single disk? I recall reading that metadata is backed up even in non-mirrored datasets. Or don't dirs count as metadata?
  2. Assuming the following volumes exist: /vol, /vol/a, /vol/b, /vol/a/x, /vol/a/y, /vol/a/z. My IO errors are all happening in /vol/a. I can "zfs rename" the subvolumes out of /vol/a, then "zfs destroy /vol/a", then recreate it, recopy its data, and move its subvols back into place. The big question is... is this likely to be a good move, or will whatever damage or corruption that exists currently just grow with this further manipulation of the pool?

Thanks!

/a

@dweeezil
Copy link
Contributor

dweeezil commented Sep 7, 2014

@alexisrosen Presumably you have acltype=posixacl set. Do you also have xattr=sa?

@alexisrosen
Copy link
Author

Yes, I do.
I really want to understand what's going on here. On a traditional FS, "I/O error" would (assuming the transport was OK) mean that the disk I/O failed, and there would be a message from the kernel in syslog complaining about the block it was trying to read. But I don't see anything in syslog when I get the I/O error from ZFS.

@dweeezil
Copy link
Contributor

dweeezil commented Sep 7, 2014

@alexisrosen Unfortunately, it's not clear to me which file/directory might have a problem. I'm concerned there might be a problem with the SA. Could you please find the inode number of the bad file or directory with ls -di <file or directory> and then run zdb -dddd <pool>/<fs> <inum> ( is the pool on which the file or directory resides and is the filesystem within the pool on which the bad file or directory resides) on it and post the results.

@alexisrosen
Copy link
Author

Thanks, I'll do that as soon as I can (burning in another disk right now, might be a couple days). In the meantime, any idea about the fundamental question of what an I/O error means for ZFS?

@dweeezil
Copy link
Contributor

dweeezil commented Sep 7, 2014

@alexisrosen I was trying to avoid answering that part of your question :) An EIO is supposed to mean the same thing it means for any other filesystem which for ZFS, it would typically mean an I/O error from which normal ZFS recovery mechanisms couldn't recover. Could you please also verify that failing files were created with a version of zfsonlinux >= 0.6.3. I was going out on a limb by suggesting the line of debugging outlined above but it would be nice to eliminate a bad (SA) xattr as the culprit here.

@alexisrosen
Copy link
Author

It's definitely >=0.6.3 - it was installed on Ubuntu from the PPA about a week ago. Metapackage is "8trusty", and the individual packages are all "0.6.3-2trusty". All pools, FSes, and files were created after that.

Beyond your suggestion (which I will definitely do, almost certainly on Tuesday), is there a way to get a raw block list for a file or directory? I should be able to probe those blocks (for example with dd) to provoke a kernel IO error.

Hm... What happens with there is an unrecoverable error and the disk remaps the block? Presumably, it returns all zeroes or garbage. Either way, that should fail the ZFS checksum and provoke... what behavior exactly? If there's no recoverable block for a file and you try to read it, what does ZFS do?

Are directories stored multiple times or are they not that kind of metadata?

@dweeezil
Copy link
Contributor

dweeezil commented Sep 8, 2014

@alexisrosen The zdb command I mentioned above will try to read the directory in a similar manner as ZFS, itself does. If you add an extra pair of "-d" options (-dddddd), it will display something like this in addition to the rest of its output:

Indirect blocks:
               0 L0 0:1a000:c00 0:3c01a000:c00 3200L/c00P F=1 B=15/15

The colon-separated tuples (DVAs) tell where the data is located. This example is from a pool with no redundancy and, as you can see, there are still 2 copies of it because it's considered to be metadata (the first field of the tuple is the top-level vdev on which the data are stored). Zdb can dump the contents given a DVA with the "-R" option. For example, zdb -R tank 0:3c01a000:c00:d would dump the second copy in this case (the extra ":d" tells it to decompress the data; metadata is typically compressed).

During normal operation, ZFS will pick one of the DVAs to read and if it fails for some reason (checksum, etc.) it will try another one.

@alexisrosen
Copy link
Author

Well, this is not encouraging. :-(

I just plugged the zfs disk back in. "zfs mount -a" did nothing, so I did "zpool list" to see if ZFS could see the disk. This hung - I had to open a new window to get a new shell, as the process is stuck in diskwait:

mserv-root ~ <15> ps up27085
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     27085  0.0  0.0  30772  1328 pts/2    D+   20:00   0:00 zpool list

Again, syslog suggests that there isn't anything wrong with the disk itself - instead, I see this (and nothing else!):
Sep 9 20:00:15 mserv kernel: [519679.685578] SPLError: 373:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp' has encountered an uncorrectable I/O failure and has been suspended.
Sep 9 20:00:15 mserv kernel: [519679.685578]

Unfortunately, I don't know if that message was provoked by the "zfs mount", the "zpool list", or neither. What's up with the empty line after the SPLErr?

Anyway, something seems extraordinarily fishy here. Why would the zpool be stuck in diskwait? And what damaged my zpool anyway? It was fine before I removed the disk. I might believe that this could be caused by removing the disk while it was mounted, but it was definitely unmounted before removal. And anyway, with ZFS, it shouldn't matter - I should be able to yank it at any time and state on disk should be consistent.

In general, is there anything I should do before removing removable media with zpools on them, beyond the simple unmount of the filesystem(s) they contain?

BTW, further commands like "spool status" also hang in diskwait. I am able to dd the first and last 500MB of the disk to /dev/null without error or delay.

So... now what?

@behlendorf
Copy link
Contributor

@alexisrosen The "I/O failure" message indicates there is very likely something wrong with the disk (or I/O path). ZFS was getting I/O errors while trying to access it and there wasn't sufficient redundancy to continue operation so the pool was suspended. Unfortunately, there's a bug here which can prevent other zfs and zpool commands from working while the pool is suspended.

Are you able to read the entire disk without error using something like dd? Based on the console error ZFS was definitely seeing errors while trying to access it for some reason.

@alexisrosen
Copy link
Author

So after a little thought, I looked for other processes in diskwait:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       451  0.2  0.0      0     0 ?        D<   Sep03  17:59 [txg_sync]
root     10690  0.0  0.0  30884   676 ?        D    Sep05   0:00 zfs snapshot -o com.sun:auto-snapshot-desc - -r zp zfs-auto-snap_daily-2014-09-05-1054

This made me wonder if it had perhaps gotten confused even before I removed the disk. So I rebooted, and wonder of wonders, the pool came up fine. I wonder what caused the txg_sync thread to hang, but in the meantime, after I put my kid to bed I'll do the debugging Tim asked for. And in the meantime I'll dd the entire disk to /dev/null (though I think that's likely a wild goose chase since there are no block errors logged in syslog).

@dweeezil
Copy link
Contributor

@alexisrosen FYI, you can cat /proc/<pid>/stack to get stack traces for those blocked processes. That might be useful information if you encounter them again.

@alexisrosen
Copy link
Author

Tim, here's what you asked for:

mserv-root /z/shared <2> ls -l bad
ls: bad/AD: Bad address
total 17K
drwxr-sr-x 2 alexis alexis 9 Aug  6 19:22 AD
mserv-root /z/shared <3> cd bad
mserv-root /z/shared/bad <4> getfacl AD 
getfacl: AD: Input/output error
mserv-root /z/shared/bad <5> getfattr AD
getfattr: AD: Bad address
mserv-root /z/shared/bad <6> ls AD 
ls: cannot open directory AD: Input/output error
mserv-root /z/shared/bad <7> lld -i AD
ls: AD: Bad address
60077 drwxr-sr-x 2 alexis alexis 9 Aug  6 19:22 AD
mserv-root /z/shared/bad <8> zdb -dddd zp/shared 60077   
Dataset zp/shared [ZPL], ID 55, cr_txg 462, 391G, 59943 objects, rootbp DVA[0]=<0:1800007d000:1000> DVA[1]=<0:220000b2000:1000> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=52365L/52365P fill=59943 cksum=1a9ea44af2:8fc922b046c:1a066359a89bd:356ac3ffc60e9b

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
     60077    1    16K     1K    16K     1K  100.00  ZFS directory
                                        196   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED SPILL_BLKPTR
    dnode maxblkid: 0
    path    /bad/AD
    uid     1000
    gid     1000
    atime   Thu Sep  4 19:21:09 2014
    mtime   Wed Aug  6 19:22:31 2014
    ctime   Fri Sep  5 01:56:38 2014
    crtime  Thu Sep  4 19:21:08 2014
    gen 42733
    mode    42755
    size    9
    parent  23
    links   2
    pflags  40800000044
    microzap: 1024 bytes, 7 entries

        2014 Rosen E-file Auth-signed.pdf = 60102 (type: Regular File)
        2013 Taxes summary.pdf = 60101 (type: Regular File)
        Fax Cover.doc = 60104 (type: Regular File)
        TWC bills.pdf = 60105 (type: Regular File)
        .Parent = 60100 (type: Regular File)
        Con Ed bills.pdf = 60103 (type: Regular File)
         ACL = 60099 (type: Regular File)

In syslog, I noted this, which happened during a dd of the whole disk, and not during the zdb:

Sep 10 01:16:01 mserv kernel: [14704.726436] ata3.00: exception Emask 0x10 SAct 0x600 SErr 0x1800000 action 0x6 frozen
Sep 10 01:16:01 mserv kernel: [14704.726459] ata3.00: irq_stat 0x08000008, interface fatal error
Sep 10 01:16:01 mserv kernel: [14704.726474] ata3: SError: { LinkSeq TrStaTrns }
Sep 10 01:16:01 mserv kernel: [14704.726488] ata3.00: failed command: READ FPDMA QUEUED
Sep 10 01:16:01 mserv kernel: [14704.726507] ata3.00: cmd 60/00:48:00:49:68/01:00:dc:00:00/40 tag 9 ncq 131072 in
Sep 10 01:16:01 mserv kernel: [14704.726507]          res 40/00:48:00:49:68/00:00:dc:00:00/40 Emask 0x10 (ATA bus error)
Sep 10 01:16:01 mserv kernel: [14704.726530] ata3.00: status: { DRDY }
Sep 10 01:16:01 mserv kernel: [14704.726541] ata3.00: failed command: READ FPDMA QUEUED
Sep 10 01:16:01 mserv kernel: [14704.726560] ata3.00: cmd 60/00:50:00:4a:68/01:00:dc:00:00/40 tag 10 ncq 131072 in
Sep 10 01:16:01 mserv kernel: [14704.726560]          res 40/00:48:00:49:68/00:00:dc:00:00/40 Emask 0x10 (ATA bus error)
Sep 10 01:16:01 mserv kernel: [14704.726582] ata3.00: status: { DRDY }
Sep 10 01:16:01 mserv kernel: [14704.726597] ata3: hard resetting link
Sep 10 01:16:02 mserv kernel: [14705.740045] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 10 01:16:02 mserv kernel: [14705.741342] ata3.00: configured for UDMA/133
Sep 10 01:16:02 mserv kernel: [14705.741375] ata3: EH complete

I am not 100% certain how to read that but my naive read on it is a short-term transport failure, not a disk error, which makes sense since it's sitting in an esata disk dock. Net cable length is probably outside the SATA standard so the occasional transport error is no surprise.

More in a few minutes.

@alexisrosen
Copy link
Author

Here's the dir itself:

mserv-root /z/shared/bad <11> zdb -R zp 0:16f0f86f000:1000  
Found vdev: /dev/disk/by-id/wwn-0x5000c50074929fa3-part1

0:16f0f86f000:1000
          0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef
000000:  c043800108000304  eac601c000525b45  ......C.E[R.....
000010:  3200000608803508  00736f5220343130  .5.....2014 Ros.
000020:  6c69662d45206e65  2d68747541206500  en E-fil.e Auth-
000030:  2e64656e67690073  0620270c66166470  s.igned.pd.f.' .
000040:  4c164341204028c3  464028c80b801d20  .(@ AC.L ....(@F
000050:  65766f4320007861  3e80636f44642e72  ax. Cover.dDoc.>
000060:  50c02e4028c40000  0c3a88746e657261  ...(@..Parent.:.
000070:  006f434028c71225  6c6c6970622064f9  %..(@Co..d bpill
000080:  c50620350cef4873  5420334001402803  sH..5 ...(@.@3 T
000090:  7573207300657861  468c792e72616d6d  axe.s summar.y.F
0000a0:  548028c90bb05120  fc6cb0bda443fe57   Q...(.TW.C...l.
0000b0:  fc42fc42fc42fc2f  0000000042fc0142  /.B.B.B.B..B....
0000c0:  0000000000000000  0000000000000000  ................

The rest of the block is all zeroes. All 4k was printed out.

Looking at the first zdb output... looks like I get inode numbers for each file! So... Using zdb -dddddd and zdb -R, I was able to read each block, which, IIUC, is the inode. No IO errors reading the inodes. The last of them (#60099) starts like this:

mserv-root /z/shared/bad <29> zdb -R zp 0:16f0f8e0000:1000         
Found vdev: /dev/disk/by-id/wwn-0x5000c50074929fa3-part1

0:16f0f8e0000:1000
          0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef
000000:  1605007ec8000000  0d30000100020007  ....~.........0.
000010:  1be5022300180000  0cb6000000035300  ....#....S......
000020:  0cb5010000045300  7d020000080ef200  .S.............}
000030:  0900000010000000  200000008d020000  ............... 
000040:  c10200000b000000  0200000d00540020  ........ .T.....
000050:  b10200000e520030  0200000f00600044  0.R.....D.`.....
000060:  5645448080000cad  80810030c9020000  .....DEV....0...
000070:  0cd10200004f4e49  0200004e59537200  INO......rSYN...
000080:  00007e5660000cd9  0f007c030030e102  ...`V~....0..|..
000090:  123d1b40aeff0002  0f000c80130004d5  ....@.=.........
0000a0:  2411f2043f0e01d0  31001a0807220500  ...?...$.."....1
0000b0:  5bea41000916a0c7  50f2043100094c4b  .....A.[KL..1..P
0000c0:  0050fe00020f0008  0000000000000000  ......P.........
0000d0:  0000000000000000  0000000000000000  ................

This file appears to have no name. Is that because it's the ACL for the directory?

Meanwhile, no further IO errors while I'm doing this. Is there a zdb or other command that will let me dump the actual data blocks of the files in question? And if none of these commands are generating errors, what does?

@alexisrosen
Copy link
Author

One more thought. I don't know how likely this is to be coincidence or not, but here are the four problem dirs:

mserv-root /z/shared <24> ll -Ra bad*
bad:
ls: bad/AD: Bad address
total 42K
drwxrws--x+  3 alexis   1004  3 Sep  5 01:56 .
drwxrwxrwt  24 root   video  28 Sep  9 21:24 ..
drwxr-sr-x   2 alexis alexis  9 Aug  6 19:22 AD
ls: cannot open directory bad/AD: Input/output error

bad1:
ls: bad1/AD: Bad address
total 74K
drwxrws--x+  3 alexis  1004  3 Sep  5 01:57 .
drwxrwxrwt  24 root   video 28 Sep  9 21:24 ..
drwxr-sr-x   2 alexis  1004  7 Jun 21 02:38 AD
ls: cannot open directory bad1/AD: Input/output error

bad2:
total 66K
drwxr-xr-x   4 root   root   4 Sep 10 03:17 .
drwxrwxrwt  24 root   video 28 Sep  9 21:24 ..
drwxrws--x+  3 alexis  1004  3 Sep 10 03:17 Shared Documents
drwxrws--x+  3 alexis  1004  3 Sep  9 21:24 Taxes

bad2/Shared Documents:
ls: bad2/Shared Documents/.AppleDouble: Bad address
total 42K
drwxrws--x+ 3 alexis   1004 3 Sep 10 03:17 .
drwxr-xr-x  4 root   root   4 Sep 10 03:17 ..
drwxr-sr-x  2 alexis alexis 9 Aug  6 19:22 .AppleDouble
ls: cannot open directory bad2/Shared Documents/.AppleDouble: Input/output error

bad2/Taxes:
ls: bad2/Taxes/.AppleDouble: Bad address
total 74K
drwxrws--x+ 3 alexis 1004 3 Sep  9 21:24 .
drwxr-xr-x  4 root   root 4 Sep 10 03:17 ..
drwxr-sr-x  2 alexis 1004 7 Jun 21 02:38 .AppleDouble
ls: cannot open directory bad2/Taxes/.AppleDouble: Input/output error

All four problem dirs are ".AppleDouble". (The ones named AD were renamed by me to make it easier to poke at them.) They were all created (on their original ext4 FS) by netatalk, and they're fine there. I was also able to recopy them into the zfs FS, with the result showing no errors, with xattrs/ACLs coming across correctly. The recopy was done with "cp -pr", while the original was done with rsync. However rsync copied over other trees with xattrs and ACLs without errors. So all in all I have no idea what to make of this data point, but I thought it was worth mentioning.

@dweeezil
Copy link
Contributor

@alexisrosen The zdb output above doesn't think there's any xattrs on 60077. I find this surprising and it's likely wrong. The key piece of information in that output is SPILL_BLKPTR. Most of our issues regarding xattr SAs seem to have revolved around extension into spill blocks. Could you please strace the ls command on that directory to see which syscall is getting the IO error. Maybe just run "stat" on it to see what happens. It'd be interesting to know the exact series of system calls used when this directory was created. It sounds like it was created by rsync. Maybe you could try rsyncing just this directory from its original source on an ext4 file system (as produced by netatalk) and see if you can reduce this to a simpler test case and then to strace the rsync once you can reproduce it.

I've got a number of patches to zdb to help debug this type of problem. If you can't reliably reproduce this, then those might be useful.

@alexisrosen
Copy link
Author

Argh. Just now I tried to remount the FS to do the things you asked, and the mount hung again. Sure enough, just like last time, the thread "txg_sync" was stuck in diskwait. That despite my having unmounted all the filesystems before turning off that drive. What is going on here? Is this a known defect?

Rebooting now...

@dweeezil
Copy link
Contributor

@alexisrosen If you're having problems importing or mounting, you might try importing in read-only mode with zpool import -o readonly=on. You could also get some more information with zdb without importing the pool at all. In order to run zdb on a non-imported pool, you muse use the "-e" and "-p" options. For example, zdb -e -p /dev/disk/by-id tank/fs -dddd 1234. I'm particularly interested in seeing 3 pieces of information.

First, the SA registration object. It's typically in object 5 and will look like this:

# zdb -e -p /dev/disk/by-id -dddd mytank/fs 5
Dataset mytank/fs [ZPL],...
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         5    1    16K  1.50K     1K  1.50K  100.00  SA attr registration
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    microzap: 1536 bytes, 21 entries

        ZPL_RDEV =  800000a : [8:0:10]
        ZPL_XATTR =  8000009 : [8:0:9]

If it's not object 5, you can fish around low-numbered objects with zdb until you find it. I'm interested in the "ZPL_..." set of lines at the end. This list shows the system attributes in use throughout the specified file system.

Next, the SA layouts which are typically in object 6:

# zdb -e -p /dev/disk/by-id -dddd mytank/fs 6
Dataset mytank/fs [ZPL], ... 
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         6    1    16K    16K  7.00K    32K  100.00  SA attr layouts
...
        2 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19 ]
        3 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  20 ]

Again, if it's not object 6, it should be in one of the low-numbered ones. We want to see the lines at the end of the form "2=..." as shown above. This list shows all of the combinations and ordering of system attributes used throughout the specified file system.

Finally, I'd like to see the SA layout used for one of the broken files/directories such as 60077 above. Unfortunately, zdb doesn't print this information. You can compile my zdb branch as of dweeezil/zfs@495b189 (in https://github.com/dweeezil/zfs/tree/zdb) and run its zdb right out of the build directory as cmd/zdb/zdb <options> ... and it will show something like this:

# cmd/zdb/zdb -e -p /dev/disk/by-id -dddddd mytank/fs 7
Dataset mytank/fs [ZPL]...
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         7    1    16K    512    512    512  100.00  ZFS plain file (K=inherit) (Z=inherit)
                                        240   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    SA hdrsize 16
    SA layout 3

The SA header size and layout are printed right after "dnode maxblkid".

By combining those 3 pieces of information, we'll get a better read on the contents of the SA.

AFAIK, there are no known bugs in this area at the moment. The debugging I've outlined above will allow is to tell whether there is still a problem lurking or whether the problem is more likely caused by something else on your system.

@alexisrosen
Copy link
Author

First of all - no problem mounting the FS... as long as I reboot to clear the diskwait, which is most likely due to the txg_sync kernel thread being hung in disk wait. I didn't remember to get a stack trace before rebooting, sorry - next time I will. (And there will likely be a next time, this looks reproducible. To repeat, is this a known bug?)

Anyway, there aren't any xattrs on 60077's analog on the source disk, so the zdb output saying that 60077 doesn't have any should also be correct.

Here's the trace on "ls -l bad":

openat(AT_FDCWD, "/z/shared/bad", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
getdents(3, /* 3 entries */, 32768)     = 72
lstat("/z/shared/bad/AD", {st_mode=S_IFDIR|S_ISGID|0755, st_size=9, ...}) = 0
lgetxattr("/z/shared/bad/AD", "security.selinux", 0x1c464d0, 255) = -1 EFAULT (Bad address)

Wow, how totally unhelpful. That return val is not special to getxattr, and the man page implies that it comes from stat. Stat just says that EFAULT means... "Bad address". WTF? The stat itself returned 0, so what is going on?

As for reproducing the error... I already tried rsyncing just that subtree. Works fine, no errors.

@alexisrosen
Copy link
Author

I'm out of time, unfortunately, but will try to get you the other info you asked for tomorrow.

@dweeezil
Copy link
Contributor

@alexisrosen It's not clear to me the exact steps you're doing which is causing the hung sync task. Going out on a limb, here's a few possibilities: I got the impression that something regarding the pool and/or device configuration may have changed; if /etc/zfs/zpool.cache exists when the zfs module is loaded, it'll believe its contents unless the zfs_autiomport_disable module parameter is set. If the cache file is loaded and it has extra and/or incorrect information, the zfs module can get very confused. Another possibility is that you're running into condition which has been fixed by 2d50158. Finally, there's a chance the system is running into a bad dnode (with a bad xattr?) and panicking internally which may stop everything in its tracks.

As to the error on lgetxattr() that's kind of what I was expecting to see (not sure about the EFAULT, however). It does likely confirm there's an xattr of some sort on "/z/shared/bad/AD". BTW, do you also have selinux enabled on this system? That's another potentially big user of xattrs. Also, are you sure about the lack of xattrs on the source system? Did you examine it with getfattr -d -m - <name>? In any case, the inode on which I'd like you to run the various zdb commands is the one corresponding to any file returning a weird error on an xattr-related call (which would be "/z/shared/bad/AD" in this case). The EFAULT is actually a somewhat expected unexpected (if that makes any sense) error because it's returned when consistency problems are detected on nvlists.

@alexisrosen
Copy link
Author

I wondered what an strace of "cd bad ; ls AD" would show. It was this:

stat("AD", {st_mode=S_IFDIR|S_ISGID|0755, st_size=9, ...}) = 0
openat(AT_FDCWD, "AD", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 EIO (Input/output error)

Hm. Totally uninformative. EIO isn't even documented as a possible return code for openat() or open(), and again, there are NO messages in syslog (or dmesg) showing a block error. So where the heck is that IO error coming from? Shouldn't that be the focus?

@alexisrosen
Copy link
Author

About the hung txg_sync: Nothing ever changes for the zpool... except sometimes, the underlying device (/dev/disk/by-id/whatever) isn't always there (since sometimes that dock is used for a different drive). (And of course, this isn't a final configuration, which will have mirrored disks that are never removed.)

The hang looks like a bug to me, but may be driven by a lack of understanding by me of expected behavior. This is what I do:

  1. Use various filesystems in pool zp on disk /dev/disk/by-id/whatever
  2. Unmount all those filesystems and confirm they're unmounted with "df"
  3. Turn off (or remove) the disk /dev/disk/by-id/whatever
  4. Wait some amount of time (hours or more)
  5. Attempt to "zpool status" or "zpool list"

At this point, the zpool command gets stuck in diskwait, I look around with ps, and find txg_sync in diskwait as well. Only a reboot clears this (and one out of two times, the system couldn't reboot, stuck with some daemons already down and some still up. Annoying...).

But... If I don't do step 4 (wait a while), and do "zpool status" only a few minutes after turning off the disk, not only doesn't it hang, but it says (in part):

    NAME                      STATE     READ WRITE CKSUM
    zp                        ONLINE       0     0     0
      wwn-0x5000c50074929fa3  ONLINE       0     0     0

...which is clearly not the case, and must be the result of using cached info since the disk is off.

Now, I would argue that hanging in diskwait is a bug even if I'm not following the rules, but... am I following the rules? Am I supposed to export that pool before removing (or turning off) the media? Or do something else so zfs doesn't hang trying to read it?

@dweeezil
Copy link
Contributor

Based on the EIO, I'd say this and #2700 are the same thing. I have a feeling that in both this issue and #2700, the dnode spill is getting wrecked. I'm going to try to work up a test case and also to enhance my debugging zdb branch a bit more to assist.

@alexisrosen
Copy link
Author

Guys - sorry I disappeared on you. Some major real-life events forced me to put this whole project on hold, and I'm not finished yet, so the disk has just been sitting around waiting for me to find time for debugging. I'm glad to see that you think you've got this resolved, and I hope to be able to get back to working on this in a few weeks.

I used the Ubuntu PPA to set up ZFS originally; what do I need to do if I want a version that picks up this fix?

Thanks!

ryao pushed a commit to ryao/zfs that referenced this issue Nov 29, 2014
If a spill block's dbuf hasn't yet been written when a spill block is
freed, the unwritten version will still be written.  This patch handles
the case in which a spill block's dbuf is freed and undirties it to
prevent it from being written.

The most common case in which this could happen is when xattr=sa is being
used and a long xattr is immediately replaced by a short xattr as in:

	setfattr -n user.test -v very_very_very..._long_value  <file>
	setfattr -n user.test -v short_value  <file>

The first value must be sufficiently long that a spill block is generated
and the second value must be short enough to not require a spill block.
In practice, this would typically happen due to internal xattr operations
as a result of setting acltype=posixacl.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#2663
Closes openzfs#2700
Closes openzfs#2701
Closes openzfs#2717
Closes openzfs#2863
Closes openzfs#2884

Conflicts:
	module/zfs/dbuf.c
behlendorf pushed a commit that referenced this issue Dec 23, 2014
If a spill block's dbuf hasn't yet been written when a spill block is
freed, the unwritten version will still be written.  This patch handles
the case in which a spill block's dbuf is freed and undirties it to
prevent it from being written.

The most common case in which this could happen is when xattr=sa is being
used and a long xattr is immediately replaced by a short xattr as in:

	setfattr -n user.test -v very_very_very..._long_value  <file>
	setfattr -n user.test -v short_value  <file>

The first value must be sufficiently long that a spill block is generated
and the second value must be short enough to not require a spill block.
In practice, this would typically happen due to internal xattr operations
as a result of setting acltype=posixacl.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #2663
Closes #2700
Closes #2701
Closes #2717
Closes #2863
Closes #2884
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants