PANIC: metaslab_free_dva(): bad DVA with zfs 0.6.5.2 #3937

xhernandez · 2015-10-19T12:16:43Z

I've found what seems an in-memory corruption of zfs.

I have created a pool using raidz1 with 6 disks and two child datasets. I've configured the following properties:

xattr=sa
acltype=posixacl
compression=lz4

And I've set a quota on each child dataset. I use each dataset as a brick for gluster.

When gluster rebuilds a brick (self-heal operation), it copies data inside zfs that contains extended attributes and acl's. While it was doing this, zfs detected a problem (see below).

After restarting the server and accessing the volume, no problem has been detected, so I assume that on-disk data is healthy. I'm currently trying to reproduce the problem on another test server because this one is in production.

The bad DVA is 201326592:14468632831362131968:0. I only have two vdevs (0 and 1), so 201326592 is clearly wrong (in hex it's 0xC000000, not sure if it means anything). The DVA's offset is more interesting: in hex it's 0xC8CAE8E6EAE4E800. Divided by two is 0x6465747375727400 and "trusted" in ascii. Many of the extended attributes that gluster uses start by "trusted" so I'm guessing that some extended attribute manipulation has corrupted some memory block.

kernel: [5292444.782956] PANIC: metaslab_free_dva(): bad DVA 201326592:14468632831362131968:0
kernel: [5292444.783282] Showing stack for process 61687
kernel: [5292444.783286] CPU: 6 PID: 61687 Comm: z_wr_int_4 Tainted: P           O--------------   3.10.0-5-pve #1
kernel: [5292444.783288] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014
kernel: [5292444.783290]  0000000000000000 ffff8808e85b3728 ffffffff816125e0 ffff8808e85b3738
kernel: [5292444.783296]  ffffffffa062e784 ffff8808e85b3868 ffffffffa062e81e ffff8808e85b37b8
kernel: [5292444.783299]  62616c736174656d 76645f656572665f 646162203a292861 3130322041564420
kernel: [5292444.783302] Call Trace:
kernel: [5292444.783313]  [<ffffffff816125e0>] dump_stack+0x19/0x1b
kernel: [5292444.783323]  [<ffffffffa062e784>] spl_dumpstack+0x44/0x50 [spl]
kernel: [5292444.783328]  [<ffffffffa062e81e>] vcmn_err+0x8e/0x130 [spl]
kernel: [5292444.783355]  [<ffffffffa0262454>] ? isci_task_execute_task+0x204/0x320 [isci]
kernel: [5292444.783360]  [<ffffffff8118ed02>] ? kmem_cache_alloc+0x1b2/0x1e0
kernel: [5292444.783363]  [<ffffffff816161ed>] ? mutex_lock+0x1d/0x41
kernel: [5292444.783367]  [<ffffffffa062ae19>] ? spl_kmem_cache_alloc+0x69/0x150 [spl]
kernel: [5292444.783410]  [<ffffffffa0b00342>] zfs_panic_recover+0x52/0x60 [zfs]
kernel: [5292444.783414]  [<ffffffffa062ae19>] ? spl_kmem_cache_alloc+0x69/0x150 [spl]
kernel: [5292444.783435]  [<ffffffffa0ae3ff8>] metaslab_free_dva+0x1e8/0x3b0 [zfs]
kernel: [5292444.783456]  [<ffffffffa0ae6fdc>] metaslab_free+0x9c/0xe0 [zfs]
kernel: [5292444.783482]  [<ffffffffa0b4a1bc>] zio_dva_free+0x1c/0x30 [zfs]
kernel: [5292444.783504]  [<ffffffffa0b4e012>] zio_wait+0xd2/0x210 [zfs]
kernel: [5292444.783524]  [<ffffffffa0b4e21b>] zio_free+0xcb/0x120 [zfs]
kernel: [5292444.783544]  [<ffffffffa0addb21>] dsl_free+0x11/0x20 [zfs]
kernel: [5292444.783562]  [<ffffffffa0ac7e88>] dsl_dataset_block_kill+0x278/0x4c0 [zfs]
kernel: [5292444.783576]  [<ffffffffa0aa74ca>] dbuf_write_done+0x19a/0x240 [zfs]
kernel: [5292444.783588]  [<ffffffffa0a9e5fe>] arc_write_done+0x25e/0x3f0 [zfs]
kernel: [5292444.783609]  [<ffffffffa0b4ff59>] zio_done.part.11+0x259/0xed0 [zfs]
kernel: [5292444.783613]  [<ffffffffa06298ca>] ? spl_kmem_free+0x2a/0x40 [spl]
kernel: [5292444.783616]  [<ffffffff8118dfbd>] ? kfree+0xfd/0x130
kernel: [5292444.783618]  [<ffffffff816161ed>] ? mutex_lock+0x1d/0x41
kernel: [5292444.783638]  [<ffffffffa0b50c4a>] zio_done+0x7a/0x80 [zfs]
kernel: [5292444.783658]  [<ffffffffa0b506fc>] zio_done.part.11+0x9fc/0xed0 [zfs]
kernel: [5292444.783677]  [<ffffffffa0b50c4a>] zio_done+0x7a/0x80 [zfs]
kernel: [5292444.783696]  [<ffffffffa0b506fc>] zio_done.part.11+0x9fc/0xed0 [zfs]
kernel: [5292444.783716]  [<ffffffffa0ad7830>] ? dsl_pool_undirty_space+0xd0/0xe0 [zfs]
kernel: [5292444.783735]  [<ffffffffa0b50c4a>] zio_done+0x7a/0x80 [zfs]
kernel: [5292444.783755]  [<ffffffffa0b506fc>] zio_done.part.11+0x9fc/0xed0 [zfs]
kernel: [5292444.783774]  [<ffffffffa0b50c4a>] zio_done+0x7a/0x80 [zfs]
kernel: [5292444.783793]  [<ffffffffa0b4ad68>] zio_execute+0xc8/0x180 [zfs]
kernel: [5292444.783798]  [<ffffffffa062caee>] taskq_thread+0x1fe/0x3f0 [spl]
kernel: [5292444.783803]  [<ffffffff81094450>] ? try_to_wake_up+0x2a0/0x2a0
kernel: [5292444.783807]  [<ffffffffa062c8f0>] ? taskq_thread_spawn+0x70/0x70 [spl]
kernel: [5292444.783812]  [<ffffffff81083080>] kthread+0xc0/0xd0
kernel: [5292444.783815]  [<ffffffff81082fc0>] ? flush_kthread_worker+0x80/0x80
kernel: [5292444.783820]  [<ffffffff8162262c>] ret_from_fork+0x7c/0xb0
kernel: [5292444.783822]  [<ffffffff81082fc0>] ? flush_kthread_worker+0x80/0x80

The text was updated successfully, but these errors were encountered:

xhernandez · 2015-11-04T08:50:47Z

I have been able to reproduce the problem once or twice per day, however I haven't been able to identify the cause.

Some more info I have found:

When the panic happens, sometimes but not always, a user process gets stopped in 'D' state. In this case, the process has 3 threads doing a zfs system call and they are always doing the same: create a hard link (linkat), write to a file and fsync.

I've also been able to see the contents of the blkptr_t whose DVA's are being freed:

ffff880ed13dd7f0 | 00 00 00 30 00 00 00 0c 74 72 75 73 74 65 64 2e | ...0....trusted.
ffff880ed13dd800 | 67 66 69 64 00 00 00 0a 00 00 00 10 cb da 87 1a | gfid............
ffff880ed13dd810 | 3b 77 40 7d 90 07 e5 53 9c ce fe 98 00 00 00 00 | ;w@}...S........
ffff880ed13dd820 | 00 00 00 00 00 00 2c 00 00 00 00 00 00 00 00 00 | ......,.........
ffff880ed13dd830 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff880ed13dd840 | 30 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0...............
ffff880ed13dd850 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff880ed13dd860 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

As you can see, it seems to contain embedded data, however the "embedded" flag is not set, causing the panic. It also seems that it's part of an nvlist containing one extended attribute. That extended attribute belongs to a directory.

Not sure if all this information may be useful to find the root cause.

If anyone has any idea to do more tests, I'll be happy to try them.

xhernandez · 2015-11-04T09:16:26Z

More info. I've just compiled zfs with debugging and it has failed in another place:

kernel: [ 1305.076205] VERIFY3(space >= -delta) failed (0 >= 3536635740)
kernel: [ 1305.076463] PANIC at dnode.c:1803:dnode_diduse_space()
kernel: [ 1305.076690] Showing stack for process 147993
kernel: [ 1305.076697] CPU: 14 PID: 147993 Comm: z_wr_iss Tainted: P           O--------------   3.10.0-1-pve #1
kernel: [ 1305.076701] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
kernel: [ 1305.076704]  ffffffffa0836d18 ffff881d5a5dba68 ffffffff8161006a ffff881d5a5dba78
kernel: [ 1305.076711]  ffffffffa0302764 ffff881d5a5dbc18 ffffffffa030299d ffff881b39540d80
kernel: [ 1305.076716]  00000420198ee000 ffff881d00000030 ffff881d5a5dbc28 ffff881d5a5dbbb8
kernel: [ 1305.076721] Call Trace:
kernel: [ 1305.076744]  [<ffffffff8161006a>] dump_stack+0x19/0x1b
kernel: [ 1305.076759]  [<ffffffffa0302764>] spl_dumpstack+0x44/0x50 [spl]
kernel: [ 1305.076772]  [<ffffffffa030299d>] spl_panic+0xbd/0x100 [spl]
kernel: [ 1305.076833]  [<ffffffffa0775be9>] ? metaslab_block_alloc+0xb9/0x1c0 [zfs]
kernel: [ 1305.076868]  [<ffffffffa077697f>] ? metaslab_alloc_dva+0x8ef/0xda0 [zfs]
kernel: [ 1305.076911]  [<ffffffffa07f7136>] ? zio_execute+0x126/0x350 [zfs]
kernel: [ 1305.076949]  [<ffffffffa07fcc4f>] ? zio_nowait+0x10f/0x310 [zfs]
kernel: [ 1305.076955]  [<ffffffff81613d5d>] ? mutex_lock+0x1d/0x41
kernel: [ 1305.076960]  [<ffffffff81613d5d>] ? mutex_lock+0x1d/0x41
kernel: [ 1305.076998]  [<ffffffffa0796c58>] ? spa_config_held+0xb8/0xd0 [zfs]
kernel: [ 1305.077028]  [<ffffffffa074ccbc>] dnode_diduse_space+0x29c/0x310 [zfs]
kernel: [ 1305.077062]  [<ffffffffa0796cef>] ? dva_get_dsize_sync+0x7f/0xc0 [zfs]
kernel: [ 1305.077098]  [<ffffffffa0796d76>] ? bp_get_dsize_sync+0x46/0xa0 [zfs]
kernel: [ 1305.077122]  [<ffffffffa071faaf>] dbuf_write_ready+0xaf/0x4e0 [zfs]
kernel: [ 1305.077143]  [<ffffffffa070ee3c>] arc_write_ready+0x6c/0x1d0 [zfs]
kernel: [ 1305.077180]  [<ffffffffa07ff967>] zio_ready+0x97/0x7b0 [zfs]
kernel: [ 1305.077190]  [<ffffffffa02ffeb2>] ? taskq_member+0x62/0x70 [spl]
kernel: [ 1305.077246]  [<ffffffffa07f6fd2>] ? zio_taskq_member.isra.4+0x62/0xa0 [zfs]
kernel: [ 1305.077282]  [<ffffffffa07f7136>] zio_execute+0x126/0x350 [zfs]
kernel: [ 1305.077291]  [<ffffffffa0300aee>] taskq_thread+0x1fe/0x3f0 [spl]
kernel: [ 1305.077298]  [<ffffffff81091230>] ? try_to_wake_up+0x2b0/0x2b0
kernel: [ 1305.077306]  [<ffffffffa03008f0>] ? taskq_thread_spawn+0x70/0x70 [spl]
kernel: [ 1305.077310]  [<ffffffff81080700>] kthread+0xc0/0xd0
kernel: [ 1305.077314]  [<ffffffff81080640>] ? flush_kthread_worker+0x80/0x80
kernel: [ 1305.077320]  [<ffffffff8162022c>] ret_from_fork+0x7c/0xb0
kernel: [ 1305.077324]  [<ffffffff81080640>] ? flush_kthread_worker+0x80/0x80

xhernandez · 2015-11-04T10:59:18Z

In this last test, one thread of the user process was calling 'mkdir' and got stuck

dweeezil · 2015-11-04T13:57:38Z

@xhernandez Could you please add some debugging like this (completely un-tested patch) to get the object number:

[~/src/zfs] cardinal% git diff
diff --git a/module/zfs/dnode.c b/module/zfs/dnode.c
index 2858bbf..9d76f53 100644
--- a/module/zfs/dnode.c
+++ b/module/zfs/dnode.c
@@ -1798,8 +1798,12 @@ dnode_diduse_space(dnode_t *dn, int64_t delta)
        mutex_enter(&dn->dn_mtx);
        space = DN_USED_BYTES(dn->dn_phys);
        if (delta > 0) {
+               if (!(space + delta >= space))
+                       printk("%s line %d: obj %lld\n", __FUNCTION__, __LINE__, (u_longlong_t)dn->dn_object);
                ASSERT3U(space + delta, >=, space); /* no overflow */
        } else {
+               if (!(space >= -delta))
+                       printk("%s line %d: obj %lld\n", __FUNCTION__, __LINE__, (u_longlong_t)dn->dn_object);
                ASSERT3U(space, >=, -delta); /* no underflow */
        }
        space += delta;

Then you can examine it with zdb -ddddd and get a better handle on the corruption involved. You can also use one of my enhanced zdb's available in https://github.com/dweeezil/zfs/tree/zdb (which I just rebased on current master code); it adds additional debugging up to 7 "-d" options.

This feels like one of the types of dnode SA corruption which should have been fixed a long time ago. How long has this pool existed? If it's been around for a long time, it's possible the corruption has been present for awhile. Or, if the system doesn't have ECC memory, possibly caused by a bit-flip.

In any case, get the object number, dump it with zdb and that should give a better idea as to what's happening.

xhernandez · 2015-11-04T14:12:30Z

@dweeezil I'll apply the patch and get the information you requested.

The pool is completely new. In the test machine I'm using, the pool is recreated before running each test. This problem has happened in two different servers, both using ECC memory.

I'll update as soon as I get more info. Thanks.

xhernandez · 2015-11-05T08:51:02Z

A new panic:

kernel: [ 1344.465350] dnode_diduse_space line 1806: obj 16388
kernel: [ 1344.465356] VERIFY3(space >= -delta) failed (0 >= 3536635740)
kernel: [ 1344.465638] PANIC at dnode.c:1810:dnode_diduse_space()
kernel: [ 1344.465874] Showing stack for process 139682
kernel: [ 1344.465880] CPU: 6 PID: 139682 Comm: z_wr_iss Tainted: P           O--------------   3.10.0-1-pve #1
kernel: [ 1344.465882] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
kernel: [ 1344.465886]  ffffffffa0522d18 ffff881b97c07a38 ffffffff8161006a ffff881b97c07a48
kernel: [ 1344.465893]  ffffffffa030a764 ffff881b97c07be8 ffffffffa030a99d ffff881b97c07ab8
kernel: [ 1344.465899]  ffffffff8105acdd 0000000000000030 ffff881b97c07bf8 ffff881b97c07b88
kernel: [ 1344.465905] Call Trace:
kernel: [ 1344.465929]  [<ffffffff8161006a>] dump_stack+0x19/0x1b
kernel: [ 1344.465941]  [<ffffffffa030a764>] spl_dumpstack+0x44/0x50 [spl]
kernel: [ 1344.465949]  [<ffffffffa030a99d>] spl_panic+0xbd/0x100 [spl]
kernel: [ 1344.465960]  [<ffffffff8105acdd>] ? msg_print_text+0xdd/0x1b0
kernel: [ 1344.465966]  [<ffffffff81000a29>] ? _stext+0x861/0xe38
kernel: [ 1344.465989]  [<ffffffff8105bfc9>] ? console_unlock+0x209/0x3f0
kernel: [ 1344.466002]  [<ffffffff8160978b>] ? printk+0x61/0x63
kernel: [ 1344.466047]  [<ffffffffa0438ddb>] dnode_diduse_space+0x38b/0x3a0 [zfs]
kernel: [ 1344.466086]  [<ffffffffa0460482>] ? memory_dump+0x142/0x160 [zfs]
kernel: [ 1344.466122]  [<ffffffffa048cb2e>] ? vdev_lookup_top+0x2e/0xd0 [zfs]
kernel: [ 1344.466144]  [<ffffffffa040bac2>] dbuf_write_ready+0xc2/0x510 [zfs]
kernel: [ 1344.466164]  [<ffffffffa03fae3c>] arc_write_ready+0x6c/0x1d0 [zfs]
kernel: [ 1344.466202]  [<ffffffffa04eba37>] zio_ready+0x97/0x7b0 [zfs]
kernel: [ 1344.466211]  [<ffffffffa0307eb2>] ? taskq_member+0x62/0x70 [spl]
kernel: [ 1344.466244]  [<ffffffffa04e30a2>] ? zio_taskq_member.isra.4+0x62/0xa0 [zfs]
kernel: [ 1344.466277]  [<ffffffffa04e3206>] zio_execute+0x126/0x350 [zfs]
kernel: [ 1344.466284]  [<ffffffff8161756b>] ? _raw_spin_unlock_irqrestore+0x1b/0x40
kernel: [ 1344.466292]  [<ffffffffa0308aee>] taskq_thread+0x1fe/0x3f0 [spl]
kernel: [ 1344.466302]  [<ffffffff81091230>] ? try_to_wake_up+0x2b0/0x2b0
kernel: [ 1344.466307]  [<ffffffffa03088f0>] ? taskq_thread_spawn+0x70/0x70 [spl]
kernel: [ 1344.466311]  [<ffffffff81080700>] kthread+0xc0/0xd0
kernel: [ 1344.466314]  [<ffffffff81080640>] ? flush_kthread_worker+0x80/0x80
kernel: [ 1344.466320]  [<ffffffff8162022c>] ret_from_fork+0x7c/0xb0
kernel: [ 1344.466322]  [<ffffffff81080640>] ? flush_kthread_worker+0x80/0x80

The object info:

# zdb -ddddddd pool-sata/brick2 16388
Dataset pool-sata/brick2 [ZPL], ID 49, cr_txg 6, 28.0G, 16404 objects, rootbp DVA[0]=<0:25cc464000:400> DVA[1]=<0:42016e0ec00:400> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=1378L/1378P fill=16404 cksum=160522d119:6e85e77d5b6:1351d97b37057:272acf11a4b6aa

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
     16388    1    16K    512      0    512  100.00  ZFS directory (K=inherit) (Z=inherit)
                                        244   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED 
        dnode maxblkid: 0
        SA hdrsize 16
        SA layout 3
        path    <hidden>
        uid     2502
        gid     2513
        atime   Thu Nov  5 09:30:59 2015
        mtime   Thu Nov  5 09:30:59 2015
        ctime   Thu Nov  5 09:30:59 2015
        crtime  Thu Nov  5 09:30:59 2015
        gen     1377
        mode    40770
        size    2
        parent  16370
        links   2
        pflags  40800000044
        ndacl   3
        Misc SA sizes:
                DACL_ACES = 24
                ZNODE_ACL = N/A
        dump_znode_sa_xattr: sa_xattr_size=68 sa_size error=0
        SA packed dump sa_xattr_size=68: \001\001\000\000\000\000\000\000\000\000\000\001\000\000\000\060\000\000\000\060\000\000\000\014\164\162\165\163\164\145\144\056\147\146\151\144\000\000\000\012\000\000\000\020\030\043\352\174\344\062\114\140\265\324\143\112\271\131\232\246\000\000\000\000\000\000\000\000
        SA xattr dump:
                trusted.gfid[0]: 24
                trusted.gfid[1]: 35
                trusted.gfid[2]: 234
                trusted.gfid[3]: 124
                trusted.gfid[4]: 228
                trusted.gfid[5]: 50
                trusted.gfid[6]: 76
                trusted.gfid[7]: 96
                trusted.gfid[8]: 181
                trusted.gfid[9]: 212
                trusted.gfid[10]: 99
                trusted.gfid[11]: 74
                trusted.gfid[12]: 185
                trusted.gfid[13]: 89
                trusted.gfid[14]: 154
                trusted.gfid[15]: 166
        SA xattrs: 68 bytes, 1 entries

                trusted.gfid = \030#\352|\3442L`\265\324cJ\271Y\232\246
        microzap: 512 bytes, 0 entries

Indirect blocks:
               0 L0 EMBEDDED et=0 200L/1dP B=1377

                segment [0000000000000000, 0000000000000200) size   512

dweeezil · 2015-11-05T15:02:12Z

@xhernandez I guess I jumped to the wrong conclusion, your dnode is perfectly fine insofar as ZFS is concerned. This is exactly what I'd expect a newly-created directory's dnode to look like. The problem is that a space delta is being calculated incorrectly. It's trying to free 3536635740 bytes of space from a 0 byte object which triggers the ASSERT. I'm looking through the code right now to try to see how this might happen. Is this a problem you can reproduce easily? Is it always triggered on the same directory? Have you got any idea what types of operations might be happening to that directory at the time? Adding files to it, deleting files from it?

At this point, I'm suspicious as to whether there may be some place where proper handling of embedded data blkptrs isn't happening and/or that their related macros are even working properly at all. This could also be some sort of race condition.

xhernandez · 2015-11-05T16:55:16Z

@dweeezil Yes, it seems that I can recreate this problem quite easily now (in 1 or 2 hours), specially after having compiled with debugging. The directory is not always the same, but it's always a directory (at least till now). I think this is important because regular files also have acl's and extended attributes but do not seem to have any problem.

It's hard to say what it's doing exactly when it fails. The process replicates information from another mount point (formatted using XFS) to the ZFS pool. At least at two of the failures, one thread of the user process has been blocked in an mkdir call. I cannot tell if the mkdir is related to the failing directory or not though. I'll try to find more detailed information about the steps it's doing.

I've done some more tests. It seems that the failure comes from the dbuf_write_ready() function, where delta is calculated. bp_get_dsize_sync(spa, bp) returns 1700, but bp_get_dsize_sync(spa, bp_orig) returns 3536637440 (zio->io_prev_space_delta is 0). bp_orig comes from zio->io_bp_orig and these are its contents:

00 00 00 30 00 00 00 0c 74 72 75 73 74 65 64 2e | ...0....trusted.
67 66 69 64 00 00 00 0a 00 00 00 10 5c bf 13 17 | gfid........\...
91 96 48 24 a4 cc 9d 92 60 22 d8 97 00 00 00 00 | ..H$....`"......
00 00 00 00 00 00 2c 00 00 00 00 00 00 00 00 00 | ......,.........
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
9c 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

I'm not sure if it's important, but zio->io_bp_copy is exactly equal to zio->io_bp_orig.

The size is calculated from the DVA's, but since the blkptr_t seems embedded but the flag is not set, the calculated size is incorrect. I don't know where or how it's modified.

I've dumped the full zio_t structure if it helps:

Raw:

ffff881f80350460 | 31 00 00 00 00 00 00 00 c6 48 00 00 00 00 00 00 | 1........H......
ffff881f80350470 | 00 00 00 00 00 00 00 00 fe ff ff ff ff ff ff ff | ................
ffff881f80350480 | 07 00 00 00 0f 00 00 00 2c 00 00 00 00 02 59 19 | ........,.....Y.
ffff881f80350490 | 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 00 | ................
ffff881f803504a0 | 03 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 | ................
ffff881f803504b0 | 9d 05 00 00 00 00 00 00 00 a0 22 bd 0f 88 ff ff | ..........".....
ffff881f803504c0 | 80 0d b8 fb 04 88 ff ff 00 00 00 00 00 00 00 00 | ................
ffff881f803504d0 | 00 00 00 30 00 00 00 0c 74 72 75 73 74 65 64 2e | ...0....trusted.
ffff881f803504e0 | 67 66 69 64 00 00 00 0a 00 00 00 10 5c bf 13 17 | gfid........\...
ffff881f803504f0 | 91 96 48 24 a4 cc 9d 92 60 22 d8 97 00 00 00 00 | ..H$....`"......
ffff881f80350500 | 00 00 00 00 00 00 2c 00 00 00 00 00 00 00 00 00 | ......,.........
ffff881f80350510 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350520 | 9c 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350530 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350540 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350550 | 30 00 00 00 00 00 00 00 10 00 00 00 00 00 00 00 | 0...............
ffff881f80350560 | 20 75 67 d1 1a 88 ff ff 20 75 67 d1 1a 88 ff ff |  ug..... ug.....
ffff881f80350570 | 30 00 00 00 00 00 00 00 20 00 00 00 00 00 00 00 | 0....... .......
ffff881f80350580 | 80 05 35 80 1f 88 ff ff 80 05 35 80 1f 88 ff ff | ..5.......5.....
ffff881f80350590 | 00 00 00 00 00 00 00 00 60 04 35 80 1f 88 ff ff | ........`.5.....
ffff881f803505a0 | 40 2a a1 67 0e 88 ff ff d0 7d 2c a0 ff ff ff ff | @*.g.....},.....
ffff881f803505b0 | a0 50 2c a0 ff ff ff ff 80 c3 2c a0 ff ff ff ff | .P,.......,.....
ffff881f803505c0 | c0 67 70 91 1b 88 ff ff 00 00 00 00 00 00 00 00 | .gp.............
ffff881f803505d0 | 00 00 00 30 00 00 00 0c 74 72 75 73 74 65 64 2e | ...0....trusted.
ffff881f803505e0 | 67 66 69 64 00 00 00 0a 00 00 00 10 5c bf 13 17 | gfid........\...
ffff881f803505f0 | 91 96 48 24 a4 cc 9d 92 60 22 d8 97 00 00 00 00 | ..H$....`"......
ffff881f80350600 | 00 00 00 00 00 00 2c 00 00 00 00 00 00 00 00 00 | ......,.........
ffff881f80350610 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350620 | 9c 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350630 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350640 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350650 | 00 94 df ab 1b 88 ff ff 00 54 fa e1 04 88 ff ff | .........T......
ffff881f80350660 | 00 02 00 00 00 00 00 00 00 04 00 00 00 00 00 00 | ................
ffff881f80350670 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350680 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350690 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f803506a0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f803506b0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f803506c0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f803506d0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 | ................
ffff881f803506e0 | 38 20 2f 00 00 00 00 00 01 00 00 00 38 20 2f 00 | 8 /.........8 /.
ffff881f803506f0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350700 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350710 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350720 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350730 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350740 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350750 | 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 | ................
ffff881f80350760 | 00 00 00 00 00 00 00 00 60 04 35 80 1f 88 ff ff | ........`.5.....
ffff881f80350770 | 00 00 00 00 00 00 00 00 20 ec 6c f1 0e 88 ff ff | ........ .l.....
ffff881f80350780 | 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 | ................
ffff881f80350790 | 90 07 35 80 1f 88 ff ff 90 07 35 80 1f 88 ff ff | ..5.......5.....
ffff881f803507a0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f803507b0 | 0a 00 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f803507c0 | f4 45 65 34 00 00 00 00 00 00 00 00 00 00 00 00 | .Ee4............
ffff881f803507d0 | d0 07 35 80 1f 88 ff ff d0 07 35 80 1f 88 ff ff | ..5.......5.....
ffff881f803507e0 | 00 00 00 00 00 00 00 00 e8 07 35 80 1f 88 ff ff | ..........5.....
ffff881f803507f0 | e8 07 35 80 1f 88 ff ff 01 00 00 00 00 00 00 00 | ..5.............
ffff881f80350800 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350810 | 00 00 00 00 00 00 00 00 04 00 04 00 00 00 00 00 | ................
ffff881f80350820 | 00 00 00 00 00 00 00 00 28 08 35 80 1f 88 ff ff | ........(.5.....
ffff881f80350830 | 28 08 35 80 1f 88 ff ff 00 00 00 00 00 00 00 00 | (.5.............
ffff881f80350840 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350850 | 80 da cb 81 ff ff ff ff 00 00 00 00 00 00 00 00 | ................
ffff881f80350860 | 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff | ................
ffff881f80350870 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350880 | 00 00 00 00 00 00 00 00 88 08 35 80 1f 88 ff ff | ..........5.....
ffff881f80350890 | 88 08 35 80 1f 88 ff ff 46 4d 1b 00 00 00 00 00 | ..5.....FM......
ffff881f803508a0 | 30 01 3b a0 ff ff ff ff 60 04 35 80 1f 88 ff ff | 0.;.....`.5.....
ffff881f803508b0 | 00 8b 3b e2 1f 88 ff ff 01 00 00 00 00 00 00 00 | ..;.............

Interpreted output (made by hand, I hope it's right):

zio {
    .io_bookmark = {
        .zb_objset       = 0x31,
        .zb_object       = 0x4004,
        .zb_level        = 0x0,
        .zb_blkid        = 0xfffffffffffffffe
    },
    .io_prop = {
        .zp_checksum     = ZIO_CHECKSUM_FLETCHER_4,
        .zp_compress     = ZIO_COMPRESS_LZ4,
        .zp_type         = DMU_OT_SA,
        .zp_level        = 0x0,
        .zp_copies       = 0x2,
        .zp_dedup        = B_FALSE,
        .zp_dedup_verify = B_FALSE,
        .zp_nopwrite     = B_FALSE
    },
    .io_type             = ZIO_TYPE_WRITE,
    .io_child_type       = ZIO_CHILD_LOGICAL,
    .io_cmd              = 0x0,
    .io_priority         = ZIO_PRIORITY_ASYNC_WRITE,
    .io_reexecute        = 0x0,
    .io_state            = { 0x0, 0x0 },
    .io_txg              = 0x563,
    .io_spa              = 0xffff880fdba02000,
    .io_bp               = 0xffff881b922b8980,
    .io_bp_override      = 0x0,
    .io_bp_copy = {

    },
    .io_parent_list = {
        .list_size       = 0x30,
        .list_offset     = 0x10,
        .list_head       = {
            .next        = 0xffff8802fe3fc5b0,
            .prev        = 0xffff8802fe3fc5b0
        },
    },
    .io_child_list = {
        .list_size       = 0x30,
        .list_offset     = 0x20,
        .list_head = {
            .next        = 0xffff8805d33fdd00,
            .prev        = 0xffff8805d33fdd00
        }
    },
    .io_walk_link        = 0x0,
    .io_logical          = 0xffff8805d33fdbe0,
    .io_transform_stack  = 0xffff8807f270fe80,
    .io_ready            = 0xffffffffa03fadd0,
    .io_physdone         = 0xffffffffa03f80a0,
    .io_done             = 0xffffffffa03ff380,
    .io_private          = 0xffff881ac0f58900,
    .io_prev_space_delta = 0x0,
    .io_bp_orig = {

    },
    .io_data             = 0xffff881b6f918800,
    .io_orig_data        = 0xffff8805c8403c00,
    .io_size             = 0x200,
    .io_orig_size        = 0x400,
    .io_vd               = 0x0,
    .io_vsd              = 0x0,
    .io_vsd_ops          = 0x0,
    .io_offset           = 0x0,
    .io_timestamp        = 0x0,
    .io_delta            = 0x0,
    .io_delay            = 0x0,
    .io_queue_node = {
        .avl_child       = { 0x0, 0x0 },
        .avl_pcb         = 0x0
    },
    .io_offset_node = {
        .avl_child       = { 0x0, 0x0 },
        .avl_pcb         = 0x0
    },
    .io_flags            = 0x0,
    .io_stage            = ZIO_STAGE_READY,
    .io_pipeline         = ZIO_STAGE_ISSUE_ASYNC |
                           ZIO_STAGE_WRITE_BP_INIT |
                           ZIO_STAGE_CHECKSUM_GENERATE |
                           ZIO_STAGE_DVA_ALLOCATE |
                           ZIO_STAGE_READY |
                           ZIO_STAGE_VDEV_IO_START |
                           ZIO_STAGE_VDEV_IO_DONE |
                           ZIO_STAGE_VDEV_IO_ASSESS |
                           ZIO_STAGE_DONE,
    .io_orig_flags       = 0x0,
    .io_orig_stage       = ZIO_STAGE_OPEN,
    .io_orig_pipeline    = ZIO_STAGE_ISSUE_ASYNC |
                           ZIO_STAGE_WRITE_BP_INIT |
                           ZIO_STAGE_CHECKSUM_GENERATE |
                           ZIO_STAGE_DVA_ALLOCATE |
                           ZIO_STAGE_READY |
                           ZIO_STAGE_VDEV_IO_START |
                           ZIO_STAGE_VDEV_IO_DONE |
                           ZIO_STAGE_VDEV_IO_ASSESS |
                           ZIO_STAGE_DONE,
    .io_error            = 0x0,
    .io_child_error      = { 0x0, 0x0, 0x0, 0x0 },
    .io_children         = { { 0x0, 0x0 }, { 0x0, 0x0 }, { 0x0, 0x0 }, { 0x0, 0x0 } },
    .io_child_count      = 0x0,
    .io_phys_children    = 0x0,
    .io_parent_count     = 0x1,
    .io_stall            = 0x0,
    .io_gang_leader      = 0xffff8805d33fdbe0,
    .io_gang_tree        = 0x0,
    .io_executor         = 0xffff881fe54495a0,
    .io_waiter           = 0x0,
    .io_lock = { },
    .io_cv = { },
    .io_chsum_report
    .io_ena
    .io_tqent
}

Hope it has some clue...

xhernandez · 2015-11-05T17:06:24Z

Not sure if it helps, but zio->io_data and zio->io_orig_data seem to contain extended attributes and zio->io_data seems corrupted (at least the header is quite different):

zio->io_orig_data

ffff8804e1fa5400 | 5a 50 2f 00 04 04 78 03 01 01 00 00 00 00 00 00 | ZP/...x.........
ffff8804e1fa5410 | 00 00 00 01 00 00 00 30 00 00 00 30 00 00 00 0c | .......0...0....
ffff8804e1fa5420 | 74 72 75 73 74 65 64 2e 67 66 69 64 00 00 00 0a | trusted.gfid....
ffff8804e1fa5430 | 00 00 00 10 5c bf 13 17 91 96 48 24 a4 cc 9d 92 | ....\.....H$....
ffff8804e1fa5440 | 60 22 d8 97 00 00 00 80 00 00 00 88 00 00 00 18 | `"..............
ffff8804e1fa5450 | 73 79 73 74 65 6d 2e 70 6f 73 69 78 5f 61 63 6c | system.posix_acl
ffff8804e1fa5460 | 5f 64 65 66 61 75 6c 74 00 00 00 0a 00 00 00 54 | _default.......T
ffff8804e1fa5470 | 02 00 00 00 01 00 07 00 ff ff ff ff 02 00 07 00 | ................
ffff8804e1fa5480 | c4 09 00 00 02 00 07 00 c2 c6 2d 00 04 00 00 00 | ..........-.....
ffff8804e1fa5490 | ff ff ff ff 08 00 07 00 04 00 00 00 08 00 07 00 | ................
ffff8804e1fa54a0 | d0 09 00 00 08 00 07 00 24 0c 00 00 08 00 07 00 | ........$.......
ffff8804e1fa54b0 | c2 c6 2d 00 10 00 07 00 ff ff ff ff 20 00 00 00 | ..-......... ...
ffff8804e1fa54c0 | ff ff ff ff 00 00 00 80 00 00 00 80 00 00 00 17 | ................
ffff8804e1fa54d0 | 73 79 73 74 65 6d 2e 70 6f 73 69 78 5f 61 63 6c | system.posix_acl
ffff8804e1fa54e0 | 5f 61 63 63 65 73 73 00 00 00 00 0a 00 00 00 54 | _access........T
ffff8804e1fa54f0 | 02 00 00 00 01 00 07 00 ff ff ff ff 02 00 07 00 | ................
ffff8804e1fa5500 | c4 09 00 00 02 00 07 00 c2 c6 2d 00 04 00 00 00 | ..........-.....
ffff8804e1fa5510 | ff ff ff ff 08 00 07 00 04 00 00 00 08 00 07 00 | ................
ffff8804e1fa5520 | d0 09 00 00 08 00 07 00 24 0c 00 00 08 00 07 00 | ........$.......
ffff8804e1fa5530 | c2 c6 2d 00 10 00 07 00 ff ff ff ff 20 00 00 00 | ..-......... ...
ffff8804e1fa5540 | ff ff ff ff 00 00 00 3c 00 00 00 38 00 00 00 15 | .......<...8....
ffff8804e1fa5550 | 74 72 75 73 74 65 64 2e 67 6c 75 73 74 65 72 66 | trusted.glusterf
ffff8804e1fa5560 | 73 2e 64 68 74 00 00 00 00 00 00 0a 00 00 00 10 | s.dht...........
ffff8804e1fa5570 | 00 00 00 01 00 00 00 00 00 00 00 00 52 a6 66 df | ............R.f.
ffff8804e1fa5580 | 00 00 00 a8 00 00 00 a8 00 00 00 17 74 72 75 73 | ............trus
ffff8804e1fa5590 | 74 65 64 2e 53 47 49 5f 41 43 4c 5f 44 45 46 41 | ted.SGI_ACL_DEFA
ffff8804e1fa55a0 | 55 4c 54 00 00 00 00 0a 00 00 00 7c 00 00 00 0a | ULT........|....
ffff8804e1fa55b0 | 00 00 00 01 ff ff ff ff 00 07 00 00 00 00 00 02 | ................
ffff8804e1fa55c0 | 00 00 09 c4 00 07 00 00 00 00 00 02 00 2d c6 c2 | .............-..
ffff8804e1fa55d0 | 00 07 00 00 00 00 00 04 ff ff ff ff 00 00 00 00 | ................
ffff8804e1fa55e0 | 00 00 00 08 00 00 00 04 00 07 00 00 00 00 00 08 | ................
ffff8804e1fa55f0 | 00 00 09 d0 00 07 00 00 00 00 00 08 00 00 0c 24 | ...............$
ffff8804e1fa5600 | 00 07 00 00 00 00 00 08 00 2d c6 c2 00 07 00 00 | .........-......
ffff8804e1fa5610 | 00 00 00 10 ff ff ff ff 00 07 00 00 00 00 00 20 | ...............
ffff8804e1fa5620 | ff ff ff ff 00 00 00 00 00 00 00 a4 00 00 00 a8 | ................
ffff8804e1fa5630 | 00 00 00 14 74 72 75 73 74 65 64 2e 53 47 49 5f | ....trusted.SGI_
ffff8804e1fa5640 | 41 43 4c 5f 46 49 4c 45 00 00 00 0a 00 00 00 7c | ACL_FILE.......|
ffff8804e1fa5650 | 00 00 00 0a 00 00 00 01 ff ff ff ff 00 07 00 00 | ................
ffff8804e1fa5660 | 00 00 00 02 00 00 09 c4 00 07 00 00 00 00 00 02 | ................
ffff8804e1fa5670 | 00 2d c6 c2 00 07 00 00 00 00 00 04 ff ff ff ff | .-..............
ffff8804e1fa5680 | 00 00 00 00 00 00 00 08 00 00 00 04 00 07 00 00 | ................
ffff8804e1fa5690 | 00 00 00 08 00 00 09 d0 00 07 00 00 00 00 00 08 | ................
ffff8804e1fa56a0 | 00 00 0c 24 00 07 00 00 00 00 00 08 00 2d c6 c2 | ...$.........-..
ffff8804e1fa56b0 | 00 07 00 00 00 00 00 10 ff ff ff ff 00 07 00 00 | ................
ffff8804e1fa56c0 | 00 00 00 20 ff ff ff ff 00 00 00 00 00 00 00 34 | ... ...........4
ffff8804e1fa56d0 | 00 00 00 38 00 00 00 11 74 72 75 73 74 65 64 2e | ...8....trusted.
ffff8804e1fa56e0 | 61 66 72 2e 64 69 72 74 79 00 00 00 00 00 00 0a | afr.dirty.......
ffff8804e1fa56f0 | 00 00 00 0c 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa5700 | 00 00 00 3c 00 00 00 40 00 00 00 19 74 72 75 73 | ...<[email protected]
ffff8804e1fa5710 | 74 65 64 2e 61 66 72 2e 73 61 74 61 2d 63 6c 69 | ted.afr.sata-cli
ffff8804e1fa5720 | 65 6e 74 2d 31 00 00 00 00 00 00 0a 00 00 00 0c | ent-1...........
ffff8804e1fa5730 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3c | ...............<
ffff8804e1fa5740 | 00 00 00 40 00 00 00 19 74 72 75 73 74 65 64 2e | [email protected].
ffff8804e1fa5750 | 61 66 72 2e 73 61 74 61 2d 63 6c 69 65 6e 74 2d | afr.sata-client-
ffff8804e1fa5760 | 33 00 00 00 00 00 00 0a 00 00 00 0c 00 00 00 00 | 3...............
ffff8804e1fa5770 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa5780 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa5790 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa57a0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa57b0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa57c0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa57d0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa57e0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa57f0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

zio->io_data

ffff881babdf9400 | 00 00 01 ac b4 5a 50 2f 00 04 04 78 03 01 01 00 | .....ZP/...x....
ffff881babdf9410 | 01 00 00 0a 00 13 30 04 00 f3 3a 0c 74 72 75 73 | ......0...:.trus
ffff881babdf9420 | 74 65 64 2e 67 66 69 64 00 00 00 0a 00 00 00 10 | ted.gfid........
ffff881babdf9430 | 5c bf 13 17 91 96 48 24 a4 cc 9d 92 60 22 d8 97 | \.....H$....`"..
ffff881babdf9440 | 00 00 00 80 00 00 00 88 00 00 00 18 73 79 73 74 | ............syst
ffff881babdf9450 | 65 6d 2e 70 6f 73 69 78 5f 61 63 6c 5f 64 65 66 | em.posix_acl_def
ffff881babdf9460 | 61 75 6c 74 3c 00 21 54 02 61 00 e0 07 00 ff ff | ault<.!T.a......
ffff881babdf9470 | ff ff 02 00 07 00 c4 09 00 00 08 00 71 c2 c6 2d | ............q..-
ffff881babdf9480 | 00 04 00 00 18 00 31 08 00 07 0c 00 00 08 00 22 | ......1........"
ffff881babdf9490 | d0 09 08 00 22 24 0c 08 00 00 28 00 13 10 40 00 | ...."$....(...@.
ffff881babdf94a0 | 13 20 30 00 03 80 00 00 84 00 1d 17 80 00 60 61 | . 0...........`a
ffff881babdf94b0 | 63 63 65 73 73 dc 00 00 bc 00 0f 80 00 45 95 3c | ccess........E.<
ffff881babdf94c0 | 00 00 00 38 00 00 00 15 30 01 10 6c 38 01 60 72 | ...8....0..l8.`r
ffff881babdf94d0 | 66 73 2e 64 68 fd 00 04 3c 01 01 ff 00 01 0f 00 | fs.dh...<.......
ffff881babdf94e0 | a3 00 00 52 a6 66 df 00 00 00 a8 04 00 14 17 3c | ...R.f.........<
ffff881babdf94f0 | 00 f0 00 53 47 49 5f 41 43 4c 5f 44 45 46 41 55 | ...SGI_ACL_DEFAU
ffff881babdf9500 | 4c 54 2b 00 00 bc 00 13 7c 44 00 10 01 3c 01 20 | LT+.....|D...<.
ffff881babdf9510 | 00 07 17 00 65 00 02 00 00 09 c4 0c 00 33 2d c6 | ....e........3-.
ffff881babdf9520 | c2 0c 00 11 04 24 00 01 23 00 63 00 08 00 00 00 | .....$..#.c.....
ffff881babdf9530 | 04 18 00 56 08 00 00 09 d0 0c 00 25 0c 24 0c 00 | ...V.......%.$..
ffff881babdf9540 | 06 3c 00 11 10 3c 00 02 60 00 11 20 0c 00 02 48 | .<...<..`.. ...H
ffff881babdf9550 | 00 13 a4 a8 00 1c 14 a8 00 43 46 49 4c 45 9c 00 | .........CFILE..
ffff881babdf9560 | 0f a4 00 6d 13 34 88 01 14 11 a4 00 92 61 66 72 | ...m.4.......afr
ffff881babdf9570 | 2e 64 69 72 74 79 c4 00 00 48 01 12 0c 0b 00 05 | .dirty...H......
ffff881babdf9580 | 02 00 00 bc 01 58 40 00 00 00 19 34 00 d2 73 61 | [email protected]
ffff881babdf9590 | 74 61 2d 63 6c 69 65 6e 74 2d 31 2b 00 0f 3c 00 | ta-client-1+..<.
ffff881babdf95a0 | 22 1f 33 3c 00 07 0f 02 00 6d 50 00 00 00 00 00 | ".3<.....mP.....
ffff881babdf95b0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881babdf95c0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881babdf95d0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881babdf95e0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881babdf95f0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

dweeezil · 2015-11-05T17:13:48Z

@xhernandez That's all great information! The on-disk copy of the dnode, according to your previous zdb output, is just fine; the single blkptr is clearly set as embedded (otherwise zdb wouldn't display it as such). After looking through the code a bit more earlier today, I had kind of come to the conclusion that the bogus size was being computed due to an improperly constructed embedded data blkptr.

One very interesting outcome of the embedded data blkptr feature, which may very well have nothing to do with this issue, is that when a spill block is needed, sometimes the contents of the spill block can, itself, be squeezed into an embedded data blkptr which eliminates the need for a "true" spill block.

I've just started going over the data in your posting but I'm wondering if maybe this directory is being expanded to the point where it needs another data block to hold the zap and then also at the same time, the SA is getting kicked into a (embedded) spill block. As I mentioned, your on-disk copy according to zdb is a perfectly well-formed empty directory but the bug you're seeing is getting tripped when some subsequent operations are performed on it.

dweeezil · 2015-11-05T17:14:54Z

@xhernandez I almost forgot to ask: What are the contents of this directory on the source system? Is it empty? Does it contain a lot of files?

xhernandez · 2015-11-05T17:32:25Z

The last directory that failed contains only 2 regular files. In this case seems that both files were already stored on disk before the crash (or the user process has been able to continue creating files after the problem, but I doubt it). The zdb output of this one is:

Dataset pool-sata/brick2 [ZPL], ID 49, cr_txg 6, 28.4G, 19899 objects, rootbp DVA[0]=<0:4029ffbc00:400> DVA[1]=<0:42002413000:400> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=1457L/1457P fill=19899 cksum=1398d95339:69d9f4e9a4a:130643bfe43b7:268bd18915bf26

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
     18630    1    16K    512      0    512  100.00  ZFS directory (K=inherit) (Z=inherit)
                                        244   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        path    <hidden>
        uid     2502
        gid     2513
        atime   Thu Nov  5 13:27:38 2015
        mtime   Thu Nov  5 13:39:18 2015
        ctime   Thu Nov  5 13:39:18 2015
        crtime  Thu Nov  5 13:27:38 2015
        gen     1435
        mode    40770
        size    4
        parent  12602
        links   2
        pflags  40800000044
        SA xattrs: 68 bytes, 1 entries

                trusted.gfid = \\277\023\027\221\226H$\244\314\235\222`"\330\227
        microzap: 512 bytes, 2 entries

                <file1> = 18655 (type: Regular File)
                <file2> = 18657 (type: Regular File)
Indirect blocks:
               0 L0 EMBEDDED et=0 200L/6eP B=1446

                segment [0000000000000000, 0000000000000200) size   512

I need to check it, but I'm pretty sure that after having replicated the contents of a directory, the process updates some extended attributes of the parent directory. Maybe it's here when the problem happens, at least this time.

dweeezil · 2015-11-05T18:06:34Z

@xhernandez Could you please post the output of zdb -dddd 5 6. I'm thinking your corrupted-looking zio buffer might be a spill block. Object 6 should show us the SA layouts it's trying to use and I'm curious if there are any layouts with only 2 entries.

dweeezil · 2015-11-05T18:07:29Z

@xhernandez Correction, please do zdb -dddd <pool>/<fs> 5 6.

xhernandez · 2015-11-05T21:11:41Z

# zdb -dddd pool-sata/brick2 5 6
Dataset pool-sata/brick2 [ZPL], ID 49, cr_txg 6, 28.4G, 19899 objects, rootbp DVA[0]=<0:4029ffbc00:400> DVA[1]=<0:42002413000:400> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=1457L/1457P fill=19899 cksum=1398d95339:69d9f4e9a4a:130643bfe43b7:268bd18915bf26

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         5    1    16K  1.50K  1.50K  1.50K  100.00  SA attr registration
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        microzap: 1536 bytes, 21 entries

                ZPL_PARENT =  8000007 : [8:0:7]
                ZPL_DACL_ACES =  40013 : [0:4:19]
                ZPL_UID =  800000c : [8:0:12]
                ZPL_DACL_COUNT =  8000010 : [8:0:16]
                ZPL_ATIME =  10000000 : [16:0:0]
                ZPL_LINKS =  8000008 : [8:0:8]
                ZPL_SYMLINK =  30011 : [0:3:17]
                ZPL_RDEV =  800000a : [8:0:10]
                ZPL_CRTIME =  10000003 : [16:0:3]
                ZPL_GEN =  8000004 : [8:0:4]
                ZPL_DXATTR =  30014 : [0:3:20]
                ZPL_CTIME =  10000002 : [16:0:2]
                ZPL_MTIME =  10000001 : [16:0:1]
                ZPL_SCANSTAMP =  20030012 : [32:3:18]
                ZPL_GID =  800000d : [8:0:13]
                ZPL_FLAGS =  800000b : [8:0:11]
                ZPL_PAD =  2000000e : [32:0:14]
                ZPL_ZNODE_ACL =  5803000f : [88:3:15]
                ZPL_SIZE =  8000006 : [8:0:6]
                ZPL_XATTR =  8000009 : [8:0:9]
                ZPL_MODE =  8000005 : [8:0:5]

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         6    1    16K    16K  10.0K    32K  100.00  SA attr layouts
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 1
        Fat ZAP stats:
                Pointer table:
                        1024 elements
                        zt_blk: 0
                        zt_numblks: 0
                        zt_shift: 10
                        zt_blks_copied: 0
                        zt_nextblk: 0
                ZAP entries: 5
                Leaf blocks: 1
                Total blocks: 2
                zap_block_type: 0x8000000000000001
                zap_magic: 0x2f52ab2ab
                zap_salt: 0x347d676d
                Leafs with 2^n pointers:
                          9:      1 *
                Blocks with n*5 entries:
                          1:      1 *
                Blocks n/10 full:
                          1:      1 *
                Entries with n chunks:
                          3:      2 **
                          4:      3 ***
                Buckets with n entries:
                          0:    507 ****************************************
                          1:      5 *

                4 = [ 20 ]
                3 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  20 ]
                6 = [ 17 ]
                2 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19 ]
                5 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  17 ]

xhernandez · 2015-11-06T11:57:39Z

I've seen that both zio->io_bp_copy and zio->io_bp_orig are equal, but the only place where they are set explicitly to the same value is zio_create() (at least I haven't been able to locate any other place).

I've added a check in zio_create() to verify that the passed blkptr_t is ok and it failed:

kernel: [ 2423.449656] PANIC at zio.c:542:zio_create()
kernel: [ 2423.449843] Showing stack for process 5263
kernel: [ 2423.449848] CPU: 11 PID: 5263 Comm: txg_sync Tainted: P           O--------------   3.10.0-1-pve #1
kernel: [ 2423.449850] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
kernel: [ 2423.449854]  ffffffffa0d6e180 ffff880fe4e994f8 ffffffff8161006a ffff880fe4e99508
kernel: [ 2423.449862]  ffffffffa0357764 ffff880fe4e996a8 ffffffffa035799d ffff880fe4e99568
kernel: [ 2423.449867]  ffffffffa0c975d0 ffff881a00000030 ffff880fe4e996b8 ffff880fe4e99648
kernel: [ 2423.449873] Call Trace:
kernel: [ 2423.449895]  [<ffffffff8161006a>] dump_stack+0x19/0x1b
kernel: [ 2423.449910]  [<ffffffffa0357764>] spl_dumpstack+0x44/0x50 [spl]
kernel: [ 2423.449919]  [<ffffffffa035799d>] spl_panic+0xbd/0x100 [spl]
kernel: [ 2423.449971]  [<ffffffffa0c975d0>] ? spa_taskq_dispatch_ent+0x90/0x120 [zfs]
kernel: [ 2423.449980]  [<ffffffffa0355e96>] ? taskq_dispatch_ent+0x66/0x170 [spl]
kernel: [ 2423.450022]  [<ffffffffa0d041c0>] ? zio_taskq_member.isra.4+0xa0/0xa0 [zfs]
kernel: [ 2423.450060]  [<ffffffffa0c975d0>] ? spa_taskq_dispatch_ent+0x90/0x120 [zfs]
kernel: [ 2423.450068]  [<ffffffff811890f5>] ? kmem_cache_alloc+0x35/0x1e0
kernel: [ 2423.450076]  [<ffffffffa0353e19>] ? spl_kmem_cache_alloc+0x69/0x150 [spl]
kernel: [ 2423.450084]  [<ffffffffa0353e19>] ? spl_kmem_cache_alloc+0x69/0x150 [spl]
kernel: [ 2423.450122]  [<ffffffffa0d07b66>] zio_create+0x236/0x800 [zfs]
kernel: [ 2423.450161]  [<ffffffffa0d0885b>] zio_write+0x12b/0x1f0 [zfs]
kernel: [ 2423.450185]  [<ffffffffa0c20380>] ? l2arc_feed_thread+0x780/0x780 [zfs]
kernel: [ 2423.450206]  [<ffffffffa0c1af60>] arc_write+0x130/0x290 [zfs]
kernel: [ 2423.450227]  [<ffffffffa0c1bdd0>] ? arc_cksum_compute.isra.10+0x130/0x130 [zfs]
kernel: [ 2423.450249]  [<ffffffffa0c190a0>] ? arc_evictable_memory+0x80/0x80 [zfs]
kernel: [ 2423.450273]  [<ffffffffa0c20380>] ? l2arc_feed_thread+0x780/0x780 [zfs]
kernel: [ 2423.450299]  [<ffffffffa0c2d1cf>] dbuf_write.isra.10+0x23f/0x6c0 [zfs]
kernel: [ 2423.450322]  [<ffffffffa0c2ca00>] ? dbuf_destroy+0x490/0x490 [zfs]
kernel: [ 2423.450344]  [<ffffffffa0c2b9d0>] ? dbuf_set_data+0x100/0x100 [zfs]
kernel: [ 2423.450366]  [<ffffffffa0c307f0>] ? dbuf_read_done+0x2d0/0x2d0 [zfs]
kernel: [ 2423.450401]  [<ffffffffa0c89313>] ? refcount_add_many+0xb3/0x150 [zfs]
kernel: [ 2423.450424]  [<ffffffffa0c33507>] dbuf_sync_leaf+0x197/0x910 [zfs]
kernel: [ 2423.450462]  [<ffffffffa0d09f90>] ? zio_nowait+0x190/0x310 [zfs]
kernel: [ 2423.450485]  [<ffffffffa0c33d3c>] ? dbuf_sync_list+0xbc/0x160 [zfs]
kernel: [ 2423.450508]  [<ffffffffa0c33d65>] dbuf_sync_list+0xe5/0x160 [zfs]
kernel: [ 2423.450538]  [<ffffffffa0c5d33d>] dnode_sync+0x51d/0xfc0 [zfs]
kernel: [ 2423.450573]  [<ffffffffa0c893c6>] ? refcount_add+0x16/0x20 [zfs]
kernel: [ 2423.450600]  [<ffffffffa0c44107>] dmu_objset_sync_dnodes+0x97/0x1f0 [zfs]
kernel: [ 2423.450626]  [<ffffffffa0c44415>] dmu_objset_sync+0x1b5/0x450 [zfs]
kernel: [ 2423.450651]  [<ffffffffa0c42430>] ? dmu_objset_userspace_present+0x20/0x20 [zfs]
kernel: [ 2423.450676]  [<ffffffffa0c42e40>] ? copies_changed_cb+0xa0/0xa0 [zfs]
kernel: [ 2423.450706]  [<ffffffffa0c662e2>] dsl_dataset_sync+0x82/0x160 [zfs]
kernel: [ 2423.450738]  [<ffffffffa0c72e6f>] dsl_pool_sync+0xef/0x5d0 [zfs]
kernel: [ 2423.450773]  [<ffffffffa0c9496d>] spa_sync+0x46d/0xdf0 [zfs]
kernel: [ 2423.450780]  [<ffffffff81089495>] ? __wake_up_common+0x55/0x90
kernel: [ 2423.450786]  [<ffffffff81019ae9>] ? read_tsc+0x9/0x20
kernel: [ 2423.450824]  [<ffffffffa0cac176>] txg_sync_thread+0x3d6/0x700 [zfs]
kernel: [ 2423.450860]  [<ffffffffa0cabda0>] ? txg_quiesce_thread+0x500/0x500 [zfs]
kernel: [ 2423.450869]  [<ffffffffa0354948>] thread_generic_wrapper+0x78/0x90 [spl]
kernel: [ 2423.450877]  [<ffffffffa03548d0>] ? spl_vmem_fini+0x10/0x10 [spl]
kernel: [ 2423.450883]  [<ffffffff81080700>] kthread+0xc0/0xd0
kernel: [ 2423.450887]  [<ffffffff81080640>] ? flush_kthread_worker+0x80/0x80
kernel: [ 2423.450894]  [<ffffffff8162022c>] ret_from_fork+0x7c/0xb0
kernel: [ 2423.450898]  [<ffffffff81080640>] ? flush_kthread_worker+0x80/0x80

dweeezil · 2015-11-06T13:14:38Z

@xhernandez Could you please post the contents (likely as a gist since it might be long) of /proc/spl/kstat/zfs/dbgmsg when the problem happens. Since you're running a debug build, it should be recording the debug messages there.

dweeezil · 2015-11-06T13:25:47Z

@xhernandez What check did you perform against the blkptr to trigger the panic above?

xhernandez · 2015-11-06T21:19:06Z

@dweeezil I've uploaded the dbgmsg file after a panic.

I did a simplified check to detect the corruption I'm seeing in my tests:

for (i = 0; i < BP_GET_NDVAS(bp); i++) {
    ASSERT3U(DVA_GET_VDEV(&bp->blk_dva[i]), <, 2);
}

When it failed, the vdev was 0xC000000.

xhernandez · 2015-11-10T09:57:11Z

@dweeezil I've back traced the blkptr_t that gets corrupted and I see something that I'm not sure if it's right or not (I've just started to dig into zfs code).

The blkptr_t belongs to a dmu_buf_impl_t that is added simultaneously to two different lists of dn->dn_dirty_records[x] in dbuf_dirty(). Both are added because dn->db_blkid is DMU_BONUS_BLKID or DMU_SPILL_BLKID.

Later, the first one is removed from the list and dbuf_sync_leaf() is called. This leads to dbuf_write(). Before the dbuf_write(), the blkptr_t is ok.

Some time later, the second dirty record is removed from the list and dbuf_sync_leaf() is called. At this point, the blkptr_t is already corrupted.

Timing of the events (in seconds):

2975.674965: The first dirty record referencing the dmu_buf_impl_t is added to a list
2975.675777: The second dirty record referencing the same dmu_buf_impl_t is added to a list
2975.677525: The first dirty record is used in dbuf_sync_leaf().
2976.093830: The second dirty record is used in dbuf_sync_leaf().

Could this be the cause of the problem or is this a normal behaviour ?

dweeezil · 2015-11-10T13:29:03Z

@xhernandez Interesting bit of tracing there. I'll try to get back on this today. It was fairly clear to me that the problem is occurring either as part of the transition of a blkptr to/from BP_IS_EMBEDDED() or the transition to/from a dnode needing or not needing a spill block.

With embedded data blkptrs, there are a number of new ways to represent a dnode. This feels remarkably similar to the type of problem fixed by 4254acb.

samuelxhu · 2016-05-13T19:49:43Z

@xhernandez What is the current status of the debugging? I am keeping a close look at the issues you encountered, as I may build a gluster cluster on top of ZoL.

Just want to make sure, whether this issue is a showstopper for Gluster on top os ZoL.

xhernandez · 2016-05-13T20:54:17Z

@samuelxhu Due to other priorities, I've been unable to continue debugging it until this very week. I've been able to reproduce the problem again and I expect to find enough information to solve the bug.

AFAIK the bug only happens when xattr=sa and acl's are used. We have been using Gluster with xattr=sa for a long time and we haven't seen any issue with latests versions. However I can't assure you that the bug won't really manifest itself without acl's.

I'll post more information as soon as I have something interesting.

samuelxhu · 2016-05-14T07:53:48Z

@xhernandez Great information. Before the bug is fixed, it seems safer to use xattr=sa ony and leave out posixacl option. Just wonder why (or when) do we need setting posixacl if Gluster on top of ZoL can work properly without it?

xhernandez · 2016-05-14T15:07:52Z

@samuelxhu ACL's are typically needed when you use samba on top of Gluster, for example.

xhernandez · 2016-05-26T16:23:47Z

@dweeezil I think I've found something about this bug.

The sequence of actions seems to be the following, however I'm unable to check it because I would need to force the creation of a new transaction group at certain points and I don't know how to do that.

Some processing is made to create an entry and add some attributes. I think the exact details of this step are not relevant for the bug.
After some processing, the entry has a single xattr that fits into the bonus buffer. No spill buffer is needed (though it was previously allocated because there were more xattrs, not sure if this is important). Note that this xattr uses the same space of the blkptr_t of the bonus buffer.
A new transaction group is created, but the previous one is not being synced yet.
Additional xattrs are added. The bonus buffer is dirtied. Since the previous transaction group is still pending, the current contents of this buffer are copied into a newly allocated buffer (BUF1) of the dirty record. After adding more xattrs, all of them are moved into a spill buffer. The blkptr_t for the spill buffer is taken from the corresponding address of the dnode.
A new transaction group is created
More actions on the entry causes the bonus buffer to be dirtied and the current contents of the bonus buffer to be copied into a new buffer (BUF2) allocated for the dirty record.
The first transaction is processed. BUF1 is copied into the dnode memory buffer. This overwrites the blkptr_t structure stored at the end of the bonus buffer with the value of the xattr defined at step 2. Note that we were already using this blkptr_t since we have allocated a spill buffer later, at step 4.
From this point, any operation needing the blkptr_t of the spill buffer will have troubles. Note that even when BUF2 is copied into the dnode space, it only overwrites sa data, it doesn't touch the area reserved for blkptr_t since it's not used anymore.

Not sure if this is a complete (or correct) description of what happens or more information is needed. I still don't understand all the internals of ZFS so maybe I have misinterpreted something.

hsepeng · 2016-06-01T15:58:45Z

i and @javenwu were working on this bug fix. our lattest fix patch diff file shown below, which
include detailed comments about how to solve this bug.
please help to review it, thanks for all your efforts and valuable comments

coderevie.txt

ahrens · 2016-06-01T16:13:26Z

@hsepeng The change (in coderevie.txt) make sense. Can you explain (in the comment) the code path that leads to trying to zio_free() the garbage dn_spill?

The comments could use some wordsmithing. Let me know if you want help with that.

Date: Thur Jun 2 13:59:06 2016 +0800 fix the PANIC: metaslab_free_dva(): bad DVA with zfs openzfs#3937 the panic was introduced by the following scenario: in the previous transaction group, the bonus buffer was entirely used to store the attributes for the dnode which override the dn_spill field. however, when adding more attributes to the file, it will need the spill block to hold the extra attributes overflowing the bonus buffer. make sure to clear the garbage left in the dn_spill field which was the previous attributes in bonus buffer, otherwise, after writing out the spill block data to the new allocated dva, it will try to free the old block pointed by the invalid dn_spill, that would introduce the panic

xhernandez · 2016-06-02T06:22:24Z

@hsepeng are you sure that checking the dn_flags inside the mutex is necessary ?

If flags could be set concurrently, then this solution is not valid because it might set the DNODE_FLAG_SPILL_BLKPTR before it's checked, even if it's inside the mutex. In this case the db->db_blkptr won't be cleared and the bug will appear again.

I think it doesn't make sense this possibility (having concurrent updates) because we are preparing the dnode to be written to the disk, so there shouldn't be parallel updates.

Additionally, reading an aligned integer is an atomic operation (even if it won't be atomic, we are only testing a single bit, independently of the others).

hsepeng · 2016-06-02T06:38:33Z

@xhernandez i agree with you that the clear and set of the dn_flags were in the same thread context without concurrent updates in the current code base.
i make dn_flags test and set under the mutex protection is from the perspective of code maintainance and the future just in case scenario since the overhead is negligible.

xhernandez · 2016-06-02T07:07:50Z

@hsepeng if in the future the flag is touched anywhere else, this piece of code will need to be changed also, or the bug will appear again. Note that other places where dn_flags is checked are done outside the mutex protection.

If sometime a change is made that sets the flags in another place, the mutex doesn't guarantee anything. It can be set before the mutex is entered, so the check will fail and db->db_blkptr won't be cleared. Additionally, if it has already been set it's not necessary to set it again. So in all cases we can move the check of the flag outside the mutex safely.

It could even be considered to use an atomic set or test_and_set operation to completely remove the mutex, but this would require a bigger change.

Having unneeded mutexes might increase the risk of deadlocks if lock order is not correctly checked in all places. Having less mutexes minimizes this problem and simplifies future changes.

dweeezil · 2016-06-02T13:13:24Z

The missing piece of this puzzle is how dbuf_sync_leaf() can be entered for a spill block when both db_blkptr is not NULL, points to SA leftovers in db_spill and DNODE_FLAG_SPILL_BLKPTR is clear. The patch proposed here clearly fixes the problem (and actually NULLs db_blkptr in many cases where it is already NULL, but I wonder if the real issue is how this condition is happening in the first place.

xhernandez · 2016-06-02T17:24:01Z

@dweeezil I'm trying to trace the path followed by a dbuf that works fine and a dbuf that causes a panic and I've seen that they start to differ when dbuf_undirty() is called.

In the bad case dbuf_undirty() returns false, but not because the dbuf is still referenced by anyone else, as I previously thought. It returns false because the dbuf is not dirty in the current transaction group. It's dirty in a previous one or not dirty at all (if I correctly understand the code). This is the check that causes the return:

if (dr == NULL || dr->dr_txg < txg)
        return (B_FALSE)

I'll analyze the remainig data to post a more detailed description.

dweeezil · 2016-06-02T23:39:44Z

@xhernandez I was able to exercise that code path in my testing, but it never resulted in non NULL db_blkptr for a spill block pointer pointing at trash.

I'll keep trying to reproduce this, but think your patch makes sense right now even though I'm unclear of the code paths which can cause the problem.

xhernandez · 2016-06-04T17:28:37Z

@dweezil I think I've identified the code path that leads to this situation.

Current txg = A.
A new spill buffer is created. Its dbuf is initialized with db_blkptr = NULL and it's dirtied.
Current txg = B.
The spill buffer is modified. It's marked as dirty in this txg.
Additional changes make the spill buffer unnecessary because the xattr fits into the bonus buffer, so it's removed. The dbuf is undirtied in this txg, but it's still referenced and cannot be destroyed.
Current txg = C.
Starts syncing of txg A
dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr is NULL, dbuf_check_blkptr() is called.
The dbuf starts being written and it reaches the ready state (not done yet).
A new change makes the spill buffer necessary again. sa_build_layouts() ends calling dbuf_find() to locate the dbuf. It finds the old dbuf because it has not been destroyed yet (it will be destroyed when the previous write is done and there are no more references). The old dbuf has db_blkptr != NULL.
txg A write is complete and the dbuf released. However it's still referenced, so it's not destroyed.
Current txg = D.
Starts syncing of txg B
dbuf_sync_leaf() is called for the bonus buffer. Its contents are directly copied into the dnode, overwriting the blkptr area because, in txg B, the bonus buffer was big enough to hold the entire xattr.
At this point, the db_blkptr of the spill buffer used in txg C gets corrupted.

the panic was introduced by the following scenario: in the previous transaction group, the bonus buffer was entirely used to store the attributes for the dnode which override the dn_spill field. however, when adding more attributes to the file, it will need the spill block to hold the extra attributes overflowing the bonus buffer. make sure to clear the garbage left in the dn_spill field which was the previous attributes in bonus buffer, otherwise, after writing out the spill block data to the new allocated dva, it will try to free the old block pointed by the invalid dn_spill, that would introduce the panic

dweeezil · 2016-06-06T03:28:21Z

@xhernandez I finally got a chance to go over your dbgmsg output. One of the key observations was the failures of arc_tempreserve(). When this happens, the creation of new txgs is stalled and this is exactly the type of thing which would likely be required in order that the scenario you describe could happen. Your dmu_tx kstat would likely bear that out. I'm still working on a reliable reproducer.

the panic was introduced by the following scenario: in the previous transaction group, the bonus buffer was entirely used to store the attributes for the dnode which override the dn_spill field. however, when adding more attributes to the file, it will need the spill block to hold the extra attributes overflowing the bonus buffer. make sure to clear the garbage left in the dn_spill field which was the previous attributes in bonus buffer, otherwise, after writing out the spill block data to the new allocated dva, it will try to free the old block pointed by the invalid dn_spill, that would introduce the panic

xhernandez · 2016-06-08T09:23:58Z

@behlendorf will this patch be included in the next 0.6.5.x release ?

behlendorf · 2016-06-09T16:48:58Z

It's possible if we can get a few developers to review and sign off on the proposed change in #4743. The fix itself looks reasonable to me but I think the comment could be a little more concise. Maybe @dweeezil @ahrens or @xhernandez can propose something. Including the detailed walk-thru from the above comment in the commit comment would also be very useful.

dweeezil · 2016-06-09T16:59:49Z

I had started working a bit more concise description of conditions required for this issue. As mentioned earlier, I think one important prerequisite is dmu tx assignment stalls of some sort in order that multiple references to the spill and/or bonus actually exist at the same time. So far, I've not come up with a reproducer. I'd at least like to see the description include the open/quiesce/sync/close state of the txgs involved.

That said, the fix does look perfectly reasonable.

* Consistently use parsable instead of parseable This is a purely cosmetical change, to consistently prefer one of two (both acceptable) choises for the word parsable in documentation and code. I don't really care which to use, but acording to wiktionary https://en.wiktionary.org/wiki/parsable#English parsable is preferred. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4682 * Add missing RPM BuildRequires Both libudev and libattr are recommended build requirements. As such their development headers should lists in the rpm spec file so those dependencies are pulled in when building rpm packages. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4676 * Skip ctldir znode in zfs_rezget to fix snapdir issues Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This will cause funny behaviour for the mounted snapdirs. Especially for Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone automount it again as long as someone is still using the detached mount. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4514 Closes #4661 Closes #4672 * Improve zfs-module-parameters(5) Various rewrites to the descriptions of module parameters. Corrects spelling mistakes, makes descriptions them more user-friendly and describes some ZFS quirks which should be understood before changing parameter values. Signed-off-by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4671 * Fix arc_prune_task use-after-free arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent the underlying zsb from disappearing if there's a concurrent umount. We fix this by force the caller of arc_remove_prune_callback to wait for arc_prune_taskq to finish. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4687 Closes #4690 * Add request size histograms (-r) to zpool iostat, minor man page fix Add -r option to "zpool iostat" to print request size histograms for the leaf ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs ("agg"). These stats can be useful for seeing how well the ZFS IO aggregator is working. $ zpool iostat -r mypool sync_read sync_write async_read async_write scrub req_size ind agg ind agg ind agg ind agg ind agg ---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 512 0 0 0 0 0 0 530 0 0 0 1K 0 0 260 0 0 0 116 246 0 0 2K 0 0 0 0 0 0 0 431 0 0 4K 0 0 0 0 0 0 3 107 0 0 8K 15 0 35 0 0 0 0 6 0 0 16K 0 0 0 0 0 0 0 39 0 0 32K 0 0 0 0 0 0 0 0 0 0 64K 20 0 40 0 0 0 0 0 0 0 128K 0 0 20 0 0 0 0 0 0 0 256K 0 0 0 0 0 0 0 0 0 0 512K 0 0 0 0 0 0 0 0 0 0 1M 0 0 0 0 0 0 0 0 0 0 2M 0 0 0 0 0 0 0 0 0 0 4M 0 0 0 0 0 0 155 19 0 0 8M 0 0 0 0 0 0 0 811 0 0 16M 0 0 0 0 0 0 0 68 0 0 -------------------------------------------------------------------------------- Also rename the stray "-G" in the man page to be "-w" for latency histograms. Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #4659 * OpenZFS 6531 - Provide mechanism to artificially limit disk performance Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Dan McDonald <[email protected]> Ported by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6531 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130 Porting notes: - Added new IO delay tracepoints, and moved common ZIO tracepoint macros to a new trace_common.h file. - Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function. - Updated zinject man page - Updated zpool_scrub test files * Systemd configuration fixes * Disable zfs-import-scan.service by default. This ensures that pools will not be automatically imported unless they appear in the cache file. When this service is explicitly enabled pools will be imported with the "cachefile=none" property set. This prevents the creation of, or update to, an existing cache file. $ systemctl list-unit-files | grep zfs zfs-import-cache.service enabled zfs-import-scan.service disabled zfs-mount.service enabled zfs-share.service enabled zfs-zed.service enabled zfs.target enabled * Change services to dynamic from static by adding an [Install] section and adding 'WantedBy' tags in favor of 'Requires' tags. This allows for easier customization of the boot behavior. * Start the zfs-import-cache.service after the root pivot so the cache file is available in the standard location. * Start the zfs-mount.service after the systemd-remount-fs.service to ensure the root fs is writeable and the ZFS filesystems can create their mount points. * Change the default behavior to only load the ZFS kernel modules in zfs-import-*.service or when blkid(8) detects a pool. Users who wish to unconditionally load the kernel modules must uncomment the list of modules in /lib/modules-load.d/zfs.conf. Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4325 Closes #4496 Closes #4658 Closes #4699 * Fix self-healing IO prior to dsl_pool_init() completion Async writes triggered by a self-healing IO may be issued before the pool finishes the process of initialization. This results in a NULL dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes(). George Wilson recommended addressing this issue by initializing the passed `dsl_pool_t **` prior to dmu_objset_open_impl(). Since the caller is passing the `spa->spa_dsl_pool` this has the effect of ensuring it's initialized. However, since this depends on the caller knowing they must pass the `spa->spa_dsl_pool` an additional NULL check was added to vdev_queue_max_async_writes(). This guards against any future restructuring of the code which might result in dsl_pool_init() being called differently. Signed-off-by: GeLiXin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4652 * Add isa_defs for MIPS GCC for MIPS only defines _LP64 when 64bit, while no _ILP32 defined when 32bit. Signed-off-by: YunQiang Su <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4712 * Fix out-of-bound access in zfs_fillpage The original code will do an out-of-bound access on pl[] during last iteration. ================================================================== BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs] Read of size 8 by task tmpfile/7850 page:ffffea00017c6dc0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0xffff8000000000() page dumped because: kasan: bad access detected CPU: 3 PID: 7850 Comm: tmpfile Tainted: G OE 4.6.0+ #3 ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618 ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8 ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3 Call Trace: [<ffffffff81635618>] dump_stack+0x63/0x8b [<ffffffff81313ee8>] kasan_report_error+0x528/0x560 [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0 [<ffffffff813144b8>] kasan_report+0x58/0x60 [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffff81312e4e>] __asan_load8+0x5e/0x70 [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs] [<ffffffff81353c3a>] SyS_execve+0x3a/0x50 [<ffffffff810058ef>] do_syscall_64+0xef/0x180 [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25 Memory state around the buggy address: ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 ^ ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4705 Issue #4708 * Fix memleak in zpl_parse_options strsep() will advance tmp_mntopts, and will change it to NULL on last iteration. This will cause strfree(tmp_mntopts) to not free anything. unreferenced object 0xffff8800883976c0 (size 64): comm "mount.zfs", pid 3361, jiffies 4294931877 (age 1482.408s) hex dump (first 32 bytes): 72 77 00 73 74 72 69 63 74 61 74 69 6d 65 00 7a rw.strictatime.z 66 73 75 74 69 6c 00 6d 6e 74 70 6f 69 6e 74 3d fsutil.mntpoint= backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811f9cac>] __kmalloc+0x16c/0x250 [<ffffffffc065ce9b>] strdup+0x3b/0x60 [spl] [<ffffffffc080fad6>] zpl_parse_options+0x56/0x300 [zfs] [<ffffffffc080fe46>] zpl_mount+0x36/0x80 [zfs] [<ffffffff81222dc8>] mount_fs+0x38/0x160 [<ffffffff81240097>] vfs_kern_mount+0x67/0x110 [<ffffffff812428e0>] do_mount+0x250/0xe20 [<ffffffff812437d5>] SyS_mount+0x95/0xe0 [<ffffffff8181aff6>] entry_SYSCALL_64_fastpath+0x1e/0xa8 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4706 Issue #4708 * Fix memleak in vdev_config_generate_stats fnvlist_add_nvlist will copy the contents of nvx, so we need to free it here. unreferenced object 0xffff8800a6934e80 (size 64): comm "zpool", pid 3398, jiffies 4295007406 (age 214.180s) hex dump (first 32 bytes): 60 06 c2 73 00 88 ff ff 00 7c 8c 73 00 88 ff ff `..s.....|.s.... 00 00 00 00 00 00 00 00 40 b0 70 c0 ff ff ff ff [email protected]..... backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811fac7d>] __kmalloc_node+0x17d/0x310 [<ffffffffc065528c>] spl_kmem_alloc_impl+0xac/0x180 [spl] [<ffffffffc0657379>] spl_vmem_alloc+0x19/0x20 [spl] [<ffffffffc07056cf>] nv_alloc_sleep_spl+0x1f/0x30 [znvpair] [<ffffffffc07006b7>] nvlist_xalloc.part.13+0x27/0xc0 [znvpair] [<ffffffffc07007ad>] nvlist_alloc+0x3d/0x40 [znvpair] [<ffffffffc0703abc>] fnvlist_alloc+0x2c/0x80 [znvpair] [<ffffffffc07b1783>] vdev_config_generate_stats+0x83/0x370 [zfs] [<ffffffffc07b1f53>] vdev_config_generate+0x4e3/0x650 [zfs] [<ffffffffc07996db>] spa_config_generate+0x20b/0x4b0 [zfs] [<ffffffffc0794f64>] spa_tryimport+0xc4/0x430 [zfs] [<ffffffffc07d11d8>] zfs_ioc_pool_tryimport+0x68/0x110 [zfs] [<ffffffffc07d4fc6>] zfsdev_ioctl+0x646/0x7a0 [zfs] [<ffffffff81232e31>] do_vfs_ioctl+0xa1/0x5b0 [<ffffffff812333b9>] SyS_ioctl+0x79/0x90 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4707 Issue #4708 * Linux 4.7 compat: handler->set() takes both dentry and inode Counterpart to fd4c7b7, the same approach was taken to resolve the compatibility issue. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4717 Issue #4665 * Implementation of AVX2 optimized Fletcher-4 New functionality: - Preserves existing scalar implementation. - Adds AVX2 optimized Fletcher-4 computation. - Fastest routines selected on module load (benchmark). - Test case for Fletcher-4 added to ztest. New zcommon module parameters: - zfs_fletcher_4_impl (str): selects the implementation to use. "fastest" - use the fastest version available "cycle" - cycle trough all available impl for ztest "scalar" - use the original version "avx2" - new AVX2 implementation if available Performance comparison (Intel i7 CPU, 1MB data buffers): - Scalar: 4216 MB/s - AVX2: 14499 MB/s See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl` to get list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Jinshan Xiong <[email protected]> Signed-off-by: Andreas Dilger <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4330 * Fix cstyle.pl warnings As of perl v5.22.1 the following warnings are generated: * Redundant argument in printf at scripts/cstyle.pl line 194 * Unescaped left brace in regex is deprecated, passed through in regex; marked by <-- HERE in m/\S{ <-- HERE / at scripts/cstyle.pl line 608. They have been addressed by escaping the left braces and by providing the correct number of arguments to printf based on the fmt specifier set by the verbose option. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4723 * Fix minor spelling mistakes Trivial spelling mistake fix in error message text. * Fix spelling mistake "adminstrator" -> "administrator" * Fix spelling mistake "specificed" -> "specified" * Fix spelling mistake "interperted" -> "interpreted" Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4728 * Add `zfs allow` and `zfs unallow` support ZFS allows for specific permissions to be delegated to normal users with the `zfs allow` and `zfs unallow` commands. In addition, non- privileged users should be able to run all of the following commands: * zpool [list | iostat | status | get] * zfs [list | get] Historically this functionality was not available on Linux. In order to add it the secpolicy_* functions needed to be implemented and mapped to the equivalent Linux capability. Only then could the permissions on the `/dev/zfs` be relaxed and the internal ZFS permission checks used. Even with this change some limitations remain. Under Linux only the root user is allowed to modify the namespace (unless it's a private namespace). This means the mount, mountpoint, canmount, unmount, and remount delegations cannot be supported with the existing code. It may be possible to add this functionality in the future. This functionality was validated with the cli_user and delegation test cases from the ZFS Test Suite. These tests exhaustively verify each of the supported permissions which can be delegated and ensures only an authorized user can perform it. Two minor bug fixes were required for test-running.py. First, the Timer() object cannot be safely created in a `try:` block when there is an unconditional `finally` block which references it. Second, when running as a normal user also check for scripts using the both the .ksh and .sh suffixes. Finally, existing users who are simulating delegations by setting group permissions on the /dev/zfs device should revert that customization when updating to a version with this change. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #362 Closes #434 Closes #4100 Closes #4394 Closes #4410 Closes #4487 * Remove libzfs_graph.c The libzfs_graph.c source file should have been removed in 330d06f, it is entirely unused. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4766 * Linux 4.6 compat: Fall back to d_prune_aliases() if necessary As of 4.6, the icache and dcache LRUs are memcg aware insofar as the kernel's per-superblock shrinker is concerned. The effect is that dcache or icache entries added by a task in a non-root memcg won't be scanned by the shrinker in the context of the root (or NULL) memcg. This defeats the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to grow uncontrollably. This patch reverts to the d_prune_aliaes() method in case the kernel's per-superblock shrinker is not able to free anything. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes: #4726 * SIMD implementation of vdev_raidz generate and reconstruct routines This is a new implementation of RAIDZ1/2/3 routines using x86_64 scalar, SSE, and AVX2 instruction sets. Included are 3 parity generation routines (P, PQ, and PQR) and 7 reconstruction routines, for all RAIDZ level. On module load, a quick benchmark of supported routines will select the fastest for each operation and they will be used at runtime. Original implementation is still present and can be selected via module parameter. Patch contains: - specialized gen/rec routines for all RAIDZ levels, - new scalar raidz implementation (unrolled), - two x86_64 SIMD implementations (SSE and AVX2 instructions sets), - fastest routines selected on module load (benchmark). - cmd/raidz_test - verify and benchmark all implementations - added raidz_test to the ZFS Test Suite New zfs module parameters: - zfs_vdev_raidz_impl (str): selects the implementation to use. On module load, the parameter will only accept first 3 options, and the other implementations can be set once module is finished loading. Possible values for this option are: "fastest" - use the fastest math available "original" - use the original raidz code "scalar" - new scalar impl "sse" - new SSE impl if available "avx2" - new AVX2 impl if available See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to get the list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4328 * Fix NFS credential The commit f74b821 caused a regression where creating file through NFS will always create a file owned by root. This is because the patch enables the KSID code in zfs_acl_ids_create, which it would use euid and egid of the current process. However, on Linux, we should use fsuid and fsgid for file operations, which is the original behaviour. So we revert this part of code. The patch also enables secpolicy_vnode_*, since they are also used in file operations, we change them to use fsuid and fsgid. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4772 Closes #4758 * OpenZFS 6513 - partially filled holes lose birth time Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Boris Protopopov <[email protected]> Approved by: Richard Lowe <[email protected]>a Ported by: Boris Protopopov <[email protected]> Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6513 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0 If a ZFS object contains a hole at level one, and then a data block is created at level 0 underneath that l1 block, l0 holes will be created. However, these l0 holes do not have the birth time property set; as a result, incremental sends will not send those holes. Fix is to modify the dbuf_read code to fill in birth time data. * Add a test case for dmu_free_long_range() to ztest Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4754 * Revert "Add a test case for dmu_free_long_range() to ztest" This reverts commit d0de2e82df579f4e4edf5643b674a1464fae485f which introduced a new test case to ztest which is failing occasionally during automated testing. The change is being reverted until the issue can be fully investigated. Signed-off-by: Brian Behlendorf <[email protected]> Issue #4754 * OpenZFS 6878 - Add scrub completion info to "zpool history" Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Approved by: Dan McDonald <[email protected]> Authored by: Nav Ravindranath <[email protected]> Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6878 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1825bc5 Closes #4787 * FreeBSD rS271776 - Persist vdev_resilver_txg changes Persist vdev_resilver_txg changes to avoid panic caused by validation vs a vdev_resilver_txg value from a previous resilver. Authored-by: smh <[email protected]> Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/5154 FreeBSD-issue: https://reviews.freebsd.org/rS271776 FreeBSD-commit: https://github.com/freebsd/freebsd/commit/c3c60bf Closes #4790 * xattrtest: allow verify with -R and other improvements - Use a fixed buffer of random bytes when random xattr values are in effect. This eliminates the potential performance bottleneck of reading from /dev/urandom for each file. This also allows us to verify xattrs in random value mode. - Show the rate of operations per second in addition to elapsed time for each phase of the test. This may be useful for benchmarking. - Set default xattr size to 6 so that verify doesn't fail if user doesn't specify a size. We need at least six bytes to store the leading "size=X" string that is used for verification. - Allow user to execute just one phase of the test. Acceptable values for -o and their meanings are: 1 - run the create phase 2 - run the setxattr phase 3 - run the getxattr phase 4 - run the unlink phase Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> * Backfill metadnode more intelligently Only attempt to backfill lower metadnode object numbers if at least 4096 objects have been freed since the last rescan, and at most once per transaction group. This avoids a pathology in dmu_object_alloc() that caused O(N^2) behavior for create-heavy workloads and substantially improves object creation rates. As summarized by @mahrens in #4636: "Normally, the object allocator simply checks to see if the next object is available. The slow calls happened when dmu_object_alloc() checks to see if it can backfill lower object numbers. This happens every time we move on to a new L1 indirect block (i.e. every 32 * 128 = 4096 objects). When re-checking lower object numbers, we use the on-disk fill count (blkptr_t:blk_fill) to quickly skip over indirect blocks that don’t have enough free dnodes (defined as an L2 with at least 393,216 of 524,288 dnodes free). Therefore, we may find that a block of dnodes has a low (or zero) fill count, and yet we can’t allocate any of its dnodes, because they've been allocated in memory but not yet written to disk. In this case we have to hold each of the dnodes and then notice that it has been allocated in memory. The end result is that allocating N objects in the same TXG can require CPU usage proportional to N^2." Add a tunable dmu_rescan_dnode_threshold to define the number of objects that must be freed before a rescan is performed. Don't bother to export this as a module option because testing doesn't show a compelling reason to change it. The vast majority of the performance gain comes from limit the rescan to at most once per TXG. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> * Implement large_dnode pool feature Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3542 * Sync DMU_BACKUP_FEATURE_* flags Flag 20 was used in OpenZFS as DMU_BACKUP_FEATURE_RESUMING. The DMU_BACKUP_FEATURE_LARGE_DNODE flag must be shifted to 21 and then reserved in the upstream OpenZFS implementation. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #4795 * OpenZFS 2605, 6980, 6902 2605 want to resume interrupted zfs send Reviewed by: George Wilson <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Xin Li <[email protected]> Reviewed by: Arne Jansen <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: kernelOfTruth <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/2605 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12 6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6980 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f Porting notes: - All rsend and snapshop tests enabled and updated for Linux. - Fix misuse of input argument in traverse_visitbp(). - Fix ISO C90 warnings and errors. - Fix gcc 'missing braces around initializer' in 'struct send_thread_arg to_arg =' warning. - Replace 4 argument fletcher_4_native() with 3 argument version, this change was made in OpenZFS 4185 which has not been ported. - Part of the sections for 'zfs receive' and 'zfs send' was rewritten and reordered to approximate upstream. - Fix mktree xattr creation, 'user.' prefix required. - Minor fixes to newly enabled test cases - Long holds for volumes allowed during receive for minor registration. * OpenZFS 6051 - lzc_receive: allow the caller to read the begin record Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6051 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/620f322 * OpenZFS 6393 - zfs receive a full send as a clone Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6394 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e * OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS Authored by: Andrew Stormont <[email protected]> Reviewed by: Anil Vijarnia <[email protected]> Reviewed by: Kim Shrier <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6536 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/880094b * OpenZFS 6738 - zfs send stream padding needs documentation Authored by: Eli Rosenthal <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6738 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c20404ff * OpenZFS 4986 - receiving replication stream fails if any snapshot exceeds refquota Authored by: Dan McDonald <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Gordon Ross <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/4986 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5878fad * OpenZFS 6562 - Refquota on receive doesn't account for overage Authored by: Dan McDonald <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Reviewed by: Toomas Soome <[email protected]> Approved by: Gordon Ross <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6562 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5f7a8e6 * Implement zfs_ioc_recv_new() for OpenZFS 2605 Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy ZFS_IOC_RECV user/kernel interface. The new interface supports all stream options but is currently only used for resumable streams. This way updated user space utilities will interoperate with older kernel modules. ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW handler. Non-Linux OpenZFS platforms have opted to change the legacy interface in an incompatible fashion instead of adding a new ioctl. Signed-off-by: Brian Behlendorf <[email protected]> * OpenZFS 6314 - buffer overflow in dsl_dataset_name Reviewed by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6314 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee * OpenZFS 6876 - Stack corruption after importing a pool with a too-long name Reviewed by: Prakash Surya <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for trouble. We should check every dataset on import, using a 1024 byte buffer and checking each time to see if the dataset's new name is longer than 256 bytes. OpenZFS-issue: https://www.illumos.org/issues/6876 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e * Vectorized fletcher_4 must be 128-bit aligned The fletcher_4_native() and fletcher_4_byteswap() functions may only safely use the vectorized implementations when the buffer is 128-bit aligned. This is because both the AVX2 and SSE implementations process four 32-bit words per iterations. Fallback to the scalar implementation which only processes a single 32-bit word for unaligned buffers. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Issue #4330 * Allow building with `CFLAGS="-O0"` If compiled with -O0, gcc doesn't do any stack frame coalescing and -Wframe-larger-than=1024 is triggered in debug mode. Starting with gcc 4.8, new opt level -Og is introduced for debugging, which does not trigger this warning. Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4799 * Don't allow accessing XATTR via export handle Allow accessing XATTR through export handle is a very bad idea. It would allow user to write whatever they want in fields where they otherwise could not. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4828 * Fix get_zfs_sb race with concurrent umount Certain ioctl operations will call get_zfs_sb, which will holds an active count on sb without checking whether it's active or not. This will result in use-after-free. We fix this by using atomic_inc_not_zero to make sure we got an active sb. P1 P2 --- --- deactivate_locked_super(): s_active = 0 zfs_sb_hold() ->get_zfs_sb(): s_active = 1 ->zpl_kill_sb() -->zpl_put_super() --->zfs_umount() ---->zfs_sb_free(zsb) zfs_sb_rele(zsb) Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> * Fix Large kmem_alloc in vdev_metaslab_init This allocation can go way over 1MB, so we should use vmem_alloc instead of kmem_alloc. Large kmem_alloc(1430784, 0x1000), please file an issue... Call Trace: [<ffffffffa0324aff>] ? spl_kmem_zalloc+0xef/0x160 [spl] [<ffffffffa17d0c8d>] ? vdev_metaslab_init+0x9d/0x1f0 [zfs] [<ffffffffa17d46d0>] ? vdev_load+0xc0/0xd0 [zfs] [<ffffffffa17d4643>] ? vdev_load+0x33/0xd0 [zfs] [<ffffffffa17c0004>] ? spa_load+0xfc4/0x1b60 [zfs] [<ffffffffa17c1838>] ? spa_tryimport+0x98/0x430 [zfs] [<ffffffffa17f28b1>] ? zfs_ioc_pool_tryimport+0x41/0x80 [zfs] [<ffffffffa17f5669>] ? zfsdev_ioctl+0x4a9/0x4e0 [zfs] [<ffffffff811bacdf>] ? do_vfs_ioctl+0x2cf/0x4b0 [<ffffffff811baf41>] ? SyS_ioctl+0x81/0xa0 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4752 * Add configure result for xattr_handler Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4828 * fh_to_dentry should return ESTALE when generation mismatch When generation mismatch, it usually means the file pointed by the file handle was deleted. We should return ESTALE to indicate this. We return ENOENT in zfs_vget since zpl_fh_to_dentry will convert it to ESTALE. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4828 * xattr dir doesn't get purged during iput We need to set inode->i_nlink to zero so iput will purge it. Without this, it will get purged during shrink cache or umount, which would likely result in deadlock due to zfs_zget waiting forever on its children which are in the dispose_list of the same thread. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Issue #4359 Issue #3508 Issue #4413 Issue #4827 * Kill zp->z_xattr_parent to prevent pinning zp->z_xattr_parent will pin the parent. This will cause huge issue when unlink a file with xattr. Because the unlinked file is pinned, it will never get purged immediately. And because of that, the xattr stuff will never be marked as unlinked. So the whole unlinked stuff will stay there until shrink cache or umount. This change partially reverts e89260a. This is safe because only the zp->z_xattr_parent optimization is removed, zpl_xattr_security_init() is still called from the zpl outside the inode lock. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Issue #4359 Issue #3508 Issue #4413 Issue #4827 * Fix RAIDZ_TEST tests Remove stray trailing } which prevented the raidz stress tests from running in-tree. Signed-off-by: Brian Behlendorf <[email protected]> * Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z The following scenario can result in garbage in the dn_spill field. The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR is clear to ensure the dn_spill field is cleared. Current txg = A. * A new spill buffer is created. Its dbuf is initialized with db_blkptr = NULL and it's dirtied. Current txg = B. * The spill buffer is modified. It's marked as dirty in this txg. * Additional changes make the spill buffer unnecessary because the xattr fits into the bonus buffer, so it's removed. The dbuf is undirtied in this txg, but it's still referenced and cannot be destroyed. Current txg = C. * Starts syncing of txg A * dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr is NULL, dbuf_check_blkptr() is called. * The dbuf starts being written and it reaches the ready state (not done yet). * A new change makes the spill buffer necessary again. sa_build_layouts() ends up calling dbuf_find() to locate the dbuf. It finds the old dbuf because it has not been destroyed yet (it will be destroyed when the previous write is done and there are no more references). The old dbuf has db_blkptr != NULL. * txg A write is complete and the dbuf released. However it's still referenced, so it's not destroyed. Current txg = D. * Starts syncing of txg B * dbuf_sync_leaf() is called for the bonus buffer. Its contents are directly copied into the dnode, overwriting the blkptr area because, in txg B, the bonus buffer was big enough to hold the entire xattr. * At this point, the db_blkptr of the spill buffer used in txg C gets corrupted. Signed-off-by: Peng <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3937 * Fix handling of errors nvlist in zfs_ioc_recv_new() zfs_ioc_recv_impl() is changed to always allocate the 'errors' nvlist, its callers are responsible for freeing it. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4829 * Add RAID-Z routines for SSE2 instruction set, in x86_64 mode. The patch covers low-end and older x86 CPUs. Parity generation is equivalent to SSSE3 implementation, but reconstruction is somewhat slower. Previous 'sse' implementation is renamed to 'ssse3' to indicate highest instruction set used. Benchmark results: scalar_rec_p 4 720476442 scalar_rec_q 4 187462804 scalar_rec_r 4 138996096 scalar_rec_pq 4 140834951 scalar_rec_pr 4 129332035 scalar_rec_qr 4 81619194 scalar_rec_pqr 4 53376668 sse2_rec_p 4 2427757064 sse2_rec_q 4 747120861 sse2_rec_r 4 499871637 sse2_rec_pq 4 522403710 sse2_rec_pr 4 464632780 sse2_rec_qr 4 319124434 sse2_rec_pqr 4 205794190 ssse3_rec_p 4 2519939444 ssse3_rec_q 4 1003019289 ssse3_rec_r 4 616428767 ssse3_rec_pq 4 706326396 ssse3_rec_pr 4 570493618 ssse3_rec_qr 4 400185250 ssse3_rec_pqr 4 377541245 original_rec_p 4 691658568 original_rec_q 4 195510948 original_rec_r 4 26075538 original_rec_pq 4 103087368 original_rec_pr 4 15767058 original_rec_qr 4 15513175 original_rec_pqr 4 10746357 Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4783 * Enable zpool_upgrade test cases Creating the pool in a striped rather than mirrored configuration provides enough space for all upgrade tests to run. Test case zpool_upgrade_007_pos still fails and must be investigated so it has been left disabled. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4852 * Prevent null dereferences when accessing dbuf kstat In arc_buf_info(), the arc_buf_t may have no header. If not, don't try to fetch the arc buffer stats and instead just zero them. The null dereferences were observed while accessing the dbuf kstat with awk on a system in which millions of small files were being created in order to overflow the system's metadata limit. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4837 * Fix dbuf_stats_hash_table_data race Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf can be freed at any time. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4846 * Use native inode->i_nlink instead of znode->z_links A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's 64 bit on-disk link count. We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a more Linux-integrated fix for the same issue. In addition, setting the initial link count on a new node has been changed from setting one less than required in zfs_mknode() then incrementing to the correct count in zfs_link_create() (which was somewhat bizarre in the first place), to setting the correct count in zfs_mknode() and not incrementing it in zfs_link_create(). This both means we no longer set the link count in sa_bulk_update() twice (once for the initial incorrect count then again for the correct count), as well as adhering to the Linux requirement of not incrementing a zero link count without I_LINKABLE (see linux commit f4e0c30c). Signed-off-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4838 Issue #227 * Implementation of SSE optimized Fletcher-4 Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4) This commit adds another implementation of the Fletcher-4 algorithm. It is automatically selected at module load if it benchmarks higher than all other available implementations. The module benchmark was also amended to analyze the performance of the byteswap-ed version of Fletcher-4, as well as the non-byteswaped version. The average performance of the two is used to select the the fastest implementation available on the host system. Adds a pair of fields to an existing zcommon module parameter: - zfs_fletcher_4_impl (str) "sse2" - new SSE2 implementation if available "ssse3" - new SSSE3 implementation if available Signed-off-by: Tyler J. Stachecki <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4789 * Fix filesystem destroy with receive_resume_token It is possible that the given DS may have hidden child (%recv) datasets - "leftovers" resulting from the previously interrupted 'zfs receieve'. Try to remove the hidden child (%recv) and after that try to remove the target dataset. If the hidden child (%recv) does not exist the original error (EEXIST) will be returned. Signed-off-by: Roman Strashkin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4818 * Prevent segfaults in SSE optimized Fletcher-4 In some cases, the compiler was not respecting the GNU aligned attribute for stack variables in 35a76a0. This was resulting in a segfault on CentOS 6.7 hosts using gcc 4.4.7-17. This issue was fixed in gcc 4.6. To prevent this from occurring, use unaligned loads and stores for all stack and global memory references in the SSE optimized Fletcher-4 code. Disable zimport testing against master where this flaw exists: TEST_ZIMPORT_VERSIONS="installed" Signed-off-by: Tyler J. Stachecki <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4862 * Update arc_summary.py for prefetch changes Commit 7f60329 removed several kstats which arc_summary.py read. Remove these kstats from arc_summary.py in the same way this was handled in FreeNAS. FreeNAS-commit: https://github.com/freenas/freenas/commit/3901f73 Signed-off-by: Brian Behlendorf <[email protected]> Closes #4695 * Wait iput_async before evict_inodes to prevent race Wait for iput_async before entering evict_inodes in generic_shutdown_super. The reason we must finish before evict_inodes is when lazytime is on, or when zfs_purgedir calls zfs_zget, iput would bump i_count from 0 to 1. This would race with the i_count check in evict_inodes. This means it could destroy the inode while we are still using it. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4854 * Fixes and enhancements of SIMD raidz parity - Implementation lock replaced with atomic variable - Trailing whitespace is removed from user specified parameter, to enhance experience when using commands that add newline, e.g. `echo` - raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813 - silence `cppcheck` in vdev_raidz, partial solution of Issue #1392 - Minor fixes and cleanups - Enable use of original parity methods in [fastest] configuration. New opaque original ops structure, representing native methods, is added to supported raidz methods. Original parity methods are executed if selected implementation has NULL fn pointer. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4813 Issue #1392 * RAIDZ parity kstat rework Print table with speed of methods for each implementation. Last line describes contents of [fastest] selection. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4860 * Fix NULL pointer in zfs_preumount from 1d9b3bd When zfs_domount fails zsb will be freed, and its caller mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into zfs_preumount. In order to make sure we don't touch any nonexistent stuff, we must make sure s_fs_info is NULL in the fail path so zfs_preumount can easily check that. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4867 Issue #4854 * Illumos Crypto Port module added to enable native encryption in zfs A port of the Illumos Crypto Framework to a Linux kernel module (found in module/icp). This is needed to do the actual encryption work. We cannot use the Linux kernel's built in crypto api because it is only exported to GPL-licensed modules. Having the ICP also means the crypto code can run on any of the other kernels under OpenZFS. I ended up porting over most of the internals of the framework, which means that porting over other API calls (if we need them) should be fairly easy. Specifically, I have ported over the API functions related to encryption, digests, macs, and crypto templates. The ICP is able to use assembly-accelerated encryption on amd64 machines and AES-NI instructions on Intel chips that support it. There are place-holder directories for similar assembly optimizations for other architectures (although they have not been written). Signed-off-by: Tom Caputi <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4329 * Fix for compilation error when using the kernel's CONFIG_LOCKDEP Signed-off-by: Tom Caputi <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4329 * zloop: print backtrace from core files Find the core file by using `/proc/sys/kernel/core_pattern` Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4874 * Fix for metaslab_fastwrite_unmark() assert failure Currently there is an issue where metaslab_fastwrite_unmark() unmarks fastwrites on vdev_t's that have never had fastwrites marked on them. The 'fastwrite mark' is essentially a count of outstanding bytes that will be written to a vdev and is used in syncing context. The problem stems from the fact that the vdev_pending_fastwrite field is not being transferred over when replacing a top-level vdev. As a result, the metaslab is marked for fastwrite on the old vdev and unmarked on the new one, which brings the fastwrite count below zero. This fix simply assigns vdev_pending_fastwrite from the old vdev to the new one so this count is not lost. Signed-off-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4267 * Remove znode's z_uid/z_gid member Remove duplicate z_uid/z_gid member which are also held in the generic vfs inode struct. This is done by first removing the members from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID macros to access the respective member from struct inode. In cases where the uid/gids are being marshalled from/to disk, use the newly introduced zfs_(uid|gid)_(read|write) functions to properly save the uids rather than the internal kernel representation. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4685 Issue #227 * Check whether the kernel supports i_uid/gid_read/write helpers Since the concept of a kuid and the need to translate from it to ordinary integer type was added in kernel version 3.5 implement necessary plumbing to be able to detect this condition during compile time. If the kernel doesn't support the kuid then just fall back to directly accessing the respective struct inode's members Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4685 Issue #227 * Fix uninitialized variable in avl_add() Silence the following warning when compiling with gcc 5.4.0. Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609. module/avl/avl.c: In function ‘avl_add’: module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized in this function [-Wmaybe-uninitialized] avl_insert(tree, new_node, where); Signed-off-by: Brian Behlendorf <[email protected]> * Fix sync behavior for disk vdevs Prior to b39c22b, which was first generally available in the 0.6.5 release as b39c22b, ZoL never actually submitted synchronous read or write requests to the Linux block layer. This means the vdev_disk_dio_is_sync() function had always returned false and, therefore, the completion in dio_request_t.dr_comp was never actually used. In b39c22b, synchronous ZIO operations were translated to synchronous BIO requests in vdev_disk_io_start(). The follow-on commits 5592404 and aa159af fixed several problems introduced by b39c22b. In particular, 5592404 introduced the new flag parameter "wait" to __vdev_disk_physio() but under ZoL, since vdev_disk_physio() is never actually used, the wait flag was always zero so the new code had no effect other than to cause a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af. The original rationale for introducing synchronous operations in b39c22b was to hurry certains requests through the BIO layer which would have otherwise been subject to its unplug timer which would increase the latency. This behavior of the unplug timer, however, went away during the transition of the plug/unplug system between kernels 2.6.32 and 2.6.39. To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior. For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and ise used for the same purpose. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4858 * Limit the amount of dnode metadata in the ARC Metadata-intensive workloads can cause the ARC to become permanently filled with dnode_t objects as they're pinned by the VFS layer. Subsequent data-intensive workloads may only benefit from about 25% of the potential ARC (arc_c_max - arc_meta_limit). In order to help track metadata usage more precisely, the other_size metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size. The new zfs_arc_dnode_limit tunable, which defaults to 10% of zfs_arc_meta_limit, defines the minimum number of bytes which is desirable to be consumed by dnodes. Attempts to evict non-metadata will trigger async prune tasks if the space used by dnodes exceeds this limit. The new zfs_arc_dnode_reduce_percent tunable specifies the amount by which the excess dnode space is attempted to be pruned as a percentage of the amount by which zfs_arc_dnode_limit is being exceeded. By default, it tries to unpin 10% of the dnodes. The problem of dnode metadata pinning was observed with the following testing procedure (in this example, zfs_arc_max is set to 4GiB): - Create a large number of small files until arc_meta_used exceeds arc_meta_limit (3GiB with default tuning) and arc_prune starts increasing. - Create a 3GiB file with dd. Observe arc_mata_used. It will still be around 3GiB. - Repeatedly read the 3GiB file and observe arc_meta_limit as before. It will continue to stay around 3GiB. With this modification, space for the 3GiB file is gradually made available as subsequent demands on th…

The following scenario can result in garbage in the dn_spill field. The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR is clear to ensure the dn_spill field is cleared. Current txg = A. * A new spill buffer is created. Its dbuf is initialized with db_blkptr = NULL and it's dirtied. Current txg = B. * The spill buffer is modified. It's marked as dirty in this txg. * Additional changes make the spill buffer unnecessary because the xattr fits into the bonus buffer, so it's removed. The dbuf is undirtied in this txg, but it's still referenced and cannot be destroyed. Current txg = C. * Starts syncing of txg A * dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr is NULL, dbuf_check_blkptr() is called. * The dbuf starts being written and it reaches the ready state (not done yet). * A new change makes the spill buffer necessary again. sa_build_layouts() ends up calling dbuf_find() to locate the dbuf. It finds the old dbuf because it has not been destroyed yet (it will be destroyed when the previous write is done and there are no more references). The old dbuf has db_blkptr != NULL. * txg A write is complete and the dbuf released. However it's still referenced, so it's not destroyed. Current txg = D. * Starts syncing of txg B * dbuf_sync_leaf() is called for the bonus buffer. Its contents are directly copied into the dnode, overwriting the blkptr area because, in txg B, the bonus buffer was big enough to hold the entire xattr. * At this point, the db_blkptr of the spill buffer used in txg C gets corrupted. Signed-off-by: Peng <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#3937

behlendorf added the Bug - Major label Oct 20, 2015

behlendorf added this to the 0.7.0 milestone Oct 20, 2015

behlendorf modified the milestones: 0.8.0, 0.7.0 Mar 26, 2016

behlendorf mentioned this issue Jun 2, 2016

Author: PengXie <[email protected]> #4719

Closed

behlendorf modified the milestones: 0.6.5.8, 0.8.0 Jul 12, 2016

behlendorf closed this as completed in 81edd3e Jul 12, 2016

PANIC: metaslab_free_dva(): bad DVA with zfs 0.6.5.2 #3937

PANIC: metaslab_free_dva(): bad DVA with zfs 0.6.5.2 #3937

Comments

xhernandez commented Oct 19, 2015

xhernandez commented Nov 4, 2015

xhernandez commented Nov 4, 2015

xhernandez commented Nov 4, 2015

dweeezil commented Nov 4, 2015

xhernandez commented Nov 4, 2015

xhernandez commented Nov 5, 2015

dweeezil commented Nov 5, 2015

xhernandez commented Nov 5, 2015

xhernandez commented Nov 5, 2015

dweeezil commented Nov 5, 2015

dweeezil commented Nov 5, 2015

xhernandez commented Nov 5, 2015

dweeezil commented Nov 5, 2015

dweeezil commented Nov 5, 2015

xhernandez commented Nov 5, 2015

xhernandez commented Nov 6, 2015

dweeezil commented Nov 6, 2015

dweeezil commented Nov 6, 2015

xhernandez commented Nov 6, 2015

xhernandez commented Nov 10, 2015

dweeezil commented Nov 10, 2015

samuelxhu commented May 13, 2016

xhernandez commented May 13, 2016

samuelxhu commented May 14, 2016 • edited Loading

xhernandez commented May 14, 2016

xhernandez commented May 26, 2016

hsepeng commented Jun 1, 2016

ahrens commented Jun 1, 2016

xhernandez commented Jun 2, 2016

hsepeng commented Jun 2, 2016

xhernandez commented Jun 2, 2016

dweeezil commented Jun 2, 2016

xhernandez commented Jun 2, 2016 • edited Loading

dweeezil commented Jun 2, 2016

xhernandez commented Jun 4, 2016 • edited Loading

dweeezil commented Jun 6, 2016

xhernandez commented Jun 8, 2016

behlendorf commented Jun 9, 2016

dweeezil commented Jun 9, 2016

samuelxhu commented May 14, 2016 •

edited

Loading

xhernandez commented Jun 2, 2016 •

edited

Loading

xhernandez commented Jun 4, 2016 •

edited

Loading