Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PANIC: metaslab_free_dva(): bad DVA with zfs 0.6.5.2 #3937

Closed
xhernandez opened this issue Oct 19, 2015 · 62 comments
Closed

PANIC: metaslab_free_dva(): bad DVA with zfs 0.6.5.2 #3937

xhernandez opened this issue Oct 19, 2015 · 62 comments
Milestone

Comments

@xhernandez
Copy link

I've found what seems an in-memory corruption of zfs.

I have created a pool using raidz1 with 6 disks and two child datasets. I've configured the following properties:

xattr=sa
acltype=posixacl
compression=lz4

And I've set a quota on each child dataset. I use each dataset as a brick for gluster.

When gluster rebuilds a brick (self-heal operation), it copies data inside zfs that contains extended attributes and acl's. While it was doing this, zfs detected a problem (see below).

After restarting the server and accessing the volume, no problem has been detected, so I assume that on-disk data is healthy. I'm currently trying to reproduce the problem on another test server because this one is in production.

The bad DVA is 201326592:14468632831362131968:0. I only have two vdevs (0 and 1), so 201326592 is clearly wrong (in hex it's 0xC000000, not sure if it means anything). The DVA's offset is more interesting: in hex it's 0xC8CAE8E6EAE4E800. Divided by two is 0x6465747375727400 and "trusted" in ascii. Many of the extended attributes that gluster uses start by "trusted" so I'm guessing that some extended attribute manipulation has corrupted some memory block.

kernel: [5292444.782956] PANIC: metaslab_free_dva(): bad DVA 201326592:14468632831362131968:0
kernel: [5292444.783282] Showing stack for process 61687
kernel: [5292444.783286] CPU: 6 PID: 61687 Comm: z_wr_int_4 Tainted: P           O--------------   3.10.0-5-pve #1
kernel: [5292444.783288] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014
kernel: [5292444.783290]  0000000000000000 ffff8808e85b3728 ffffffff816125e0 ffff8808e85b3738
kernel: [5292444.783296]  ffffffffa062e784 ffff8808e85b3868 ffffffffa062e81e ffff8808e85b37b8
kernel: [5292444.783299]  62616c736174656d 76645f656572665f 646162203a292861 3130322041564420
kernel: [5292444.783302] Call Trace:
kernel: [5292444.783313]  [<ffffffff816125e0>] dump_stack+0x19/0x1b
kernel: [5292444.783323]  [<ffffffffa062e784>] spl_dumpstack+0x44/0x50 [spl]
kernel: [5292444.783328]  [<ffffffffa062e81e>] vcmn_err+0x8e/0x130 [spl]
kernel: [5292444.783355]  [<ffffffffa0262454>] ? isci_task_execute_task+0x204/0x320 [isci]
kernel: [5292444.783360]  [<ffffffff8118ed02>] ? kmem_cache_alloc+0x1b2/0x1e0
kernel: [5292444.783363]  [<ffffffff816161ed>] ? mutex_lock+0x1d/0x41
kernel: [5292444.783367]  [<ffffffffa062ae19>] ? spl_kmem_cache_alloc+0x69/0x150 [spl]
kernel: [5292444.783410]  [<ffffffffa0b00342>] zfs_panic_recover+0x52/0x60 [zfs]
kernel: [5292444.783414]  [<ffffffffa062ae19>] ? spl_kmem_cache_alloc+0x69/0x150 [spl]
kernel: [5292444.783435]  [<ffffffffa0ae3ff8>] metaslab_free_dva+0x1e8/0x3b0 [zfs]
kernel: [5292444.783456]  [<ffffffffa0ae6fdc>] metaslab_free+0x9c/0xe0 [zfs]
kernel: [5292444.783482]  [<ffffffffa0b4a1bc>] zio_dva_free+0x1c/0x30 [zfs]
kernel: [5292444.783504]  [<ffffffffa0b4e012>] zio_wait+0xd2/0x210 [zfs]
kernel: [5292444.783524]  [<ffffffffa0b4e21b>] zio_free+0xcb/0x120 [zfs]
kernel: [5292444.783544]  [<ffffffffa0addb21>] dsl_free+0x11/0x20 [zfs]
kernel: [5292444.783562]  [<ffffffffa0ac7e88>] dsl_dataset_block_kill+0x278/0x4c0 [zfs]
kernel: [5292444.783576]  [<ffffffffa0aa74ca>] dbuf_write_done+0x19a/0x240 [zfs]
kernel: [5292444.783588]  [<ffffffffa0a9e5fe>] arc_write_done+0x25e/0x3f0 [zfs]
kernel: [5292444.783609]  [<ffffffffa0b4ff59>] zio_done.part.11+0x259/0xed0 [zfs]
kernel: [5292444.783613]  [<ffffffffa06298ca>] ? spl_kmem_free+0x2a/0x40 [spl]
kernel: [5292444.783616]  [<ffffffff8118dfbd>] ? kfree+0xfd/0x130
kernel: [5292444.783618]  [<ffffffff816161ed>] ? mutex_lock+0x1d/0x41
kernel: [5292444.783638]  [<ffffffffa0b50c4a>] zio_done+0x7a/0x80 [zfs]
kernel: [5292444.783658]  [<ffffffffa0b506fc>] zio_done.part.11+0x9fc/0xed0 [zfs]
kernel: [5292444.783677]  [<ffffffffa0b50c4a>] zio_done+0x7a/0x80 [zfs]
kernel: [5292444.783696]  [<ffffffffa0b506fc>] zio_done.part.11+0x9fc/0xed0 [zfs]
kernel: [5292444.783716]  [<ffffffffa0ad7830>] ? dsl_pool_undirty_space+0xd0/0xe0 [zfs]
kernel: [5292444.783735]  [<ffffffffa0b50c4a>] zio_done+0x7a/0x80 [zfs]
kernel: [5292444.783755]  [<ffffffffa0b506fc>] zio_done.part.11+0x9fc/0xed0 [zfs]
kernel: [5292444.783774]  [<ffffffffa0b50c4a>] zio_done+0x7a/0x80 [zfs]
kernel: [5292444.783793]  [<ffffffffa0b4ad68>] zio_execute+0xc8/0x180 [zfs]
kernel: [5292444.783798]  [<ffffffffa062caee>] taskq_thread+0x1fe/0x3f0 [spl]
kernel: [5292444.783803]  [<ffffffff81094450>] ? try_to_wake_up+0x2a0/0x2a0
kernel: [5292444.783807]  [<ffffffffa062c8f0>] ? taskq_thread_spawn+0x70/0x70 [spl]
kernel: [5292444.783812]  [<ffffffff81083080>] kthread+0xc0/0xd0
kernel: [5292444.783815]  [<ffffffff81082fc0>] ? flush_kthread_worker+0x80/0x80
kernel: [5292444.783820]  [<ffffffff8162262c>] ret_from_fork+0x7c/0xb0
kernel: [5292444.783822]  [<ffffffff81082fc0>] ? flush_kthread_worker+0x80/0x80
@behlendorf behlendorf added this to the 0.7.0 milestone Oct 20, 2015
@xhernandez
Copy link
Author

I have been able to reproduce the problem once or twice per day, however I haven't been able to identify the cause.

Some more info I have found:

When the panic happens, sometimes but not always, a user process gets stopped in 'D' state. In this case, the process has 3 threads doing a zfs system call and they are always doing the same: create a hard link (linkat), write to a file and fsync.

I've also been able to see the contents of the blkptr_t whose DVA's are being freed:

ffff880ed13dd7f0 | 00 00 00 30 00 00 00 0c 74 72 75 73 74 65 64 2e | ...0....trusted.
ffff880ed13dd800 | 67 66 69 64 00 00 00 0a 00 00 00 10 cb da 87 1a | gfid............
ffff880ed13dd810 | 3b 77 40 7d 90 07 e5 53 9c ce fe 98 00 00 00 00 | ;w@}...S........
ffff880ed13dd820 | 00 00 00 00 00 00 2c 00 00 00 00 00 00 00 00 00 | ......,.........
ffff880ed13dd830 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff880ed13dd840 | 30 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0...............
ffff880ed13dd850 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff880ed13dd860 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

As you can see, it seems to contain embedded data, however the "embedded" flag is not set, causing the panic. It also seems that it's part of an nvlist containing one extended attribute. That extended attribute belongs to a directory.

Not sure if all this information may be useful to find the root cause.

If anyone has any idea to do more tests, I'll be happy to try them.

@xhernandez
Copy link
Author

More info. I've just compiled zfs with debugging and it has failed in another place:

kernel: [ 1305.076205] VERIFY3(space >= -delta) failed (0 >= 3536635740)
kernel: [ 1305.076463] PANIC at dnode.c:1803:dnode_diduse_space()
kernel: [ 1305.076690] Showing stack for process 147993
kernel: [ 1305.076697] CPU: 14 PID: 147993 Comm: z_wr_iss Tainted: P           O--------------   3.10.0-1-pve #1
kernel: [ 1305.076701] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
kernel: [ 1305.076704]  ffffffffa0836d18 ffff881d5a5dba68 ffffffff8161006a ffff881d5a5dba78
kernel: [ 1305.076711]  ffffffffa0302764 ffff881d5a5dbc18 ffffffffa030299d ffff881b39540d80
kernel: [ 1305.076716]  00000420198ee000 ffff881d00000030 ffff881d5a5dbc28 ffff881d5a5dbbb8
kernel: [ 1305.076721] Call Trace:
kernel: [ 1305.076744]  [<ffffffff8161006a>] dump_stack+0x19/0x1b
kernel: [ 1305.076759]  [<ffffffffa0302764>] spl_dumpstack+0x44/0x50 [spl]
kernel: [ 1305.076772]  [<ffffffffa030299d>] spl_panic+0xbd/0x100 [spl]
kernel: [ 1305.076833]  [<ffffffffa0775be9>] ? metaslab_block_alloc+0xb9/0x1c0 [zfs]
kernel: [ 1305.076868]  [<ffffffffa077697f>] ? metaslab_alloc_dva+0x8ef/0xda0 [zfs]
kernel: [ 1305.076911]  [<ffffffffa07f7136>] ? zio_execute+0x126/0x350 [zfs]
kernel: [ 1305.076949]  [<ffffffffa07fcc4f>] ? zio_nowait+0x10f/0x310 [zfs]
kernel: [ 1305.076955]  [<ffffffff81613d5d>] ? mutex_lock+0x1d/0x41
kernel: [ 1305.076960]  [<ffffffff81613d5d>] ? mutex_lock+0x1d/0x41
kernel: [ 1305.076998]  [<ffffffffa0796c58>] ? spa_config_held+0xb8/0xd0 [zfs]
kernel: [ 1305.077028]  [<ffffffffa074ccbc>] dnode_diduse_space+0x29c/0x310 [zfs]
kernel: [ 1305.077062]  [<ffffffffa0796cef>] ? dva_get_dsize_sync+0x7f/0xc0 [zfs]
kernel: [ 1305.077098]  [<ffffffffa0796d76>] ? bp_get_dsize_sync+0x46/0xa0 [zfs]
kernel: [ 1305.077122]  [<ffffffffa071faaf>] dbuf_write_ready+0xaf/0x4e0 [zfs]
kernel: [ 1305.077143]  [<ffffffffa070ee3c>] arc_write_ready+0x6c/0x1d0 [zfs]
kernel: [ 1305.077180]  [<ffffffffa07ff967>] zio_ready+0x97/0x7b0 [zfs]
kernel: [ 1305.077190]  [<ffffffffa02ffeb2>] ? taskq_member+0x62/0x70 [spl]
kernel: [ 1305.077246]  [<ffffffffa07f6fd2>] ? zio_taskq_member.isra.4+0x62/0xa0 [zfs]
kernel: [ 1305.077282]  [<ffffffffa07f7136>] zio_execute+0x126/0x350 [zfs]
kernel: [ 1305.077291]  [<ffffffffa0300aee>] taskq_thread+0x1fe/0x3f0 [spl]
kernel: [ 1305.077298]  [<ffffffff81091230>] ? try_to_wake_up+0x2b0/0x2b0
kernel: [ 1305.077306]  [<ffffffffa03008f0>] ? taskq_thread_spawn+0x70/0x70 [spl]
kernel: [ 1305.077310]  [<ffffffff81080700>] kthread+0xc0/0xd0
kernel: [ 1305.077314]  [<ffffffff81080640>] ? flush_kthread_worker+0x80/0x80
kernel: [ 1305.077320]  [<ffffffff8162022c>] ret_from_fork+0x7c/0xb0
kernel: [ 1305.077324]  [<ffffffff81080640>] ? flush_kthread_worker+0x80/0x80

@xhernandez
Copy link
Author

In this last test, one thread of the user process was calling 'mkdir' and got stuck

@dweeezil
Copy link
Contributor

dweeezil commented Nov 4, 2015

@xhernandez Could you please add some debugging like this (completely un-tested patch) to get the object number:

[~/src/zfs] cardinal% git diff
diff --git a/module/zfs/dnode.c b/module/zfs/dnode.c
index 2858bbf..9d76f53 100644
--- a/module/zfs/dnode.c
+++ b/module/zfs/dnode.c
@@ -1798,8 +1798,12 @@ dnode_diduse_space(dnode_t *dn, int64_t delta)
        mutex_enter(&dn->dn_mtx);
        space = DN_USED_BYTES(dn->dn_phys);
        if (delta > 0) {
+               if (!(space + delta >= space))
+                       printk("%s line %d: obj %lld\n", __FUNCTION__, __LINE__, (u_longlong_t)dn->dn_object);
                ASSERT3U(space + delta, >=, space); /* no overflow */
        } else {
+               if (!(space >= -delta))
+                       printk("%s line %d: obj %lld\n", __FUNCTION__, __LINE__, (u_longlong_t)dn->dn_object);
                ASSERT3U(space, >=, -delta); /* no underflow */
        }
        space += delta;

Then you can examine it with zdb -ddddd and get a better handle on the corruption involved. You can also use one of my enhanced zdb's available in https://github.com/dweeezil/zfs/tree/zdb (which I just rebased on current master code); it adds additional debugging up to 7 "-d" options.

This feels like one of the types of dnode SA corruption which should have been fixed a long time ago. How long has this pool existed? If it's been around for a long time, it's possible the corruption has been present for awhile. Or, if the system doesn't have ECC memory, possibly caused by a bit-flip.

In any case, get the object number, dump it with zdb and that should give a better idea as to what's happening.

@xhernandez
Copy link
Author

@dweeezil I'll apply the patch and get the information you requested.

The pool is completely new. In the test machine I'm using, the pool is recreated before running each test. This problem has happened in two different servers, both using ECC memory.

I'll update as soon as I get more info. Thanks.

@xhernandez
Copy link
Author

A new panic:

kernel: [ 1344.465350] dnode_diduse_space line 1806: obj 16388
kernel: [ 1344.465356] VERIFY3(space >= -delta) failed (0 >= 3536635740)
kernel: [ 1344.465638] PANIC at dnode.c:1810:dnode_diduse_space()
kernel: [ 1344.465874] Showing stack for process 139682
kernel: [ 1344.465880] CPU: 6 PID: 139682 Comm: z_wr_iss Tainted: P           O--------------   3.10.0-1-pve #1
kernel: [ 1344.465882] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
kernel: [ 1344.465886]  ffffffffa0522d18 ffff881b97c07a38 ffffffff8161006a ffff881b97c07a48
kernel: [ 1344.465893]  ffffffffa030a764 ffff881b97c07be8 ffffffffa030a99d ffff881b97c07ab8
kernel: [ 1344.465899]  ffffffff8105acdd 0000000000000030 ffff881b97c07bf8 ffff881b97c07b88
kernel: [ 1344.465905] Call Trace:
kernel: [ 1344.465929]  [<ffffffff8161006a>] dump_stack+0x19/0x1b
kernel: [ 1344.465941]  [<ffffffffa030a764>] spl_dumpstack+0x44/0x50 [spl]
kernel: [ 1344.465949]  [<ffffffffa030a99d>] spl_panic+0xbd/0x100 [spl]
kernel: [ 1344.465960]  [<ffffffff8105acdd>] ? msg_print_text+0xdd/0x1b0
kernel: [ 1344.465966]  [<ffffffff81000a29>] ? _stext+0x861/0xe38
kernel: [ 1344.465989]  [<ffffffff8105bfc9>] ? console_unlock+0x209/0x3f0
kernel: [ 1344.466002]  [<ffffffff8160978b>] ? printk+0x61/0x63
kernel: [ 1344.466047]  [<ffffffffa0438ddb>] dnode_diduse_space+0x38b/0x3a0 [zfs]
kernel: [ 1344.466086]  [<ffffffffa0460482>] ? memory_dump+0x142/0x160 [zfs]
kernel: [ 1344.466122]  [<ffffffffa048cb2e>] ? vdev_lookup_top+0x2e/0xd0 [zfs]
kernel: [ 1344.466144]  [<ffffffffa040bac2>] dbuf_write_ready+0xc2/0x510 [zfs]
kernel: [ 1344.466164]  [<ffffffffa03fae3c>] arc_write_ready+0x6c/0x1d0 [zfs]
kernel: [ 1344.466202]  [<ffffffffa04eba37>] zio_ready+0x97/0x7b0 [zfs]
kernel: [ 1344.466211]  [<ffffffffa0307eb2>] ? taskq_member+0x62/0x70 [spl]
kernel: [ 1344.466244]  [<ffffffffa04e30a2>] ? zio_taskq_member.isra.4+0x62/0xa0 [zfs]
kernel: [ 1344.466277]  [<ffffffffa04e3206>] zio_execute+0x126/0x350 [zfs]
kernel: [ 1344.466284]  [<ffffffff8161756b>] ? _raw_spin_unlock_irqrestore+0x1b/0x40
kernel: [ 1344.466292]  [<ffffffffa0308aee>] taskq_thread+0x1fe/0x3f0 [spl]
kernel: [ 1344.466302]  [<ffffffff81091230>] ? try_to_wake_up+0x2b0/0x2b0
kernel: [ 1344.466307]  [<ffffffffa03088f0>] ? taskq_thread_spawn+0x70/0x70 [spl]
kernel: [ 1344.466311]  [<ffffffff81080700>] kthread+0xc0/0xd0
kernel: [ 1344.466314]  [<ffffffff81080640>] ? flush_kthread_worker+0x80/0x80
kernel: [ 1344.466320]  [<ffffffff8162022c>] ret_from_fork+0x7c/0xb0
kernel: [ 1344.466322]  [<ffffffff81080640>] ? flush_kthread_worker+0x80/0x80

The object info:

# zdb -ddddddd pool-sata/brick2 16388
Dataset pool-sata/brick2 [ZPL], ID 49, cr_txg 6, 28.0G, 16404 objects, rootbp DVA[0]=<0:25cc464000:400> DVA[1]=<0:42016e0ec00:400> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=1378L/1378P fill=16404 cksum=160522d119:6e85e77d5b6:1351d97b37057:272acf11a4b6aa

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
     16388    1    16K    512      0    512  100.00  ZFS directory (K=inherit) (Z=inherit)
                                        244   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED 
        dnode maxblkid: 0
        SA hdrsize 16
        SA layout 3
        path    <hidden>
        uid     2502
        gid     2513
        atime   Thu Nov  5 09:30:59 2015
        mtime   Thu Nov  5 09:30:59 2015
        ctime   Thu Nov  5 09:30:59 2015
        crtime  Thu Nov  5 09:30:59 2015
        gen     1377
        mode    40770
        size    2
        parent  16370
        links   2
        pflags  40800000044
        ndacl   3
        Misc SA sizes:
                DACL_ACES = 24
                ZNODE_ACL = N/A
        dump_znode_sa_xattr: sa_xattr_size=68 sa_size error=0
        SA packed dump sa_xattr_size=68: \001\001\000\000\000\000\000\000\000\000\000\001\000\000\000\060\000\000\000\060\000\000\000\014\164\162\165\163\164\145\144\056\147\146\151\144\000\000\000\012\000\000\000\020\030\043\352\174\344\062\114\140\265\324\143\112\271\131\232\246\000\000\000\000\000\000\000\000
        SA xattr dump:
                trusted.gfid[0]: 24
                trusted.gfid[1]: 35
                trusted.gfid[2]: 234
                trusted.gfid[3]: 124
                trusted.gfid[4]: 228
                trusted.gfid[5]: 50
                trusted.gfid[6]: 76
                trusted.gfid[7]: 96
                trusted.gfid[8]: 181
                trusted.gfid[9]: 212
                trusted.gfid[10]: 99
                trusted.gfid[11]: 74
                trusted.gfid[12]: 185
                trusted.gfid[13]: 89
                trusted.gfid[14]: 154
                trusted.gfid[15]: 166
        SA xattrs: 68 bytes, 1 entries

                trusted.gfid = \030#\352|\3442L`\265\324cJ\271Y\232\246
        microzap: 512 bytes, 0 entries

Indirect blocks:
               0 L0 EMBEDDED et=0 200L/1dP B=1377

                segment [0000000000000000, 0000000000000200) size   512

@dweeezil
Copy link
Contributor

dweeezil commented Nov 5, 2015

@xhernandez I guess I jumped to the wrong conclusion, your dnode is perfectly fine insofar as ZFS is concerned. This is exactly what I'd expect a newly-created directory's dnode to look like. The problem is that a space delta is being calculated incorrectly. It's trying to free 3536635740 bytes of space from a 0 byte object which triggers the ASSERT. I'm looking through the code right now to try to see how this might happen. Is this a problem you can reproduce easily? Is it always triggered on the same directory? Have you got any idea what types of operations might be happening to that directory at the time? Adding files to it, deleting files from it?

At this point, I'm suspicious as to whether there may be some place where proper handling of embedded data blkptrs isn't happening and/or that their related macros are even working properly at all. This could also be some sort of race condition.

@xhernandez
Copy link
Author

@dweeezil Yes, it seems that I can recreate this problem quite easily now (in 1 or 2 hours), specially after having compiled with debugging. The directory is not always the same, but it's always a directory (at least till now). I think this is important because regular files also have acl's and extended attributes but do not seem to have any problem.

It's hard to say what it's doing exactly when it fails. The process replicates information from another mount point (formatted using XFS) to the ZFS pool. At least at two of the failures, one thread of the user process has been blocked in an mkdir call. I cannot tell if the mkdir is related to the failing directory or not though. I'll try to find more detailed information about the steps it's doing.

I've done some more tests. It seems that the failure comes from the dbuf_write_ready() function, where delta is calculated. bp_get_dsize_sync(spa, bp) returns 1700, but bp_get_dsize_sync(spa, bp_orig) returns 3536637440 (zio->io_prev_space_delta is 0). bp_orig comes from zio->io_bp_orig and these are its contents:

00 00 00 30 00 00 00 0c 74 72 75 73 74 65 64 2e | ...0....trusted.
67 66 69 64 00 00 00 0a 00 00 00 10 5c bf 13 17 | gfid........\...
91 96 48 24 a4 cc 9d 92 60 22 d8 97 00 00 00 00 | ..H$....`"......
00 00 00 00 00 00 2c 00 00 00 00 00 00 00 00 00 | ......,.........
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
9c 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

I'm not sure if it's important, but zio->io_bp_copy is exactly equal to zio->io_bp_orig.

The size is calculated from the DVA's, but since the blkptr_t seems embedded but the flag is not set, the calculated size is incorrect. I don't know where or how it's modified.

I've dumped the full zio_t structure if it helps:

Raw:

ffff881f80350460 | 31 00 00 00 00 00 00 00 c6 48 00 00 00 00 00 00 | 1........H......
ffff881f80350470 | 00 00 00 00 00 00 00 00 fe ff ff ff ff ff ff ff | ................
ffff881f80350480 | 07 00 00 00 0f 00 00 00 2c 00 00 00 00 02 59 19 | ........,.....Y.
ffff881f80350490 | 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 00 | ................
ffff881f803504a0 | 03 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 | ................
ffff881f803504b0 | 9d 05 00 00 00 00 00 00 00 a0 22 bd 0f 88 ff ff | ..........".....
ffff881f803504c0 | 80 0d b8 fb 04 88 ff ff 00 00 00 00 00 00 00 00 | ................
ffff881f803504d0 | 00 00 00 30 00 00 00 0c 74 72 75 73 74 65 64 2e | ...0....trusted.
ffff881f803504e0 | 67 66 69 64 00 00 00 0a 00 00 00 10 5c bf 13 17 | gfid........\...
ffff881f803504f0 | 91 96 48 24 a4 cc 9d 92 60 22 d8 97 00 00 00 00 | ..H$....`"......
ffff881f80350500 | 00 00 00 00 00 00 2c 00 00 00 00 00 00 00 00 00 | ......,.........
ffff881f80350510 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350520 | 9c 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350530 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350540 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350550 | 30 00 00 00 00 00 00 00 10 00 00 00 00 00 00 00 | 0...............
ffff881f80350560 | 20 75 67 d1 1a 88 ff ff 20 75 67 d1 1a 88 ff ff |  ug..... ug.....
ffff881f80350570 | 30 00 00 00 00 00 00 00 20 00 00 00 00 00 00 00 | 0....... .......
ffff881f80350580 | 80 05 35 80 1f 88 ff ff 80 05 35 80 1f 88 ff ff | ..5.......5.....
ffff881f80350590 | 00 00 00 00 00 00 00 00 60 04 35 80 1f 88 ff ff | ........`.5.....
ffff881f803505a0 | 40 2a a1 67 0e 88 ff ff d0 7d 2c a0 ff ff ff ff | @*.g.....},.....
ffff881f803505b0 | a0 50 2c a0 ff ff ff ff 80 c3 2c a0 ff ff ff ff | .P,.......,.....
ffff881f803505c0 | c0 67 70 91 1b 88 ff ff 00 00 00 00 00 00 00 00 | .gp.............
ffff881f803505d0 | 00 00 00 30 00 00 00 0c 74 72 75 73 74 65 64 2e | ...0....trusted.
ffff881f803505e0 | 67 66 69 64 00 00 00 0a 00 00 00 10 5c bf 13 17 | gfid........\...
ffff881f803505f0 | 91 96 48 24 a4 cc 9d 92 60 22 d8 97 00 00 00 00 | ..H$....`"......
ffff881f80350600 | 00 00 00 00 00 00 2c 00 00 00 00 00 00 00 00 00 | ......,.........
ffff881f80350610 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350620 | 9c 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350630 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350640 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350650 | 00 94 df ab 1b 88 ff ff 00 54 fa e1 04 88 ff ff | .........T......
ffff881f80350660 | 00 02 00 00 00 00 00 00 00 04 00 00 00 00 00 00 | ................
ffff881f80350670 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350680 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350690 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f803506a0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f803506b0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f803506c0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f803506d0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 | ................
ffff881f803506e0 | 38 20 2f 00 00 00 00 00 01 00 00 00 38 20 2f 00 | 8 /.........8 /.
ffff881f803506f0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350700 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350710 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350720 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350730 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350740 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350750 | 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 | ................
ffff881f80350760 | 00 00 00 00 00 00 00 00 60 04 35 80 1f 88 ff ff | ........`.5.....
ffff881f80350770 | 00 00 00 00 00 00 00 00 20 ec 6c f1 0e 88 ff ff | ........ .l.....
ffff881f80350780 | 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 | ................
ffff881f80350790 | 90 07 35 80 1f 88 ff ff 90 07 35 80 1f 88 ff ff | ..5.......5.....
ffff881f803507a0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f803507b0 | 0a 00 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f803507c0 | f4 45 65 34 00 00 00 00 00 00 00 00 00 00 00 00 | .Ee4............
ffff881f803507d0 | d0 07 35 80 1f 88 ff ff d0 07 35 80 1f 88 ff ff | ..5.......5.....
ffff881f803507e0 | 00 00 00 00 00 00 00 00 e8 07 35 80 1f 88 ff ff | ..........5.....
ffff881f803507f0 | e8 07 35 80 1f 88 ff ff 01 00 00 00 00 00 00 00 | ..5.............
ffff881f80350800 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350810 | 00 00 00 00 00 00 00 00 04 00 04 00 00 00 00 00 | ................
ffff881f80350820 | 00 00 00 00 00 00 00 00 28 08 35 80 1f 88 ff ff | ........(.5.....
ffff881f80350830 | 28 08 35 80 1f 88 ff ff 00 00 00 00 00 00 00 00 | (.5.............
ffff881f80350840 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350850 | 80 da cb 81 ff ff ff ff 00 00 00 00 00 00 00 00 | ................
ffff881f80350860 | 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff | ................
ffff881f80350870 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881f80350880 | 00 00 00 00 00 00 00 00 88 08 35 80 1f 88 ff ff | ..........5.....
ffff881f80350890 | 88 08 35 80 1f 88 ff ff 46 4d 1b 00 00 00 00 00 | ..5.....FM......
ffff881f803508a0 | 30 01 3b a0 ff ff ff ff 60 04 35 80 1f 88 ff ff | 0.;.....`.5.....
ffff881f803508b0 | 00 8b 3b e2 1f 88 ff ff 01 00 00 00 00 00 00 00 | ..;.............

Interpreted output (made by hand, I hope it's right):

zio {
    .io_bookmark = {
        .zb_objset       = 0x31,
        .zb_object       = 0x4004,
        .zb_level        = 0x0,
        .zb_blkid        = 0xfffffffffffffffe
    },
    .io_prop = {
        .zp_checksum     = ZIO_CHECKSUM_FLETCHER_4,
        .zp_compress     = ZIO_COMPRESS_LZ4,
        .zp_type         = DMU_OT_SA,
        .zp_level        = 0x0,
        .zp_copies       = 0x2,
        .zp_dedup        = B_FALSE,
        .zp_dedup_verify = B_FALSE,
        .zp_nopwrite     = B_FALSE
    },
    .io_type             = ZIO_TYPE_WRITE,
    .io_child_type       = ZIO_CHILD_LOGICAL,
    .io_cmd              = 0x0,
    .io_priority         = ZIO_PRIORITY_ASYNC_WRITE,
    .io_reexecute        = 0x0,
    .io_state            = { 0x0, 0x0 },
    .io_txg              = 0x563,
    .io_spa              = 0xffff880fdba02000,
    .io_bp               = 0xffff881b922b8980,
    .io_bp_override      = 0x0,
    .io_bp_copy = {

    },
    .io_parent_list = {
        .list_size       = 0x30,
        .list_offset     = 0x10,
        .list_head       = {
            .next        = 0xffff8802fe3fc5b0,
            .prev        = 0xffff8802fe3fc5b0
        },
    },
    .io_child_list = {
        .list_size       = 0x30,
        .list_offset     = 0x20,
        .list_head = {
            .next        = 0xffff8805d33fdd00,
            .prev        = 0xffff8805d33fdd00
        }
    },
    .io_walk_link        = 0x0,
    .io_logical          = 0xffff8805d33fdbe0,
    .io_transform_stack  = 0xffff8807f270fe80,
    .io_ready            = 0xffffffffa03fadd0,
    .io_physdone         = 0xffffffffa03f80a0,
    .io_done             = 0xffffffffa03ff380,
    .io_private          = 0xffff881ac0f58900,
    .io_prev_space_delta = 0x0,
    .io_bp_orig = {

    },
    .io_data             = 0xffff881b6f918800,
    .io_orig_data        = 0xffff8805c8403c00,
    .io_size             = 0x200,
    .io_orig_size        = 0x400,
    .io_vd               = 0x0,
    .io_vsd              = 0x0,
    .io_vsd_ops          = 0x0,
    .io_offset           = 0x0,
    .io_timestamp        = 0x0,
    .io_delta            = 0x0,
    .io_delay            = 0x0,
    .io_queue_node = {
        .avl_child       = { 0x0, 0x0 },
        .avl_pcb         = 0x0
    },
    .io_offset_node = {
        .avl_child       = { 0x0, 0x0 },
        .avl_pcb         = 0x0
    },
    .io_flags            = 0x0,
    .io_stage            = ZIO_STAGE_READY,
    .io_pipeline         = ZIO_STAGE_ISSUE_ASYNC |
                           ZIO_STAGE_WRITE_BP_INIT |
                           ZIO_STAGE_CHECKSUM_GENERATE |
                           ZIO_STAGE_DVA_ALLOCATE |
                           ZIO_STAGE_READY |
                           ZIO_STAGE_VDEV_IO_START |
                           ZIO_STAGE_VDEV_IO_DONE |
                           ZIO_STAGE_VDEV_IO_ASSESS |
                           ZIO_STAGE_DONE,
    .io_orig_flags       = 0x0,
    .io_orig_stage       = ZIO_STAGE_OPEN,
    .io_orig_pipeline    = ZIO_STAGE_ISSUE_ASYNC |
                           ZIO_STAGE_WRITE_BP_INIT |
                           ZIO_STAGE_CHECKSUM_GENERATE |
                           ZIO_STAGE_DVA_ALLOCATE |
                           ZIO_STAGE_READY |
                           ZIO_STAGE_VDEV_IO_START |
                           ZIO_STAGE_VDEV_IO_DONE |
                           ZIO_STAGE_VDEV_IO_ASSESS |
                           ZIO_STAGE_DONE,
    .io_error            = 0x0,
    .io_child_error      = { 0x0, 0x0, 0x0, 0x0 },
    .io_children         = { { 0x0, 0x0 }, { 0x0, 0x0 }, { 0x0, 0x0 }, { 0x0, 0x0 } },
    .io_child_count      = 0x0,
    .io_phys_children    = 0x0,
    .io_parent_count     = 0x1,
    .io_stall            = 0x0,
    .io_gang_leader      = 0xffff8805d33fdbe0,
    .io_gang_tree        = 0x0,
    .io_executor         = 0xffff881fe54495a0,
    .io_waiter           = 0x0,
    .io_lock = { },
    .io_cv = { },
    .io_chsum_report
    .io_ena
    .io_tqent
}

Hope it has some clue...

@xhernandez
Copy link
Author

Not sure if it helps, but zio->io_data and zio->io_orig_data seem to contain extended attributes and zio->io_data seems corrupted (at least the header is quite different):

zio->io_orig_data

ffff8804e1fa5400 | 5a 50 2f 00 04 04 78 03 01 01 00 00 00 00 00 00 | ZP/...x.........
ffff8804e1fa5410 | 00 00 00 01 00 00 00 30 00 00 00 30 00 00 00 0c | .......0...0....
ffff8804e1fa5420 | 74 72 75 73 74 65 64 2e 67 66 69 64 00 00 00 0a | trusted.gfid....
ffff8804e1fa5430 | 00 00 00 10 5c bf 13 17 91 96 48 24 a4 cc 9d 92 | ....\.....H$....
ffff8804e1fa5440 | 60 22 d8 97 00 00 00 80 00 00 00 88 00 00 00 18 | `"..............
ffff8804e1fa5450 | 73 79 73 74 65 6d 2e 70 6f 73 69 78 5f 61 63 6c | system.posix_acl
ffff8804e1fa5460 | 5f 64 65 66 61 75 6c 74 00 00 00 0a 00 00 00 54 | _default.......T
ffff8804e1fa5470 | 02 00 00 00 01 00 07 00 ff ff ff ff 02 00 07 00 | ................
ffff8804e1fa5480 | c4 09 00 00 02 00 07 00 c2 c6 2d 00 04 00 00 00 | ..........-.....
ffff8804e1fa5490 | ff ff ff ff 08 00 07 00 04 00 00 00 08 00 07 00 | ................
ffff8804e1fa54a0 | d0 09 00 00 08 00 07 00 24 0c 00 00 08 00 07 00 | ........$.......
ffff8804e1fa54b0 | c2 c6 2d 00 10 00 07 00 ff ff ff ff 20 00 00 00 | ..-......... ...
ffff8804e1fa54c0 | ff ff ff ff 00 00 00 80 00 00 00 80 00 00 00 17 | ................
ffff8804e1fa54d0 | 73 79 73 74 65 6d 2e 70 6f 73 69 78 5f 61 63 6c | system.posix_acl
ffff8804e1fa54e0 | 5f 61 63 63 65 73 73 00 00 00 00 0a 00 00 00 54 | _access........T
ffff8804e1fa54f0 | 02 00 00 00 01 00 07 00 ff ff ff ff 02 00 07 00 | ................
ffff8804e1fa5500 | c4 09 00 00 02 00 07 00 c2 c6 2d 00 04 00 00 00 | ..........-.....
ffff8804e1fa5510 | ff ff ff ff 08 00 07 00 04 00 00 00 08 00 07 00 | ................
ffff8804e1fa5520 | d0 09 00 00 08 00 07 00 24 0c 00 00 08 00 07 00 | ........$.......
ffff8804e1fa5530 | c2 c6 2d 00 10 00 07 00 ff ff ff ff 20 00 00 00 | ..-......... ...
ffff8804e1fa5540 | ff ff ff ff 00 00 00 3c 00 00 00 38 00 00 00 15 | .......<...8....
ffff8804e1fa5550 | 74 72 75 73 74 65 64 2e 67 6c 75 73 74 65 72 66 | trusted.glusterf
ffff8804e1fa5560 | 73 2e 64 68 74 00 00 00 00 00 00 0a 00 00 00 10 | s.dht...........
ffff8804e1fa5570 | 00 00 00 01 00 00 00 00 00 00 00 00 52 a6 66 df | ............R.f.
ffff8804e1fa5580 | 00 00 00 a8 00 00 00 a8 00 00 00 17 74 72 75 73 | ............trus
ffff8804e1fa5590 | 74 65 64 2e 53 47 49 5f 41 43 4c 5f 44 45 46 41 | ted.SGI_ACL_DEFA
ffff8804e1fa55a0 | 55 4c 54 00 00 00 00 0a 00 00 00 7c 00 00 00 0a | ULT........|....
ffff8804e1fa55b0 | 00 00 00 01 ff ff ff ff 00 07 00 00 00 00 00 02 | ................
ffff8804e1fa55c0 | 00 00 09 c4 00 07 00 00 00 00 00 02 00 2d c6 c2 | .............-..
ffff8804e1fa55d0 | 00 07 00 00 00 00 00 04 ff ff ff ff 00 00 00 00 | ................
ffff8804e1fa55e0 | 00 00 00 08 00 00 00 04 00 07 00 00 00 00 00 08 | ................
ffff8804e1fa55f0 | 00 00 09 d0 00 07 00 00 00 00 00 08 00 00 0c 24 | ...............$
ffff8804e1fa5600 | 00 07 00 00 00 00 00 08 00 2d c6 c2 00 07 00 00 | .........-......
ffff8804e1fa5610 | 00 00 00 10 ff ff ff ff 00 07 00 00 00 00 00 20 | ...............
ffff8804e1fa5620 | ff ff ff ff 00 00 00 00 00 00 00 a4 00 00 00 a8 | ................
ffff8804e1fa5630 | 00 00 00 14 74 72 75 73 74 65 64 2e 53 47 49 5f | ....trusted.SGI_
ffff8804e1fa5640 | 41 43 4c 5f 46 49 4c 45 00 00 00 0a 00 00 00 7c | ACL_FILE.......|
ffff8804e1fa5650 | 00 00 00 0a 00 00 00 01 ff ff ff ff 00 07 00 00 | ................
ffff8804e1fa5660 | 00 00 00 02 00 00 09 c4 00 07 00 00 00 00 00 02 | ................
ffff8804e1fa5670 | 00 2d c6 c2 00 07 00 00 00 00 00 04 ff ff ff ff | .-..............
ffff8804e1fa5680 | 00 00 00 00 00 00 00 08 00 00 00 04 00 07 00 00 | ................
ffff8804e1fa5690 | 00 00 00 08 00 00 09 d0 00 07 00 00 00 00 00 08 | ................
ffff8804e1fa56a0 | 00 00 0c 24 00 07 00 00 00 00 00 08 00 2d c6 c2 | ...$.........-..
ffff8804e1fa56b0 | 00 07 00 00 00 00 00 10 ff ff ff ff 00 07 00 00 | ................
ffff8804e1fa56c0 | 00 00 00 20 ff ff ff ff 00 00 00 00 00 00 00 34 | ... ...........4
ffff8804e1fa56d0 | 00 00 00 38 00 00 00 11 74 72 75 73 74 65 64 2e | ...8....trusted.
ffff8804e1fa56e0 | 61 66 72 2e 64 69 72 74 79 00 00 00 00 00 00 0a | afr.dirty.......
ffff8804e1fa56f0 | 00 00 00 0c 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa5700 | 00 00 00 3c 00 00 00 40 00 00 00 19 74 72 75 73 | ...<[email protected]
ffff8804e1fa5710 | 74 65 64 2e 61 66 72 2e 73 61 74 61 2d 63 6c 69 | ted.afr.sata-cli
ffff8804e1fa5720 | 65 6e 74 2d 31 00 00 00 00 00 00 0a 00 00 00 0c | ent-1...........
ffff8804e1fa5730 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3c | ...............<
ffff8804e1fa5740 | 00 00 00 40 00 00 00 19 74 72 75 73 74 65 64 2e | [email protected].
ffff8804e1fa5750 | 61 66 72 2e 73 61 74 61 2d 63 6c 69 65 6e 74 2d | afr.sata-client-
ffff8804e1fa5760 | 33 00 00 00 00 00 00 0a 00 00 00 0c 00 00 00 00 | 3...............
ffff8804e1fa5770 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa5780 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa5790 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa57a0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa57b0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa57c0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa57d0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa57e0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff8804e1fa57f0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

zio->io_data

ffff881babdf9400 | 00 00 01 ac b4 5a 50 2f 00 04 04 78 03 01 01 00 | .....ZP/...x....
ffff881babdf9410 | 01 00 00 0a 00 13 30 04 00 f3 3a 0c 74 72 75 73 | ......0...:.trus
ffff881babdf9420 | 74 65 64 2e 67 66 69 64 00 00 00 0a 00 00 00 10 | ted.gfid........
ffff881babdf9430 | 5c bf 13 17 91 96 48 24 a4 cc 9d 92 60 22 d8 97 | \.....H$....`"..
ffff881babdf9440 | 00 00 00 80 00 00 00 88 00 00 00 18 73 79 73 74 | ............syst
ffff881babdf9450 | 65 6d 2e 70 6f 73 69 78 5f 61 63 6c 5f 64 65 66 | em.posix_acl_def
ffff881babdf9460 | 61 75 6c 74 3c 00 21 54 02 61 00 e0 07 00 ff ff | ault<.!T.a......
ffff881babdf9470 | ff ff 02 00 07 00 c4 09 00 00 08 00 71 c2 c6 2d | ............q..-
ffff881babdf9480 | 00 04 00 00 18 00 31 08 00 07 0c 00 00 08 00 22 | ......1........"
ffff881babdf9490 | d0 09 08 00 22 24 0c 08 00 00 28 00 13 10 40 00 | ...."$....(...@.
ffff881babdf94a0 | 13 20 30 00 03 80 00 00 84 00 1d 17 80 00 60 61 | . 0...........`a
ffff881babdf94b0 | 63 63 65 73 73 dc 00 00 bc 00 0f 80 00 45 95 3c | ccess........E.<
ffff881babdf94c0 | 00 00 00 38 00 00 00 15 30 01 10 6c 38 01 60 72 | ...8....0..l8.`r
ffff881babdf94d0 | 66 73 2e 64 68 fd 00 04 3c 01 01 ff 00 01 0f 00 | fs.dh...<.......
ffff881babdf94e0 | a3 00 00 52 a6 66 df 00 00 00 a8 04 00 14 17 3c | ...R.f.........<
ffff881babdf94f0 | 00 f0 00 53 47 49 5f 41 43 4c 5f 44 45 46 41 55 | ...SGI_ACL_DEFAU
ffff881babdf9500 | 4c 54 2b 00 00 bc 00 13 7c 44 00 10 01 3c 01 20 | LT+.....|D...<.
ffff881babdf9510 | 00 07 17 00 65 00 02 00 00 09 c4 0c 00 33 2d c6 | ....e........3-.
ffff881babdf9520 | c2 0c 00 11 04 24 00 01 23 00 63 00 08 00 00 00 | .....$..#.c.....
ffff881babdf9530 | 04 18 00 56 08 00 00 09 d0 0c 00 25 0c 24 0c 00 | ...V.......%.$..
ffff881babdf9540 | 06 3c 00 11 10 3c 00 02 60 00 11 20 0c 00 02 48 | .<...<..`.. ...H
ffff881babdf9550 | 00 13 a4 a8 00 1c 14 a8 00 43 46 49 4c 45 9c 00 | .........CFILE..
ffff881babdf9560 | 0f a4 00 6d 13 34 88 01 14 11 a4 00 92 61 66 72 | ...m.4.......afr
ffff881babdf9570 | 2e 64 69 72 74 79 c4 00 00 48 01 12 0c 0b 00 05 | .dirty...H......
ffff881babdf9580 | 02 00 00 bc 01 58 40 00 00 00 19 34 00 d2 73 61 | [email protected]
ffff881babdf9590 | 74 61 2d 63 6c 69 65 6e 74 2d 31 2b 00 0f 3c 00 | ta-client-1+..<.
ffff881babdf95a0 | 22 1f 33 3c 00 07 0f 02 00 6d 50 00 00 00 00 00 | ".3<.....mP.....
ffff881babdf95b0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881babdf95c0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881babdf95d0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881babdf95e0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
ffff881babdf95f0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

@dweeezil
Copy link
Contributor

dweeezil commented Nov 5, 2015

@xhernandez That's all great information! The on-disk copy of the dnode, according to your previous zdb output, is just fine; the single blkptr is clearly set as embedded (otherwise zdb wouldn't display it as such). After looking through the code a bit more earlier today, I had kind of come to the conclusion that the bogus size was being computed due to an improperly constructed embedded data blkptr.

One very interesting outcome of the embedded data blkptr feature, which may very well have nothing to do with this issue, is that when a spill block is needed, sometimes the contents of the spill block can, itself, be squeezed into an embedded data blkptr which eliminates the need for a "true" spill block.

I've just started going over the data in your posting but I'm wondering if maybe this directory is being expanded to the point where it needs another data block to hold the zap and then also at the same time, the SA is getting kicked into a (embedded) spill block. As I mentioned, your on-disk copy according to zdb is a perfectly well-formed empty directory but the bug you're seeing is getting tripped when some subsequent operations are performed on it.

@dweeezil
Copy link
Contributor

dweeezil commented Nov 5, 2015

@xhernandez I almost forgot to ask: What are the contents of this directory on the source system? Is it empty? Does it contain a lot of files?

@xhernandez
Copy link
Author

The last directory that failed contains only 2 regular files. In this case seems that both files were already stored on disk before the crash (or the user process has been able to continue creating files after the problem, but I doubt it). The zdb output of this one is:

Dataset pool-sata/brick2 [ZPL], ID 49, cr_txg 6, 28.4G, 19899 objects, rootbp DVA[0]=<0:4029ffbc00:400> DVA[1]=<0:42002413000:400> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=1457L/1457P fill=19899 cksum=1398d95339:69d9f4e9a4a:130643bfe43b7:268bd18915bf26

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
     18630    1    16K    512      0    512  100.00  ZFS directory (K=inherit) (Z=inherit)
                                        244   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        path    <hidden>
        uid     2502
        gid     2513
        atime   Thu Nov  5 13:27:38 2015
        mtime   Thu Nov  5 13:39:18 2015
        ctime   Thu Nov  5 13:39:18 2015
        crtime  Thu Nov  5 13:27:38 2015
        gen     1435
        mode    40770
        size    4
        parent  12602
        links   2
        pflags  40800000044
        SA xattrs: 68 bytes, 1 entries

                trusted.gfid = \\277\023\027\221\226H$\244\314\235\222`"\330\227
        microzap: 512 bytes, 2 entries

                <file1> = 18655 (type: Regular File)
                <file2> = 18657 (type: Regular File)
Indirect blocks:
               0 L0 EMBEDDED et=0 200L/6eP B=1446

                segment [0000000000000000, 0000000000000200) size   512

I need to check it, but I'm pretty sure that after having replicated the contents of a directory, the process updates some extended attributes of the parent directory. Maybe it's here when the problem happens, at least this time.

@dweeezil
Copy link
Contributor

dweeezil commented Nov 5, 2015

@xhernandez Could you please post the output of zdb -dddd 5 6. I'm thinking your corrupted-looking zio buffer might be a spill block. Object 6 should show us the SA layouts it's trying to use and I'm curious if there are any layouts with only 2 entries.

@dweeezil
Copy link
Contributor

dweeezil commented Nov 5, 2015

@xhernandez Correction, please do zdb -dddd <pool>/<fs> 5 6.

@xhernandez
Copy link
Author

# zdb -dddd pool-sata/brick2 5 6
Dataset pool-sata/brick2 [ZPL], ID 49, cr_txg 6, 28.4G, 19899 objects, rootbp DVA[0]=<0:4029ffbc00:400> DVA[1]=<0:42002413000:400> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=1457L/1457P fill=19899 cksum=1398d95339:69d9f4e9a4a:130643bfe43b7:268bd18915bf26

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         5    1    16K  1.50K  1.50K  1.50K  100.00  SA attr registration
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        microzap: 1536 bytes, 21 entries

                ZPL_PARENT =  8000007 : [8:0:7]
                ZPL_DACL_ACES =  40013 : [0:4:19]
                ZPL_UID =  800000c : [8:0:12]
                ZPL_DACL_COUNT =  8000010 : [8:0:16]
                ZPL_ATIME =  10000000 : [16:0:0]
                ZPL_LINKS =  8000008 : [8:0:8]
                ZPL_SYMLINK =  30011 : [0:3:17]
                ZPL_RDEV =  800000a : [8:0:10]
                ZPL_CRTIME =  10000003 : [16:0:3]
                ZPL_GEN =  8000004 : [8:0:4]
                ZPL_DXATTR =  30014 : [0:3:20]
                ZPL_CTIME =  10000002 : [16:0:2]
                ZPL_MTIME =  10000001 : [16:0:1]
                ZPL_SCANSTAMP =  20030012 : [32:3:18]
                ZPL_GID =  800000d : [8:0:13]
                ZPL_FLAGS =  800000b : [8:0:11]
                ZPL_PAD =  2000000e : [32:0:14]
                ZPL_ZNODE_ACL =  5803000f : [88:3:15]
                ZPL_SIZE =  8000006 : [8:0:6]
                ZPL_XATTR =  8000009 : [8:0:9]
                ZPL_MODE =  8000005 : [8:0:5]

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         6    1    16K    16K  10.0K    32K  100.00  SA attr layouts
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 1
        Fat ZAP stats:
                Pointer table:
                        1024 elements
                        zt_blk: 0
                        zt_numblks: 0
                        zt_shift: 10
                        zt_blks_copied: 0
                        zt_nextblk: 0
                ZAP entries: 5
                Leaf blocks: 1
                Total blocks: 2
                zap_block_type: 0x8000000000000001
                zap_magic: 0x2f52ab2ab
                zap_salt: 0x347d676d
                Leafs with 2^n pointers:
                          9:      1 *
                Blocks with n*5 entries:
                          1:      1 *
                Blocks n/10 full:
                          1:      1 *
                Entries with n chunks:
                          3:      2 **
                          4:      3 ***
                Buckets with n entries:
                          0:    507 ****************************************
                          1:      5 *

                4 = [ 20 ]
                3 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  20 ]
                6 = [ 17 ]
                2 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19 ]
                5 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  17 ]

@xhernandez
Copy link
Author

I've seen that both zio->io_bp_copy and zio->io_bp_orig are equal, but the only place where they are set explicitly to the same value is zio_create() (at least I haven't been able to locate any other place).

I've added a check in zio_create() to verify that the passed blkptr_t is ok and it failed:

kernel: [ 2423.449656] PANIC at zio.c:542:zio_create()
kernel: [ 2423.449843] Showing stack for process 5263
kernel: [ 2423.449848] CPU: 11 PID: 5263 Comm: txg_sync Tainted: P           O--------------   3.10.0-1-pve #1
kernel: [ 2423.449850] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
kernel: [ 2423.449854]  ffffffffa0d6e180 ffff880fe4e994f8 ffffffff8161006a ffff880fe4e99508
kernel: [ 2423.449862]  ffffffffa0357764 ffff880fe4e996a8 ffffffffa035799d ffff880fe4e99568
kernel: [ 2423.449867]  ffffffffa0c975d0 ffff881a00000030 ffff880fe4e996b8 ffff880fe4e99648
kernel: [ 2423.449873] Call Trace:
kernel: [ 2423.449895]  [<ffffffff8161006a>] dump_stack+0x19/0x1b
kernel: [ 2423.449910]  [<ffffffffa0357764>] spl_dumpstack+0x44/0x50 [spl]
kernel: [ 2423.449919]  [<ffffffffa035799d>] spl_panic+0xbd/0x100 [spl]
kernel: [ 2423.449971]  [<ffffffffa0c975d0>] ? spa_taskq_dispatch_ent+0x90/0x120 [zfs]
kernel: [ 2423.449980]  [<ffffffffa0355e96>] ? taskq_dispatch_ent+0x66/0x170 [spl]
kernel: [ 2423.450022]  [<ffffffffa0d041c0>] ? zio_taskq_member.isra.4+0xa0/0xa0 [zfs]
kernel: [ 2423.450060]  [<ffffffffa0c975d0>] ? spa_taskq_dispatch_ent+0x90/0x120 [zfs]
kernel: [ 2423.450068]  [<ffffffff811890f5>] ? kmem_cache_alloc+0x35/0x1e0
kernel: [ 2423.450076]  [<ffffffffa0353e19>] ? spl_kmem_cache_alloc+0x69/0x150 [spl]
kernel: [ 2423.450084]  [<ffffffffa0353e19>] ? spl_kmem_cache_alloc+0x69/0x150 [spl]
kernel: [ 2423.450122]  [<ffffffffa0d07b66>] zio_create+0x236/0x800 [zfs]
kernel: [ 2423.450161]  [<ffffffffa0d0885b>] zio_write+0x12b/0x1f0 [zfs]
kernel: [ 2423.450185]  [<ffffffffa0c20380>] ? l2arc_feed_thread+0x780/0x780 [zfs]
kernel: [ 2423.450206]  [<ffffffffa0c1af60>] arc_write+0x130/0x290 [zfs]
kernel: [ 2423.450227]  [<ffffffffa0c1bdd0>] ? arc_cksum_compute.isra.10+0x130/0x130 [zfs]
kernel: [ 2423.450249]  [<ffffffffa0c190a0>] ? arc_evictable_memory+0x80/0x80 [zfs]
kernel: [ 2423.450273]  [<ffffffffa0c20380>] ? l2arc_feed_thread+0x780/0x780 [zfs]
kernel: [ 2423.450299]  [<ffffffffa0c2d1cf>] dbuf_write.isra.10+0x23f/0x6c0 [zfs]
kernel: [ 2423.450322]  [<ffffffffa0c2ca00>] ? dbuf_destroy+0x490/0x490 [zfs]
kernel: [ 2423.450344]  [<ffffffffa0c2b9d0>] ? dbuf_set_data+0x100/0x100 [zfs]
kernel: [ 2423.450366]  [<ffffffffa0c307f0>] ? dbuf_read_done+0x2d0/0x2d0 [zfs]
kernel: [ 2423.450401]  [<ffffffffa0c89313>] ? refcount_add_many+0xb3/0x150 [zfs]
kernel: [ 2423.450424]  [<ffffffffa0c33507>] dbuf_sync_leaf+0x197/0x910 [zfs]
kernel: [ 2423.450462]  [<ffffffffa0d09f90>] ? zio_nowait+0x190/0x310 [zfs]
kernel: [ 2423.450485]  [<ffffffffa0c33d3c>] ? dbuf_sync_list+0xbc/0x160 [zfs]
kernel: [ 2423.450508]  [<ffffffffa0c33d65>] dbuf_sync_list+0xe5/0x160 [zfs]
kernel: [ 2423.450538]  [<ffffffffa0c5d33d>] dnode_sync+0x51d/0xfc0 [zfs]
kernel: [ 2423.450573]  [<ffffffffa0c893c6>] ? refcount_add+0x16/0x20 [zfs]
kernel: [ 2423.450600]  [<ffffffffa0c44107>] dmu_objset_sync_dnodes+0x97/0x1f0 [zfs]
kernel: [ 2423.450626]  [<ffffffffa0c44415>] dmu_objset_sync+0x1b5/0x450 [zfs]
kernel: [ 2423.450651]  [<ffffffffa0c42430>] ? dmu_objset_userspace_present+0x20/0x20 [zfs]
kernel: [ 2423.450676]  [<ffffffffa0c42e40>] ? copies_changed_cb+0xa0/0xa0 [zfs]
kernel: [ 2423.450706]  [<ffffffffa0c662e2>] dsl_dataset_sync+0x82/0x160 [zfs]
kernel: [ 2423.450738]  [<ffffffffa0c72e6f>] dsl_pool_sync+0xef/0x5d0 [zfs]
kernel: [ 2423.450773]  [<ffffffffa0c9496d>] spa_sync+0x46d/0xdf0 [zfs]
kernel: [ 2423.450780]  [<ffffffff81089495>] ? __wake_up_common+0x55/0x90
kernel: [ 2423.450786]  [<ffffffff81019ae9>] ? read_tsc+0x9/0x20
kernel: [ 2423.450824]  [<ffffffffa0cac176>] txg_sync_thread+0x3d6/0x700 [zfs]
kernel: [ 2423.450860]  [<ffffffffa0cabda0>] ? txg_quiesce_thread+0x500/0x500 [zfs]
kernel: [ 2423.450869]  [<ffffffffa0354948>] thread_generic_wrapper+0x78/0x90 [spl]
kernel: [ 2423.450877]  [<ffffffffa03548d0>] ? spl_vmem_fini+0x10/0x10 [spl]
kernel: [ 2423.450883]  [<ffffffff81080700>] kthread+0xc0/0xd0
kernel: [ 2423.450887]  [<ffffffff81080640>] ? flush_kthread_worker+0x80/0x80
kernel: [ 2423.450894]  [<ffffffff8162022c>] ret_from_fork+0x7c/0xb0
kernel: [ 2423.450898]  [<ffffffff81080640>] ? flush_kthread_worker+0x80/0x80

@dweeezil
Copy link
Contributor

dweeezil commented Nov 6, 2015

@xhernandez Could you please post the contents (likely as a gist since it might be long) of /proc/spl/kstat/zfs/dbgmsg when the problem happens. Since you're running a debug build, it should be recording the debug messages there.

@dweeezil
Copy link
Contributor

dweeezil commented Nov 6, 2015

@xhernandez What check did you perform against the blkptr to trigger the panic above?

@xhernandez
Copy link
Author

@dweeezil I've uploaded the dbgmsg file after a panic.

I did a simplified check to detect the corruption I'm seeing in my tests:

for (i = 0; i < BP_GET_NDVAS(bp); i++) {
    ASSERT3U(DVA_GET_VDEV(&bp->blk_dva[i]), <, 2);
}

When it failed, the vdev was 0xC000000.

@xhernandez
Copy link
Author

@dweeezil I've back traced the blkptr_t that gets corrupted and I see something that I'm not sure if it's right or not (I've just started to dig into zfs code).

The blkptr_t belongs to a dmu_buf_impl_t that is added simultaneously to two different lists of dn->dn_dirty_records[x] in dbuf_dirty(). Both are added because dn->db_blkid is DMU_BONUS_BLKID or DMU_SPILL_BLKID.

Later, the first one is removed from the list and dbuf_sync_leaf() is called. This leads to dbuf_write(). Before the dbuf_write(), the blkptr_t is ok.

Some time later, the second dirty record is removed from the list and dbuf_sync_leaf() is called. At this point, the blkptr_t is already corrupted.

Timing of the events (in seconds):

2975.674965: The first dirty record referencing the dmu_buf_impl_t is added to a list
2975.675777: The second dirty record referencing the same dmu_buf_impl_t is added to a list
2975.677525: The first dirty record is used in dbuf_sync_leaf().
2976.093830: The second dirty record is used in dbuf_sync_leaf().

Could this be the cause of the problem or is this a normal behaviour ?

@dweeezil
Copy link
Contributor

@xhernandez Interesting bit of tracing there. I'll try to get back on this today. It was fairly clear to me that the problem is occurring either as part of the transition of a blkptr to/from BP_IS_EMBEDDED() or the transition to/from a dnode needing or not needing a spill block.

With embedded data blkptrs, there are a number of new ways to represent a dnode. This feels remarkably similar to the type of problem fixed by 4254acb.

@behlendorf behlendorf modified the milestones: 0.8.0, 0.7.0 Mar 26, 2016
@samuelxhu
Copy link

@xhernandez What is the current status of the debugging? I am keeping a close look at the issues you encountered, as I may build a gluster cluster on top of ZoL.

Just want to make sure, whether this issue is a showstopper for Gluster on top os ZoL.

@xhernandez
Copy link
Author

@samuelxhu Due to other priorities, I've been unable to continue debugging it until this very week. I've been able to reproduce the problem again and I expect to find enough information to solve the bug.

AFAIK the bug only happens when xattr=sa and acl's are used. We have been using Gluster with xattr=sa for a long time and we haven't seen any issue with latests versions. However I can't assure you that the bug won't really manifest itself without acl's.

I'll post more information as soon as I have something interesting.

@samuelxhu
Copy link

samuelxhu commented May 14, 2016

@xhernandez Great information. Before the bug is fixed, it seems safer to use xattr=sa ony and leave out posixacl option. Just wonder why (or when) do we need setting posixacl if Gluster on top of ZoL can work properly without it?

@xhernandez
Copy link
Author

@samuelxhu ACL's are typically needed when you use samba on top of Gluster, for example.

@xhernandez
Copy link
Author

@dweeezil I think I've found something about this bug.

The sequence of actions seems to be the following, however I'm unable to check it because I would need to force the creation of a new transaction group at certain points and I don't know how to do that.

  1. Some processing is made to create an entry and add some attributes. I think the exact details of this step are not relevant for the bug.
  2. After some processing, the entry has a single xattr that fits into the bonus buffer. No spill buffer is needed (though it was previously allocated because there were more xattrs, not sure if this is important). Note that this xattr uses the same space of the blkptr_t of the bonus buffer.
  3. A new transaction group is created, but the previous one is not being synced yet.
  4. Additional xattrs are added. The bonus buffer is dirtied. Since the previous transaction group is still pending, the current contents of this buffer are copied into a newly allocated buffer (BUF1) of the dirty record. After adding more xattrs, all of them are moved into a spill buffer. The blkptr_t for the spill buffer is taken from the corresponding address of the dnode.
  5. A new transaction group is created
  6. More actions on the entry causes the bonus buffer to be dirtied and the current contents of the bonus buffer to be copied into a new buffer (BUF2) allocated for the dirty record.
  7. The first transaction is processed. BUF1 is copied into the dnode memory buffer. This overwrites the blkptr_t structure stored at the end of the bonus buffer with the value of the xattr defined at step 2. Note that we were already using this blkptr_t since we have allocated a spill buffer later, at step 4.
  8. From this point, any operation needing the blkptr_t of the spill buffer will have troubles. Note that even when BUF2 is copied into the dnode space, it only overwrites sa data, it doesn't touch the area reserved for blkptr_t since it's not used anymore.

Not sure if this is a complete (or correct) description of what happens or more information is needed. I still don't understand all the internals of ZFS so maybe I have misinterpreted something.

@hsepeng
Copy link
Contributor

hsepeng commented Jun 1, 2016

i and @javenwu were working on this bug fix. our lattest fix patch diff file shown below, which
include detailed comments about how to solve this bug.
please help to review it, thanks for all your efforts and valuable comments

coderevie.txt

@ahrens
Copy link
Member

ahrens commented Jun 1, 2016

@hsepeng The change (in coderevie.txt) make sense. Can you explain (in the comment) the code path that leads to trying to zio_free() the garbage dn_spill?

The comments could use some wordsmithing. Let me know if you want help with that.

hsepeng added a commit to hsepeng/zfs that referenced this issue Jun 2, 2016
Date:   Thur Jun 2 13:59:06 2016 +0800

    fix the PANIC: metaslab_free_dva(): bad DVA with zfs openzfs#3937

    the panic was introduced by the following scenario:
    in the previous transaction group, the bonus buffer
    was entirely used to store the attributes for the
    dnode which override the dn_spill field.
    however, when adding more attributes to the file,
    it will need the spill block to hold the extra
    attributes overflowing the bonus buffer.
    make sure to clear the garbage left in the dn_spill
    field which was the previous attributes in bonus
    buffer, otherwise, after writing out the spill block
    data to the new allocated dva, it will try to free
    the old block pointed by the invalid dn_spill, that
    would introduce the panic
@xhernandez
Copy link
Author

@hsepeng are you sure that checking the dn_flags inside the mutex is necessary ?

If flags could be set concurrently, then this solution is not valid because it might set the DNODE_FLAG_SPILL_BLKPTR before it's checked, even if it's inside the mutex. In this case the db->db_blkptr won't be cleared and the bug will appear again.

I think it doesn't make sense this possibility (having concurrent updates) because we are preparing the dnode to be written to the disk, so there shouldn't be parallel updates.

Additionally, reading an aligned integer is an atomic operation (even if it won't be atomic, we are only testing a single bit, independently of the others).

@hsepeng
Copy link
Contributor

hsepeng commented Jun 2, 2016

@xhernandez i agree with you that the clear and set of the dn_flags were in the same thread context without concurrent updates in the current code base.
i make dn_flags test and set under the mutex protection is from the perspective of code maintainance and the future just in case scenario since the overhead is negligible.

@xhernandez
Copy link
Author

@hsepeng if in the future the flag is touched anywhere else, this piece of code will need to be changed also, or the bug will appear again. Note that other places where dn_flags is checked are done outside the mutex protection.

If sometime a change is made that sets the flags in another place, the mutex doesn't guarantee anything. It can be set before the mutex is entered, so the check will fail and db->db_blkptr won't be cleared. Additionally, if it has already been set it's not necessary to set it again. So in all cases we can move the check of the flag outside the mutex safely.

It could even be considered to use an atomic set or test_and_set operation to completely remove the mutex, but this would require a bigger change.

Having unneeded mutexes might increase the risk of deadlocks if lock order is not correctly checked in all places. Having less mutexes minimizes this problem and simplifies future changes.

@dweeezil
Copy link
Contributor

dweeezil commented Jun 2, 2016

The missing piece of this puzzle is how dbuf_sync_leaf() can be entered for a spill block when both db_blkptr is not NULL, points to SA leftovers in db_spill and DNODE_FLAG_SPILL_BLKPTR is clear. The patch proposed here clearly fixes the problem (and actually NULLs db_blkptr in many cases where it is already NULL, but I wonder if the real issue is how this condition is happening in the first place.

@xhernandez
Copy link
Author

xhernandez commented Jun 2, 2016

@dweeezil I'm trying to trace the path followed by a dbuf that works fine and a dbuf that causes a panic and I've seen that they start to differ when dbuf_undirty() is called.

In the bad case dbuf_undirty() returns false, but not because the dbuf is still referenced by anyone else, as I previously thought. It returns false because the dbuf is not dirty in the current transaction group. It's dirty in a previous one or not dirty at all (if I correctly understand the code). This is the check that causes the return:

if (dr == NULL || dr->dr_txg < txg)
        return (B_FALSE)

I'll analyze the remainig data to post a more detailed description.

@dweeezil
Copy link
Contributor

dweeezil commented Jun 2, 2016

@xhernandez I was able to exercise that code path in my testing, but it never resulted in non NULL db_blkptr for a spill block pointer pointing at trash.

I'll keep trying to reproduce this, but think your patch makes sense right now even though I'm unclear of the code paths which can cause the problem.

@xhernandez
Copy link
Author

xhernandez commented Jun 4, 2016

@dweezil I think I've identified the code path that leads to this situation.

  1. Current txg = A.
  2. A new spill buffer is created. Its dbuf is initialized with db_blkptr = NULL and it's dirtied.
  3. Current txg = B.
  4. The spill buffer is modified. It's marked as dirty in this txg.
  5. Additional changes make the spill buffer unnecessary because the xattr fits into the bonus buffer, so it's removed. The dbuf is undirtied in this txg, but it's still referenced and cannot be destroyed.
  6. Current txg = C.
  7. Starts syncing of txg A
  8. dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr is NULL, dbuf_check_blkptr() is called.
  9. The dbuf starts being written and it reaches the ready state (not done yet).
  10. A new change makes the spill buffer necessary again. sa_build_layouts() ends calling dbuf_find() to locate the dbuf. It finds the old dbuf because it has not been destroyed yet (it will be destroyed when the previous write is done and there are no more references). The old dbuf has db_blkptr != NULL.
  11. txg A write is complete and the dbuf released. However it's still referenced, so it's not destroyed.
  12. Current txg = D.
  13. Starts syncing of txg B
  14. dbuf_sync_leaf() is called for the bonus buffer. Its contents are directly copied into the dnode, overwriting the blkptr area because, in txg B, the bonus buffer was big enough to hold the entire xattr.
  15. At this point, the db_blkptr of the spill buffer used in txg C gets corrupted.

hsepeng added a commit to hsepeng/zfs that referenced this issue Jun 5, 2016
    the panic was introduced by the following scenario:
    in the previous transaction group, the bonus buffer
    was entirely used to store the attributes for the
    dnode which override the dn_spill field.
    however, when adding more attributes to the file,
    it will need the spill block to hold the extra
    attributes overflowing the bonus buffer.
    make sure to clear the garbage left in the dn_spill
    field which was the previous attributes in bonus
    buffer, otherwise, after writing out the spill block
    data to the new allocated dva, it will try to free
    the old block pointed by the invalid dn_spill, that
    would introduce the panic
@dweeezil
Copy link
Contributor

dweeezil commented Jun 6, 2016

@xhernandez I finally got a chance to go over your dbgmsg output. One of the key observations was the failures of arc_tempreserve(). When this happens, the creation of new txgs is stalled and this is exactly the type of thing which would likely be required in order that the scenario you describe could happen. Your dmu_tx kstat would likely bear that out. I'm still working on a reliable reproducer.

hsepeng added a commit to hsepeng/zfs that referenced this issue Jun 7, 2016
the panic was introduced by the following scenario:
in the previous transaction group, the bonus buffer
was entirely used to store the attributes for the
dnode which override the dn_spill field.
however, when adding more attributes to the file,
it will need the spill block to hold the extra
attributes overflowing the bonus buffer.
make sure to clear the garbage left in the dn_spill
field which was the previous attributes in bonus
buffer, otherwise, after writing out the spill block
data to the new allocated dva, it will try to free
the old block pointed by the invalid dn_spill, that
would introduce the panic
hsepeng added a commit to hsepeng/zfs that referenced this issue Jun 8, 2016
the panic was introduced by the following scenario:
in the previous transaction group, the bonus buffer
was entirely used to store the attributes for the
dnode which override the dn_spill field.
however, when adding more attributes to the file,
it will need the spill block to hold the extra
attributes overflowing the bonus buffer.
make sure to clear the garbage left in the dn_spill
field which was the previous attributes in bonus
buffer, otherwise, after writing out the spill block
data to the new allocated dva, it will try to free
the old block pointed by the invalid dn_spill, that
would introduce the panic
@xhernandez
Copy link
Author

@behlendorf will this patch be included in the next 0.6.5.x release ?

@behlendorf
Copy link
Contributor

It's possible if we can get a few developers to review and sign off on the proposed change in #4743. The fix itself looks reasonable to me but I think the comment could be a little more concise. Maybe @dweeezil @ahrens or @xhernandez can propose something. Including the detailed walk-thru from the above comment in the commit comment would also be very useful.

@dweeezil
Copy link
Contributor

dweeezil commented Jun 9, 2016

I had started working a bit more concise description of conditions required for this issue. As mentioned earlier, I think one important prerequisite is dmu tx assignment stalls of some sort in order that multiple references to the spill and/or bonus actually exist at the same time. So far, I've not come up with a reproducer. I'd at least like to see the description include the open/quiesce/sync/close state of the txgs involved.

That said, the fix does look perfectly reasonable.

@behlendorf behlendorf modified the milestones: 0.6.5.8, 0.8.0 Jul 12, 2016
GeLiXin added a commit to GeLiXin/zfs that referenced this issue Aug 1, 2016
* Consistently use parsable instead of parseable

This is a purely cosmetical change, to consistently prefer one of
two (both acceptable) choises for the word parsable in documentation and
code. I don't really care which to use, but acording to wiktionary
https://en.wiktionary.org/wiki/parsable#English parsable is preferred.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4682

* Add missing RPM BuildRequires

Both libudev and libattr are recommended build requirements.  As
such their development headers should lists in the rpm spec file
so those dependencies are pulled in when building rpm packages.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4676

* Skip ctldir znode in zfs_rezget to fix snapdir issues

Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This
will cause funny behaviour for the mounted snapdirs. Especially for
Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone
automount it again as long as someone is still using the detached mount.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4514
Closes #4661
Closes #4672

* Improve zfs-module-parameters(5)

Various rewrites to the descriptions of module parameters. Corrects
spelling mistakes, makes descriptions them more user-friendly and
describes some ZFS quirks which should be understood before changing
parameter values.

Signed-off-by: DHE <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4671

* Fix arc_prune_task use-after-free

arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent
the underlying zsb from disappearing if there's a concurrent umount. We fix
this by force the caller of arc_remove_prune_callback to wait for
arc_prune_taskq to finish.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4687
Closes #4690

* Add request size histograms (-r) to zpool iostat, minor man page fix

Add -r option to "zpool iostat" to print request size histograms for the leaf
ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs
("agg"). These stats can be useful for seeing how well the ZFS IO aggregator
is working.

$ zpool iostat -r
mypool        sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0    530      0      0      0
1K              0      0    260      0      0      0    116    246      0      0
2K              0      0      0      0      0      0      0    431      0      0
4K              0      0      0      0      0      0      3    107      0      0
8K             15      0     35      0      0      0      0      6      0      0
16K             0      0      0      0      0      0      0     39      0      0
32K             0      0      0      0      0      0      0      0      0      0
64K            20      0     40      0      0      0      0      0      0      0
128K            0      0     20      0      0      0      0      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0    155     19      0      0
8M              0      0      0      0      0      0      0    811      0      0
16M             0      0      0      0      0      0      0     68      0      0
--------------------------------------------------------------------------------

Also rename the stray "-G" in the man page to be "-w" for latency histograms.

Signed-off-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #4659

* OpenZFS 6531 - Provide mechanism to artificially limit disk performance

Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130

Porting notes:
- Added new IO delay tracepoints, and moved common ZIO tracepoint macros
  to a new trace_common.h file.
- Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function.
- Updated zinject man page
- Updated zpool_scrub test files

* Systemd configuration fixes

* Disable zfs-import-scan.service by default.  This ensures that
pools will not be automatically imported unless they appear in
the cache file.  When this service is explicitly enabled pools
will be imported with the "cachefile=none" property set.  This
prevents the creation of, or update to, an existing cache file.

    $ systemctl list-unit-files | grep zfs
    zfs-import-cache.service                  enabled
    zfs-import-scan.service                   disabled
    zfs-mount.service                         enabled
    zfs-share.service                         enabled
    zfs-zed.service                           enabled
    zfs.target                                enabled

* Change services to dynamic from static by adding an [Install]
section and adding 'WantedBy' tags in favor of 'Requires' tags.
This allows for easier customization of the boot behavior.

* Start the zfs-import-cache.service after the root pivot so
the cache file is available in the standard location.

* Start the zfs-mount.service after the systemd-remount-fs.service
to ensure the root fs is writeable and the ZFS filesystems can
create their mount points.

* Change the default behavior to only load the ZFS kernel modules
in zfs-import-*.service or when blkid(8) detects a pool.  Users
who wish to unconditionally load the kernel modules must uncomment
the list of modules in /lib/modules-load.d/zfs.conf.

Reviewed-by: Richard Laager <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4325
Closes #4496
Closes #4658
Closes #4699

* Fix self-healing IO prior to dsl_pool_init() completion

Async writes triggered by a self-healing IO may be issued before the
pool finishes the process of initialization.  This results in a NULL
dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes().

George Wilson recommended addressing this issue by initializing the
passed `dsl_pool_t **` prior to dmu_objset_open_impl().  Since the
caller is passing the `spa->spa_dsl_pool` this has the effect of
ensuring it's initialized.

However, since this depends on the caller knowing they must pass
the `spa->spa_dsl_pool` an additional NULL check was added to
vdev_queue_max_async_writes().  This guards against any future
restructuring of the code which might result in dsl_pool_init()
being called differently.

Signed-off-by: GeLiXin <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4652

* Add isa_defs for MIPS

GCC for MIPS only defines _LP64 when 64bit,
while no _ILP32 defined when 32bit.

Signed-off-by: YunQiang Su <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4712

* Fix out-of-bound access in zfs_fillpage

The original code will do an out-of-bound access on pl[] during last
iteration.

 ==================================================================
 BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs]
 Read of size 8 by task tmpfile/7850
 page:ffffea00017c6dc0 count:0 mapcount:0 mapping:          (null) index:0x0
 flags: 0xffff8000000000()
 page dumped because: kasan: bad access detected
 CPU: 3 PID: 7850 Comm: tmpfile Tainted: G           OE   4.6.0+ #3
  ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618
  ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8
  ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3
 Call Trace:
  [<ffffffff81635618>] dump_stack+0x63/0x8b
  [<ffffffff81313ee8>] kasan_report_error+0x528/0x560
  [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0
  [<ffffffff813144b8>] kasan_report+0x58/0x60
  [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs]
  [<ffffffff81312e4e>] __asan_load8+0x5e/0x70
  [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs]
  [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs]

  [<ffffffff81353c3a>] SyS_execve+0x3a/0x50
  [<ffffffff810058ef>] do_syscall_64+0xef/0x180
  [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25
 Memory state around the buggy address:
  ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4
                                                                 ^
  ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
  ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ==================================================================

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4705
Issue #4708

* Fix memleak in zpl_parse_options

strsep() will advance tmp_mntopts, and will change it to NULL on last
iteration.  This will cause strfree(tmp_mntopts) to not free anything.

unreferenced object 0xffff8800883976c0 (size 64):
  comm "mount.zfs", pid 3361, jiffies 4294931877 (age 1482.408s)
  hex dump (first 32 bytes):
    72 77 00 73 74 72 69 63 74 61 74 69 6d 65 00 7a  rw.strictatime.z
    66 73 75 74 69 6c 00 6d 6e 74 70 6f 69 6e 74 3d  fsutil.mntpoint=
  backtrace:
    [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811f9cac>] __kmalloc+0x16c/0x250
    [<ffffffffc065ce9b>] strdup+0x3b/0x60 [spl]
    [<ffffffffc080fad6>] zpl_parse_options+0x56/0x300 [zfs]
    [<ffffffffc080fe46>] zpl_mount+0x36/0x80 [zfs]
    [<ffffffff81222dc8>] mount_fs+0x38/0x160
    [<ffffffff81240097>] vfs_kern_mount+0x67/0x110
    [<ffffffff812428e0>] do_mount+0x250/0xe20
    [<ffffffff812437d5>] SyS_mount+0x95/0xe0
    [<ffffffff8181aff6>] entry_SYSCALL_64_fastpath+0x1e/0xa8
    [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4706
Issue #4708

* Fix memleak in vdev_config_generate_stats

fnvlist_add_nvlist will copy the contents of nvx, so we need to
free it here.

unreferenced object 0xffff8800a6934e80 (size 64):
  comm "zpool", pid 3398, jiffies 4295007406 (age 214.180s)
  hex dump (first 32 bytes):
    60 06 c2 73 00 88 ff ff 00 7c 8c 73 00 88 ff ff  `..s.....|.s....
    00 00 00 00 00 00 00 00 40 b0 70 c0 ff ff ff ff  [email protected].....
  backtrace:
    [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811fac7d>] __kmalloc_node+0x17d/0x310
    [<ffffffffc065528c>] spl_kmem_alloc_impl+0xac/0x180 [spl]
    [<ffffffffc0657379>] spl_vmem_alloc+0x19/0x20 [spl]
    [<ffffffffc07056cf>] nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
    [<ffffffffc07006b7>] nvlist_xalloc.part.13+0x27/0xc0 [znvpair]
    [<ffffffffc07007ad>] nvlist_alloc+0x3d/0x40 [znvpair]
    [<ffffffffc0703abc>] fnvlist_alloc+0x2c/0x80 [znvpair]
    [<ffffffffc07b1783>] vdev_config_generate_stats+0x83/0x370 [zfs]
    [<ffffffffc07b1f53>] vdev_config_generate+0x4e3/0x650 [zfs]
    [<ffffffffc07996db>] spa_config_generate+0x20b/0x4b0 [zfs]
    [<ffffffffc0794f64>] spa_tryimport+0xc4/0x430 [zfs]
    [<ffffffffc07d11d8>] zfs_ioc_pool_tryimport+0x68/0x110 [zfs]
    [<ffffffffc07d4fc6>] zfsdev_ioctl+0x646/0x7a0 [zfs]
    [<ffffffff81232e31>] do_vfs_ioctl+0xa1/0x5b0
    [<ffffffff812333b9>] SyS_ioctl+0x79/0x90

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4707
Issue #4708

* Linux 4.7 compat: handler->set() takes both dentry and inode

Counterpart to fd4c7b7, the same approach was taken to resolve
the compatibility issue.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #4717 
Issue #4665

* Implementation of AVX2 optimized Fletcher-4

New functionality:
- Preserves existing scalar implementation.
- Adds AVX2 optimized Fletcher-4 computation.
- Fastest routines selected on module load (benchmark).
- Test case for Fletcher-4 added to ztest.

New zcommon module parameters:
-  zfs_fletcher_4_impl (str): selects the implementation to use.
    "fastest" - use the fastest version available
    "cycle"   - cycle trough all available impl for ztest
    "scalar"  - use the original version
    "avx2"    - new AVX2 implementation if available

Performance comparison (Intel i7 CPU, 1MB data buffers):
- Scalar:  4216 MB/s
- AVX2:   14499 MB/s

See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl`
to get list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Jinshan Xiong <[email protected]>
Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4330

* Fix cstyle.pl warnings

As of perl v5.22.1 the following warnings are generated:

* Redundant argument in printf at scripts/cstyle.pl line 194

* Unescaped left brace in regex is deprecated, passed through
  in regex; marked by <-- HERE in m/\S{ <-- HERE / at
  scripts/cstyle.pl line 608.

They have been addressed by escaping the left braces and by
providing the correct number of arguments to printf based on
the fmt specifier set by the verbose option.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4723

* Fix minor spelling mistakes

Trivial spelling mistake fix in error message text.

* Fix spelling mistake "adminstrator" -> "administrator"
* Fix spelling mistake "specificed" -> "specified"
* Fix spelling mistake "interperted" -> "interpreted"

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4728

* Add `zfs allow` and `zfs unallow` support

ZFS allows for specific permissions to be delegated to normal users
with the `zfs allow` and `zfs unallow` commands.  In addition, non-
privileged users should be able to run all of the following commands:

  * zpool [list | iostat | status | get]
  * zfs [list | get]

Historically this functionality was not available on Linux.  In order
to add it the secpolicy_* functions needed to be implemented and mapped
to the equivalent Linux capability.  Only then could the permissions on
the `/dev/zfs` be relaxed and the internal ZFS permission checks used.

Even with this change some limitations remain.  Under Linux only the
root user is allowed to modify the namespace (unless it's a private
namespace).  This means the mount, mountpoint, canmount, unmount,
and remount delegations cannot be supported with the existing code.  It
may be possible to add this functionality in the future.

This functionality was validated with the cli_user and delegation test
cases from the ZFS Test Suite.  These tests exhaustively verify each
of the supported permissions which can be delegated and ensures only
an authorized user can perform it.

Two minor bug fixes were required for test-running.py.  First, the
Timer() object cannot be safely created in a `try:` block when there
is an unconditional `finally` block which references it.  Second,
when running as a normal user also check for scripts using the
both the .ksh and .sh suffixes.

Finally, existing users who are simulating delegations by setting
group permissions on the /dev/zfs device should revert that
customization when updating to a version with this change.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #362 
Closes #434 
Closes #4100
Closes #4394 
Closes #4410 
Closes #4487

* Remove libzfs_graph.c

The libzfs_graph.c source file should have been removed in 330d06f,
it is entirely unused.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4766

* Linux 4.6 compat: Fall back to d_prune_aliases() if necessary

As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes: #4726

* SIMD implementation of vdev_raidz generate and reconstruct routines

This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.

Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite

New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
  module load, the parameter will only accept first 3 options, and
  the other implementations can be set once module is finished
  loading. Possible values for this option are:
    "fastest" - use the fastest math available
    "original" - use the original raidz code
    "scalar" - new scalar impl
    "sse" - new SSE impl if available
    "avx2" - new AVX2 impl if available

See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4328

* Fix NFS credential

The commit f74b821 caused a regression where creating file through NFS will
always create a file owned by root. This is because the patch enables the KSID
code in zfs_acl_ids_create, which it would use euid and egid of the current
process. However, on Linux, we should use fsuid and fsgid for file operations,
which is the original behaviour. So we revert this part of code.

The patch also enables secpolicy_vnode_*, since they are also used in file
operations, we change them to use fsuid and fsgid.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4772
Closes #4758

* OpenZFS 6513 - partially filled holes lose birth time

Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Boris Protopopov <[email protected]>
Approved by: Richard Lowe <[email protected]>a
Ported by: Boris Protopopov <[email protected]>
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6513
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0

If a ZFS object contains a hole at level one, and then a data block is
created at level 0 underneath that l1 block, l0 holes will be created.
However, these l0 holes do not have the birth time property set; as a
result, incremental sends will not send those holes.

Fix is to modify the dbuf_read code to fill in birth time data.

* Add a test case for dmu_free_long_range() to ztest

Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4754

* Revert "Add a test case for dmu_free_long_range() to ztest"

This reverts commit d0de2e82df579f4e4edf5643b674a1464fae485f which
introduced a new test case to ztest which is failing occasionally
during automated testing.  The change is being reverted until
the issue can be fully investigated.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4754

* OpenZFS 6878 - Add scrub completion info to "zpool history"

Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Approved by: Dan McDonald <[email protected]>
Authored by: Nav Ravindranath <[email protected]>
Ported-by: Chris Dunlop <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6878
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1825bc5
Closes #4787

* FreeBSD rS271776 - Persist vdev_resilver_txg changes

Persist vdev_resilver_txg changes to avoid panic caused by validation
vs a vdev_resilver_txg value from a previous resilver.

Authored-by: smh <[email protected]>
Ported-by: Chris Dunlop <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/5154
FreeBSD-issue: https://reviews.freebsd.org/rS271776
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/c3c60bf
Closes #4790

* xattrtest: allow verify with -R and other improvements

- Use a fixed buffer of random bytes when random xattr values are in
  effect.  This eliminates the potential performance bottleneck of
  reading from /dev/urandom for each file. This also allows us to
  verify xattrs in random value mode.

- Show the rate of operations per second in addition to elapsed time
  for each phase of the test. This may be useful for benchmarking.

- Set default xattr size to 6 so that verify doesn't fail if user
  doesn't specify a size. We need at least six bytes to store the
  leading "size=X" string that is used for verification.

- Allow user to execute just one phase of the test. Acceptable
  values for -o and their meanings are:

   1 - run the create phase
   2 - run the setxattr phase
   3 - run the getxattr phase
   4 - run the unlink phase

Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

* Backfill metadnode more intelligently

Only attempt to backfill lower metadnode object numbers if at least
4096 objects have been freed since the last rescan, and at most once
per transaction group. This avoids a pathology in dmu_object_alloc()
that caused O(N^2) behavior for create-heavy workloads and
substantially improves object creation rates.  As summarized by
@mahrens in #4636:

"Normally, the object allocator simply checks to see if the next
object is available. The slow calls happened when dmu_object_alloc()
checks to see if it can backfill lower object numbers. This happens
every time we move on to a new L1 indirect block (i.e. every 32 *
128 = 4096 objects).  When re-checking lower object numbers, we use
the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
indirect blocks that don’t have enough free dnodes (defined as an L2
with at least 393,216 of 524,288 dnodes free). Therefore, we may
find that a block of dnodes has a low (or zero) fill count, and yet
we can’t allocate any of its dnodes, because they've been allocated
in memory but not yet written to disk. In this case we have to hold
each of the dnodes and then notice that it has been allocated in
memory.

The end result is that allocating N objects in the same TXG can
require CPU usage proportional to N^2."

Add a tunable dmu_rescan_dnode_threshold to define the number of
objects that must be freed before a rescan is performed. Don't bother
to export this as a module option because testing doesn't show a
compelling reason to change it. The vast majority of the performance
gain comes from limit the rescan to at most once per TXG.

Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

* Implement large_dnode pool feature

Justification
-------------

This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks.  Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided.  Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks.  Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.

ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.

Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.

Implementation
--------------

The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.

Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.

The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run

 # zfs set dnodesize=auto tank/fish

The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.

The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.

New DMU interfaces:
  dmu_object_alloc_dnsize()
  dmu_object_claim_dnsize()
  dmu_object_reclaim_dnsize()

New ZAP interfaces:
  zap_create_dnsize()
  zap_create_norm_dnsize()
  zap_create_flags_dnsize()
  zap_create_claim_norm_dnsize()
  zap_create_link_dnsize()

The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.

These are a few noteworthy changes to key functions:

* The prototype for dnode_hold_impl() now takes a "slots" parameter.
  When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
  ensure the hole at the specified object offset is large enough to
  hold the dnode being created. The slots parameter is also used
  to ensure a dnode does not span multiple dnode blocks. In both of
  these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
  these failure cases are only possible when using DNODE_MUST_BE_FREE.

  If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
  dnode_hold_impl() will check if the requested dnode is already
  consumed as an extra dnode slot by an large dnode, in which case
  it returns ENOENT.

* The function dmu_object_alloc() advances to the next dnode block
  if dnode_hold_impl() returns an error for a requested object.
  This is because the beginning of the next dnode block is the only
  location it can safely assume to either be a hole or a valid
  starting point for a dnode.

* dnode_next_offset_level() and other functions that iterate
  through dnode blocks may no longer use a simple array indexing
  scheme. These now use the current dnode's dn_num_slots field to
  advance to the next dnode in the block. This is to ensure we
  properly skip the current dnode's bonus area and don't interpret it
  as a valid dnode.

zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.

For ZIL create log records, zdb will now display the slot count for
the object.

ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.

Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number.  This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.

ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.

Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.

While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.

For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.

ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.

Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.

Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.

Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3542

* Sync DMU_BACKUP_FEATURE_* flags

Flag 20 was used in OpenZFS as DMU_BACKUP_FEATURE_RESUMING.  The
DMU_BACKUP_FEATURE_LARGE_DNODE flag must be shifted to 21 and
then reserved in the upstream OpenZFS implementation.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Closes #4795

* OpenZFS 2605, 6980, 6902

2605 want to resume interrupted zfs send
Reviewed by: George Wilson <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Xin Li <[email protected]>
Reviewed by: Arne Jansen <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: kernelOfTruth <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/2605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12

6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: George Wilson <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Ported by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6980
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f

Porting notes:
- All rsend and snapshop tests enabled and updated for Linux.
- Fix misuse of input argument in traverse_visitbp().
- Fix ISO C90 warnings and errors.
- Fix gcc 'missing braces around initializer' in
  'struct send_thread_arg to_arg =' warning.
- Replace 4 argument fletcher_4_native() with 3 argument version,
  this change was made in OpenZFS 4185 which has not been ported.
- Part of the sections for 'zfs receive' and 'zfs send' was
  rewritten and reordered to approximate upstream.
- Fix mktree xattr creation, 'user.' prefix required.
- Minor fixes to newly enabled test cases
- Long holds for volumes allowed during receive for minor registration.

* OpenZFS 6051 - lzc_receive: allow the caller to read the begin record

Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6051
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/620f322

* OpenZFS 6393 - zfs receive a full send as a clone

Authored by: Paul Dagnelie <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6394
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e

* OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS

Authored by: Andrew Stormont <[email protected]>
Reviewed by: Anil Vijarnia <[email protected]>
Reviewed by: Kim Shrier <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6536
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/880094b

* OpenZFS 6738 - zfs send stream padding needs documentation

Authored by: Eli Rosenthal <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Dan McDonald <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6738
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c20404ff

* OpenZFS 4986 - receiving replication stream fails if any snapshot exceeds refquota

Authored by: Dan McDonald <[email protected]>
Reviewed by: John Kennedy <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Gordon Ross <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/4986
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5878fad

* OpenZFS 6562 - Refquota on receive doesn't account for overage

Authored by: Dan McDonald <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Yuri Pankov <[email protected]>
Reviewed by: Toomas Soome <[email protected]>
Approved by: Gordon Ross <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6562
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5f7a8e6

* Implement zfs_ioc_recv_new() for OpenZFS 2605

Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy
ZFS_IOC_RECV user/kernel interface.  The new interface supports all
stream options but is currently only used for resumable streams.
This way updated user space utilities will interoperate with older
kernel modules.

ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW
handler.  Non-Linux OpenZFS platforms have opted to change the
legacy interface in an incompatible fashion instead of adding a
new ioctl.

Signed-off-by: Brian Behlendorf <[email protected]>

* OpenZFS 6314 - buffer overflow in dsl_dataset_name

Reviewed by: George Wilson <[email protected]>
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Igor Kozhukhov <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6314
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee

* OpenZFS 6876 - Stack corruption after importing a pool with a too-long name

Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Yuri Pankov <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking
for trouble. We should check every dataset on import, using a 1024 byte
buffer and checking each time to see if the dataset's new name is longer
than 256 bytes.

OpenZFS-issue: https://www.illumos.org/issues/6876
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e

* Vectorized fletcher_4 must be 128-bit aligned

The fletcher_4_native() and fletcher_4_byteswap() functions may only
safely use the vectorized implementations when the buffer is 128-bit
aligned.  This is because both the AVX2 and SSE implementations process
four 32-bit words per iterations.  Fallback to the scalar implementation
which only processes a single 32-bit word for unaligned buffers.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Gvozden Neskovic <[email protected]>
Issue #4330

* Allow building with `CFLAGS="-O0"`

If compiled with -O0, gcc doesn't do any stack frame coalescing
and -Wframe-larger-than=1024 is triggered in debug mode.
Starting with gcc 4.8, new opt level -Og is introduced for debugging, which
does not trigger this warning.

Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4799

* Don't allow accessing XATTR via export handle

Allow accessing XATTR through export handle is a very bad idea. It
would allow user to write whatever they want in fields where they
otherwise could not.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4828

* Fix get_zfs_sb race with concurrent umount

Certain ioctl operations will call get_zfs_sb, which will holds an active
count on sb without checking whether it's active or not. This will result
in use-after-free. We fix this by using atomic_inc_not_zero to make sure
we got an active sb.

P1                                          P2
---                                         ---
deactivate_locked_super(): s_active = 0
                                            zfs_sb_hold()
                                            ->get_zfs_sb(): s_active = 1
->zpl_kill_sb()
-->zpl_put_super()
--->zfs_umount()
---->zfs_sb_free(zsb)
                                            zfs_sb_rele(zsb)

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

* Fix Large kmem_alloc in vdev_metaslab_init

This allocation can go way over 1MB, so we should use vmem_alloc
instead of kmem_alloc.

  Large kmem_alloc(1430784, 0x1000), please file an issue...
  Call Trace:
   [<ffffffffa0324aff>] ? spl_kmem_zalloc+0xef/0x160 [spl]
   [<ffffffffa17d0c8d>] ? vdev_metaslab_init+0x9d/0x1f0 [zfs]
   [<ffffffffa17d46d0>] ? vdev_load+0xc0/0xd0 [zfs]
   [<ffffffffa17d4643>] ? vdev_load+0x33/0xd0 [zfs]
   [<ffffffffa17c0004>] ? spa_load+0xfc4/0x1b60 [zfs]
   [<ffffffffa17c1838>] ? spa_tryimport+0x98/0x430 [zfs]
   [<ffffffffa17f28b1>] ? zfs_ioc_pool_tryimport+0x41/0x80 [zfs]
   [<ffffffffa17f5669>] ? zfsdev_ioctl+0x4a9/0x4e0 [zfs]
   [<ffffffff811bacdf>] ? do_vfs_ioctl+0x2cf/0x4b0
   [<ffffffff811baf41>] ? SyS_ioctl+0x81/0xa0

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4752

* Add configure result for xattr_handler

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4828

* fh_to_dentry should return ESTALE when generation mismatch

When generation mismatch, it usually means the file pointed by the file handle
was deleted. We should return ESTALE to indicate this. We return ENOENT in
zfs_vget since zpl_fh_to_dentry will convert it to ESTALE.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4828

* xattr dir doesn't get purged during iput

We need to set inode->i_nlink to zero so iput will purge it. Without this, it
will get purged during shrink cache or umount, which would likely result in
deadlock due to zfs_zget waiting forever on its children which are in the
dispose_list of the same thread.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chris Dunlop <[email protected]>
Issue #4359
Issue #3508
Issue #4413
Issue #4827

* Kill zp->z_xattr_parent to prevent pinning

zp->z_xattr_parent will pin the parent. This will cause huge issue
when unlink a file with xattr. Because the unlinked file is pinned, it
will never get purged immediately. And because of that, the xattr
stuff will never be marked as unlinked. So the whole unlinked stuff
will stay there until shrink cache or umount.

This change partially reverts e89260a.  This is safe because only the
zp->z_xattr_parent optimization is removed, zpl_xattr_security_init()
is still called from the zpl outside the inode lock.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chris Dunlop <[email protected]>
Issue #4359
Issue #3508
Issue #4413
Issue #4827

* Fix RAIDZ_TEST tests

Remove stray trailing } which prevented the raidz stress tests from
running in-tree.

Signed-off-by: Brian Behlendorf <[email protected]>

* Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z

The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.

Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
  db_blkptr = NULL and it's dirtied.

Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
  xattr fits into the bonus buffer, so it's removed. The dbuf is
  undirtied in this txg, but it's still referenced and cannot be
  destroyed.

Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
  is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
  (not done yet).
* A new change makes the spill buffer necessary again.
  sa_build_layouts() ends up calling dbuf_find() to locate the
  dbuf.  It finds the old dbuf because it has not been destroyed yet
  (it will be destroyed when the previous write is done and there
  are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
  referenced, so it's not destroyed.

Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
  directly copied into the dnode, overwriting the blkptr area because,
  in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
  gets corrupted.

Signed-off-by: Peng <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3937

* Fix handling of errors nvlist in zfs_ioc_recv_new()

zfs_ioc_recv_impl() is changed to always allocate the 'errors'
nvlist, its callers are responsible for freeing it.

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4829

* Add RAID-Z routines for SSE2 instruction set, in x86_64 mode.

The patch covers low-end and older x86 CPUs.  Parity generation is
equivalent to SSSE3 implementation, but reconstruction is somewhat
slower.  Previous 'sse' implementation is renamed to 'ssse3' to
indicate highest instruction set used.

Benchmark results:
scalar_rec_p                    4    720476442
scalar_rec_q                    4    187462804
scalar_rec_r                    4    138996096
scalar_rec_pq                   4    140834951
scalar_rec_pr                   4    129332035
scalar_rec_qr                   4    81619194
scalar_rec_pqr                  4    53376668

sse2_rec_p                      4    2427757064
sse2_rec_q                      4    747120861
sse2_rec_r                      4    499871637
sse2_rec_pq                     4    522403710
sse2_rec_pr                     4    464632780
sse2_rec_qr                     4    319124434
sse2_rec_pqr                    4    205794190

ssse3_rec_p                     4    2519939444
ssse3_rec_q                     4    1003019289
ssse3_rec_r                     4    616428767
ssse3_rec_pq                    4    706326396
ssse3_rec_pr                    4    570493618
ssse3_rec_qr                    4    400185250
ssse3_rec_pqr                   4    377541245

original_rec_p                  4    691658568
original_rec_q                  4    195510948
original_rec_r                  4    26075538
original_rec_pq                 4    103087368
original_rec_pr                 4    15767058
original_rec_qr                 4    15513175
original_rec_pqr                4    10746357

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4783

* Enable zpool_upgrade test cases

Creating the pool in a striped rather than mirrored configuration
provides enough space for all upgrade tests to run.  Test case
zpool_upgrade_007_pos still fails and must be investigated so
it has been left disabled.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4852

* Prevent null dereferences when accessing dbuf kstat

In arc_buf_info(), the arc_buf_t may have no header.  If not, don't try
to fetch the arc buffer stats and instead just zero them.

The null dereferences were observed while accessing the dbuf kstat with
awk on a system in which millions of small files were being created in
order to overflow the system's metadata limit.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #4837

* Fix dbuf_stats_hash_table_data race

Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf
can be freed at any time.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4846

* Use native inode->i_nlink instead of znode->z_links

A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's
64 bit on-disk link count.

We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a
more Linux-integrated fix for the same issue.

In addition, setting the initial link count on a new node has been changed
from setting one less than required in zfs_mknode() then incrementing to the
correct count in zfs_link_create() (which was somewhat bizarre in the first
place), to setting the correct count in zfs_mknode() and not incrementing it
in zfs_link_create(). This both means we no longer set the link count in
sa_bulk_update() twice (once for the initial incorrect count then again for
the correct count), as well as adhering to the Linux requirement of not
incrementing a zero link count without I_LINKABLE (see linux commit
f4e0c30c).

Signed-off-by: Chris Dunlop <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #4838
Issue #227

* Implementation of SSE optimized Fletcher-4

Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4)
This commit adds another implementation of the Fletcher-4 algorithm.
It is automatically selected at module load if it benchmarks higher
than all other available implementations.

The module benchmark was also amended to analyze the performance of
the byteswap-ed version of Fletcher-4, as well as the non-byteswaped
version. The average performance of the two is used to select the
the fastest implementation available on the host system.

Adds a pair of fields to an existing zcommon module parameter:
-  zfs_fletcher_4_impl (str)
    "sse2"    - new SSE2 implementation if available
    "ssse3"   - new SSSE3 implementation if available

Signed-off-by: Tyler J. Stachecki <[email protected]>
Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4789

* Fix filesystem destroy with receive_resume_token

It is possible that the given DS may have hidden child (%recv)
datasets - "leftovers" resulting from the previously interrupted
'zfs receieve'.  Try to remove the hidden child (%recv) and after
that try to remove the target dataset.   If the hidden child
(%recv) does not exist the original error (EEXIST) will be returned.

Signed-off-by: Roman Strashkin <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4818

* Prevent segfaults in SSE optimized Fletcher-4

In some cases, the compiler was not respecting the GNU aligned
attribute for stack variables in 35a76a0. This was resulting in
a segfault on CentOS 6.7 hosts using gcc 4.4.7-17.  This issue
was fixed in gcc 4.6.

To prevent this from occurring, use unaligned loads and stores
for all stack and global memory references in the SSE optimized
Fletcher-4 code.

Disable zimport testing against master where this flaw exists:

TEST_ZIMPORT_VERSIONS="installed"

Signed-off-by: Tyler J. Stachecki <[email protected]>
Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4862

* Update arc_summary.py for prefetch changes

Commit 7f60329 removed several kstats which arc_summary.py read.
Remove these kstats from arc_summary.py in the same way this was
handled in FreeNAS.

FreeNAS-commit: https://github.com/freenas/freenas/commit/3901f73

Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4695

* Wait iput_async before evict_inodes to prevent race

Wait for iput_async before entering evict_inodes in
generic_shutdown_super. The reason we must finish before
evict_inodes is when lazytime is on, or when zfs_purgedir calls
zfs_zget, iput would bump i_count from 0 to 1. This would race
with the i_count check in evict_inodes.  This means it could
destroy the inode while we are still using it.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4854

* Fixes and enhancements of SIMD raidz parity

- Implementation lock replaced with atomic variable

- Trailing whitespace is removed from user specified parameter, to enhance
experience when using commands that add newline, e.g. `echo`

- raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813

- silence `cppcheck` in vdev_raidz, partial solution of Issue #1392

- Minor fixes and cleanups

- Enable use of original parity methods in [fastest] configuration.
New opaque original ops structure, representing native methods, is added
to supported raidz methods. Original parity methods are executed if selected
implementation has NULL fn pointer.

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4813
Issue #1392

* RAIDZ parity kstat rework

Print table with speed of methods for each implementation.
Last line describes contents of [fastest] selection.

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4860

* Fix NULL pointer in zfs_preumount from 1d9b3bd

When zfs_domount fails zsb will be freed, and its caller
mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into
zfs_preumount.

In order to make sure we don't touch any nonexistent stuff, we must make sure
s_fs_info is NULL in the fail path so zfs_preumount can easily check that.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4867
Issue #4854

* Illumos Crypto Port module added to enable native encryption in zfs

A port of the Illumos Crypto Framework to a Linux kernel module (found
in module/icp). This is needed to do the actual encryption work. We cannot
use the Linux kernel's built in crypto api because it is only exported to
GPL-licensed modules. Having the ICP also means the crypto code can run on
any of the other kernels under OpenZFS. I ended up porting over most of the
internals of the framework, which means that porting over other API calls (if
we need them) should be fairly easy. Specifically, I have ported over the API
functions related to encryption, digests, macs, and crypto templates. The ICP
is able to use assembly-accelerated encryption on amd64 machines and AES-NI
instructions on Intel chips that support it. There are place-holder
directories for similar assembly optimizations for other architectures
(although they have not been written).

Signed-off-by: Tom Caputi <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4329

* Fix for compilation error when using the kernel's CONFIG_LOCKDEP

Signed-off-by: Tom Caputi <[email protected]>
Signed-off-by: Chris Dunlop <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4329

* zloop: print backtrace from core files

Find the core file by using `/proc/sys/kernel/core_pattern`

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4874

* Fix for metaslab_fastwrite_unmark() assert failure

Currently there is an issue where metaslab_fastwrite_unmark() unmarks
fastwrites on vdev_t's that have never had fastwrites marked on them.
The 'fastwrite mark' is essentially a count of outstanding bytes that
will be written to a vdev and is used in syncing context. The problem
stems from the fact that the vdev_pending_fastwrite field is not being
transferred over when replacing a top-level vdev. As a result, the
metaslab is marked for fastwrite on the old vdev and unmarked on the
new one, which brings the fastwrite count below zero. This fix simply
assigns vdev_pending_fastwrite from the old vdev to the new one so
this count is not lost.

Signed-off-by: Tom Caputi <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4267

* Remove znode's z_uid/z_gid member

Remove duplicate z_uid/z_gid member which are also held in the
generic vfs inode struct. This is done by first removing the members
from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID
macros to access the respective member from struct inode. In cases
where the uid/gids are being marshalled from/to disk, use the newly
introduced zfs_(uid|gid)_(read|write) functions to properly
save the uids rather than the internal kernel representation.

Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4685
Issue #227

* Check whether the kernel supports i_uid/gid_read/write helpers

Since the concept of a kuid and the need to translate from it to
ordinary integer type was added in kernel version 3.5 implement necessary
plumbing to be able to detect this condition during compile time. If
the kernel doesn't support the kuid then just fall back to directly
accessing the respective struct inode's members

Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4685
Issue #227

* Fix uninitialized variable in avl_add()

Silence the following warning when compiling with gcc 5.4.0.
Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609.

module/avl/avl.c: In function ‘avl_add’:
module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized
    in this function [-Wmaybe-uninitialized]
  avl_insert(tree, new_node, where);

Signed-off-by: Brian Behlendorf <[email protected]>

* Fix sync behavior for disk vdevs

Prior to b39c22b, which was first generally available in the 0.6.5
release as b39c22b, ZoL never actually submitted synchronous read or write
requests to the Linux block layer.  This means the vdev_disk_dio_is_sync()
function had always returned false and, therefore, the completion in
dio_request_t.dr_comp was never actually used.

In b39c22b, synchronous ZIO operations were translated to synchronous
BIO requests in vdev_disk_io_start().  The follow-on commits 5592404 and
aa159af fixed several problems introduced by b39c22b.  In particular,
5592404 introduced the new flag parameter "wait" to __vdev_disk_physio()
but under ZoL, since vdev_disk_physio() is never actually used, the wait
flag was always zero so the new code had no effect other than to cause
a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af.

The original rationale for introducing synchronous operations in b39c22b
was to hurry certains requests through the BIO layer which would have
otherwise been subject to its unplug timer which would increase the
latency.  This behavior of the unplug timer, however, went away during the
transition of the plug/unplug system between kernels 2.6.32 and 2.6.39.

To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the
BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior.

For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and
ise used for the same purpose.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4858

* Limit the amount of dnode metadata in the ARC

Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).

In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.

The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes.  Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.

The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded.  By default,
it tries to unpin 10% of the dnodes.

The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):

    - Create a large number of small files until arc_meta_used exceeds
      arc_meta_limit (3GiB with default tuning) and arc_prune
      starts increasing.

    - Create a 3GiB file with dd.  Observe arc_mata_used.  It will still
      be around 3GiB.

    - Repeatedly read the 3GiB file and observe arc_meta_limit as before.
      It will continue to stay around 3GiB.

With this modification, space for the 3GiB file is gradually made
available as subsequent demands on th…
nedbass pushed a commit to nedbass/zfs that referenced this issue Aug 26, 2016
The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.

Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
  db_blkptr = NULL and it's dirtied.

Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
  xattr fits into the bonus buffer, so it's removed. The dbuf is
  undirtied in this txg, but it's still referenced and cannot be
  destroyed.

Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
  is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
  (not done yet).
* A new change makes the spill buffer necessary again.
  sa_build_layouts() ends up calling dbuf_find() to locate the
  dbuf.  It finds the old dbuf because it has not been destroyed yet
  (it will be destroyed when the previous write is done and there
  are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
  referenced, so it's not destroyed.

Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
  directly copied into the dnode, overwriting the blkptr area because,
  in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
  gets corrupted.

Signed-off-by: Peng <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#3937
nedbass pushed a commit to nedbass/zfs that referenced this issue Sep 3, 2016
The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.

Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
  db_blkptr = NULL and it's dirtied.

Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
  xattr fits into the bonus buffer, so it's removed. The dbuf is
  undirtied in this txg, but it's still referenced and cannot be
  destroyed.

Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
  is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
  (not done yet).
* A new change makes the spill buffer necessary again.
  sa_build_layouts() ends up calling dbuf_find() to locate the
  dbuf.  It finds the old dbuf because it has not been destroyed yet
  (it will be destroyed when the previous write is done and there
  are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
  referenced, so it's not destroyed.

Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
  directly copied into the dnode, overwriting the blkptr area because,
  in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
  gets corrupted.

Signed-off-by: Peng <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#3937
nedbass pushed a commit to nedbass/zfs that referenced this issue Sep 5, 2016
The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.

Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
  db_blkptr = NULL and it's dirtied.

Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
  xattr fits into the bonus buffer, so it's removed. The dbuf is
  undirtied in this txg, but it's still referenced and cannot be
  destroyed.

Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
  is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
  (not done yet).
* A new change makes the spill buffer necessary again.
  sa_build_layouts() ends up calling dbuf_find() to locate the
  dbuf.  It finds the old dbuf because it has not been destroyed yet
  (it will be destroyed when the previous write is done and there
  are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
  referenced, so it's not destroyed.

Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
  directly copied into the dnode, overwriting the blkptr area because,
  in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
  gets corrupted.

Signed-off-by: Peng <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#3937
nedbass pushed a commit to nedbass/zfs that referenced this issue Sep 5, 2016
The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.

Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
  db_blkptr = NULL and it's dirtied.

Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
  xattr fits into the bonus buffer, so it's removed. The dbuf is
  undirtied in this txg, but it's still referenced and cannot be
  destroyed.

Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
  is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
  (not done yet).
* A new change makes the spill buffer necessary again.
  sa_build_layouts() ends up calling dbuf_find() to locate the
  dbuf.  It finds the old dbuf because it has not been destroyed yet
  (it will be destroyed when the previous write is done and there
  are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
  referenced, so it's not destroyed.

Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
  directly copied into the dnode, overwriting the blkptr area because,
  in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
  gets corrupted.

Signed-off-by: Peng <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#3937
tuxoko pushed a commit to tuxoko/zfs that referenced this issue Sep 8, 2016
The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.

Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
  db_blkptr = NULL and it's dirtied.

Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
  xattr fits into the bonus buffer, so it's removed. The dbuf is
  undirtied in this txg, but it's still referenced and cannot be
  destroyed.

Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
  is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
  (not done yet).
* A new change makes the spill buffer necessary again.
  sa_build_layouts() ends up calling dbuf_find() to locate the
  dbuf.  It finds the old dbuf because it has not been destroyed yet
  (it will be destroyed when the previous write is done and there
  are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
  referenced, so it's not destroyed.

Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
  directly copied into the dnode, overwriting the blkptr area because,
  in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
  gets corrupted.

Signed-off-by: Peng <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#3937
DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Oct 19, 2016
The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.

Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
  db_blkptr = NULL and it's dirtied.

Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
  xattr fits into the bonus buffer, so it's removed. The dbuf is
  undirtied in this txg, but it's still referenced and cannot be
  destroyed.

Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
  is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
  (not done yet).
* A new change makes the spill buffer necessary again.
  sa_build_layouts() ends up calling dbuf_find() to locate the
  dbuf.  It finds the old dbuf because it has not been destroyed yet
  (it will be destroyed when the previous write is done and there
  are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
  referenced, so it's not destroyed.

Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
  directly copied into the dnode, overwriting the blkptr area because,
  in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
  gets corrupted.

Signed-off-by: Peng <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#3937
DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Oct 29, 2016
The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.

Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
  db_blkptr = NULL and it's dirtied.

Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
  xattr fits into the bonus buffer, so it's removed. The dbuf is
  undirtied in this txg, but it's still referenced and cannot be
  destroyed.

Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
  is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
  (not done yet).
* A new change makes the spill buffer necessary again.
  sa_build_layouts() ends up calling dbuf_find() to locate the
  dbuf.  It finds the old dbuf because it has not been destroyed yet
  (it will be destroyed when the previous write is done and there
  are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
  referenced, so it's not destroyed.

Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
  directly copied into the dnode, overwriting the blkptr area because,
  in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
  gets corrupted.

Signed-off-by: Peng <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#3937
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants