Large dnode pool feature #3542

nedbass · 2015-06-29T18:38:24Z

This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.

ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.

Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.

adilger · 2015-08-21T22:21:33Z

Ned, have you done any performance testing on this patch with Lustre or other xattr-heavy workload?

I can imagine that increasing the dnode size itself would reduce performance to some extent, since more IO is needed for a given number of dnodes. Once xattrs are included into the test mix, and a real world workload where the dnode attributes and/or xattr are modified after initial creation the performance would swing in favor of the large dnodes due to avoiding seeking for the spill blocks.

jxiong · 2015-09-04T23:10:52Z

On my testing node, I noticed that once I enabled this feature, the xattr is not stored as SA any more.

[root@centos7 testpool]# ~/srcs/zfs/zfs/cmd/zfs/zfs get all zfstest/testpool |egrep 'dnodesize|xattr'
zfstest/testpool xattr sa local
zfstest/testpool dnodesize 1K local

touch a file and set xattr:

[root@centos7 testpool]# rm -rf tf
[root@centos7 testpool]# touch tf
[root@centos7 testpool]# setfattr -n user.val -v "The size of a dnode may be a multiple of 512 bytes up to the size of" tf
[root@centos7 testpool]# stat tf
File: ‘tf’
Size: 0 Blocks: 1 IO Block: 131072 regular empty file
Device: 26h/38d Inode: 15 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Context: unconfined_u:object_r:unlabeled_t:s0
Access: 2015-09-04 10:47:39.241747282 -0400
Modify: 2015-09-04 10:47:39.241747282 -0400
Change: 2015-09-04 10:47:39.241747282 -0400
Birth: -

And then I found that user.val is actually stored in a separate object:

Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
    17    1    16K    512      0      1K    512  100.00  ZFS directory
                                           168   bonus  System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED 
dnode maxblkid: 0
path    /tf/<xattrdir>
uid     0
gid     0
atime   Fri Sep  4 10:47:39 2015
mtime   Fri Sep  4 10:47:42 2015
ctime   Fri Sep  4 10:47:42 2015
crtime  Fri Sep  4 10:47:39 2015
gen 658
mode    41777
size    4
parent  15
links   2
pflags  40800000145
microzap: 512 bytes, 2 entries

    user.val = 21 (type: Regular File)
    security.selinux = 19 (type: Regular File)

Indirect blocks:
0 L0 EMBEDDED et=0 200L/53P B=659

Is this feature exclusive with SA based xattr, or I just missed something in my setup?

behlendorf · 2015-09-09T16:04:41Z

@jxiong enabling the pool feature and setting both the setting xattr=sa and dnodesize=N properties should be sufficient. Could you post a test script or an example commands for how you ran the test and we'll try and reproduce the issue. I know Ned tested this pretty extensively before posting anything for review. But on the surface it sure looks like you did everything right.

jxiong · 2015-09-09T22:14:25Z

@behlendorf the commands I used to do the test are regular ones, just create a testpool with xattr set to sa and dnodesize set to 1K. I was trying to see how large lustre layout can be stored in 1K dnode. The commands are as follows:

[root@centos7 zfs]# zpool create -f zfstest raidz2 /Users/jxiong/vdevs/d{0..9}
[root@centos7 zfs]# zfs create -o xattr=sa -o dnodesize=1k zfstest/testpool
[root@centos7 zfs]# zfs get all zfstest/testpool |egrep 'dnodesize|xattr'
zfstest/testpool xattr sa local
zfstest/testpool dnodesize 1K local
[root@centos7 zfs]# touch /zfstest/testpool/tf
[root@centos7 zfs]# setfattr -n user.val -v "helloworkd" /zfstest/testpool/tf
[root@centos7 zfs]# stat --printf="%i\n" /zfstest/testpool/tf
7
[root@centos7 zfs]# zdb -vvvv zfstest/testpool 7
....

Before the patch is applied, I should be able to see that user.val is stored in bonus buffer, but now I can see the user.val is stored in another object just like 'xattr=on' wasn't taken effect.

[root@centos7 zfs]# zdb -vvvv zfstest/testpool 11
Dataset zfstest/testpool [ZPL], ID 71, cr_txg 5, 37.1K, 9 objects, rootbp DVA[0]=<0:16800:600> DVA[1]=<0:2e00de00:600> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=89L/89P fill=9 cksum=b30a661a1:44c445301f7:d9f4cd414a58:1db06447b7fe03

Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
    11    1    16K    512     1K      1K    512  100.00  ZFS plain file
                                           168   bonus  System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED 
dnode maxblkid: 0
path    /tf/<xattrdir>/user.val
uid     0
gid     0
atime   Wed Sep  9 18:04:27 2015
mtime   Wed Sep  9 18:04:27 2015
ctime   Wed Sep  9 18:04:27 2015
crtime  Wed Sep  9 18:04:27 2015
gen 89
mode    100644
size    10
parent  9
links   1
pflags  40800000005

Indirect blocks:
0 L0 0:12c00:600 200L/200P F=1 B=89/89

    segment [0000000000000000, 0000000000000200) size   512

nedbass · 2015-09-13T23:27:38Z

@jxiong I'm unable to reproduce your results. It is working as expected for me. Can you post the git commit revision you are testing, and confirm that your user space tools match the running kernel?

Below is a transcript of me running your test case. You can see from the zdb output that the xattrs are stored in the bonus buffer as expected.

$  bass6@firstdown ~/zfs /dev/pts/0 Sun Sep 13 16:20:17 (large_dnode_feat u=) >
git log --pretty=oneline | head -1
0da0e790a28488e8cf1e6ea6a1c0a6640a6f0b15 Implement large_dnode pool feature
$  bass6@firstdown ~/zfs /dev/pts/0 Sun Sep 13 16:20:25 (large_dnode_feat u=) >
sudo ./cmd/zpool/zpool create -f zfstest raidz2 /home/bass6/vdevs/d{0..9}
$  bass6@firstdown ~/zfs /dev/pts/0 Sun Sep 13 16:20:57 (large_dnode_feat u=) >
sudo ./cmd/zfs/zfs create -o xattr=sa -o dnodesize=1k zfstest/testpool
$  bass6@firstdown ~/zfs /dev/pts/0 Sun Sep 13 16:21:03 (large_dnode_feat u=) >
sudo ./cmd/zfs/zfs get all zfstest/testpool |egrep 'dnodesize|xattr'
zfstest/testpool  xattr                 sa                     local
zfstest/testpool  dnodesize             1K                     local
$  bass6@firstdown ~/zfs /dev/pts/0 Sun Sep 13 16:21:14 (large_dnode_feat u=) >
sudo touch /zfstest/testpool/tf
$  bass6@firstdown ~/zfs /dev/pts/0 Sun Sep 13 16:21:31 (large_dnode_feat u=) >
sudo  setfattr -n user.val -v "helloworkd" /zfstest/testpool/tf
$  bass6@firstdown ~/zfs /dev/pts/0 Sun Sep 13 16:21:45 (large_dnode_feat u=) >
stat --printf="%i\n" /zfstest/testpool/tf
7
$  bass6@firstdown ~/zfs /dev/pts/0 Sun Sep 13 16:21:59 (large_dnode_feat u=) >
sudo ./cmd/zdb/zdb -vvvv zfstest/testpool 7
Dataset zfstest/testpool [ZPL], ID 71, cr_txg 11, 35.9K, 7 objects, rootbp DVA[0]=<0:cc00:600> DVA[1]=<0:f00cc00:600> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=21L/21P fill=7 cksum=d11c61e75:50fe17f8e3a:102dc32f47fb3:23860111b46c19

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         7    1    16K    512      0      1K    512    0.00  ZFS plain file
                                               236   bonus  System attributes
    dnode flags: USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    /tf
    uid     0
    gid     0
    atime   Sun Sep 13 16:21:31 2015
    mtime   Sun Sep 13 16:21:31 2015
    ctime   Sun Sep 13 16:21:31 2015
    crtime  Sun Sep 13 16:21:31 2015
    gen 18
    mode    100644
    size    0
    parent  4
    links   1
    pflags  40800000004
    SA xattrs: 60 bytes, 1 entries

        user.val = helloworkd
Indirect blocks:


$  bass6@firstdown ~/zfs /dev/pts/0 Sun Sep 13 16:22:06 (large_dnode_feat u=) >
sudo ./cmd/zdb/zdb -vvvv zfstest/testpool 9
Dataset zfstest/testpool [ZPL], ID 71, cr_txg 11, 35.9K, 7 objects, rootbp DVA[0]=<0:cc00:600> DVA[1]=<0:f00cc00:600> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=21L/21P fill=7 cksum=d11c61e75:50fe17f8e3a:102dc32f47fb3:23860111b46c19

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
zdb: dmu_bonus_hold(9) failed, errno 2
$  bass6@firstdown ~/zfs /dev/pts/0 Sun Sep 13 16:22:18 (large_dnode_feat u=) >
sudo ./cmd/zdb/zdb -vvvv zfstest/testpool 11
Dataset zfstest/testpool [ZPL], ID 71, cr_txg 11, 35.9K, 7 objects, rootbp DVA[0]=<0:cc00:600> DVA[1]=<0:f00cc00:600> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=21L/21P fill=7 cksum=d11c61e75:50fe17f8e3a:102dc32f47fb3:23860111b46c19

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
zdb: dmu_bonus_hold(11) failed, errno 2

nedbass · 2015-09-14T06:04:23Z

I can reproduce the problem after rebasing the patch on master. Thanks for the bug report, I'll run down the root cause.

nedbass · 2015-09-16T10:20:10Z

@jxiong I've identified the cause of the issue you reported. Pull request #3788 should address it.

jxiong · 2015-09-16T22:26:33Z

@nedbass I will try it out. Thanks a lot.

behlendorf · 2015-10-20T22:37:07Z

@ahrens the proposed large dnode feature is ready for review. Can you please look this patch over. The only major missing bit of functionality missing is send/recv support to pools which don't have large dnode support. What are your thoughts on this?

I've created a new 'OpenZFS Review' tag so we can more easily track these reviews going forward.

https://github.com/zfsonlinux/zfs/labels/OpenZFS%20Review

ahrens · 2015-10-23T23:15:45Z

include/sys/dnode.h

@@ -202,6 +221,7 @@ typedef struct dnode {
 	uint32_t dn_datablksz;		/* in bytes */
 	uint64_t dn_maxblkid;
 	uint8_t dn_next_type[TXG_SIZE];
+	uint8_t dn_count;		/* metadnode slots consumed on disk */


good that there's a comment explaining what this is, but even better would be to name the variable more evocatively. e.g. dn_num_slots

ahrens · 2015-10-23T23:44:39Z

Can you remind me why you need to be able to change the dnode size after the fact, as opposed to setting it only at filesystem creation time? Seems like that would be simpler so I'd like to understand the use case.

ahrens · 2015-10-24T06:18:25Z

Having zil replay create the object with a different size seems wrong to me. What if a subsequent log record could require the additional space in the bonus buffer? It seems like the ZIL code is assuming that the bonus buffer is only used by the SA code which can (I presume) dynamically deal with any bonus buffer size.

I would suggest you find a way to store the num_slots in the lr_create_t (or at worst, add a new log record type). For example, you could use some of the high 16 bits of lr_foid, which is currently at most DN_MAX_OBJECT. (use the BF64_* macros to carve it up into a bitfield).

adilger · 2015-10-24T07:13:11Z

On Oct 23, 2015, at 17:44, Matthew Ahrens [email protected] wrote:

Can you remind me why you need to be able to change the dnode size after the fact, as opposed to setting it only at filesystem creation time? Seems like that would be simpler so I'd like to understand the use case.

I think the main benefit is that the administrator doesn't need to guess in advance what dnode size to select, nor whether that size is appropriate or optimal for all users/apps in the filesystem. Different usage patterns may create different amounts of extended metadata (inline xattrs/SAs) that are difficult to know in advance.

Another major benefit of allowing dynamic dnode sizes is that this would make the feature immediately useful for existing datasets rather than only new ones.

behlendorf · 2015-10-26T20:50:36Z

Can you remind me why you need to be able to change the dnode size after the fact.

Making the dnode size fixed at dataset create time would have definitely simplify things. Although I think it would also make the feature far less useful for exactly the reasons @adilger mentions.

I know internally we have large existing filesystems (with ~3+ billion files) which we rely on 24/7. We'd benefit greatly from being able to enable this feature on an existing dataset and have all new dnodes properly sized. Even better I could see this feature being enhanced so existing dnodes could be grown/shrunk in place if the needed slots were available. That's a huge win for existing filesystems.

Having zil replay create the object with a different size seems wrong to me.

This is another area where @nedbass and I went back and forth. If at all possible we wanted to avoid adding a new log record type and we were leery of squatting on any existing fields. That led us to deciding that large dnodes were just an optimization and that in the unlikely case of a crash creating a few of non-optimally sized dnodes wouldn't be the worst thing. That said, if you're OK with us using the upper bits of lr_foid I like that much better.

@ahrens do you have any ideas about how send/recv should be handled to a pool which doesn't support this feature? Can we just punt on it and avoid handling this case (my preference), or must we be excessively clever and generate a stream which is backwards compatible on the fly?

Thanks for the review! I'm sure @nedbass will be able to chime in the next week or so.

ahrens · 2015-10-26T23:20:38Z

That led us to deciding that large dnodes were just an optimization and that in the unlikely case of a crash creating a few of non-optimally sized dnodes wouldn't be the worst thing. That said, if you're OK with us using the upper bits of lr_foid I like that much better.

@behlendorf I think if you wanted to make large dnodes truly and optimization, it would need to be tightly integrated with the SA code and not usable any other way. Contrast with the spill block, which is only used the SA (I think) but it works correctly without the SA and it would be easy to correctly add a different consumer.

Given that large dnodes is are already changing the on-disk format, I don't see why modifying the ZIL format (either to use upper bits of lr_foid or to add a new log record type) is a big deal.

do you have any ideas about how send/recv should be handled to a pool which doesn't support this feature?

The recv should gracefully fail. You can handle this like the spill block, by adding a new DRR_BACKUP_FEATURE_*.

behlendorf · 2015-10-27T16:59:28Z

Thanks for the feedback and pointer to DMU_BACKUP_FEATURE_* flags. We'll try and get it refreshed in the next few weeks.

nedbass · 2016-03-25T21:02:39Z

@bzzz77 I updated the patch to address the performance regression. With large dnodes dmu_object_next() needs to call dmu_object_info() to figure out how many dnodes to skip in the current block. It turns out this introduces a significant throughput penalty on create rates. To fix this I updated dmu_object_alloc() to avoid the call to dmu_object_next() altogether, and instead simply retry the allocation starting from the next dnode block. This approach seems to be significantly faster, even for master without the large dnode patch. I plan to submit that change as a separate pull request for master to elicit some discussion and testing.

bzzz77 · 2016-03-28T08:21:37Z

I can confirm this patch is doing much better.

adilger · 2016-05-13T01:39:33Z

module/zfs/dmu.c

@@ -1929,6 +1930,18 @@ dmu_object_size_from_db(dmu_buf_t *db_fake, uint32_t *blksize,
 }


Looks like dmu_object_size_from_db() also needs to be updated. The "add 1 for dnode space" is incorrect for large dnodes, so this should be:

/* add in the number of slots used for the dnode itself */ *nbk512 = ... SPA_MINBLOCKSHIFT) + (dn->dn_num_slots << DNODE_SHIFT);

Good catch.

nedbass · 2016-06-03T21:23:19Z

Refreshed patch rebased on #4711. Other changes:

Further reworked metadnode backfill algorithm to avoid potential pathology with large dnodes. A low blk_fill count could make full dnode blocks appear sparse. The allocator would then repeatedly attempt and fail to allocate from these full blocks. For example, a full block with 4k dnodes would appear to be 25% full according to the blk_fill field, so it would be selected for backfill. To mitigate this the code now switches to searching for empty blocks after too many failures.
Changed the allowed values for the dnodesize property from a numeric value to legacy | auto. This avoids burdening users with having to figure out a reasonable number. Currently auto just sets a default size of 1k, which is a good choice for most common xattr workloads (e.g. Lustre, SELinux). Future code improvements could dynamically choose a dnode size based on observed workloads.

TODO: test cases

adilger · 2016-06-04T03:41:48Z

While I see the "auto" and "legacy" options as the best useful useful for most users, I don't think that allowing a numerical value is overly complex. Users don't have to specify the dnode size if they don't want, but this could be very useful for benchmarking and performance tuning in the future. That is doubly true because this is a new feature and we don't necessarily have a good idea of what the optimum value is for different workloads. For example, Samba makes heavy use of xattrs and may benefit from larger dnodes than other configurations.

As for the fill count vs. large dnode size, this could negate some of the gains being made on the file creation front if there is a mixed create/unlink workload since it means that more full blocks may be reused before their time...

Does it make sense to keep an in-memory counter of dnode sizes and dnode counts to work out the running average dnode size (possibly taking into account the desired dnode size even when there wasn't enough space for it). The counters could be per-CPU and do not need to be perfectly uptodate, so long as they are consistent in the sum and count

That would help in both of these cases. It would give the expansion factor for the full count, to determine when a dnode leaf block should be considered full. It would also help the "auto" case know how large to allocate dnodes rather than having a fixed size and assuming that is best.

nedbass · 2016-06-04T10:02:26Z

@adilger thanks for the feedback. I just refreshed the patch with a version that allows some literal values to be specified. For simplicity of the UI I restricted it to powers of 2 from 1k to 16k. I was having trouble coming up with a UI I was happy with that allowed both integer and string property values. The existing ZFS property interfaces don't support that in a flexible way. So I settled on limiting allowed values to legacy | auto | 1k | 2k | 4k | 8k | 16k. Ideally I would have liked to support integer and shorthand formats for input and output, and to support arbitrary multiples of 512, but it didn't seem worth the code complexity that would entail. So just exposing the shorthand powers-of-2 seemed like a reasonable compromise.

nedbass · 2016-06-23T02:58:33Z

Refreshed with test cases added to ZFS Test Suite. Also rebased to include some improvements to xattrtest which the test suite takes advantage of.

- Use a fixed buffer of random bytes when random xattr values are in effect. This eliminates the potential performance bottleneck of reading from /dev/urandom for each file. This also allows us to verify xattrs in random value mode. - Show the rate of operations per second in addition to elapsed time for each phase of the test. This may be useful for benchmarking. - Set default xattr size to 6 so that verify doesn't fail if user doesn't specify a size. We need at least six bytes to store the leading "size=X" string that is used for verification. - Allow user to execute just one phase of the test. Acceptable values for -o and their meanings are: 1 - run the create phase 2 - run the setxattr phase 3 - run the getxattr phase 4 - run the unlink phase Signed-off-by: Ned Bass <[email protected]>

@mahrens

Only attempt to backfill lower metadnode object numbers if at least 4096 objects have been freed since the last rescan, and at most once per transaction group. This avoids a pathology in dmu_object_alloc() that caused O(N^2) behavior for create-heavy workloads and substantially improves object creation rates. As summarized by @mahrens in openzfs#4636: "Normally, the object allocator simply checks to see if the next object is available. The slow calls happened when dmu_object_alloc() checks to see if it can backfill lower object numbers. This happens every time we move on to a new L1 indirect block (i.e. every 32 * 128 = 4096 objects). When re-checking lower object numbers, we use the on-disk fill count (blkptr_t:blk_fill) to quickly skip over indirect blocks that don’t have enough free dnodes (defined as an L2 with at least 393,216 of 524,288 dnodes free). Therefore, we may find that a block of dnodes has a low (or zero) fill count, and yet we can’t allocate any of its dnodes, because they've been allocated in memory but not yet written to disk. In this case we have to hold each of the dnodes and then notice that it has been allocated in memory. The end result is that allocating N objects in the same TXG can require CPU usage proportional to N^2." Add a tunable dmu_rescan_dnode_threshold to define the number of objects that must be freed before a rescan is performed. Don't bother to export this as a module option because testing doesn't show a compelling reason to change it. The vast majority of the performance gain comes from limit the rescan to at most once per TXG. Signed-off-by: Ned Bass <[email protected]>

Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <[email protected]>

behlendorf · 2016-06-24T21:26:12Z

Thanks everybody, merged to master:

50c957f Implement large_dnode pool feature
68cbd56 Backfill metadnode more intelligently
8128558 xattrtest: allow verify with -R and other improvements

* Consistently use parsable instead of parseable This is a purely cosmetical change, to consistently prefer one of two (both acceptable) choises for the word parsable in documentation and code. I don't really care which to use, but acording to wiktionary https://en.wiktionary.org/wiki/parsable#English parsable is preferred. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4682 * Add missing RPM BuildRequires Both libudev and libattr are recommended build requirements. As such their development headers should lists in the rpm spec file so those dependencies are pulled in when building rpm packages. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4676 * Skip ctldir znode in zfs_rezget to fix snapdir issues Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This will cause funny behaviour for the mounted snapdirs. Especially for Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone automount it again as long as someone is still using the detached mount. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4514 Closes #4661 Closes #4672 * Improve zfs-module-parameters(5) Various rewrites to the descriptions of module parameters. Corrects spelling mistakes, makes descriptions them more user-friendly and describes some ZFS quirks which should be understood before changing parameter values. Signed-off-by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4671 * Fix arc_prune_task use-after-free arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent the underlying zsb from disappearing if there's a concurrent umount. We fix this by force the caller of arc_remove_prune_callback to wait for arc_prune_taskq to finish. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4687 Closes #4690 * Add request size histograms (-r) to zpool iostat, minor man page fix Add -r option to "zpool iostat" to print request size histograms for the leaf ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs ("agg"). These stats can be useful for seeing how well the ZFS IO aggregator is working. $ zpool iostat -r mypool sync_read sync_write async_read async_write scrub req_size ind agg ind agg ind agg ind agg ind agg ---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 512 0 0 0 0 0 0 530 0 0 0 1K 0 0 260 0 0 0 116 246 0 0 2K 0 0 0 0 0 0 0 431 0 0 4K 0 0 0 0 0 0 3 107 0 0 8K 15 0 35 0 0 0 0 6 0 0 16K 0 0 0 0 0 0 0 39 0 0 32K 0 0 0 0 0 0 0 0 0 0 64K 20 0 40 0 0 0 0 0 0 0 128K 0 0 20 0 0 0 0 0 0 0 256K 0 0 0 0 0 0 0 0 0 0 512K 0 0 0 0 0 0 0 0 0 0 1M 0 0 0 0 0 0 0 0 0 0 2M 0 0 0 0 0 0 0 0 0 0 4M 0 0 0 0 0 0 155 19 0 0 8M 0 0 0 0 0 0 0 811 0 0 16M 0 0 0 0 0 0 0 68 0 0 -------------------------------------------------------------------------------- Also rename the stray "-G" in the man page to be "-w" for latency histograms. Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #4659 * OpenZFS 6531 - Provide mechanism to artificially limit disk performance Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Dan McDonald <[email protected]> Ported by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6531 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130 Porting notes: - Added new IO delay tracepoints, and moved common ZIO tracepoint macros to a new trace_common.h file. - Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function. - Updated zinject man page - Updated zpool_scrub test files * Systemd configuration fixes * Disable zfs-import-scan.service by default. This ensures that pools will not be automatically imported unless they appear in the cache file. When this service is explicitly enabled pools will be imported with the "cachefile=none" property set. This prevents the creation of, or update to, an existing cache file. $ systemctl list-unit-files | grep zfs zfs-import-cache.service enabled zfs-import-scan.service disabled zfs-mount.service enabled zfs-share.service enabled zfs-zed.service enabled zfs.target enabled * Change services to dynamic from static by adding an [Install] section and adding 'WantedBy' tags in favor of 'Requires' tags. This allows for easier customization of the boot behavior. * Start the zfs-import-cache.service after the root pivot so the cache file is available in the standard location. * Start the zfs-mount.service after the systemd-remount-fs.service to ensure the root fs is writeable and the ZFS filesystems can create their mount points. * Change the default behavior to only load the ZFS kernel modules in zfs-import-*.service or when blkid(8) detects a pool. Users who wish to unconditionally load the kernel modules must uncomment the list of modules in /lib/modules-load.d/zfs.conf. Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4325 Closes #4496 Closes #4658 Closes #4699 * Fix self-healing IO prior to dsl_pool_init() completion Async writes triggered by a self-healing IO may be issued before the pool finishes the process of initialization. This results in a NULL dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes(). George Wilson recommended addressing this issue by initializing the passed `dsl_pool_t **` prior to dmu_objset_open_impl(). Since the caller is passing the `spa->spa_dsl_pool` this has the effect of ensuring it's initialized. However, since this depends on the caller knowing they must pass the `spa->spa_dsl_pool` an additional NULL check was added to vdev_queue_max_async_writes(). This guards against any future restructuring of the code which might result in dsl_pool_init() being called differently. Signed-off-by: GeLiXin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4652 * Add isa_defs for MIPS GCC for MIPS only defines _LP64 when 64bit, while no _ILP32 defined when 32bit. Signed-off-by: YunQiang Su <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4712 * Fix out-of-bound access in zfs_fillpage The original code will do an out-of-bound access on pl[] during last iteration. ================================================================== BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs] Read of size 8 by task tmpfile/7850 page:ffffea00017c6dc0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0xffff8000000000() page dumped because: kasan: bad access detected CPU: 3 PID: 7850 Comm: tmpfile Tainted: G OE 4.6.0+ #3 ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618 ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8 ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3 Call Trace: [<ffffffff81635618>] dump_stack+0x63/0x8b [<ffffffff81313ee8>] kasan_report_error+0x528/0x560 [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0 [<ffffffff813144b8>] kasan_report+0x58/0x60 [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffff81312e4e>] __asan_load8+0x5e/0x70 [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs] [<ffffffff81353c3a>] SyS_execve+0x3a/0x50 [<ffffffff810058ef>] do_syscall_64+0xef/0x180 [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25 Memory state around the buggy address: ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 ^ ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4705 Issue #4708 * Fix memleak in zpl_parse_options strsep() will advance tmp_mntopts, and will change it to NULL on last iteration. This will cause strfree(tmp_mntopts) to not free anything. unreferenced object 0xffff8800883976c0 (size 64): comm "mount.zfs", pid 3361, jiffies 4294931877 (age 1482.408s) hex dump (first 32 bytes): 72 77 00 73 74 72 69 63 74 61 74 69 6d 65 00 7a rw.strictatime.z 66 73 75 74 69 6c 00 6d 6e 74 70 6f 69 6e 74 3d fsutil.mntpoint= backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811f9cac>] __kmalloc+0x16c/0x250 [<ffffffffc065ce9b>] strdup+0x3b/0x60 [spl] [<ffffffffc080fad6>] zpl_parse_options+0x56/0x300 [zfs] [<ffffffffc080fe46>] zpl_mount+0x36/0x80 [zfs] [<ffffffff81222dc8>] mount_fs+0x38/0x160 [<ffffffff81240097>] vfs_kern_mount+0x67/0x110 [<ffffffff812428e0>] do_mount+0x250/0xe20 [<ffffffff812437d5>] SyS_mount+0x95/0xe0 [<ffffffff8181aff6>] entry_SYSCALL_64_fastpath+0x1e/0xa8 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4706 Issue #4708 * Fix memleak in vdev_config_generate_stats fnvlist_add_nvlist will copy the contents of nvx, so we need to free it here. unreferenced object 0xffff8800a6934e80 (size 64): comm "zpool", pid 3398, jiffies 4295007406 (age 214.180s) hex dump (first 32 bytes): 60 06 c2 73 00 88 ff ff 00 7c 8c 73 00 88 ff ff `..s.....|.s.... 00 00 00 00 00 00 00 00 40 b0 70 c0 ff ff ff ff [email protected]..... backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811fac7d>] __kmalloc_node+0x17d/0x310 [<ffffffffc065528c>] spl_kmem_alloc_impl+0xac/0x180 [spl] [<ffffffffc0657379>] spl_vmem_alloc+0x19/0x20 [spl] [<ffffffffc07056cf>] nv_alloc_sleep_spl+0x1f/0x30 [znvpair] [<ffffffffc07006b7>] nvlist_xalloc.part.13+0x27/0xc0 [znvpair] [<ffffffffc07007ad>] nvlist_alloc+0x3d/0x40 [znvpair] [<ffffffffc0703abc>] fnvlist_alloc+0x2c/0x80 [znvpair] [<ffffffffc07b1783>] vdev_config_generate_stats+0x83/0x370 [zfs] [<ffffffffc07b1f53>] vdev_config_generate+0x4e3/0x650 [zfs] [<ffffffffc07996db>] spa_config_generate+0x20b/0x4b0 [zfs] [<ffffffffc0794f64>] spa_tryimport+0xc4/0x430 [zfs] [<ffffffffc07d11d8>] zfs_ioc_pool_tryimport+0x68/0x110 [zfs] [<ffffffffc07d4fc6>] zfsdev_ioctl+0x646/0x7a0 [zfs] [<ffffffff81232e31>] do_vfs_ioctl+0xa1/0x5b0 [<ffffffff812333b9>] SyS_ioctl+0x79/0x90 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4707 Issue #4708 * Linux 4.7 compat: handler->set() takes both dentry and inode Counterpart to fd4c7b7, the same approach was taken to resolve the compatibility issue. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4717 Issue #4665 * Implementation of AVX2 optimized Fletcher-4 New functionality: - Preserves existing scalar implementation. - Adds AVX2 optimized Fletcher-4 computation. - Fastest routines selected on module load (benchmark). - Test case for Fletcher-4 added to ztest. New zcommon module parameters: - zfs_fletcher_4_impl (str): selects the implementation to use. "fastest" - use the fastest version available "cycle" - cycle trough all available impl for ztest "scalar" - use the original version "avx2" - new AVX2 implementation if available Performance comparison (Intel i7 CPU, 1MB data buffers): - Scalar: 4216 MB/s - AVX2: 14499 MB/s See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl` to get list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Jinshan Xiong <[email protected]> Signed-off-by: Andreas Dilger <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4330 * Fix cstyle.pl warnings As of perl v5.22.1 the following warnings are generated: * Redundant argument in printf at scripts/cstyle.pl line 194 * Unescaped left brace in regex is deprecated, passed through in regex; marked by <-- HERE in m/\S{ <-- HERE / at scripts/cstyle.pl line 608. They have been addressed by escaping the left braces and by providing the correct number of arguments to printf based on the fmt specifier set by the verbose option. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4723 * Fix minor spelling mistakes Trivial spelling mistake fix in error message text. * Fix spelling mistake "adminstrator" -> "administrator" * Fix spelling mistake "specificed" -> "specified" * Fix spelling mistake "interperted" -> "interpreted" Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4728 * Add `zfs allow` and `zfs unallow` support ZFS allows for specific permissions to be delegated to normal users with the `zfs allow` and `zfs unallow` commands. In addition, non- privileged users should be able to run all of the following commands: * zpool [list | iostat | status | get] * zfs [list | get] Historically this functionality was not available on Linux. In order to add it the secpolicy_* functions needed to be implemented and mapped to the equivalent Linux capability. Only then could the permissions on the `/dev/zfs` be relaxed and the internal ZFS permission checks used. Even with this change some limitations remain. Under Linux only the root user is allowed to modify the namespace (unless it's a private namespace). This means the mount, mountpoint, canmount, unmount, and remount delegations cannot be supported with the existing code. It may be possible to add this functionality in the future. This functionality was validated with the cli_user and delegation test cases from the ZFS Test Suite. These tests exhaustively verify each of the supported permissions which can be delegated and ensures only an authorized user can perform it. Two minor bug fixes were required for test-running.py. First, the Timer() object cannot be safely created in a `try:` block when there is an unconditional `finally` block which references it. Second, when running as a normal user also check for scripts using the both the .ksh and .sh suffixes. Finally, existing users who are simulating delegations by setting group permissions on the /dev/zfs device should revert that customization when updating to a version with this change. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #362 Closes #434 Closes #4100 Closes #4394 Closes #4410 Closes #4487 * Remove libzfs_graph.c The libzfs_graph.c source file should have been removed in 330d06f, it is entirely unused. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4766 * Linux 4.6 compat: Fall back to d_prune_aliases() if necessary As of 4.6, the icache and dcache LRUs are memcg aware insofar as the kernel's per-superblock shrinker is concerned. The effect is that dcache or icache entries added by a task in a non-root memcg won't be scanned by the shrinker in the context of the root (or NULL) memcg. This defeats the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to grow uncontrollably. This patch reverts to the d_prune_aliaes() method in case the kernel's per-superblock shrinker is not able to free anything. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes: #4726 * SIMD implementation of vdev_raidz generate and reconstruct routines This is a new implementation of RAIDZ1/2/3 routines using x86_64 scalar, SSE, and AVX2 instruction sets. Included are 3 parity generation routines (P, PQ, and PQR) and 7 reconstruction routines, for all RAIDZ level. On module load, a quick benchmark of supported routines will select the fastest for each operation and they will be used at runtime. Original implementation is still present and can be selected via module parameter. Patch contains: - specialized gen/rec routines for all RAIDZ levels, - new scalar raidz implementation (unrolled), - two x86_64 SIMD implementations (SSE and AVX2 instructions sets), - fastest routines selected on module load (benchmark). - cmd/raidz_test - verify and benchmark all implementations - added raidz_test to the ZFS Test Suite New zfs module parameters: - zfs_vdev_raidz_impl (str): selects the implementation to use. On module load, the parameter will only accept first 3 options, and the other implementations can be set once module is finished loading. Possible values for this option are: "fastest" - use the fastest math available "original" - use the original raidz code "scalar" - new scalar impl "sse" - new SSE impl if available "avx2" - new AVX2 impl if available See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to get the list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4328 * Fix NFS credential The commit f74b821 caused a regression where creating file through NFS will always create a file owned by root. This is because the patch enables the KSID code in zfs_acl_ids_create, which it would use euid and egid of the current process. However, on Linux, we should use fsuid and fsgid for file operations, which is the original behaviour. So we revert this part of code. The patch also enables secpolicy_vnode_*, since they are also used in file operations, we change them to use fsuid and fsgid. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4772 Closes #4758 * OpenZFS 6513 - partially filled holes lose birth time Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Boris Protopopov <[email protected]> Approved by: Richard Lowe <[email protected]>a Ported by: Boris Protopopov <[email protected]> Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6513 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0 If a ZFS object contains a hole at level one, and then a data block is created at level 0 underneath that l1 block, l0 holes will be created. However, these l0 holes do not have the birth time property set; as a result, incremental sends will not send those holes. Fix is to modify the dbuf_read code to fill in birth time data. * Add a test case for dmu_free_long_range() to ztest Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4754 * Revert "Add a test case for dmu_free_long_range() to ztest" This reverts commit d0de2e82df579f4e4edf5643b674a1464fae485f which introduced a new test case to ztest which is failing occasionally during automated testing. The change is being reverted until the issue can be fully investigated. Signed-off-by: Brian Behlendorf <[email protected]> Issue #4754 * OpenZFS 6878 - Add scrub completion info to "zpool history" Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Approved by: Dan McDonald <[email protected]> Authored by: Nav Ravindranath <[email protected]> Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6878 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1825bc5 Closes #4787 * FreeBSD rS271776 - Persist vdev_resilver_txg changes Persist vdev_resilver_txg changes to avoid panic caused by validation vs a vdev_resilver_txg value from a previous resilver. Authored-by: smh <[email protected]> Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/5154 FreeBSD-issue: https://reviews.freebsd.org/rS271776 FreeBSD-commit: https://github.com/freebsd/freebsd/commit/c3c60bf Closes #4790 * xattrtest: allow verify with -R and other improvements - Use a fixed buffer of random bytes when random xattr values are in effect. This eliminates the potential performance bottleneck of reading from /dev/urandom for each file. This also allows us to verify xattrs in random value mode. - Show the rate of operations per second in addition to elapsed time for each phase of the test. This may be useful for benchmarking. - Set default xattr size to 6 so that verify doesn't fail if user doesn't specify a size. We need at least six bytes to store the leading "size=X" string that is used for verification. - Allow user to execute just one phase of the test. Acceptable values for -o and their meanings are: 1 - run the create phase 2 - run the setxattr phase 3 - run the getxattr phase 4 - run the unlink phase Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> * Backfill metadnode more intelligently Only attempt to backfill lower metadnode object numbers if at least 4096 objects have been freed since the last rescan, and at most once per transaction group. This avoids a pathology in dmu_object_alloc() that caused O(N^2) behavior for create-heavy workloads and substantially improves object creation rates. As summarized by @mahrens in #4636: "Normally, the object allocator simply checks to see if the next object is available. The slow calls happened when dmu_object_alloc() checks to see if it can backfill lower object numbers. This happens every time we move on to a new L1 indirect block (i.e. every 32 * 128 = 4096 objects). When re-checking lower object numbers, we use the on-disk fill count (blkptr_t:blk_fill) to quickly skip over indirect blocks that don’t have enough free dnodes (defined as an L2 with at least 393,216 of 524,288 dnodes free). Therefore, we may find that a block of dnodes has a low (or zero) fill count, and yet we can’t allocate any of its dnodes, because they've been allocated in memory but not yet written to disk. In this case we have to hold each of the dnodes and then notice that it has been allocated in memory. The end result is that allocating N objects in the same TXG can require CPU usage proportional to N^2." Add a tunable dmu_rescan_dnode_threshold to define the number of objects that must be freed before a rescan is performed. Don't bother to export this as a module option because testing doesn't show a compelling reason to change it. The vast majority of the performance gain comes from limit the rescan to at most once per TXG. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> * Implement large_dnode pool feature Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3542 * Sync DMU_BACKUP_FEATURE_* flags Flag 20 was used in OpenZFS as DMU_BACKUP_FEATURE_RESUMING. The DMU_BACKUP_FEATURE_LARGE_DNODE flag must be shifted to 21 and then reserved in the upstream OpenZFS implementation. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #4795 * OpenZFS 2605, 6980, 6902 2605 want to resume interrupted zfs send Reviewed by: George Wilson <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Xin Li <[email protected]> Reviewed by: Arne Jansen <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: kernelOfTruth <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/2605 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12 6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6980 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f Porting notes: - All rsend and snapshop tests enabled and updated for Linux. - Fix misuse of input argument in traverse_visitbp(). - Fix ISO C90 warnings and errors. - Fix gcc 'missing braces around initializer' in 'struct send_thread_arg to_arg =' warning. - Replace 4 argument fletcher_4_native() with 3 argument version, this change was made in OpenZFS 4185 which has not been ported. - Part of the sections for 'zfs receive' and 'zfs send' was rewritten and reordered to approximate upstream. - Fix mktree xattr creation, 'user.' prefix required. - Minor fixes to newly enabled test cases - Long holds for volumes allowed during receive for minor registration. * OpenZFS 6051 - lzc_receive: allow the caller to read the begin record Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6051 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/620f322 * OpenZFS 6393 - zfs receive a full send as a clone Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6394 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e * OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS Authored by: Andrew Stormont <[email protected]> Reviewed by: Anil Vijarnia <[email protected]> Reviewed by: Kim Shrier <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6536 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/880094b * OpenZFS 6738 - zfs send stream padding needs documentation Authored by: Eli Rosenthal <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6738 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c20404ff * OpenZFS 4986 - receiving replication stream fails if any snapshot exceeds refquota Authored by: Dan McDonald <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Gordon Ross <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/4986 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5878fad * OpenZFS 6562 - Refquota on receive doesn't account for overage Authored by: Dan McDonald <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Reviewed by: Toomas Soome <[email protected]> Approved by: Gordon Ross <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6562 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5f7a8e6 * Implement zfs_ioc_recv_new() for OpenZFS 2605 Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy ZFS_IOC_RECV user/kernel interface. The new interface supports all stream options but is currently only used for resumable streams. This way updated user space utilities will interoperate with older kernel modules. ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW handler. Non-Linux OpenZFS platforms have opted to change the legacy interface in an incompatible fashion instead of adding a new ioctl. Signed-off-by: Brian Behlendorf <[email protected]> * OpenZFS 6314 - buffer overflow in dsl_dataset_name Reviewed by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6314 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee * OpenZFS 6876 - Stack corruption after importing a pool with a too-long name Reviewed by: Prakash Surya <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for trouble. We should check every dataset on import, using a 1024 byte buffer and checking each time to see if the dataset's new name is longer than 256 bytes. OpenZFS-issue: https://www.illumos.org/issues/6876 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e * Vectorized fletcher_4 must be 128-bit aligned The fletcher_4_native() and fletcher_4_byteswap() functions may only safely use the vectorized implementations when the buffer is 128-bit aligned. This is because both the AVX2 and SSE implementations process four 32-bit words per iterations. Fallback to the scalar implementation which only processes a single 32-bit word for unaligned buffers. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Issue #4330 * Allow building with `CFLAGS="-O0"` If compiled with -O0, gcc doesn't do any stack frame coalescing and -Wframe-larger-than=1024 is triggered in debug mode. Starting with gcc 4.8, new opt level -Og is introduced for debugging, which does not trigger this warning. Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4799 * Don't allow accessing XATTR via export handle Allow accessing XATTR through export handle is a very bad idea. It would allow user to write whatever they want in fields where they otherwise could not. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4828 * Fix get_zfs_sb race with concurrent umount Certain ioctl operations will call get_zfs_sb, which will holds an active count on sb without checking whether it's active or not. This will result in use-after-free. We fix this by using atomic_inc_not_zero to make sure we got an active sb. P1 P2 --- --- deactivate_locked_super(): s_active = 0 zfs_sb_hold() ->get_zfs_sb(): s_active = 1 ->zpl_kill_sb() -->zpl_put_super() --->zfs_umount() ---->zfs_sb_free(zsb) zfs_sb_rele(zsb) Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> * Fix Large kmem_alloc in vdev_metaslab_init This allocation can go way over 1MB, so we should use vmem_alloc instead of kmem_alloc. Large kmem_alloc(1430784, 0x1000), please file an issue... Call Trace: [<ffffffffa0324aff>] ? spl_kmem_zalloc+0xef/0x160 [spl] [<ffffffffa17d0c8d>] ? vdev_metaslab_init+0x9d/0x1f0 [zfs] [<ffffffffa17d46d0>] ? vdev_load+0xc0/0xd0 [zfs] [<ffffffffa17d4643>] ? vdev_load+0x33/0xd0 [zfs] [<ffffffffa17c0004>] ? spa_load+0xfc4/0x1b60 [zfs] [<ffffffffa17c1838>] ? spa_tryimport+0x98/0x430 [zfs] [<ffffffffa17f28b1>] ? zfs_ioc_pool_tryimport+0x41/0x80 [zfs] [<ffffffffa17f5669>] ? zfsdev_ioctl+0x4a9/0x4e0 [zfs] [<ffffffff811bacdf>] ? do_vfs_ioctl+0x2cf/0x4b0 [<ffffffff811baf41>] ? SyS_ioctl+0x81/0xa0 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4752 * Add configure result for xattr_handler Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4828 * fh_to_dentry should return ESTALE when generation mismatch When generation mismatch, it usually means the file pointed by the file handle was deleted. We should return ESTALE to indicate this. We return ENOENT in zfs_vget since zpl_fh_to_dentry will convert it to ESTALE. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4828 * xattr dir doesn't get purged during iput We need to set inode->i_nlink to zero so iput will purge it. Without this, it will get purged during shrink cache or umount, which would likely result in deadlock due to zfs_zget waiting forever on its children which are in the dispose_list of the same thread. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Issue #4359 Issue #3508 Issue #4413 Issue #4827 * Kill zp->z_xattr_parent to prevent pinning zp->z_xattr_parent will pin the parent. This will cause huge issue when unlink a file with xattr. Because the unlinked file is pinned, it will never get purged immediately. And because of that, the xattr stuff will never be marked as unlinked. So the whole unlinked stuff will stay there until shrink cache or umount. This change partially reverts e89260a. This is safe because only the zp->z_xattr_parent optimization is removed, zpl_xattr_security_init() is still called from the zpl outside the inode lock. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Issue #4359 Issue #3508 Issue #4413 Issue #4827 * Fix RAIDZ_TEST tests Remove stray trailing } which prevented the raidz stress tests from running in-tree. Signed-off-by: Brian Behlendorf <[email protected]> * Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z The following scenario can result in garbage in the dn_spill field. The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR is clear to ensure the dn_spill field is cleared. Current txg = A. * A new spill buffer is created. Its dbuf is initialized with db_blkptr = NULL and it's dirtied. Current txg = B. * The spill buffer is modified. It's marked as dirty in this txg. * Additional changes make the spill buffer unnecessary because the xattr fits into the bonus buffer, so it's removed. The dbuf is undirtied in this txg, but it's still referenced and cannot be destroyed. Current txg = C. * Starts syncing of txg A * dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr is NULL, dbuf_check_blkptr() is called. * The dbuf starts being written and it reaches the ready state (not done yet). * A new change makes the spill buffer necessary again. sa_build_layouts() ends up calling dbuf_find() to locate the dbuf. It finds the old dbuf because it has not been destroyed yet (it will be destroyed when the previous write is done and there are no more references). The old dbuf has db_blkptr != NULL. * txg A write is complete and the dbuf released. However it's still referenced, so it's not destroyed. Current txg = D. * Starts syncing of txg B * dbuf_sync_leaf() is called for the bonus buffer. Its contents are directly copied into the dnode, overwriting the blkptr area because, in txg B, the bonus buffer was big enough to hold the entire xattr. * At this point, the db_blkptr of the spill buffer used in txg C gets corrupted. Signed-off-by: Peng <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3937 * Fix handling of errors nvlist in zfs_ioc_recv_new() zfs_ioc_recv_impl() is changed to always allocate the 'errors' nvlist, its callers are responsible for freeing it. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4829 * Add RAID-Z routines for SSE2 instruction set, in x86_64 mode. The patch covers low-end and older x86 CPUs. Parity generation is equivalent to SSSE3 implementation, but reconstruction is somewhat slower. Previous 'sse' implementation is renamed to 'ssse3' to indicate highest instruction set used. Benchmark results: scalar_rec_p 4 720476442 scalar_rec_q 4 187462804 scalar_rec_r 4 138996096 scalar_rec_pq 4 140834951 scalar_rec_pr 4 129332035 scalar_rec_qr 4 81619194 scalar_rec_pqr 4 53376668 sse2_rec_p 4 2427757064 sse2_rec_q 4 747120861 sse2_rec_r 4 499871637 sse2_rec_pq 4 522403710 sse2_rec_pr 4 464632780 sse2_rec_qr 4 319124434 sse2_rec_pqr 4 205794190 ssse3_rec_p 4 2519939444 ssse3_rec_q 4 1003019289 ssse3_rec_r 4 616428767 ssse3_rec_pq 4 706326396 ssse3_rec_pr 4 570493618 ssse3_rec_qr 4 400185250 ssse3_rec_pqr 4 377541245 original_rec_p 4 691658568 original_rec_q 4 195510948 original_rec_r 4 26075538 original_rec_pq 4 103087368 original_rec_pr 4 15767058 original_rec_qr 4 15513175 original_rec_pqr 4 10746357 Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4783 * Enable zpool_upgrade test cases Creating the pool in a striped rather than mirrored configuration provides enough space for all upgrade tests to run. Test case zpool_upgrade_007_pos still fails and must be investigated so it has been left disabled. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4852 * Prevent null dereferences when accessing dbuf kstat In arc_buf_info(), the arc_buf_t may have no header. If not, don't try to fetch the arc buffer stats and instead just zero them. The null dereferences were observed while accessing the dbuf kstat with awk on a system in which millions of small files were being created in order to overflow the system's metadata limit. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4837 * Fix dbuf_stats_hash_table_data race Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf can be freed at any time. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4846 * Use native inode->i_nlink instead of znode->z_links A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's 64 bit on-disk link count. We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a more Linux-integrated fix for the same issue. In addition, setting the initial link count on a new node has been changed from setting one less than required in zfs_mknode() then incrementing to the correct count in zfs_link_create() (which was somewhat bizarre in the first place), to setting the correct count in zfs_mknode() and not incrementing it in zfs_link_create(). This both means we no longer set the link count in sa_bulk_update() twice (once for the initial incorrect count then again for the correct count), as well as adhering to the Linux requirement of not incrementing a zero link count without I_LINKABLE (see linux commit f4e0c30c). Signed-off-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4838 Issue #227 * Implementation of SSE optimized Fletcher-4 Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4) This commit adds another implementation of the Fletcher-4 algorithm. It is automatically selected at module load if it benchmarks higher than all other available implementations. The module benchmark was also amended to analyze the performance of the byteswap-ed version of Fletcher-4, as well as the non-byteswaped version. The average performance of the two is used to select the the fastest implementation available on the host system. Adds a pair of fields to an existing zcommon module parameter: - zfs_fletcher_4_impl (str) "sse2" - new SSE2 implementation if available "ssse3" - new SSSE3 implementation if available Signed-off-by: Tyler J. Stachecki <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4789 * Fix filesystem destroy with receive_resume_token It is possible that the given DS may have hidden child (%recv) datasets - "leftovers" resulting from the previously interrupted 'zfs receieve'. Try to remove the hidden child (%recv) and after that try to remove the target dataset. If the hidden child (%recv) does not exist the original error (EEXIST) will be returned. Signed-off-by: Roman Strashkin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4818 * Prevent segfaults in SSE optimized Fletcher-4 In some cases, the compiler was not respecting the GNU aligned attribute for stack variables in 35a76a0. This was resulting in a segfault on CentOS 6.7 hosts using gcc 4.4.7-17. This issue was fixed in gcc 4.6. To prevent this from occurring, use unaligned loads and stores for all stack and global memory references in the SSE optimized Fletcher-4 code. Disable zimport testing against master where this flaw exists: TEST_ZIMPORT_VERSIONS="installed" Signed-off-by: Tyler J. Stachecki <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4862 * Update arc_summary.py for prefetch changes Commit 7f60329 removed several kstats which arc_summary.py read. Remove these kstats from arc_summary.py in the same way this was handled in FreeNAS. FreeNAS-commit: https://github.com/freenas/freenas/commit/3901f73 Signed-off-by: Brian Behlendorf <[email protected]> Closes #4695 * Wait iput_async before evict_inodes to prevent race Wait for iput_async before entering evict_inodes in generic_shutdown_super. The reason we must finish before evict_inodes is when lazytime is on, or when zfs_purgedir calls zfs_zget, iput would bump i_count from 0 to 1. This would race with the i_count check in evict_inodes. This means it could destroy the inode while we are still using it. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4854 * Fixes and enhancements of SIMD raidz parity - Implementation lock replaced with atomic variable - Trailing whitespace is removed from user specified parameter, to enhance experience when using commands that add newline, e.g. `echo` - raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813 - silence `cppcheck` in vdev_raidz, partial solution of Issue #1392 - Minor fixes and cleanups - Enable use of original parity methods in [fastest] configuration. New opaque original ops structure, representing native methods, is added to supported raidz methods. Original parity methods are executed if selected implementation has NULL fn pointer. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4813 Issue #1392 * RAIDZ parity kstat rework Print table with speed of methods for each implementation. Last line describes contents of [fastest] selection. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4860 * Fix NULL pointer in zfs_preumount from 1d9b3bd When zfs_domount fails zsb will be freed, and its caller mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into zfs_preumount. In order to make sure we don't touch any nonexistent stuff, we must make sure s_fs_info is NULL in the fail path so zfs_preumount can easily check that. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4867 Issue #4854 * Illumos Crypto Port module added to enable native encryption in zfs A port of the Illumos Crypto Framework to a Linux kernel module (found in module/icp). This is needed to do the actual encryption work. We cannot use the Linux kernel's built in crypto api because it is only exported to GPL-licensed modules. Having the ICP also means the crypto code can run on any of the other kernels under OpenZFS. I ended up porting over most of the internals of the framework, which means that porting over other API calls (if we need them) should be fairly easy. Specifically, I have ported over the API functions related to encryption, digests, macs, and crypto templates. The ICP is able to use assembly-accelerated encryption on amd64 machines and AES-NI instructions on Intel chips that support it. There are place-holder directories for similar assembly optimizations for other architectures (although they have not been written). Signed-off-by: Tom Caputi <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4329 * Fix for compilation error when using the kernel's CONFIG_LOCKDEP Signed-off-by: Tom Caputi <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4329 * zloop: print backtrace from core files Find the core file by using `/proc/sys/kernel/core_pattern` Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4874 * Fix for metaslab_fastwrite_unmark() assert failure Currently there is an issue where metaslab_fastwrite_unmark() unmarks fastwrites on vdev_t's that have never had fastwrites marked on them. The 'fastwrite mark' is essentially a count of outstanding bytes that will be written to a vdev and is used in syncing context. The problem stems from the fact that the vdev_pending_fastwrite field is not being transferred over when replacing a top-level vdev. As a result, the metaslab is marked for fastwrite on the old vdev and unmarked on the new one, which brings the fastwrite count below zero. This fix simply assigns vdev_pending_fastwrite from the old vdev to the new one so this count is not lost. Signed-off-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4267 * Remove znode's z_uid/z_gid member Remove duplicate z_uid/z_gid member which are also held in the generic vfs inode struct. This is done by first removing the members from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID macros to access the respective member from struct inode. In cases where the uid/gids are being marshalled from/to disk, use the newly introduced zfs_(uid|gid)_(read|write) functions to properly save the uids rather than the internal kernel representation. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4685 Issue #227 * Check whether the kernel supports i_uid/gid_read/write helpers Since the concept of a kuid and the need to translate from it to ordinary integer type was added in kernel version 3.5 implement necessary plumbing to be able to detect this condition during compile time. If the kernel doesn't support the kuid then just fall back to directly accessing the respective struct inode's members Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4685 Issue #227 * Fix uninitialized variable in avl_add() Silence the following warning when compiling with gcc 5.4.0. Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609. module/avl/avl.c: In function ‘avl_add’: module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized in this function [-Wmaybe-uninitialized] avl_insert(tree, new_node, where); Signed-off-by: Brian Behlendorf <[email protected]> * Fix sync behavior for disk vdevs Prior to b39c22b, which was first generally available in the 0.6.5 release as b39c22b, ZoL never actually submitted synchronous read or write requests to the Linux block layer. This means the vdev_disk_dio_is_sync() function had always returned false and, therefore, the completion in dio_request_t.dr_comp was never actually used. In b39c22b, synchronous ZIO operations were translated to synchronous BIO requests in vdev_disk_io_start(). The follow-on commits 5592404 and aa159af fixed several problems introduced by b39c22b. In particular, 5592404 introduced the new flag parameter "wait" to __vdev_disk_physio() but under ZoL, since vdev_disk_physio() is never actually used, the wait flag was always zero so the new code had no effect other than to cause a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af. The original rationale for introducing synchronous operations in b39c22b was to hurry certains requests through the BIO layer which would have otherwise been subject to its unplug timer which would increase the latency. This behavior of the unplug timer, however, went away during the transition of the plug/unplug system between kernels 2.6.32 and 2.6.39. To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior. For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and ise used for the same purpose. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4858 * Limit the amount of dnode metadata in the ARC Metadata-intensive workloads can cause the ARC to become permanently filled with dnode_t objects as they're pinned by the VFS layer. Subsequent data-intensive workloads may only benefit from about 25% of the potential ARC (arc_c_max - arc_meta_limit). In order to help track metadata usage more precisely, the other_size metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size. The new zfs_arc_dnode_limit tunable, which defaults to 10% of zfs_arc_meta_limit, defines the minimum number of bytes which is desirable to be consumed by dnodes. Attempts to evict non-metadata will trigger async prune tasks if the space used by dnodes exceeds this limit. The new zfs_arc_dnode_reduce_percent tunable specifies the amount by which the excess dnode space is attempted to be pruned as a percentage of the amount by which zfs_arc_dnode_limit is being exceeded. By default, it tries to unpin 10% of the dnodes. The problem of dnode metadata pinning was observed with the following testing procedure (in this example, zfs_arc_max is set to 4GiB): - Create a large number of small files until arc_meta_used exceeds arc_meta_limit (3GiB with default tuning) and arc_prune starts increasing. - Create a 3GiB file with dd. Observe arc_mata_used. It will still be around 3GiB. - Repeatedly read the 3GiB file and observe arc_meta_limit as before. It will continue to stay around 3GiB. With this modification, space for the 3GiB file is gradually made available as subsequent demands on th…

nedbass force-pushed the large_dnode_feat branch 2 times, most recently from 6e1ec51 to c5e0c50 Compare June 30, 2015 20:48

nedbass changed the title ~~UNSAFE: Large dnode pool feature~~ EXPERIMENTAL: Large dnode pool feature Jun 30, 2015

nedbass force-pushed the large_dnode_feat branch 2 times, most recently from 37868f0 to 71acdca Compare July 7, 2015 18:32

nedbass force-pushed the large_dnode_feat branch from 71acdca to 0da0e79 Compare July 9, 2015 18:48

behlendorf added this to the 0.7.0 milestone Sep 21, 2015

behlendorf added the Type: Feature Feature request or new feature label Sep 21, 2015

behlendorf added the OpenZFS Review label Oct 20, 2015

ahrens reviewed Oct 23, 2015
View reviewed changes

nedbass force-pushed the large_dnode_feat branch from 0da0e79 to 50edf53 Compare December 17, 2015 02:38

nedbass mentioned this pull request Mar 26, 2016

Backfilling metadnode degrades object create rates #4460

Closed

adilger reviewed May 13, 2016
View reviewed changes

nedbass force-pushed the large_dnode_feat branch from 975a1a8 to 1130c24 Compare June 3, 2016 21:14

nedbass force-pushed the large_dnode_feat branch from 1130c24 to ea35843 Compare June 3, 2016 22:06

nedbass force-pushed the large_dnode_feat branch from ea35843 to 8346411 Compare June 4, 2016 09:54

nedbass force-pushed the large_dnode_feat branch from 8346411 to 40a8bf7 Compare June 23, 2016 02:56

nedbass force-pushed the large_dnode_feat branch 4 times, most recently from 28378bc to 1f1e19f Compare June 23, 2016 14:54

nedbass added 2 commits June 23, 2016 16:50

nedbass force-pushed the large_dnode_feat branch from 1f1e19f to 1a02d65 Compare June 23, 2016 17:07

nedbass force-pushed the large_dnode_feat branch from 1a02d65 to d8bb5f6 Compare June 23, 2016 20:21

behlendorf changed the title ~~WIP: Large dnode pool feature~~ Large dnode pool feature Jun 24, 2016

behlendorf closed this in 50c957f Jun 24, 2016

nedbass mentioned this pull request Jun 24, 2016

Backfill metadnode more intelligently #4711

Closed

lundman mentioned this pull request Jul 6, 2016

dnode_sync panic openzfsonosx/zfs#519

Open

behlendorf mentioned this pull request May 17, 2017

8265 Reserve send stream flag for large dnode feature openzfs/openzfs#381

Closed

jgottula mentioned this pull request Sep 29, 2017

Receive may skip objects in FREEOBJECTS record #6694

Closed

rrevans mentioned this pull request Dec 28, 2023

dnode_is_dirty: use dn_dirty_txg to check dirtiness #15615

Draft

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large dnode pool feature #3542

Large dnode pool feature #3542

nedbass commented Jun 29, 2015 •

edited

Loading

adilger commented Aug 21, 2015

jxiong commented Sep 4, 2015

behlendorf commented Sep 9, 2015

jxiong commented Sep 9, 2015

nedbass commented Sep 13, 2015

nedbass commented Sep 14, 2015

nedbass commented Sep 16, 2015

jxiong commented Sep 16, 2015

behlendorf commented Oct 20, 2015

ahrens Oct 23, 2015

ahrens commented Oct 23, 2015

ahrens commented Oct 24, 2015

adilger commented Oct 24, 2015

behlendorf commented Oct 26, 2015

ahrens commented Oct 26, 2015

behlendorf commented Oct 27, 2015

nedbass commented Mar 25, 2016

bzzz77 commented Mar 28, 2016

adilger May 13, 2016

nedbass Jun 3, 2016

nedbass commented Jun 3, 2016 •

edited

Loading

adilger commented Jun 4, 2016

nedbass commented Jun 4, 2016

nedbass commented Jun 23, 2016

behlendorf commented Jun 24, 2016

		@@ -1929,6 +1930,18 @@ dmu_object_size_from_db(dmu_buf_t db_fake, uint32_t blksize,
		}

Large dnode pool feature #3542

Large dnode pool feature #3542

Conversation

nedbass commented Jun 29, 2015 • edited Loading

adilger commented Aug 21, 2015

jxiong commented Sep 4, 2015

behlendorf commented Sep 9, 2015

jxiong commented Sep 9, 2015

nedbass commented Sep 13, 2015

nedbass commented Sep 14, 2015

nedbass commented Sep 16, 2015

jxiong commented Sep 16, 2015

behlendorf commented Oct 20, 2015

ahrens Oct 23, 2015

Choose a reason for hiding this comment

ahrens commented Oct 23, 2015

ahrens commented Oct 24, 2015

adilger commented Oct 24, 2015

behlendorf commented Oct 26, 2015

ahrens commented Oct 26, 2015

behlendorf commented Oct 27, 2015

nedbass commented Mar 25, 2016

bzzz77 commented Mar 28, 2016

adilger May 13, 2016

Choose a reason for hiding this comment

nedbass Jun 3, 2016

Choose a reason for hiding this comment

nedbass commented Jun 3, 2016 • edited Loading

adilger commented Jun 4, 2016

nedbass commented Jun 4, 2016

nedbass commented Jun 23, 2016

behlendorf commented Jun 24, 2016

nedbass commented Jun 29, 2015 •

edited

Loading

nedbass commented Jun 3, 2016 •

edited

Loading