Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] SIMD implementation of vdev_raidz generate and reconstruct routines #4328

Closed
wants to merge 1 commit into from

Conversation

ironMann
Copy link
Contributor

@ironMann ironMann commented Feb 11, 2016

based on top of @zfsonlinux:master

This is a new implementation of RAIDZ1/2/3 routines using x86_64 scalar, SSE, and AVX2 instruction sets. Included are 3 parity generation routines (P, PQ, and PQR) and 7 reconstruction routines, for all RAIDZ level. On module load, a quick benchmark of supported routines will select the fastest for each operation and they will be used in runtime. Original implementation is still present and can be selected via module parameter.

Patch contains:

  • specialized gen/rec routines for all RAIDZ levels,
  • new scalar raidz implementation (unrolled for speed),
  • two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
  • fastest routines selected on module load (benchmark).
  • cmd/raidz_test tool for math exactness verification, code coverage and benchmark

New zfs module parameter:

- zfs_vdev_raidz_impl (string, can be changed online): selects the implementation to use
    "fastest" - use the fastest math available
    "original" - use the original raidz code
    "cycle" - cycle through all available impl for each new vdev_raidz (default in userspace tests)
    "scalar" - new scalar impl
    "sse" - new SSE impl if available
    "avx2" - new AVX2 impl if available

@behlendorf
Copy link
Contributor

Nice work. It would be great to see this kind of optimization added to ZFS. In order to make that happen it would be great if you (@ironMann) and @rdolbeau's, who has an implementation in #3374, could review each others work. Then come up with a proposed patch set you're both happy with to review and start rigorously validating for correctness and performance.

Part of what's prevented this functionality from being being added in the need for some kind of torture test to verify it. That's functionality which should be added to ztest. Maybe something like have it randomly change the raidz implementation every few seconds. Or even a dedicated test which verifies that all implementations generate the same parity.

@ironMann
Copy link
Contributor Author

I was not aware of previous efforts. I took a look at @rdolbeau's work. Basically, two implementations differ only in minor details. He has more hand inlined assembly, whereas I utilize smaller asm macros to achieve practically the same (both unroll 4x). I opted for this because it lets me have generic algorithm functions for all methods (vdev_raidz_math_impl.h and vdev_raidz_math_x86simd.h) that are using asm macros for different instruction sets (SSE and AVX2). I chose not to do AVX128 version because it lacks shuffle instructions needed for reconstruction methods, and also because that even SSE variant approaches memory throughput of a single CPU core. Lastly, I used non-temporal prefetch and streaming store instructions so that CPU caches are not trashed by the parity operations.
These are the differences I noticed, but performance wise, I bet they are pretty close.

Regarding testing, I have an external test tool that verifies bit exactness for all methods I wrote. I'll try to port it into ztest. I also left the 'old' implementation in, so switching in runtime between them is not a problem.
Suggestions and comments are welcome.

@rdolbeau
Copy link
Contributor

The 'good' branch for my code is abd2, since at the time the abd changes were meant to be merged. I don't know what the status is now, I haven't had time to follow ZFS developement closely.

abd2 also has simple benchmarking, AVX-512 (deactivated, as I've yet to be able to test if on actual hardware) & ARMv8/NEON. However, I don't have reconstruction and I didn't add explicit prefetch since I wasn't sure whether there was any data reuse between the parity code and other parts of ZFS.

I agree that AVX-128 is a bit redundant with SSE, since the kernel is probably no affected by the SSE/AVX transition (all SSE/AVX usage has to be explicit).

@ironMann ironMann force-pushed the wip-simd branch 2 times, most recently from 01bce6e to 7e8471f Compare February 15, 2016 20:55
@sempervictus
Copy link
Contributor

I think the ABD changes are still meant to be merged, @tuxoko keeps improving the stack anyway, but it is kind of a problem for those of us building test systems with physical hardware for things like this (lots of rebuilds) that its still out of tree.

We still have a production host with SSE running the abd2 branch derived work, though i believe its from @tomgarcia's fork of @rdolbeau's work.

If there are advantages to this approach, whether in maintainability or performance, it would be nice to see work along this vein merged in sooner rather than later, as it's esoteric enough to become difficult to merge after any significant divergence upstream.

Thank you @ironMann and @rdolbeau for building out these low level solutions.

@ironMann ironMann force-pushed the wip-simd branch 2 times, most recently from 6773297 to 2e741d3 Compare February 18, 2016 15:02
@ironMann
Copy link
Contributor Author

@behlendorf: I've added some code to use ztest for testing. Default behaviour for userspace now is to cycle through all supported implementations on each new zio. This should help with math stress testing. I'm running ztest -r 11 -R 3 -a 0 -T 3600. Maybe ztest should corrupt data more frequently?
@rdolbeau: wow, AVX-512, that was preemptive 👍 I also agree with you. In any case, each inst set will be in a separate compilation unit, so tweaking it later for maximal performance should not be an issue.
@sempervictus: I'll bootstrap a couple of VM's to lighten the load on test servers. It's surprisingly not easy to keep up with all different platforms, distros, kernels, and compilers. As for performance, here is a kernel-mode benchmark (including fpu costs) from our test server: results

@rdolbeau
Copy link
Contributor

2016-02-18 17:31 GMT+01:00 Gvozden Neskovic [email protected]:

@rdolbeau: wow, AVX-512, that was preemptive

With the masking and scatter/gather and CDI, AVX-512 seems to be "SIMD done
right at last" :-) There's two versions in the code, without (KNL) or with
(SKX) AVX512BW.

But they apparently forgot some instructions - there's not vpxorb in
AVX512BW. Someone must have thought that since there was just the one
variant in SSE and AVX, they didn't need more in AVX-512 since it's
bit-wise... completely forgetting about the masking :-( So we can't easily
do XOR-with-masking from a mask register. Too bad.

Cordially,

Romain Dolbeau

@behlendorf
Copy link
Contributor

@ironMann I know this is still a WIP but I made a few comments inline. It looks like this is moving in a good direction, thanks for working on it. I'm happy to make another pass perhaps once it's passing the buildbot.

Maybe ztest should corrupt data more frequently?

For testing purposes you could definitely increase the frequency of ztest_fault_inject() in ztest.c.

@behlendorf
Copy link
Contributor

To aid in the automated testing of these optimizations a Intel Xeon E5-2676v3 (Haswell) has been added to the TEST builders. It supports the following cpu flags.

model name  : Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36
              clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl
              xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic
              movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor
              lahf_lm abm fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt

@ironMann
Copy link
Contributor Author

@behlendorf @sempervictus I adapted the code to ABD patch (in #4439), but in the process, I had to reimplement all of the vector code to be able to consume HighMem pages efficiently. From this point on, I would prefer to concentrate on a single branch for further improvements. Basically the vector code is more or less stable on both branches, and now I want to work more on testing, verification and glue code for all implementation.
Any preference on which branch to proceed?

@behlendorf I seem that 2 zfstests on SIMD machine are consistently failing:

Test: /usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_scrub_002_pos (run as root) [00:01] [FAIL]
Test: /usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_scrub_003_pos (run as root) [00:02] [FAIL]

@behlendorf
Copy link
Contributor

@ironMann I'd recommend working on top of @tuxoko's work. There's definitely ongoing work to get those changes in to state which can be maintained long term, but the intention is to move in that direction.

I suspect this is caused by a racy test case. It looks like the scrub is either already finished by the time the test tries to stop it, or maybe it hasn't started yet. Either way, I've seen this occasionally on other test instances and it's unrelated to these changes. I'll open a new issue so we can track and fix the tests.

From the log:

Test: /usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_scrub_002_pos (run as root) [00:01] [FAIL]
18:37:37.51 ASSERTION: Verify scrub -s works correctly.
18:37:38.55 SUCCESS: /sbin/zpool scrub testpool.1405
18:37:38.55 cannot cancel scrubbing testpool.1405: there is no active scrub
18:37:38.55 ERROR: /sbin/zpool scrub -s testpool.1405 exited 1
Test: /usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_scrub_003_pos (run as root) [00:02] [FAIL]
18:37:38.56 ASSERTION: scrub command terminates the existing scrub process and starts a new scrub.
18:37:39.81 SUCCESS: /sbin/zpool scrub testpool.1405
18:37:40.96 SUCCESS: /sbin/zpool scrub testpool.1405
18:37:40.96 zpool scrub don't stop existing scrubbing process.

@ironMann ironMann changed the title [WIP][comments] SIMD implementation of vdev_raidz generate and reconstruct routines [RFC] SIMD implementation of vdev_raidz generate and reconstruct routines Apr 26, 2016
@ofaaland
Copy link
Contributor

@ironMann thanks for the explanation. Do you allow those options for testing purposes? Or are they there so the user can choose the implementation, for cases when vdev_raidz_math_init() chooses an implementation that is problematic?

@ironMann
Copy link
Contributor Author

@ofaaland Well, initially it was for testing, but then it morphed to 'freedom of choice' kind of thing. I'm still not sure if 'cycle' option should be given to users (it is only useful for unit and coverage testing). Maybe it should be removed in non-debug builds.

In theory, vdev_raidz_math_init() could choose slower method due to bunch of reasons, but that is highly unlikely. But not, problematic implementation, in sense that it would not work correctly.

@behlendorf
Copy link
Contributor

@ironMann having 'cycle' available for testing has been helpful to me to build confidence in the patch. But now that most of the testing done I wouldn't object to removing it as a option from the kernel module. We'll of course want to leave it in user space for ztest though.

Thus far I haven't observed any problems with the patch. I'll leave my testing running over the weekend and if I don't encounter any problems I'll sign off on this patch so it can be merged. That just leaves the very minor issue of removing 'cycle' as mentioned above.

@ofaaland
Copy link
Contributor

ofaaland commented Jun 17, 2016

@ironMann, feel free to reject it, but how about this text for the manpage?

Parameter for selecting raidz implementation to use.

Options marked (always) below may be selected on module load as they are
supported on all systems.

The remaining options may only be set after the module is loaded, as they
are available only if the implementations are compiled in and supported
on the running system.

Once the module is loaded, the content of
/sys/module/zfs/parameters/zfs_vdev_raidz_impl will show available options
with the currently selected one enclosed in [].

Possible options are:
  fastest  - (always) implementation selected using built-in benchmark
  original - (always) original raidz implementation
  scalar   - (always) scalar raidz implementation
  sse      - implementation using SSE instruction set (64bit x86 only)
  avx2     - implementation using AVX2 instruction set (64bit x86 only)
.sp

@ironMann
Copy link
Contributor Author

@ofaaland thanks, it's more clear.

@behlendorf In addition, I would also skip benchmark in userspace, and use the highest supported impl as the 'fastest'. This avoids startup delay of tools that call kernel_init()?

@behlendorf
Copy link
Contributor

@ironMann the kernel_init() optimization is a nice idea even if it should really only impact ztest. It also looks like vdev_zaps_005_pos is failing on all the testers so that's going to need to be explained.

- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations against original

New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On module load, the
parameter will only accept first 3 options, and other implementations can be set once
module is finished loading. Possible values for this option are:
    "fastest" - use the fastest math available
    "original" - use the original raidz code
    "scalar" - new scalar impl
    "sse" - new SSE impl if available
    "avx2" - new AVX2 impl if available

see contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to get list of supported
values. If an implementation is not supported on the system, it will not be showed. Currently
selected option is enclosed in `[]`.

Added raidz_test to the ZFS Test Suite
raidz sweep test is running for 300s
Each configuration runs in a separate thread (one running thread per CPU core)
@rdolbeau
Copy link
Contributor

2016-06-18 12:53 GMT+02:00 Gvozden Neskovic [email protected]:

@behlendorf In addition, I would also skip benchmark in userspace, and
use the highest supported impl as the 'fastest'. This avoids startup delay
of tools that call kernel_init()?

Which is the fastest might be obvious on X86-64 (it's likely
AVX512>AVX2>SSE>scalar), but for other architectures it's not necessarily
the case. There's a lot of e.g. Aarch64 cores out there, and it's possible
a wide-issue scalar core would outperform a 64-bits wide NEON pipe in some
cases. And the NEON itself might need to be tuned differently for different
cores.

Cordially,

Romain Dolbeau

@ironMann
Copy link
Contributor Author

@rdolbeau You are right, but there's currently no real use-case for having 'real fastest' impl in userspace tools. These apps cycle through all supported impl. for better test coverage. Real benchmark is still performed on 'zfs' module load, but removing it in userspace saves a lot of time on startup of ztest and zdb.

@behlendorf vdev_zaps_* seem to lack proper setup. If I run them standalone on my dev VM 005 fails, but they all pass in the whole suite. I've messed with the balance by adding default_cleanup in /raidz/raidz_*. I've tried adding default_setup_noexit ${DISKS%% *} to zaps/setup.ksh but that caused zaps_001 to fail, and all others to pass... Sorry, I don't have any more time now to debug this further.

@behlendorf
Copy link
Contributor

@ironMann thanks for looking at the vdev_zaps_* failures. It's clear these failures are unrelated to this PR and the test cases themselves need to be improved. Given that they're transient and seem to be infrequent I'm OK with tackling them as their own issue. So they don't need to hold this up.

The testing I've done over the last week on real hardware has been as abusive as I could make it, and it was designed to cover as much of the testing space as possible: raidz[123] geometries, 4k-16M block sizes, 1-20 device-per-raidz groups, full device rebuilds, all new algorithms checked. I wasn't able to uncover any problems with the patch.

This patch LGTM. Nice job! Let me know if your happy with this as a final version so it can be merged. We can always tweak the administrative aspects of it a little if needed after it's merged. But the core of it looks completely solid.

@ironMann
Copy link
Contributor Author

@behlendorf Thanks, I think the patch is in a ok shape, too. The core raidz framework should be generic enough to support addition of new instruction set implementation with minimal effort (also 32bit SSE/AVX2 variants if relevant).

@behlendorf
Copy link
Contributor

behlendorf commented Jun 21, 2016

Merged to master as:

ab9f4b0 SIMD implementation of vdev_raidz generate and reconstruct routines

@ironMann thanks again for implementing this and building a solid generic framework we can extend. @rdolbeau you should be all set to extend this for NEON.

@thegreatgazoo
Copy link

I'm testing master branch today. A few comments/questions on this patch:

# cat /sys/module/zfs/parameters/zfs_vdev_raidz_impl
[fastest] original scalar sse avx2

So I'm using the fastest option, but which one is that? As an end-user, I'd want to know which implementation is actually in use, but "fastest" doesn't give me that information.

# ls -l /sys/module/zfs/parameters/zfs_vdev_raidz_impl
-rw-r--r-- 1 root root 4096 Jul 12 17:15 /sys/module/zfs/parameters/zfs_vdev_raidz_impl
# echo original > /sys/module/zfs/parameters/zfs_vdev_raidz_impl
-bash: echo: write error: Invalid argument
# echo "original" > /sys/module/zfs/parameters/zfs_vdev_raidz_impl
-bash: echo: write error: Invalid argument

The file is writable so I guess I'd be able to change it at run-time, but it ended up with write errors. Did I do something wrong here?

\fBzfs_vdev_raidz_impl\fR (string)
.ad
.RS 12n
Parameter for selecting raidz implementation to use.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather call it raidz parity implementation, just to be clear.

@ironMann
Copy link
Contributor Author

@thegreatgazoo The fastest impl. is envisioned to be found at runtime. So there's 10 raidz parity routines in total, and in theory the fastest can pick up function pointers from all supported implementations, if that is preferable. But in practice, it's probably going to use the widest SIMD implementation your CPU supports.

The file is writable so I guess I'd be able to change it at run-time, but it ended up with write errors. Did I do something wrong here?

That's news. Can you try with printf "original" > /sys/module/zfs/parameters/zfs_vdev_raidz_impl
It might be that echo adds a newline.

@behlendorf
Copy link
Contributor

It might be that echo adds a newline.

This is exactly what's happening, we'll want to trim the trailing white space. I definitely tested this so I'm not sure how I missed it.

@ironMann
Copy link
Contributor Author

@behlendorf It seems that echo is inconsistent with that. I'll fold these fixes into #4815

@rdolbeau
Copy link
Contributor

rdolbeau commented Jul 12, 2016

2016-07-12 20:05 GMT+02:00 Gvozden Neskovic [email protected]:

But in practice, it's probably going to use the widest SIMD
implementation your CPU supports.

... on x86-64. On Aarch64 (#4801), it depends on the core. Some have fast, wide
NEON support... some don't :-(

Cordially,

Romain Dolbeau

@thegreatgazoo
Copy link

@ironMann Yes newline was the problem, echo -n worked. Since mere echo works with setting other parameters under /sys/module/zfs/parameters/, I'd think it should be fixed. BTW, I saw in the code:

module_param_call(zfs_vdev_raidz_impl, zfs_vdev_raidz_impl_set,
        zfs_vdev_raidz_impl_get, NULL, 0644);

The zfs_vdev_raidz_impl_get() grabs vdev_raidz_impl_lock, but zfs_vdev_raidz_impl_set() does not - that doesn't seem right to me.

@thegreatgazoo
Copy link

@ironMann Thanks the explanation on [fastest]. My real question is, as the admin, I want to know exactly what that [fastest] is, but /sys/module/zfs/parameters/zfs_vdev_raidz_impl doesn't tell me that. Another scenario would be bug reporting, if a user suspects/hits a bug in the parity routine, we'd at least need to know what that [fastest] points to, preferrably without patching and rebooting since technically [fastest] is dynamic.

@ironMann
Copy link
Contributor Author

@thegreatgazoo That locking is indeed a bug. Thing is, that method can be called before the lock is even initialized (when module parameter is specified), and also through api from userspace. will fix.

That's a legitimate point about [fastest] option. Currently there's a kstat for measured throughput of all methods, and contents of [fastest] can be deduced from that, but there should exist a better way. I can extend kstat data to explicitly show what instruction set is used.

GeLiXin added a commit to GeLiXin/zfs that referenced this pull request Aug 1, 2016
* Consistently use parsable instead of parseable

This is a purely cosmetical change, to consistently prefer one of
two (both acceptable) choises for the word parsable in documentation and
code. I don't really care which to use, but acording to wiktionary
https://en.wiktionary.org/wiki/parsable#English parsable is preferred.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4682

* Add missing RPM BuildRequires

Both libudev and libattr are recommended build requirements.  As
such their development headers should lists in the rpm spec file
so those dependencies are pulled in when building rpm packages.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4676

* Skip ctldir znode in zfs_rezget to fix snapdir issues

Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This
will cause funny behaviour for the mounted snapdirs. Especially for
Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone
automount it again as long as someone is still using the detached mount.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4514
Closes #4661
Closes #4672

* Improve zfs-module-parameters(5)

Various rewrites to the descriptions of module parameters. Corrects
spelling mistakes, makes descriptions them more user-friendly and
describes some ZFS quirks which should be understood before changing
parameter values.

Signed-off-by: DHE <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4671

* Fix arc_prune_task use-after-free

arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent
the underlying zsb from disappearing if there's a concurrent umount. We fix
this by force the caller of arc_remove_prune_callback to wait for
arc_prune_taskq to finish.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4687
Closes #4690

* Add request size histograms (-r) to zpool iostat, minor man page fix

Add -r option to "zpool iostat" to print request size histograms for the leaf
ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs
("agg"). These stats can be useful for seeing how well the ZFS IO aggregator
is working.

$ zpool iostat -r
mypool        sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0    530      0      0      0
1K              0      0    260      0      0      0    116    246      0      0
2K              0      0      0      0      0      0      0    431      0      0
4K              0      0      0      0      0      0      3    107      0      0
8K             15      0     35      0      0      0      0      6      0      0
16K             0      0      0      0      0      0      0     39      0      0
32K             0      0      0      0      0      0      0      0      0      0
64K            20      0     40      0      0      0      0      0      0      0
128K            0      0     20      0      0      0      0      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0    155     19      0      0
8M              0      0      0      0      0      0      0    811      0      0
16M             0      0      0      0      0      0      0     68      0      0
--------------------------------------------------------------------------------

Also rename the stray "-G" in the man page to be "-w" for latency histograms.

Signed-off-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Closes #4659

* OpenZFS 6531 - Provide mechanism to artificially limit disk performance

Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130

Porting notes:
- Added new IO delay tracepoints, and moved common ZIO tracepoint macros
  to a new trace_common.h file.
- Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function.
- Updated zinject man page
- Updated zpool_scrub test files

* Systemd configuration fixes

* Disable zfs-import-scan.service by default.  This ensures that
pools will not be automatically imported unless they appear in
the cache file.  When this service is explicitly enabled pools
will be imported with the "cachefile=none" property set.  This
prevents the creation of, or update to, an existing cache file.

    $ systemctl list-unit-files | grep zfs
    zfs-import-cache.service                  enabled
    zfs-import-scan.service                   disabled
    zfs-mount.service                         enabled
    zfs-share.service                         enabled
    zfs-zed.service                           enabled
    zfs.target                                enabled

* Change services to dynamic from static by adding an [Install]
section and adding 'WantedBy' tags in favor of 'Requires' tags.
This allows for easier customization of the boot behavior.

* Start the zfs-import-cache.service after the root pivot so
the cache file is available in the standard location.

* Start the zfs-mount.service after the systemd-remount-fs.service
to ensure the root fs is writeable and the ZFS filesystems can
create their mount points.

* Change the default behavior to only load the ZFS kernel modules
in zfs-import-*.service or when blkid(8) detects a pool.  Users
who wish to unconditionally load the kernel modules must uncomment
the list of modules in /lib/modules-load.d/zfs.conf.

Reviewed-by: Richard Laager <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4325
Closes #4496
Closes #4658
Closes #4699

* Fix self-healing IO prior to dsl_pool_init() completion

Async writes triggered by a self-healing IO may be issued before the
pool finishes the process of initialization.  This results in a NULL
dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes().

George Wilson recommended addressing this issue by initializing the
passed `dsl_pool_t **` prior to dmu_objset_open_impl().  Since the
caller is passing the `spa->spa_dsl_pool` this has the effect of
ensuring it's initialized.

However, since this depends on the caller knowing they must pass
the `spa->spa_dsl_pool` an additional NULL check was added to
vdev_queue_max_async_writes().  This guards against any future
restructuring of the code which might result in dsl_pool_init()
being called differently.

Signed-off-by: GeLiXin <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4652

* Add isa_defs for MIPS

GCC for MIPS only defines _LP64 when 64bit,
while no _ILP32 defined when 32bit.

Signed-off-by: YunQiang Su <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4712

* Fix out-of-bound access in zfs_fillpage

The original code will do an out-of-bound access on pl[] during last
iteration.

 ==================================================================
 BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs]
 Read of size 8 by task tmpfile/7850
 page:ffffea00017c6dc0 count:0 mapcount:0 mapping:          (null) index:0x0
 flags: 0xffff8000000000()
 page dumped because: kasan: bad access detected
 CPU: 3 PID: 7850 Comm: tmpfile Tainted: G           OE   4.6.0+ #3
  ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618
  ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8
  ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3
 Call Trace:
  [<ffffffff81635618>] dump_stack+0x63/0x8b
  [<ffffffff81313ee8>] kasan_report_error+0x528/0x560
  [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0
  [<ffffffff813144b8>] kasan_report+0x58/0x60
  [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs]
  [<ffffffff81312e4e>] __asan_load8+0x5e/0x70
  [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs]
  [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs]

  [<ffffffff81353c3a>] SyS_execve+0x3a/0x50
  [<ffffffff810058ef>] do_syscall_64+0xef/0x180
  [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25
 Memory state around the buggy address:
  ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4
                                                                 ^
  ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
  ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ==================================================================

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4705
Issue #4708

* Fix memleak in zpl_parse_options

strsep() will advance tmp_mntopts, and will change it to NULL on last
iteration.  This will cause strfree(tmp_mntopts) to not free anything.

unreferenced object 0xffff8800883976c0 (size 64):
  comm "mount.zfs", pid 3361, jiffies 4294931877 (age 1482.408s)
  hex dump (first 32 bytes):
    72 77 00 73 74 72 69 63 74 61 74 69 6d 65 00 7a  rw.strictatime.z
    66 73 75 74 69 6c 00 6d 6e 74 70 6f 69 6e 74 3d  fsutil.mntpoint=
  backtrace:
    [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811f9cac>] __kmalloc+0x16c/0x250
    [<ffffffffc065ce9b>] strdup+0x3b/0x60 [spl]
    [<ffffffffc080fad6>] zpl_parse_options+0x56/0x300 [zfs]
    [<ffffffffc080fe46>] zpl_mount+0x36/0x80 [zfs]
    [<ffffffff81222dc8>] mount_fs+0x38/0x160
    [<ffffffff81240097>] vfs_kern_mount+0x67/0x110
    [<ffffffff812428e0>] do_mount+0x250/0xe20
    [<ffffffff812437d5>] SyS_mount+0x95/0xe0
    [<ffffffff8181aff6>] entry_SYSCALL_64_fastpath+0x1e/0xa8
    [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4706
Issue #4708

* Fix memleak in vdev_config_generate_stats

fnvlist_add_nvlist will copy the contents of nvx, so we need to
free it here.

unreferenced object 0xffff8800a6934e80 (size 64):
  comm "zpool", pid 3398, jiffies 4295007406 (age 214.180s)
  hex dump (first 32 bytes):
    60 06 c2 73 00 88 ff ff 00 7c 8c 73 00 88 ff ff  `..s.....|.s....
    00 00 00 00 00 00 00 00 40 b0 70 c0 ff ff ff ff  [email protected].....
  backtrace:
    [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811fac7d>] __kmalloc_node+0x17d/0x310
    [<ffffffffc065528c>] spl_kmem_alloc_impl+0xac/0x180 [spl]
    [<ffffffffc0657379>] spl_vmem_alloc+0x19/0x20 [spl]
    [<ffffffffc07056cf>] nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
    [<ffffffffc07006b7>] nvlist_xalloc.part.13+0x27/0xc0 [znvpair]
    [<ffffffffc07007ad>] nvlist_alloc+0x3d/0x40 [znvpair]
    [<ffffffffc0703abc>] fnvlist_alloc+0x2c/0x80 [znvpair]
    [<ffffffffc07b1783>] vdev_config_generate_stats+0x83/0x370 [zfs]
    [<ffffffffc07b1f53>] vdev_config_generate+0x4e3/0x650 [zfs]
    [<ffffffffc07996db>] spa_config_generate+0x20b/0x4b0 [zfs]
    [<ffffffffc0794f64>] spa_tryimport+0xc4/0x430 [zfs]
    [<ffffffffc07d11d8>] zfs_ioc_pool_tryimport+0x68/0x110 [zfs]
    [<ffffffffc07d4fc6>] zfsdev_ioctl+0x646/0x7a0 [zfs]
    [<ffffffff81232e31>] do_vfs_ioctl+0xa1/0x5b0
    [<ffffffff812333b9>] SyS_ioctl+0x79/0x90

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4707
Issue #4708

* Linux 4.7 compat: handler->set() takes both dentry and inode

Counterpart to fd4c7b7, the same approach was taken to resolve
the compatibility issue.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #4717 
Issue #4665

* Implementation of AVX2 optimized Fletcher-4

New functionality:
- Preserves existing scalar implementation.
- Adds AVX2 optimized Fletcher-4 computation.
- Fastest routines selected on module load (benchmark).
- Test case for Fletcher-4 added to ztest.

New zcommon module parameters:
-  zfs_fletcher_4_impl (str): selects the implementation to use.
    "fastest" - use the fastest version available
    "cycle"   - cycle trough all available impl for ztest
    "scalar"  - use the original version
    "avx2"    - new AVX2 implementation if available

Performance comparison (Intel i7 CPU, 1MB data buffers):
- Scalar:  4216 MB/s
- AVX2:   14499 MB/s

See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl`
to get list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Jinshan Xiong <[email protected]>
Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4330

* Fix cstyle.pl warnings

As of perl v5.22.1 the following warnings are generated:

* Redundant argument in printf at scripts/cstyle.pl line 194

* Unescaped left brace in regex is deprecated, passed through
  in regex; marked by <-- HERE in m/\S{ <-- HERE / at
  scripts/cstyle.pl line 608.

They have been addressed by escaping the left braces and by
providing the correct number of arguments to printf based on
the fmt specifier set by the verbose option.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4723

* Fix minor spelling mistakes

Trivial spelling mistake fix in error message text.

* Fix spelling mistake "adminstrator" -> "administrator"
* Fix spelling mistake "specificed" -> "specified"
* Fix spelling mistake "interperted" -> "interpreted"

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4728

* Add `zfs allow` and `zfs unallow` support

ZFS allows for specific permissions to be delegated to normal users
with the `zfs allow` and `zfs unallow` commands.  In addition, non-
privileged users should be able to run all of the following commands:

  * zpool [list | iostat | status | get]
  * zfs [list | get]

Historically this functionality was not available on Linux.  In order
to add it the secpolicy_* functions needed to be implemented and mapped
to the equivalent Linux capability.  Only then could the permissions on
the `/dev/zfs` be relaxed and the internal ZFS permission checks used.

Even with this change some limitations remain.  Under Linux only the
root user is allowed to modify the namespace (unless it's a private
namespace).  This means the mount, mountpoint, canmount, unmount,
and remount delegations cannot be supported with the existing code.  It
may be possible to add this functionality in the future.

This functionality was validated with the cli_user and delegation test
cases from the ZFS Test Suite.  These tests exhaustively verify each
of the supported permissions which can be delegated and ensures only
an authorized user can perform it.

Two minor bug fixes were required for test-running.py.  First, the
Timer() object cannot be safely created in a `try:` block when there
is an unconditional `finally` block which references it.  Second,
when running as a normal user also check for scripts using the
both the .ksh and .sh suffixes.

Finally, existing users who are simulating delegations by setting
group permissions on the /dev/zfs device should revert that
customization when updating to a version with this change.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes #362 
Closes #434 
Closes #4100
Closes #4394 
Closes #4410 
Closes #4487

* Remove libzfs_graph.c

The libzfs_graph.c source file should have been removed in 330d06f,
it is entirely unused.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4766

* Linux 4.6 compat: Fall back to d_prune_aliases() if necessary

As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes: #4726

* SIMD implementation of vdev_raidz generate and reconstruct routines

This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.

Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite

New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
  module load, the parameter will only accept first 3 options, and
  the other implementations can be set once module is finished
  loading. Possible values for this option are:
    "fastest" - use the fastest math available
    "original" - use the original raidz code
    "scalar" - new scalar impl
    "sse" - new SSE impl if available
    "avx2" - new AVX2 impl if available

See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4328

* Fix NFS credential

The commit f74b821 caused a regression where creating file through NFS will
always create a file owned by root. This is because the patch enables the KSID
code in zfs_acl_ids_create, which it would use euid and egid of the current
process. However, on Linux, we should use fsuid and fsgid for file operations,
which is the original behaviour. So we revert this part of code.

The patch also enables secpolicy_vnode_*, since they are also used in file
operations, we change them to use fsuid and fsgid.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4772
Closes #4758

* OpenZFS 6513 - partially filled holes lose birth time

Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Boris Protopopov <[email protected]>
Approved by: Richard Lowe <[email protected]>a
Ported by: Boris Protopopov <[email protected]>
Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6513
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0

If a ZFS object contains a hole at level one, and then a data block is
created at level 0 underneath that l1 block, l0 holes will be created.
However, these l0 holes do not have the birth time property set; as a
result, incremental sends will not send those holes.

Fix is to modify the dbuf_read code to fill in birth time data.

* Add a test case for dmu_free_long_range() to ztest

Signed-off-by: Boris Protopopov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4754

* Revert "Add a test case for dmu_free_long_range() to ztest"

This reverts commit d0de2e82df579f4e4edf5643b674a1464fae485f which
introduced a new test case to ztest which is failing occasionally
during automated testing.  The change is being reverted until
the issue can be fully investigated.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4754

* OpenZFS 6878 - Add scrub completion info to "zpool history"

Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Approved by: Dan McDonald <[email protected]>
Authored by: Nav Ravindranath <[email protected]>
Ported-by: Chris Dunlop <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6878
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1825bc5
Closes #4787

* FreeBSD rS271776 - Persist vdev_resilver_txg changes

Persist vdev_resilver_txg changes to avoid panic caused by validation
vs a vdev_resilver_txg value from a previous resilver.

Authored-by: smh <[email protected]>
Ported-by: Chris Dunlop <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/5154
FreeBSD-issue: https://reviews.freebsd.org/rS271776
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/c3c60bf
Closes #4790

* xattrtest: allow verify with -R and other improvements

- Use a fixed buffer of random bytes when random xattr values are in
  effect.  This eliminates the potential performance bottleneck of
  reading from /dev/urandom for each file. This also allows us to
  verify xattrs in random value mode.

- Show the rate of operations per second in addition to elapsed time
  for each phase of the test. This may be useful for benchmarking.

- Set default xattr size to 6 so that verify doesn't fail if user
  doesn't specify a size. We need at least six bytes to store the
  leading "size=X" string that is used for verification.

- Allow user to execute just one phase of the test. Acceptable
  values for -o and their meanings are:

   1 - run the create phase
   2 - run the setxattr phase
   3 - run the getxattr phase
   4 - run the unlink phase

Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

* Backfill metadnode more intelligently

Only attempt to backfill lower metadnode object numbers if at least
4096 objects have been freed since the last rescan, and at most once
per transaction group. This avoids a pathology in dmu_object_alloc()
that caused O(N^2) behavior for create-heavy workloads and
substantially improves object creation rates.  As summarized by
@mahrens in #4636:

"Normally, the object allocator simply checks to see if the next
object is available. The slow calls happened when dmu_object_alloc()
checks to see if it can backfill lower object numbers. This happens
every time we move on to a new L1 indirect block (i.e. every 32 *
128 = 4096 objects).  When re-checking lower object numbers, we use
the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
indirect blocks that don’t have enough free dnodes (defined as an L2
with at least 393,216 of 524,288 dnodes free). Therefore, we may
find that a block of dnodes has a low (or zero) fill count, and yet
we can’t allocate any of its dnodes, because they've been allocated
in memory but not yet written to disk. In this case we have to hold
each of the dnodes and then notice that it has been allocated in
memory.

The end result is that allocating N objects in the same TXG can
require CPU usage proportional to N^2."

Add a tunable dmu_rescan_dnode_threshold to define the number of
objects that must be freed before a rescan is performed. Don't bother
to export this as a module option because testing doesn't show a
compelling reason to change it. The vast majority of the performance
gain comes from limit the rescan to at most once per TXG.

Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

* Implement large_dnode pool feature

Justification
-------------

This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks.  Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided.  Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks.  Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.

ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.

Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.

Implementation
--------------

The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.

Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.

The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run

 # zfs set dnodesize=auto tank/fish

The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.

The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.

New DMU interfaces:
  dmu_object_alloc_dnsize()
  dmu_object_claim_dnsize()
  dmu_object_reclaim_dnsize()

New ZAP interfaces:
  zap_create_dnsize()
  zap_create_norm_dnsize()
  zap_create_flags_dnsize()
  zap_create_claim_norm_dnsize()
  zap_create_link_dnsize()

The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.

These are a few noteworthy changes to key functions:

* The prototype for dnode_hold_impl() now takes a "slots" parameter.
  When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
  ensure the hole at the specified object offset is large enough to
  hold the dnode being created. The slots parameter is also used
  to ensure a dnode does not span multiple dnode blocks. In both of
  these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
  these failure cases are only possible when using DNODE_MUST_BE_FREE.

  If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
  dnode_hold_impl() will check if the requested dnode is already
  consumed as an extra dnode slot by an large dnode, in which case
  it returns ENOENT.

* The function dmu_object_alloc() advances to the next dnode block
  if dnode_hold_impl() returns an error for a requested object.
  This is because the beginning of the next dnode block is the only
  location it can safely assume to either be a hole or a valid
  starting point for a dnode.

* dnode_next_offset_level() and other functions that iterate
  through dnode blocks may no longer use a simple array indexing
  scheme. These now use the current dnode's dn_num_slots field to
  advance to the next dnode in the block. This is to ensure we
  properly skip the current dnode's bonus area and don't interpret it
  as a valid dnode.

zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.

For ZIL create log records, zdb will now display the slot count for
the object.

ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.

Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number.  This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.

ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.

Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.

While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.

For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.

ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.

Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.

Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.

Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3542

* Sync DMU_BACKUP_FEATURE_* flags

Flag 20 was used in OpenZFS as DMU_BACKUP_FEATURE_RESUMING.  The
DMU_BACKUP_FEATURE_LARGE_DNODE flag must be shifted to 21 and
then reserved in the upstream OpenZFS implementation.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Closes #4795

* OpenZFS 2605, 6980, 6902

2605 want to resume interrupted zfs send
Reviewed by: George Wilson <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Xin Li <[email protected]>
Reviewed by: Arne Jansen <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: kernelOfTruth <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/2605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12

6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: George Wilson <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Ported by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6980
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f

Porting notes:
- All rsend and snapshop tests enabled and updated for Linux.
- Fix misuse of input argument in traverse_visitbp().
- Fix ISO C90 warnings and errors.
- Fix gcc 'missing braces around initializer' in
  'struct send_thread_arg to_arg =' warning.
- Replace 4 argument fletcher_4_native() with 3 argument version,
  this change was made in OpenZFS 4185 which has not been ported.
- Part of the sections for 'zfs receive' and 'zfs send' was
  rewritten and reordered to approximate upstream.
- Fix mktree xattr creation, 'user.' prefix required.
- Minor fixes to newly enabled test cases
- Long holds for volumes allowed during receive for minor registration.

* OpenZFS 6051 - lzc_receive: allow the caller to read the begin record

Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6051
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/620f322

* OpenZFS 6393 - zfs receive a full send as a clone

Authored by: Paul Dagnelie <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6394
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e

* OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS

Authored by: Andrew Stormont <[email protected]>
Reviewed by: Anil Vijarnia <[email protected]>
Reviewed by: Kim Shrier <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6536
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/880094b

* OpenZFS 6738 - zfs send stream padding needs documentation

Authored by: Eli Rosenthal <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Reviewed by: Dan McDonald <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6738
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c20404ff

* OpenZFS 4986 - receiving replication stream fails if any snapshot exceeds refquota

Authored by: Dan McDonald <[email protected]>
Reviewed by: John Kennedy <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Gordon Ross <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/4986
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5878fad

* OpenZFS 6562 - Refquota on receive doesn't account for overage

Authored by: Dan McDonald <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Yuri Pankov <[email protected]>
Reviewed by: Toomas Soome <[email protected]>
Approved by: Gordon Ross <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6562
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5f7a8e6

* Implement zfs_ioc_recv_new() for OpenZFS 2605

Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy
ZFS_IOC_RECV user/kernel interface.  The new interface supports all
stream options but is currently only used for resumable streams.
This way updated user space utilities will interoperate with older
kernel modules.

ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW
handler.  Non-Linux OpenZFS platforms have opted to change the
legacy interface in an incompatible fashion instead of adding a
new ioctl.

Signed-off-by: Brian Behlendorf <[email protected]>

* OpenZFS 6314 - buffer overflow in dsl_dataset_name

Reviewed by: George Wilson <[email protected]>
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Igor Kozhukhov <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/6314
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee

* OpenZFS 6876 - Stack corruption after importing a pool with a too-long name

Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Yuri Pankov <[email protected]>
Ported-by: Brian Behlendorf <[email protected]>

Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking
for trouble. We should check every dataset on import, using a 1024 byte
buffer and checking each time to see if the dataset's new name is longer
than 256 bytes.

OpenZFS-issue: https://www.illumos.org/issues/6876
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e

* Vectorized fletcher_4 must be 128-bit aligned

The fletcher_4_native() and fletcher_4_byteswap() functions may only
safely use the vectorized implementations when the buffer is 128-bit
aligned.  This is because both the AVX2 and SSE implementations process
four 32-bit words per iterations.  Fallback to the scalar implementation
which only processes a single 32-bit word for unaligned buffers.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Gvozden Neskovic <[email protected]>
Issue #4330

* Allow building with `CFLAGS="-O0"`

If compiled with -O0, gcc doesn't do any stack frame coalescing
and -Wframe-larger-than=1024 is triggered in debug mode.
Starting with gcc 4.8, new opt level -Og is introduced for debugging, which
does not trigger this warning.

Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4799

* Don't allow accessing XATTR via export handle

Allow accessing XATTR through export handle is a very bad idea. It
would allow user to write whatever they want in fields where they
otherwise could not.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4828

* Fix get_zfs_sb race with concurrent umount

Certain ioctl operations will call get_zfs_sb, which will holds an active
count on sb without checking whether it's active or not. This will result
in use-after-free. We fix this by using atomic_inc_not_zero to make sure
we got an active sb.

P1                                          P2
---                                         ---
deactivate_locked_super(): s_active = 0
                                            zfs_sb_hold()
                                            ->get_zfs_sb(): s_active = 1
->zpl_kill_sb()
-->zpl_put_super()
--->zfs_umount()
---->zfs_sb_free(zsb)
                                            zfs_sb_rele(zsb)

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

* Fix Large kmem_alloc in vdev_metaslab_init

This allocation can go way over 1MB, so we should use vmem_alloc
instead of kmem_alloc.

  Large kmem_alloc(1430784, 0x1000), please file an issue...
  Call Trace:
   [<ffffffffa0324aff>] ? spl_kmem_zalloc+0xef/0x160 [spl]
   [<ffffffffa17d0c8d>] ? vdev_metaslab_init+0x9d/0x1f0 [zfs]
   [<ffffffffa17d46d0>] ? vdev_load+0xc0/0xd0 [zfs]
   [<ffffffffa17d4643>] ? vdev_load+0x33/0xd0 [zfs]
   [<ffffffffa17c0004>] ? spa_load+0xfc4/0x1b60 [zfs]
   [<ffffffffa17c1838>] ? spa_tryimport+0x98/0x430 [zfs]
   [<ffffffffa17f28b1>] ? zfs_ioc_pool_tryimport+0x41/0x80 [zfs]
   [<ffffffffa17f5669>] ? zfsdev_ioctl+0x4a9/0x4e0 [zfs]
   [<ffffffff811bacdf>] ? do_vfs_ioctl+0x2cf/0x4b0
   [<ffffffff811baf41>] ? SyS_ioctl+0x81/0xa0

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4752

* Add configure result for xattr_handler

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4828

* fh_to_dentry should return ESTALE when generation mismatch

When generation mismatch, it usually means the file pointed by the file handle
was deleted. We should return ESTALE to indicate this. We return ENOENT in
zfs_vget since zpl_fh_to_dentry will convert it to ESTALE.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4828

* xattr dir doesn't get purged during iput

We need to set inode->i_nlink to zero so iput will purge it. Without this, it
will get purged during shrink cache or umount, which would likely result in
deadlock due to zfs_zget waiting forever on its children which are in the
dispose_list of the same thread.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chris Dunlop <[email protected]>
Issue #4359
Issue #3508
Issue #4413
Issue #4827

* Kill zp->z_xattr_parent to prevent pinning

zp->z_xattr_parent will pin the parent. This will cause huge issue
when unlink a file with xattr. Because the unlinked file is pinned, it
will never get purged immediately. And because of that, the xattr
stuff will never be marked as unlinked. So the whole unlinked stuff
will stay there until shrink cache or umount.

This change partially reverts e89260a.  This is safe because only the
zp->z_xattr_parent optimization is removed, zpl_xattr_security_init()
is still called from the zpl outside the inode lock.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chris Dunlop <[email protected]>
Issue #4359
Issue #3508
Issue #4413
Issue #4827

* Fix RAIDZ_TEST tests

Remove stray trailing } which prevented the raidz stress tests from
running in-tree.

Signed-off-by: Brian Behlendorf <[email protected]>

* Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z

The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.

Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
  db_blkptr = NULL and it's dirtied.

Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
  xattr fits into the bonus buffer, so it's removed. The dbuf is
  undirtied in this txg, but it's still referenced and cannot be
  destroyed.

Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
  is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
  (not done yet).
* A new change makes the spill buffer necessary again.
  sa_build_layouts() ends up calling dbuf_find() to locate the
  dbuf.  It finds the old dbuf because it has not been destroyed yet
  (it will be destroyed when the previous write is done and there
  are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
  referenced, so it's not destroyed.

Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
  directly copied into the dnode, overwriting the blkptr area because,
  in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
  gets corrupted.

Signed-off-by: Peng <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #3937

* Fix handling of errors nvlist in zfs_ioc_recv_new()

zfs_ioc_recv_impl() is changed to always allocate the 'errors'
nvlist, its callers are responsible for freeing it.

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4829

* Add RAID-Z routines for SSE2 instruction set, in x86_64 mode.

The patch covers low-end and older x86 CPUs.  Parity generation is
equivalent to SSSE3 implementation, but reconstruction is somewhat
slower.  Previous 'sse' implementation is renamed to 'ssse3' to
indicate highest instruction set used.

Benchmark results:
scalar_rec_p                    4    720476442
scalar_rec_q                    4    187462804
scalar_rec_r                    4    138996096
scalar_rec_pq                   4    140834951
scalar_rec_pr                   4    129332035
scalar_rec_qr                   4    81619194
scalar_rec_pqr                  4    53376668

sse2_rec_p                      4    2427757064
sse2_rec_q                      4    747120861
sse2_rec_r                      4    499871637
sse2_rec_pq                     4    522403710
sse2_rec_pr                     4    464632780
sse2_rec_qr                     4    319124434
sse2_rec_pqr                    4    205794190

ssse3_rec_p                     4    2519939444
ssse3_rec_q                     4    1003019289
ssse3_rec_r                     4    616428767
ssse3_rec_pq                    4    706326396
ssse3_rec_pr                    4    570493618
ssse3_rec_qr                    4    400185250
ssse3_rec_pqr                   4    377541245

original_rec_p                  4    691658568
original_rec_q                  4    195510948
original_rec_r                  4    26075538
original_rec_pq                 4    103087368
original_rec_pr                 4    15767058
original_rec_qr                 4    15513175
original_rec_pqr                4    10746357

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4783

* Enable zpool_upgrade test cases

Creating the pool in a striped rather than mirrored configuration
provides enough space for all upgrade tests to run.  Test case
zpool_upgrade_007_pos still fails and must be investigated so
it has been left disabled.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4852

* Prevent null dereferences when accessing dbuf kstat

In arc_buf_info(), the arc_buf_t may have no header.  If not, don't try
to fetch the arc buffer stats and instead just zero them.

The null dereferences were observed while accessing the dbuf kstat with
awk on a system in which millions of small files were being created in
order to overflow the system's metadata limit.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #4837

* Fix dbuf_stats_hash_table_data race

Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf
can be freed at any time.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4846

* Use native inode->i_nlink instead of znode->z_links

A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's
64 bit on-disk link count.

We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a
more Linux-integrated fix for the same issue.

In addition, setting the initial link count on a new node has been changed
from setting one less than required in zfs_mknode() then incrementing to the
correct count in zfs_link_create() (which was somewhat bizarre in the first
place), to setting the correct count in zfs_mknode() and not incrementing it
in zfs_link_create(). This both means we no longer set the link count in
sa_bulk_update() twice (once for the initial incorrect count then again for
the correct count), as well as adhering to the Linux requirement of not
incrementing a zero link count without I_LINKABLE (see linux commit
f4e0c30c).

Signed-off-by: Chris Dunlop <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Closes #4838
Issue #227

* Implementation of SSE optimized Fletcher-4

Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4)
This commit adds another implementation of the Fletcher-4 algorithm.
It is automatically selected at module load if it benchmarks higher
than all other available implementations.

The module benchmark was also amended to analyze the performance of
the byteswap-ed version of Fletcher-4, as well as the non-byteswaped
version. The average performance of the two is used to select the
the fastest implementation available on the host system.

Adds a pair of fields to an existing zcommon module parameter:
-  zfs_fletcher_4_impl (str)
    "sse2"    - new SSE2 implementation if available
    "ssse3"   - new SSSE3 implementation if available

Signed-off-by: Tyler J. Stachecki <[email protected]>
Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4789

* Fix filesystem destroy with receive_resume_token

It is possible that the given DS may have hidden child (%recv)
datasets - "leftovers" resulting from the previously interrupted
'zfs receieve'.  Try to remove the hidden child (%recv) and after
that try to remove the target dataset.   If the hidden child
(%recv) does not exist the original error (EEXIST) will be returned.

Signed-off-by: Roman Strashkin <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4818

* Prevent segfaults in SSE optimized Fletcher-4

In some cases, the compiler was not respecting the GNU aligned
attribute for stack variables in 35a76a0. This was resulting in
a segfault on CentOS 6.7 hosts using gcc 4.4.7-17.  This issue
was fixed in gcc 4.6.

To prevent this from occurring, use unaligned loads and stores
for all stack and global memory references in the SSE optimized
Fletcher-4 code.

Disable zimport testing against master where this flaw exists:

TEST_ZIMPORT_VERSIONS="installed"

Signed-off-by: Tyler J. Stachecki <[email protected]>
Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4862

* Update arc_summary.py for prefetch changes

Commit 7f60329 removed several kstats which arc_summary.py read.
Remove these kstats from arc_summary.py in the same way this was
handled in FreeNAS.

FreeNAS-commit: https://github.com/freenas/freenas/commit/3901f73

Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4695

* Wait iput_async before evict_inodes to prevent race

Wait for iput_async before entering evict_inodes in
generic_shutdown_super. The reason we must finish before
evict_inodes is when lazytime is on, or when zfs_purgedir calls
zfs_zget, iput would bump i_count from 0 to 1. This would race
with the i_count check in evict_inodes.  This means it could
destroy the inode while we are still using it.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4854

* Fixes and enhancements of SIMD raidz parity

- Implementation lock replaced with atomic variable

- Trailing whitespace is removed from user specified parameter, to enhance
experience when using commands that add newline, e.g. `echo`

- raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813

- silence `cppcheck` in vdev_raidz, partial solution of Issue #1392

- Minor fixes and cleanups

- Enable use of original parity methods in [fastest] configuration.
New opaque original ops structure, representing native methods, is added
to supported raidz methods. Original parity methods are executed if selected
implementation has NULL fn pointer.

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4813
Issue #1392

* RAIDZ parity kstat rework

Print table with speed of methods for each implementation.
Last line describes contents of [fastest] selection.

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4860

* Fix NULL pointer in zfs_preumount from 1d9b3bd

When zfs_domount fails zsb will be freed, and its caller
mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into
zfs_preumount.

In order to make sure we don't touch any nonexistent stuff, we must make sure
s_fs_info is NULL in the fail path so zfs_preumount can easily check that.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4867
Issue #4854

* Illumos Crypto Port module added to enable native encryption in zfs

A port of the Illumos Crypto Framework to a Linux kernel module (found
in module/icp). This is needed to do the actual encryption work. We cannot
use the Linux kernel's built in crypto api because it is only exported to
GPL-licensed modules. Having the ICP also means the crypto code can run on
any of the other kernels under OpenZFS. I ended up porting over most of the
internals of the framework, which means that porting over other API calls (if
we need them) should be fairly easy. Specifically, I have ported over the API
functions related to encryption, digests, macs, and crypto templates. The ICP
is able to use assembly-accelerated encryption on amd64 machines and AES-NI
instructions on Intel chips that support it. There are place-holder
directories for similar assembly optimizations for other architectures
(although they have not been written).

Signed-off-by: Tom Caputi <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4329

* Fix for compilation error when using the kernel's CONFIG_LOCKDEP

Signed-off-by: Tom Caputi <[email protected]>
Signed-off-by: Chris Dunlop <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4329

* zloop: print backtrace from core files

Find the core file by using `/proc/sys/kernel/core_pattern`

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4874

* Fix for metaslab_fastwrite_unmark() assert failure

Currently there is an issue where metaslab_fastwrite_unmark() unmarks
fastwrites on vdev_t's that have never had fastwrites marked on them.
The 'fastwrite mark' is essentially a count of outstanding bytes that
will be written to a vdev and is used in syncing context. The problem
stems from the fact that the vdev_pending_fastwrite field is not being
transferred over when replacing a top-level vdev. As a result, the
metaslab is marked for fastwrite on the old vdev and unmarked on the
new one, which brings the fastwrite count below zero. This fix simply
assigns vdev_pending_fastwrite from the old vdev to the new one so
this count is not lost.

Signed-off-by: Tom Caputi <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4267

* Remove znode's z_uid/z_gid member

Remove duplicate z_uid/z_gid member which are also held in the
generic vfs inode struct. This is done by first removing the members
from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID
macros to access the respective member from struct inode. In cases
where the uid/gids are being marshalled from/to disk, use the newly
introduced zfs_(uid|gid)_(read|write) functions to properly
save the uids rather than the internal kernel representation.

Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4685
Issue #227

* Check whether the kernel supports i_uid/gid_read/write helpers

Since the concept of a kuid and the need to translate from it to
ordinary integer type was added in kernel version 3.5 implement necessary
plumbing to be able to detect this condition during compile time. If
the kernel doesn't support the kuid then just fall back to directly
accessing the respective struct inode's members

Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4685
Issue #227

* Fix uninitialized variable in avl_add()

Silence the following warning when compiling with gcc 5.4.0.
Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609.

module/avl/avl.c: In function ‘avl_add’:
module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized
    in this function [-Wmaybe-uninitialized]
  avl_insert(tree, new_node, where);

Signed-off-by: Brian Behlendorf <[email protected]>

* Fix sync behavior for disk vdevs

Prior to b39c22b, which was first generally available in the 0.6.5
release as b39c22b, ZoL never actually submitted synchronous read or write
requests to the Linux block layer.  This means the vdev_disk_dio_is_sync()
function had always returned false and, therefore, the completion in
dio_request_t.dr_comp was never actually used.

In b39c22b, synchronous ZIO operations were translated to synchronous
BIO requests in vdev_disk_io_start().  The follow-on commits 5592404 and
aa159af fixed several problems introduced by b39c22b.  In particular,
5592404 introduced the new flag parameter "wait" to __vdev_disk_physio()
but under ZoL, since vdev_disk_physio() is never actually used, the wait
flag was always zero so the new code had no effect other than to cause
a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af.

The original rationale for introducing synchronous operations in b39c22b
was to hurry certains requests through the BIO layer which would have
otherwise been subject to its unplug timer which would increase the
latency.  This behavior of the unplug timer, however, went away during the
transition of the plug/unplug system between kernels 2.6.32 and 2.6.39.

To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the
BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior.

For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and
ise used for the same purpose.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4858

* Limit the amount of dnode metadata in the ARC

Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).

In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.

The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes.  Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.

The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded.  By default,
it tries to unpin 10% of the dnodes.

The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):

    - Create a large number of small files until arc_meta_used exceeds
      arc_meta_limit (3GiB with default tuning) and arc_prune
      starts increasing.

    - Create a 3GiB file with dd.  Observe arc_mata_used.  It will still
      be around 3GiB.

    - Repeatedly read the 3GiB file and observe arc_meta_limit as before.
      It will continue to stay around 3GiB.

With this modification, space for the 3GiB file is gradually made
available as subsequent demands on th…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants