[RFC] SIMD implementation of vdev_raidz generate and reconstruct routines #4328

ironMann · 2016-02-11T15:36:03Z

based on top of @zfsonlinux:master

This is a new implementation of RAIDZ1/2/3 routines using x86_64 scalar, SSE, and AVX2 instruction sets. Included are 3 parity generation routines (P, PQ, and PQR) and 7 reconstruction routines, for all RAIDZ level. On module load, a quick benchmark of supported routines will select the fastest for each operation and they will be used in runtime. Original implementation is still present and can be selected via module parameter.

Patch contains:

specialized gen/rec routines for all RAIDZ levels,
new scalar raidz implementation (unrolled for speed),
two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
fastest routines selected on module load (benchmark).
cmd/raidz_test tool for math exactness verification, code coverage and benchmark

New zfs module parameter:

- zfs_vdev_raidz_impl (string, can be changed online): selects the implementation to use
    "fastest" - use the fastest math available
    "original" - use the original raidz code
    "cycle" - cycle through all available impl for each new vdev_raidz (default in userspace tests)
    "scalar" - new scalar impl
    "sse" - new SSE impl if available
    "avx2" - new AVX2 impl if available

behlendorf · 2016-02-11T21:11:43Z

Nice work. It would be great to see this kind of optimization added to ZFS. In order to make that happen it would be great if you (@ironMann) and @rdolbeau's, who has an implementation in #3374, could review each others work. Then come up with a proposed patch set you're both happy with to review and start rigorously validating for correctness and performance.

Part of what's prevented this functionality from being being added in the need for some kind of torture test to verify it. That's functionality which should be added to ztest. Maybe something like have it randomly change the raidz implementation every few seconds. Or even a dedicated test which verifies that all implementations generate the same parity.

ironMann · 2016-02-12T14:18:57Z

I was not aware of previous efforts. I took a look at @rdolbeau's work. Basically, two implementations differ only in minor details. He has more hand inlined assembly, whereas I utilize smaller asm macros to achieve practically the same (both unroll 4x). I opted for this because it lets me have generic algorithm functions for all methods (vdev_raidz_math_impl.h and vdev_raidz_math_x86simd.h) that are using asm macros for different instruction sets (SSE and AVX2). I chose not to do AVX128 version because it lacks shuffle instructions needed for reconstruction methods, and also because that even SSE variant approaches memory throughput of a single CPU core. Lastly, I used non-temporal prefetch and streaming store instructions so that CPU caches are not trashed by the parity operations.
These are the differences I noticed, but performance wise, I bet they are pretty close.

Regarding testing, I have an external test tool that verifies bit exactness for all methods I wrote. I'll try to port it into ztest. I also left the 'old' implementation in, so switching in runtime between them is not a problem.
Suggestions and comments are welcome.

rdolbeau · 2016-02-12T14:45:04Z

The 'good' branch for my code is abd2, since at the time the abd changes were meant to be merged. I don't know what the status is now, I haven't had time to follow ZFS developement closely.

abd2 also has simple benchmarking, AVX-512 (deactivated, as I've yet to be able to test if on actual hardware) & ARMv8/NEON. However, I don't have reconstruction and I didn't add explicit prefetch since I wasn't sure whether there was any data reuse between the parity code and other parts of ZFS.

I agree that AVX-128 is a bit redundant with SSE, since the kernel is probably no affected by the SSE/AVX transition (all SSE/AVX usage has to be explicit).

sempervictus · 2016-02-16T04:53:42Z

I think the ABD changes are still meant to be merged, @tuxoko keeps improving the stack anyway, but it is kind of a problem for those of us building test systems with physical hardware for things like this (lots of rebuilds) that its still out of tree.

We still have a production host with SSE running the abd2 branch derived work, though i believe its from @tomgarcia's fork of @rdolbeau's work.

If there are advantages to this approach, whether in maintainability or performance, it would be nice to see work along this vein merged in sooner rather than later, as it's esoteric enough to become difficult to merge after any significant divergence upstream.

Thank you @ironMann and @rdolbeau for building out these low level solutions.

ironMann · 2016-02-18T16:31:28Z

@behlendorf: I've added some code to use ztest for testing. Default behaviour for userspace now is to cycle through all supported implementations on each new zio. This should help with math stress testing. I'm running ztest -r 11 -R 3 -a 0 -T 3600. Maybe ztest should corrupt data more frequently?
@rdolbeau: wow, AVX-512, that was preemptive 👍 I also agree with you. In any case, each inst set will be in a separate compilation unit, so tweaking it later for maximal performance should not be an issue.
@sempervictus: I'll bootstrap a couple of VM's to lighten the load on test servers. It's surprisingly not easy to keep up with all different platforms, distros, kernels, and compilers. As for performance, here is a kernel-mode benchmark (including fpu costs) from our test server: results

rdolbeau · 2016-02-18T16:54:37Z

2016-02-18 17:31 GMT+01:00 Gvozden Neskovic [email protected]:

@rdolbeau: wow, AVX-512, that was preemptive

With the masking and scatter/gather and CDI, AVX-512 seems to be "SIMD done
right at last" :-) There's two versions in the code, without (KNL) or with
(SKX) AVX512BW.

But they apparently forgot some instructions - there's not vpxorb in
AVX512BW. Someone must have thought that since there was just the one
variant in SSE and AVX, they didn't need more in AVX-512 since it's
bit-wise... completely forgetting about the masking :-( So we can't easily
do XOR-with-masking from a mask register. Too bad.

Cordially,

Romain Dolbeau

behlendorf · 2016-02-19T21:42:39Z

@ironMann I know this is still a WIP but I made a few comments inline. It looks like this is moving in a good direction, thanks for working on it. I'm happy to make another pass perhaps once it's passing the buildbot.

Maybe ztest should corrupt data more frequently?

For testing purposes you could definitely increase the frequency of ztest_fault_inject() in ztest.c.

behlendorf · 2016-03-21T17:32:17Z

To aid in the automated testing of these optimizations a Intel Xeon E5-2676v3 (Haswell) has been added to the TEST builders. It supports the following cpu flags.

model name  : Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36
              clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl
              xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic
              movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor
              lahf_lm abm fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt

ironMann · 2016-03-24T20:08:18Z

@behlendorf @sempervictus I adapted the code to ABD patch (in #4439), but in the process, I had to reimplement all of the vector code to be able to consume HighMem pages efficiently. From this point on, I would prefer to concentrate on a single branch for further improvements. Basically the vector code is more or less stable on both branches, and now I want to work more on testing, verification and glue code for all implementation.
Any preference on which branch to proceed?

@behlendorf I seem that 2 zfstests on SIMD machine are consistently failing:

Test: /usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_scrub_002_pos (run as root) [00:01] [FAIL]
Test: /usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_scrub_003_pos (run as root) [00:02] [FAIL]

behlendorf · 2016-03-24T20:36:46Z

@ironMann I'd recommend working on top of @tuxoko's work. There's definitely ongoing work to get those changes in to state which can be maintained long term, but the intention is to move in that direction.

I suspect this is caused by a racy test case. It looks like the scrub is either already finished by the time the test tries to stop it, or maybe it hasn't started yet. Either way, I've seen this occasionally on other test instances and it's unrelated to these changes. I'll open a new issue so we can track and fix the tests.

From the log:

Test: /usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_scrub_002_pos (run as root) [00:01] [FAIL]
18:37:37.51 ASSERTION: Verify scrub -s works correctly.
18:37:38.55 SUCCESS: /sbin/zpool scrub testpool.1405
18:37:38.55 cannot cancel scrubbing testpool.1405: there is no active scrub
18:37:38.55 ERROR: /sbin/zpool scrub -s testpool.1405 exited 1
Test: /usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_scrub/zpool_scrub_003_pos (run as root) [00:02] [FAIL]
18:37:38.56 ASSERTION: scrub command terminates the existing scrub process and starts a new scrub.
18:37:39.81 SUCCESS: /sbin/zpool scrub testpool.1405
18:37:40.96 SUCCESS: /sbin/zpool scrub testpool.1405
18:37:40.96 zpool scrub don't stop existing scrubbing process.

ofaaland · 2016-06-13T21:13:12Z

@ironMann thanks for the explanation. Do you allow those options for testing purposes? Or are they there so the user can choose the implementation, for cases when vdev_raidz_math_init() chooses an implementation that is problematic?

ironMann · 2016-06-16T08:41:43Z

@ofaaland Well, initially it was for testing, but then it morphed to 'freedom of choice' kind of thing. I'm still not sure if 'cycle' option should be given to users (it is only useful for unit and coverage testing). Maybe it should be removed in non-debug builds.

In theory, vdev_raidz_math_init() could choose slower method due to bunch of reasons, but that is highly unlikely. But not, problematic implementation, in sense that it would not work correctly.

behlendorf · 2016-06-16T21:56:38Z

@ironMann having 'cycle' available for testing has been helpful to me to build confidence in the patch. But now that most of the testing done I wouldn't object to removing it as a option from the kernel module. We'll of course want to leave it in user space for ztest though.

Thus far I haven't observed any problems with the patch. I'll leave my testing running over the weekend and if I don't encounter any problems I'll sign off on this patch so it can be merged. That just leaves the very minor issue of removing 'cycle' as mentioned above.

ofaaland · 2016-06-17T05:09:32Z

@ironMann, feel free to reject it, but how about this text for the manpage?

Parameter for selecting raidz implementation to use.

Options marked (always) below may be selected on module load as they are
supported on all systems.

The remaining options may only be set after the module is loaded, as they
are available only if the implementations are compiled in and supported
on the running system.

Once the module is loaded, the content of
/sys/module/zfs/parameters/zfs_vdev_raidz_impl will show available options
with the currently selected one enclosed in [].

Possible options are:
  fastest  - (always) implementation selected using built-in benchmark
  original - (always) original raidz implementation
  scalar   - (always) scalar raidz implementation
  sse      - implementation using SSE instruction set (64bit x86 only)
  avx2     - implementation using AVX2 instruction set (64bit x86 only)
.sp

ironMann · 2016-06-18T10:53:16Z

@ofaaland thanks, it's more clear.

@behlendorf In addition, I would also skip benchmark in userspace, and use the highest supported impl as the 'fastest'. This avoids startup delay of tools that call kernel_init()?

behlendorf · 2016-06-18T19:44:50Z

@ironMann the kernel_init() optimization is a nice idea even if it should really only impact ztest. It also looks like vdev_zaps_005_pos is failing on all the testers so that's going to need to be explained.

- specialized gen/rec routines for all RAIDZ levels, - new scalar raidz implementation (unrolled), - two x86_64 SIMD implementations (SSE and AVX2 instructions sets), - fastest routines selected on module load (benchmark). - cmd/raidz_test - verify and benchmark all implementations against original New zfs module parameters: - zfs_vdev_raidz_impl (str): selects the implementation to use. On module load, the parameter will only accept first 3 options, and other implementations can be set once module is finished loading. Possible values for this option are: "fastest" - use the fastest math available "original" - use the original raidz code "scalar" - new scalar impl "sse" - new SSE impl if available "avx2" - new AVX2 impl if available see contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to get list of supported values. If an implementation is not supported on the system, it will not be showed. Currently selected option is enclosed in `[]`. Added raidz_test to the ZFS Test Suite raidz sweep test is running for 300s Each configuration runs in a separate thread (one running thread per CPU core)

rdolbeau · 2016-06-19T07:02:42Z

2016-06-18 12:53 GMT+02:00 Gvozden Neskovic [email protected]:

@behlendorf In addition, I would also skip benchmark in userspace, and
use the highest supported impl as the 'fastest'. This avoids startup delay
of tools that call kernel_init()?

Which is the fastest might be obvious on X86-64 (it's likely
AVX512>AVX2>SSE>scalar), but for other architectures it's not necessarily
the case. There's a lot of e.g. Aarch64 cores out there, and it's possible
a wide-issue scalar core would outperform a 64-bits wide NEON pipe in some
cases. And the NEON itself might need to be tuned differently for different
cores.

Cordially,

Romain Dolbeau

ironMann · 2016-06-20T13:15:06Z

@rdolbeau You are right, but there's currently no real use-case for having 'real fastest' impl in userspace tools. These apps cycle through all supported impl. for better test coverage. Real benchmark is still performed on 'zfs' module load, but removing it in userspace saves a lot of time on startup of ztest and zdb.

@behlendorf vdev_zaps_* seem to lack proper setup. If I run them standalone on my dev VM 005 fails, but they all pass in the whole suite. I've messed with the balance by adding default_cleanup in /raidz/raidz_*. I've tried adding default_setup_noexit ${DISKS%% *} to zaps/setup.ksh but that caused zaps_001 to fail, and all others to pass... Sorry, I don't have any more time now to debug this further.

behlendorf · 2016-06-20T18:56:46Z

@ironMann thanks for looking at the vdev_zaps_* failures. It's clear these failures are unrelated to this PR and the test cases themselves need to be improved. Given that they're transient and seem to be infrequent I'm OK with tackling them as their own issue. So they don't need to hold this up.

The testing I've done over the last week on real hardware has been as abusive as I could make it, and it was designed to cover as much of the testing space as possible: raidz[123] geometries, 4k-16M block sizes, 1-20 device-per-raidz groups, full device rebuilds, all new algorithms checked. I wasn't able to uncover any problems with the patch.

This patch LGTM. Nice job! Let me know if your happy with this as a final version so it can be merged. We can always tweak the administrative aspects of it a little if needed after it's merged. But the core of it looks completely solid.

ironMann · 2016-06-20T21:54:37Z

@behlendorf Thanks, I think the patch is in a ok shape, too. The core raidz framework should be generic enough to support addition of new instruction set implementation with minimal effort (also 32bit SSE/AVX2 variants if relevant).

behlendorf · 2016-06-21T16:44:07Z

Merged to master as:

ab9f4b0 SIMD implementation of vdev_raidz generate and reconstruct routines

@ironMann thanks again for implementing this and building a solid generic framework we can extend. @rdolbeau you should be all set to extend this for NEON.

thegreatgazoo · 2016-07-12T17:25:04Z

I'm testing master branch today. A few comments/questions on this patch:

# cat /sys/module/zfs/parameters/zfs_vdev_raidz_impl
[fastest] original scalar sse avx2

So I'm using the fastest option, but which one is that? As an end-user, I'd want to know which implementation is actually in use, but "fastest" doesn't give me that information.

# ls -l /sys/module/zfs/parameters/zfs_vdev_raidz_impl
-rw-r--r-- 1 root root 4096 Jul 12 17:15 /sys/module/zfs/parameters/zfs_vdev_raidz_impl
# echo original > /sys/module/zfs/parameters/zfs_vdev_raidz_impl
-bash: echo: write error: Invalid argument
# echo "original" > /sys/module/zfs/parameters/zfs_vdev_raidz_impl
-bash: echo: write error: Invalid argument

The file is writable so I guess I'd be able to change it at run-time, but it ended up with write errors. Did I do something wrong here?

thegreatgazoo · 2016-07-12T17:28:33Z

man/man5/zfs-module-parameters.5

+\fBzfs_vdev_raidz_impl\fR (string)
+.ad
+.RS 12n
+Parameter for selecting raidz implementation to use.


I'd rather call it raidz parity implementation, just to be clear.

ironMann · 2016-07-12T18:04:59Z

@thegreatgazoo The fastest impl. is envisioned to be found at runtime. So there's 10 raidz parity routines in total, and in theory the fastest can pick up function pointers from all supported implementations, if that is preferable. But in practice, it's probably going to use the widest SIMD implementation your CPU supports.

The file is writable so I guess I'd be able to change it at run-time, but it ended up with write errors. Did I do something wrong here?

That's news. Can you try with printf "original" > /sys/module/zfs/parameters/zfs_vdev_raidz_impl
It might be that echo adds a newline.

behlendorf · 2016-07-12T18:06:56Z

It might be that echo adds a newline.

This is exactly what's happening, we'll want to trim the trailing white space. I definitely tested this so I'm not sure how I missed it.

ironMann · 2016-07-12T18:09:34Z

@behlendorf It seems that echo is inconsistent with that. I'll fold these fixes into #4815

rdolbeau · 2016-07-12T18:11:55Z

2016-07-12 20:05 GMT+02:00 Gvozden Neskovic [email protected]:

But in practice, it's probably going to use the widest SIMD
implementation your CPU supports.

... on x86-64. On Aarch64 (#4801), it depends on the core. Some have fast, wide
NEON support... some don't :-(

Cordially,

Romain Dolbeau

thegreatgazoo · 2016-07-12T19:44:07Z

@ironMann Yes newline was the problem, echo -n worked. Since mere echo works with setting other parameters under /sys/module/zfs/parameters/, I'd think it should be fixed. BTW, I saw in the code:

module_param_call(zfs_vdev_raidz_impl, zfs_vdev_raidz_impl_set,
        zfs_vdev_raidz_impl_get, NULL, 0644);

The zfs_vdev_raidz_impl_get() grabs vdev_raidz_impl_lock, but zfs_vdev_raidz_impl_set() does not - that doesn't seem right to me.

thegreatgazoo · 2016-07-12T19:52:17Z

@ironMann Thanks the explanation on [fastest]. My real question is, as the admin, I want to know exactly what that [fastest] is, but /sys/module/zfs/parameters/zfs_vdev_raidz_impl doesn't tell me that. Another scenario would be bug reporting, if a user suspects/hits a bug in the parity routine, we'd at least need to know what that [fastest] points to, preferrably without patching and rebooting since technically [fastest] is dynamic.

ironMann · 2016-07-12T20:41:40Z

@thegreatgazoo That locking is indeed a bug. Thing is, that method can be called before the lock is even initialized (when module parameter is specified), and also through api from userspace. will fix.

That's a legitimate point about [fastest] option. Currently there's a kstat for measured throughput of all methods, and contents of [fastest] can be deduced from that, but there should exist a better way. I can extend kstat data to explicitly show what instruction set is used.

* Consistently use parsable instead of parseable This is a purely cosmetical change, to consistently prefer one of two (both acceptable) choises for the word parsable in documentation and code. I don't really care which to use, but acording to wiktionary https://en.wiktionary.org/wiki/parsable#English parsable is preferred. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4682 * Add missing RPM BuildRequires Both libudev and libattr are recommended build requirements. As such their development headers should lists in the rpm spec file so those dependencies are pulled in when building rpm packages. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4676 * Skip ctldir znode in zfs_rezget to fix snapdir issues Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This will cause funny behaviour for the mounted snapdirs. Especially for Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone automount it again as long as someone is still using the detached mount. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4514 Closes #4661 Closes #4672 * Improve zfs-module-parameters(5) Various rewrites to the descriptions of module parameters. Corrects spelling mistakes, makes descriptions them more user-friendly and describes some ZFS quirks which should be understood before changing parameter values. Signed-off-by: DHE <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4671 * Fix arc_prune_task use-after-free arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent the underlying zsb from disappearing if there's a concurrent umount. We fix this by force the caller of arc_remove_prune_callback to wait for arc_prune_taskq to finish. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4687 Closes #4690 * Add request size histograms (-r) to zpool iostat, minor man page fix Add -r option to "zpool iostat" to print request size histograms for the leaf ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs ("agg"). These stats can be useful for seeing how well the ZFS IO aggregator is working. $ zpool iostat -r mypool sync_read sync_write async_read async_write scrub req_size ind agg ind agg ind agg ind agg ind agg ---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 512 0 0 0 0 0 0 530 0 0 0 1K 0 0 260 0 0 0 116 246 0 0 2K 0 0 0 0 0 0 0 431 0 0 4K 0 0 0 0 0 0 3 107 0 0 8K 15 0 35 0 0 0 0 6 0 0 16K 0 0 0 0 0 0 0 39 0 0 32K 0 0 0 0 0 0 0 0 0 0 64K 20 0 40 0 0 0 0 0 0 0 128K 0 0 20 0 0 0 0 0 0 0 256K 0 0 0 0 0 0 0 0 0 0 512K 0 0 0 0 0 0 0 0 0 0 1M 0 0 0 0 0 0 0 0 0 0 2M 0 0 0 0 0 0 0 0 0 0 4M 0 0 0 0 0 0 155 19 0 0 8M 0 0 0 0 0 0 0 811 0 0 16M 0 0 0 0 0 0 0 68 0 0 -------------------------------------------------------------------------------- Also rename the stray "-G" in the man page to be "-w" for latency histograms. Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tim Chase <[email protected]> Closes #4659 * OpenZFS 6531 - Provide mechanism to artificially limit disk performance Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Dan McDonald <[email protected]> Ported by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6531 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130 Porting notes: - Added new IO delay tracepoints, and moved common ZIO tracepoint macros to a new trace_common.h file. - Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function. - Updated zinject man page - Updated zpool_scrub test files * Systemd configuration fixes * Disable zfs-import-scan.service by default. This ensures that pools will not be automatically imported unless they appear in the cache file. When this service is explicitly enabled pools will be imported with the "cachefile=none" property set. This prevents the creation of, or update to, an existing cache file. $ systemctl list-unit-files | grep zfs zfs-import-cache.service enabled zfs-import-scan.service disabled zfs-mount.service enabled zfs-share.service enabled zfs-zed.service enabled zfs.target enabled * Change services to dynamic from static by adding an [Install] section and adding 'WantedBy' tags in favor of 'Requires' tags. This allows for easier customization of the boot behavior. * Start the zfs-import-cache.service after the root pivot so the cache file is available in the standard location. * Start the zfs-mount.service after the systemd-remount-fs.service to ensure the root fs is writeable and the ZFS filesystems can create their mount points. * Change the default behavior to only load the ZFS kernel modules in zfs-import-*.service or when blkid(8) detects a pool. Users who wish to unconditionally load the kernel modules must uncomment the list of modules in /lib/modules-load.d/zfs.conf. Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4325 Closes #4496 Closes #4658 Closes #4699 * Fix self-healing IO prior to dsl_pool_init() completion Async writes triggered by a self-healing IO may be issued before the pool finishes the process of initialization. This results in a NULL dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes(). George Wilson recommended addressing this issue by initializing the passed `dsl_pool_t **` prior to dmu_objset_open_impl(). Since the caller is passing the `spa->spa_dsl_pool` this has the effect of ensuring it's initialized. However, since this depends on the caller knowing they must pass the `spa->spa_dsl_pool` an additional NULL check was added to vdev_queue_max_async_writes(). This guards against any future restructuring of the code which might result in dsl_pool_init() being called differently. Signed-off-by: GeLiXin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4652 * Add isa_defs for MIPS GCC for MIPS only defines _LP64 when 64bit, while no _ILP32 defined when 32bit. Signed-off-by: YunQiang Su <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4712 * Fix out-of-bound access in zfs_fillpage The original code will do an out-of-bound access on pl[] during last iteration. ================================================================== BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs] Read of size 8 by task tmpfile/7850 page:ffffea00017c6dc0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0xffff8000000000() page dumped because: kasan: bad access detected CPU: 3 PID: 7850 Comm: tmpfile Tainted: G OE 4.6.0+ #3 ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618 ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8 ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3 Call Trace: [<ffffffff81635618>] dump_stack+0x63/0x8b [<ffffffff81313ee8>] kasan_report_error+0x528/0x560 [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0 [<ffffffff813144b8>] kasan_report+0x58/0x60 [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffff81312e4e>] __asan_load8+0x5e/0x70 [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs] [<ffffffff81353c3a>] SyS_execve+0x3a/0x50 [<ffffffff810058ef>] do_syscall_64+0xef/0x180 [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25 Memory state around the buggy address: ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 ^ ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4705 Issue #4708 * Fix memleak in zpl_parse_options strsep() will advance tmp_mntopts, and will change it to NULL on last iteration. This will cause strfree(tmp_mntopts) to not free anything. unreferenced object 0xffff8800883976c0 (size 64): comm "mount.zfs", pid 3361, jiffies 4294931877 (age 1482.408s) hex dump (first 32 bytes): 72 77 00 73 74 72 69 63 74 61 74 69 6d 65 00 7a rw.strictatime.z 66 73 75 74 69 6c 00 6d 6e 74 70 6f 69 6e 74 3d fsutil.mntpoint= backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811f9cac>] __kmalloc+0x16c/0x250 [<ffffffffc065ce9b>] strdup+0x3b/0x60 [spl] [<ffffffffc080fad6>] zpl_parse_options+0x56/0x300 [zfs] [<ffffffffc080fe46>] zpl_mount+0x36/0x80 [zfs] [<ffffffff81222dc8>] mount_fs+0x38/0x160 [<ffffffff81240097>] vfs_kern_mount+0x67/0x110 [<ffffffff812428e0>] do_mount+0x250/0xe20 [<ffffffff812437d5>] SyS_mount+0x95/0xe0 [<ffffffff8181aff6>] entry_SYSCALL_64_fastpath+0x1e/0xa8 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4706 Issue #4708 * Fix memleak in vdev_config_generate_stats fnvlist_add_nvlist will copy the contents of nvx, so we need to free it here. unreferenced object 0xffff8800a6934e80 (size 64): comm "zpool", pid 3398, jiffies 4295007406 (age 214.180s) hex dump (first 32 bytes): 60 06 c2 73 00 88 ff ff 00 7c 8c 73 00 88 ff ff `..s.....|.s.... 00 00 00 00 00 00 00 00 40 b0 70 c0 ff ff ff ff [email protected]..... backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811fac7d>] __kmalloc_node+0x17d/0x310 [<ffffffffc065528c>] spl_kmem_alloc_impl+0xac/0x180 [spl] [<ffffffffc0657379>] spl_vmem_alloc+0x19/0x20 [spl] [<ffffffffc07056cf>] nv_alloc_sleep_spl+0x1f/0x30 [znvpair] [<ffffffffc07006b7>] nvlist_xalloc.part.13+0x27/0xc0 [znvpair] [<ffffffffc07007ad>] nvlist_alloc+0x3d/0x40 [znvpair] [<ffffffffc0703abc>] fnvlist_alloc+0x2c/0x80 [znvpair] [<ffffffffc07b1783>] vdev_config_generate_stats+0x83/0x370 [zfs] [<ffffffffc07b1f53>] vdev_config_generate+0x4e3/0x650 [zfs] [<ffffffffc07996db>] spa_config_generate+0x20b/0x4b0 [zfs] [<ffffffffc0794f64>] spa_tryimport+0xc4/0x430 [zfs] [<ffffffffc07d11d8>] zfs_ioc_pool_tryimport+0x68/0x110 [zfs] [<ffffffffc07d4fc6>] zfsdev_ioctl+0x646/0x7a0 [zfs] [<ffffffff81232e31>] do_vfs_ioctl+0xa1/0x5b0 [<ffffffff812333b9>] SyS_ioctl+0x79/0x90 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4707 Issue #4708 * Linux 4.7 compat: handler->set() takes both dentry and inode Counterpart to fd4c7b7, the same approach was taken to resolve the compatibility issue. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4717 Issue #4665 * Implementation of AVX2 optimized Fletcher-4 New functionality: - Preserves existing scalar implementation. - Adds AVX2 optimized Fletcher-4 computation. - Fastest routines selected on module load (benchmark). - Test case for Fletcher-4 added to ztest. New zcommon module parameters: - zfs_fletcher_4_impl (str): selects the implementation to use. "fastest" - use the fastest version available "cycle" - cycle trough all available impl for ztest "scalar" - use the original version "avx2" - new AVX2 implementation if available Performance comparison (Intel i7 CPU, 1MB data buffers): - Scalar: 4216 MB/s - AVX2: 14499 MB/s See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl` to get list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Jinshan Xiong <[email protected]> Signed-off-by: Andreas Dilger <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4330 * Fix cstyle.pl warnings As of perl v5.22.1 the following warnings are generated: * Redundant argument in printf at scripts/cstyle.pl line 194 * Unescaped left brace in regex is deprecated, passed through in regex; marked by <-- HERE in m/\S{ <-- HERE / at scripts/cstyle.pl line 608. They have been addressed by escaping the left braces and by providing the correct number of arguments to printf based on the fmt specifier set by the verbose option. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4723 * Fix minor spelling mistakes Trivial spelling mistake fix in error message text. * Fix spelling mistake "adminstrator" -> "administrator" * Fix spelling mistake "specificed" -> "specified" * Fix spelling mistake "interperted" -> "interpreted" Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4728 * Add `zfs allow` and `zfs unallow` support ZFS allows for specific permissions to be delegated to normal users with the `zfs allow` and `zfs unallow` commands. In addition, non- privileged users should be able to run all of the following commands: * zpool [list | iostat | status | get] * zfs [list | get] Historically this functionality was not available on Linux. In order to add it the secpolicy_* functions needed to be implemented and mapped to the equivalent Linux capability. Only then could the permissions on the `/dev/zfs` be relaxed and the internal ZFS permission checks used. Even with this change some limitations remain. Under Linux only the root user is allowed to modify the namespace (unless it's a private namespace). This means the mount, mountpoint, canmount, unmount, and remount delegations cannot be supported with the existing code. It may be possible to add this functionality in the future. This functionality was validated with the cli_user and delegation test cases from the ZFS Test Suite. These tests exhaustively verify each of the supported permissions which can be delegated and ensures only an authorized user can perform it. Two minor bug fixes were required for test-running.py. First, the Timer() object cannot be safely created in a `try:` block when there is an unconditional `finally` block which references it. Second, when running as a normal user also check for scripts using the both the .ksh and .sh suffixes. Finally, existing users who are simulating delegations by setting group permissions on the /dev/zfs device should revert that customization when updating to a version with this change. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #362 Closes #434 Closes #4100 Closes #4394 Closes #4410 Closes #4487 * Remove libzfs_graph.c The libzfs_graph.c source file should have been removed in 330d06f, it is entirely unused. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4766 * Linux 4.6 compat: Fall back to d_prune_aliases() if necessary As of 4.6, the icache and dcache LRUs are memcg aware insofar as the kernel's per-superblock shrinker is concerned. The effect is that dcache or icache entries added by a task in a non-root memcg won't be scanned by the shrinker in the context of the root (or NULL) memcg. This defeats the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to grow uncontrollably. This patch reverts to the d_prune_aliaes() method in case the kernel's per-superblock shrinker is not able to free anything. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes: #4726 * SIMD implementation of vdev_raidz generate and reconstruct routines This is a new implementation of RAIDZ1/2/3 routines using x86_64 scalar, SSE, and AVX2 instruction sets. Included are 3 parity generation routines (P, PQ, and PQR) and 7 reconstruction routines, for all RAIDZ level. On module load, a quick benchmark of supported routines will select the fastest for each operation and they will be used at runtime. Original implementation is still present and can be selected via module parameter. Patch contains: - specialized gen/rec routines for all RAIDZ levels, - new scalar raidz implementation (unrolled), - two x86_64 SIMD implementations (SSE and AVX2 instructions sets), - fastest routines selected on module load (benchmark). - cmd/raidz_test - verify and benchmark all implementations - added raidz_test to the ZFS Test Suite New zfs module parameters: - zfs_vdev_raidz_impl (str): selects the implementation to use. On module load, the parameter will only accept first 3 options, and the other implementations can be set once module is finished loading. Possible values for this option are: "fastest" - use the fastest math available "original" - use the original raidz code "scalar" - new scalar impl "sse" - new SSE impl if available "avx2" - new AVX2 impl if available See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to get the list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4328 * Fix NFS credential The commit f74b821 caused a regression where creating file through NFS will always create a file owned by root. This is because the patch enables the KSID code in zfs_acl_ids_create, which it would use euid and egid of the current process. However, on Linux, we should use fsuid and fsgid for file operations, which is the original behaviour. So we revert this part of code. The patch also enables secpolicy_vnode_*, since they are also used in file operations, we change them to use fsuid and fsgid. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4772 Closes #4758 * OpenZFS 6513 - partially filled holes lose birth time Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Boris Protopopov <[email protected]> Approved by: Richard Lowe <[email protected]>a Ported by: Boris Protopopov <[email protected]> Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6513 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0 If a ZFS object contains a hole at level one, and then a data block is created at level 0 underneath that l1 block, l0 holes will be created. However, these l0 holes do not have the birth time property set; as a result, incremental sends will not send those holes. Fix is to modify the dbuf_read code to fill in birth time data. * Add a test case for dmu_free_long_range() to ztest Signed-off-by: Boris Protopopov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4754 * Revert "Add a test case for dmu_free_long_range() to ztest" This reverts commit d0de2e82df579f4e4edf5643b674a1464fae485f which introduced a new test case to ztest which is failing occasionally during automated testing. The change is being reverted until the issue can be fully investigated. Signed-off-by: Brian Behlendorf <[email protected]> Issue #4754 * OpenZFS 6878 - Add scrub completion info to "zpool history" Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Approved by: Dan McDonald <[email protected]> Authored by: Nav Ravindranath <[email protected]> Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6878 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1825bc5 Closes #4787 * FreeBSD rS271776 - Persist vdev_resilver_txg changes Persist vdev_resilver_txg changes to avoid panic caused by validation vs a vdev_resilver_txg value from a previous resilver. Authored-by: smh <[email protected]> Ported-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/5154 FreeBSD-issue: https://reviews.freebsd.org/rS271776 FreeBSD-commit: https://github.com/freebsd/freebsd/commit/c3c60bf Closes #4790 * xattrtest: allow verify with -R and other improvements - Use a fixed buffer of random bytes when random xattr values are in effect. This eliminates the potential performance bottleneck of reading from /dev/urandom for each file. This also allows us to verify xattrs in random value mode. - Show the rate of operations per second in addition to elapsed time for each phase of the test. This may be useful for benchmarking. - Set default xattr size to 6 so that verify doesn't fail if user doesn't specify a size. We need at least six bytes to store the leading "size=X" string that is used for verification. - Allow user to execute just one phase of the test. Acceptable values for -o and their meanings are: 1 - run the create phase 2 - run the setxattr phase 3 - run the getxattr phase 4 - run the unlink phase Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> * Backfill metadnode more intelligently Only attempt to backfill lower metadnode object numbers if at least 4096 objects have been freed since the last rescan, and at most once per transaction group. This avoids a pathology in dmu_object_alloc() that caused O(N^2) behavior for create-heavy workloads and substantially improves object creation rates. As summarized by @mahrens in #4636: "Normally, the object allocator simply checks to see if the next object is available. The slow calls happened when dmu_object_alloc() checks to see if it can backfill lower object numbers. This happens every time we move on to a new L1 indirect block (i.e. every 32 * 128 = 4096 objects). When re-checking lower object numbers, we use the on-disk fill count (blkptr_t:blk_fill) to quickly skip over indirect blocks that don’t have enough free dnodes (defined as an L2 with at least 393,216 of 524,288 dnodes free). Therefore, we may find that a block of dnodes has a low (or zero) fill count, and yet we can’t allocate any of its dnodes, because they've been allocated in memory but not yet written to disk. In this case we have to hold each of the dnodes and then notice that it has been allocated in memory. The end result is that allocating N objects in the same TXG can require CPU usage proportional to N^2." Add a tunable dmu_rescan_dnode_threshold to define the number of objects that must be freed before a rescan is performed. Don't bother to export this as a module option because testing doesn't show a compelling reason to change it. The vast majority of the performance gain comes from limit the rescan to at most once per TXG. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> * Implement large_dnode pool feature Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3542 * Sync DMU_BACKUP_FEATURE_* flags Flag 20 was used in OpenZFS as DMU_BACKUP_FEATURE_RESUMING. The DMU_BACKUP_FEATURE_LARGE_DNODE flag must be shifted to 21 and then reserved in the upstream OpenZFS implementation. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Closes #4795 * OpenZFS 2605, 6980, 6902 2605 want to resume interrupted zfs send Reviewed by: George Wilson <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Richard Elling <[email protected]> Reviewed by: Xin Li <[email protected]> Reviewed by: Arne Jansen <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: kernelOfTruth <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/2605 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12 6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6980 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f Porting notes: - All rsend and snapshop tests enabled and updated for Linux. - Fix misuse of input argument in traverse_visitbp(). - Fix ISO C90 warnings and errors. - Fix gcc 'missing braces around initializer' in 'struct send_thread_arg to_arg =' warning. - Replace 4 argument fletcher_4_native() with 3 argument version, this change was made in OpenZFS 4185 which has not been ported. - Part of the sections for 'zfs receive' and 'zfs send' was rewritten and reordered to approximate upstream. - Fix mktree xattr creation, 'user.' prefix required. - Minor fixes to newly enabled test cases - Long holds for volumes allowed during receive for minor registration. * OpenZFS 6051 - lzc_receive: allow the caller to read the begin record Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6051 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/620f322 * OpenZFS 6393 - zfs receive a full send as a clone Authored by: Paul Dagnelie <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Richard Elling <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6394 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e * OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS Authored by: Andrew Stormont <[email protected]> Reviewed by: Anil Vijarnia <[email protected]> Reviewed by: Kim Shrier <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6536 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/880094b * OpenZFS 6738 - zfs send stream padding needs documentation Authored by: Eli Rosenthal <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Reviewed by: Dan McDonald <[email protected]> Approved by: Robert Mustacchi <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6738 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c20404ff * OpenZFS 4986 - receiving replication stream fails if any snapshot exceeds refquota Authored by: Dan McDonald <[email protected]> Reviewed by: John Kennedy <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Gordon Ross <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/4986 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5878fad * OpenZFS 6562 - Refquota on receive doesn't account for overage Authored by: Dan McDonald <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Reviewed by: Toomas Soome <[email protected]> Approved by: Gordon Ross <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6562 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5f7a8e6 * Implement zfs_ioc_recv_new() for OpenZFS 2605 Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy ZFS_IOC_RECV user/kernel interface. The new interface supports all stream options but is currently only used for resumable streams. This way updated user space utilities will interoperate with older kernel modules. ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW handler. Non-Linux OpenZFS platforms have opted to change the legacy interface in an incompatible fashion instead of adding a new ioctl. Signed-off-by: Brian Behlendorf <[email protected]> * OpenZFS 6314 - buffer overflow in dsl_dataset_name Reviewed by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Igor Kozhukhov <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Brian Behlendorf <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/6314 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee * OpenZFS 6876 - Stack corruption after importing a pool with a too-long name Reviewed by: Prakash Surya <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Yuri Pankov <[email protected]> Ported-by: Brian Behlendorf <[email protected]> Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for trouble. We should check every dataset on import, using a 1024 byte buffer and checking each time to see if the dataset's new name is longer than 256 bytes. OpenZFS-issue: https://www.illumos.org/issues/6876 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e * Vectorized fletcher_4 must be 128-bit aligned The fletcher_4_native() and fletcher_4_byteswap() functions may only safely use the vectorized implementations when the buffer is 128-bit aligned. This is because both the AVX2 and SSE implementations process four 32-bit words per iterations. Fallback to the scalar implementation which only processes a single 32-bit word for unaligned buffers. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Issue #4330 * Allow building with `CFLAGS="-O0"` If compiled with -O0, gcc doesn't do any stack frame coalescing and -Wframe-larger-than=1024 is triggered in debug mode. Starting with gcc 4.8, new opt level -Og is introduced for debugging, which does not trigger this warning. Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4799 * Don't allow accessing XATTR via export handle Allow accessing XATTR through export handle is a very bad idea. It would allow user to write whatever they want in fields where they otherwise could not. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4828 * Fix get_zfs_sb race with concurrent umount Certain ioctl operations will call get_zfs_sb, which will holds an active count on sb without checking whether it's active or not. This will result in use-after-free. We fix this by using atomic_inc_not_zero to make sure we got an active sb. P1 P2 --- --- deactivate_locked_super(): s_active = 0 zfs_sb_hold() ->get_zfs_sb(): s_active = 1 ->zpl_kill_sb() -->zpl_put_super() --->zfs_umount() ---->zfs_sb_free(zsb) zfs_sb_rele(zsb) Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> * Fix Large kmem_alloc in vdev_metaslab_init This allocation can go way over 1MB, so we should use vmem_alloc instead of kmem_alloc. Large kmem_alloc(1430784, 0x1000), please file an issue... Call Trace: [<ffffffffa0324aff>] ? spl_kmem_zalloc+0xef/0x160 [spl] [<ffffffffa17d0c8d>] ? vdev_metaslab_init+0x9d/0x1f0 [zfs] [<ffffffffa17d46d0>] ? vdev_load+0xc0/0xd0 [zfs] [<ffffffffa17d4643>] ? vdev_load+0x33/0xd0 [zfs] [<ffffffffa17c0004>] ? spa_load+0xfc4/0x1b60 [zfs] [<ffffffffa17c1838>] ? spa_tryimport+0x98/0x430 [zfs] [<ffffffffa17f28b1>] ? zfs_ioc_pool_tryimport+0x41/0x80 [zfs] [<ffffffffa17f5669>] ? zfsdev_ioctl+0x4a9/0x4e0 [zfs] [<ffffffff811bacdf>] ? do_vfs_ioctl+0x2cf/0x4b0 [<ffffffff811baf41>] ? SyS_ioctl+0x81/0xa0 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4752 * Add configure result for xattr_handler Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4828 * fh_to_dentry should return ESTALE when generation mismatch When generation mismatch, it usually means the file pointed by the file handle was deleted. We should return ESTALE to indicate this. We return ENOENT in zfs_vget since zpl_fh_to_dentry will convert it to ESTALE. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4828 * xattr dir doesn't get purged during iput We need to set inode->i_nlink to zero so iput will purge it. Without this, it will get purged during shrink cache or umount, which would likely result in deadlock due to zfs_zget waiting forever on its children which are in the dispose_list of the same thread. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Issue #4359 Issue #3508 Issue #4413 Issue #4827 * Kill zp->z_xattr_parent to prevent pinning zp->z_xattr_parent will pin the parent. This will cause huge issue when unlink a file with xattr. Because the unlinked file is pinned, it will never get purged immediately. And because of that, the xattr stuff will never be marked as unlinked. So the whole unlinked stuff will stay there until shrink cache or umount. This change partially reverts e89260a. This is safe because only the zp->z_xattr_parent optimization is removed, zpl_xattr_security_init() is still called from the zpl outside the inode lock. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Issue #4359 Issue #3508 Issue #4413 Issue #4827 * Fix RAIDZ_TEST tests Remove stray trailing } which prevented the raidz stress tests from running in-tree. Signed-off-by: Brian Behlendorf <[email protected]> * Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z The following scenario can result in garbage in the dn_spill field. The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR is clear to ensure the dn_spill field is cleared. Current txg = A. * A new spill buffer is created. Its dbuf is initialized with db_blkptr = NULL and it's dirtied. Current txg = B. * The spill buffer is modified. It's marked as dirty in this txg. * Additional changes make the spill buffer unnecessary because the xattr fits into the bonus buffer, so it's removed. The dbuf is undirtied in this txg, but it's still referenced and cannot be destroyed. Current txg = C. * Starts syncing of txg A * dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr is NULL, dbuf_check_blkptr() is called. * The dbuf starts being written and it reaches the ready state (not done yet). * A new change makes the spill buffer necessary again. sa_build_layouts() ends up calling dbuf_find() to locate the dbuf. It finds the old dbuf because it has not been destroyed yet (it will be destroyed when the previous write is done and there are no more references). The old dbuf has db_blkptr != NULL. * txg A write is complete and the dbuf released. However it's still referenced, so it's not destroyed. Current txg = D. * Starts syncing of txg B * dbuf_sync_leaf() is called for the bonus buffer. Its contents are directly copied into the dnode, overwriting the blkptr area because, in txg B, the bonus buffer was big enough to hold the entire xattr. * At this point, the db_blkptr of the spill buffer used in txg C gets corrupted. Signed-off-by: Peng <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #3937 * Fix handling of errors nvlist in zfs_ioc_recv_new() zfs_ioc_recv_impl() is changed to always allocate the 'errors' nvlist, its callers are responsible for freeing it. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4829 * Add RAID-Z routines for SSE2 instruction set, in x86_64 mode. The patch covers low-end and older x86 CPUs. Parity generation is equivalent to SSSE3 implementation, but reconstruction is somewhat slower. Previous 'sse' implementation is renamed to 'ssse3' to indicate highest instruction set used. Benchmark results: scalar_rec_p 4 720476442 scalar_rec_q 4 187462804 scalar_rec_r 4 138996096 scalar_rec_pq 4 140834951 scalar_rec_pr 4 129332035 scalar_rec_qr 4 81619194 scalar_rec_pqr 4 53376668 sse2_rec_p 4 2427757064 sse2_rec_q 4 747120861 sse2_rec_r 4 499871637 sse2_rec_pq 4 522403710 sse2_rec_pr 4 464632780 sse2_rec_qr 4 319124434 sse2_rec_pqr 4 205794190 ssse3_rec_p 4 2519939444 ssse3_rec_q 4 1003019289 ssse3_rec_r 4 616428767 ssse3_rec_pq 4 706326396 ssse3_rec_pr 4 570493618 ssse3_rec_qr 4 400185250 ssse3_rec_pqr 4 377541245 original_rec_p 4 691658568 original_rec_q 4 195510948 original_rec_r 4 26075538 original_rec_pq 4 103087368 original_rec_pr 4 15767058 original_rec_qr 4 15513175 original_rec_pqr 4 10746357 Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4783 * Enable zpool_upgrade test cases Creating the pool in a striped rather than mirrored configuration provides enough space for all upgrade tests to run. Test case zpool_upgrade_007_pos still fails and must be investigated so it has been left disabled. Signed-off-by: Brian Behlendorf <[email protected]> Closes #4852 * Prevent null dereferences when accessing dbuf kstat In arc_buf_info(), the arc_buf_t may have no header. If not, don't try to fetch the arc buffer stats and instead just zero them. The null dereferences were observed while accessing the dbuf kstat with awk on a system in which millions of small files were being created in order to overflow the system's metadata limit. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4837 * Fix dbuf_stats_hash_table_data race Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf can be freed at any time. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4846 * Use native inode->i_nlink instead of znode->z_links A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's 64 bit on-disk link count. We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a more Linux-integrated fix for the same issue. In addition, setting the initial link count on a new node has been changed from setting one less than required in zfs_mknode() then incrementing to the correct count in zfs_link_create() (which was somewhat bizarre in the first place), to setting the correct count in zfs_mknode() and not incrementing it in zfs_link_create(). This both means we no longer set the link count in sa_bulk_update() twice (once for the initial incorrect count then again for the correct count), as well as adhering to the Linux requirement of not incrementing a zero link count without I_LINKABLE (see linux commit f4e0c30c). Signed-off-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes #4838 Issue #227 * Implementation of SSE optimized Fletcher-4 Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4) This commit adds another implementation of the Fletcher-4 algorithm. It is automatically selected at module load if it benchmarks higher than all other available implementations. The module benchmark was also amended to analyze the performance of the byteswap-ed version of Fletcher-4, as well as the non-byteswaped version. The average performance of the two is used to select the the fastest implementation available on the host system. Adds a pair of fields to an existing zcommon module parameter: - zfs_fletcher_4_impl (str) "sse2" - new SSE2 implementation if available "ssse3" - new SSSE3 implementation if available Signed-off-by: Tyler J. Stachecki <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4789 * Fix filesystem destroy with receive_resume_token It is possible that the given DS may have hidden child (%recv) datasets - "leftovers" resulting from the previously interrupted 'zfs receieve'. Try to remove the hidden child (%recv) and after that try to remove the target dataset. If the hidden child (%recv) does not exist the original error (EEXIST) will be returned. Signed-off-by: Roman Strashkin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4818 * Prevent segfaults in SSE optimized Fletcher-4 In some cases, the compiler was not respecting the GNU aligned attribute for stack variables in 35a76a0. This was resulting in a segfault on CentOS 6.7 hosts using gcc 4.4.7-17. This issue was fixed in gcc 4.6. To prevent this from occurring, use unaligned loads and stores for all stack and global memory references in the SSE optimized Fletcher-4 code. Disable zimport testing against master where this flaw exists: TEST_ZIMPORT_VERSIONS="installed" Signed-off-by: Tyler J. Stachecki <[email protected]> Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4862 * Update arc_summary.py for prefetch changes Commit 7f60329 removed several kstats which arc_summary.py read. Remove these kstats from arc_summary.py in the same way this was handled in FreeNAS. FreeNAS-commit: https://github.com/freenas/freenas/commit/3901f73 Signed-off-by: Brian Behlendorf <[email protected]> Closes #4695 * Wait iput_async before evict_inodes to prevent race Wait for iput_async before entering evict_inodes in generic_shutdown_super. The reason we must finish before evict_inodes is when lazytime is on, or when zfs_purgedir calls zfs_zget, iput would bump i_count from 0 to 1. This would race with the i_count check in evict_inodes. This means it could destroy the inode while we are still using it. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4854 * Fixes and enhancements of SIMD raidz parity - Implementation lock replaced with atomic variable - Trailing whitespace is removed from user specified parameter, to enhance experience when using commands that add newline, e.g. `echo` - raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813 - silence `cppcheck` in vdev_raidz, partial solution of Issue #1392 - Minor fixes and cleanups - Enable use of original parity methods in [fastest] configuration. New opaque original ops structure, representing native methods, is added to supported raidz methods. Original parity methods are executed if selected implementation has NULL fn pointer. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4813 Issue #1392 * RAIDZ parity kstat rework Print table with speed of methods for each implementation. Last line describes contents of [fastest] selection. Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4860 * Fix NULL pointer in zfs_preumount from 1d9b3bd When zfs_domount fails zsb will be freed, and its caller mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into zfs_preumount. In order to make sure we don't touch any nonexistent stuff, we must make sure s_fs_info is NULL in the fail path so zfs_preumount can easily check that. Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4867 Issue #4854 * Illumos Crypto Port module added to enable native encryption in zfs A port of the Illumos Crypto Framework to a Linux kernel module (found in module/icp). This is needed to do the actual encryption work. We cannot use the Linux kernel's built in crypto api because it is only exported to GPL-licensed modules. Having the ICP also means the crypto code can run on any of the other kernels under OpenZFS. I ended up porting over most of the internals of the framework, which means that porting over other API calls (if we need them) should be fairly easy. Specifically, I have ported over the API functions related to encryption, digests, macs, and crypto templates. The ICP is able to use assembly-accelerated encryption on amd64 machines and AES-NI instructions on Intel chips that support it. There are place-holder directories for similar assembly optimizations for other architectures (although they have not been written). Signed-off-by: Tom Caputi <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4329 * Fix for compilation error when using the kernel's CONFIG_LOCKDEP Signed-off-by: Tom Caputi <[email protected]> Signed-off-by: Chris Dunlop <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4329 * zloop: print backtrace from core files Find the core file by using `/proc/sys/kernel/core_pattern` Signed-off-by: Gvozden Neskovic <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4874 * Fix for metaslab_fastwrite_unmark() assert failure Currently there is an issue where metaslab_fastwrite_unmark() unmarks fastwrites on vdev_t's that have never had fastwrites marked on them. The 'fastwrite mark' is essentially a count of outstanding bytes that will be written to a vdev and is used in syncing context. The problem stems from the fact that the vdev_pending_fastwrite field is not being transferred over when replacing a top-level vdev. As a result, the metaslab is marked for fastwrite on the old vdev and unmarked on the new one, which brings the fastwrite count below zero. This fix simply assigns vdev_pending_fastwrite from the old vdev to the new one so this count is not lost. Signed-off-by: Tom Caputi <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4267 * Remove znode's z_uid/z_gid member Remove duplicate z_uid/z_gid member which are also held in the generic vfs inode struct. This is done by first removing the members from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID macros to access the respective member from struct inode. In cases where the uid/gids are being marshalled from/to disk, use the newly introduced zfs_(uid|gid)_(read|write) functions to properly save the uids rather than the internal kernel representation. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4685 Issue #227 * Check whether the kernel supports i_uid/gid_read/write helpers Since the concept of a kuid and the need to translate from it to ordinary integer type was added in kernel version 3.5 implement necessary plumbing to be able to detect this condition during compile time. If the kernel doesn't support the kuid then just fall back to directly accessing the respective struct inode's members Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #4685 Issue #227 * Fix uninitialized variable in avl_add() Silence the following warning when compiling with gcc 5.4.0. Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609. module/avl/avl.c: In function ‘avl_add’: module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized in this function [-Wmaybe-uninitialized] avl_insert(tree, new_node, where); Signed-off-by: Brian Behlendorf <[email protected]> * Fix sync behavior for disk vdevs Prior to b39c22b, which was first generally available in the 0.6.5 release as b39c22b, ZoL never actually submitted synchronous read or write requests to the Linux block layer. This means the vdev_disk_dio_is_sync() function had always returned false and, therefore, the completion in dio_request_t.dr_comp was never actually used. In b39c22b, synchronous ZIO operations were translated to synchronous BIO requests in vdev_disk_io_start(). The follow-on commits 5592404 and aa159af fixed several problems introduced by b39c22b. In particular, 5592404 introduced the new flag parameter "wait" to __vdev_disk_physio() but under ZoL, since vdev_disk_physio() is never actually used, the wait flag was always zero so the new code had no effect other than to cause a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af. The original rationale for introducing synchronous operations in b39c22b was to hurry certains requests through the BIO layer which would have otherwise been subject to its unplug timer which would increase the latency. This behavior of the unplug timer, however, went away during the transition of the plug/unplug system between kernels 2.6.32 and 2.6.39. To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior. For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and ise used for the same purpose. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #4858 * Limit the amount of dnode metadata in the ARC Metadata-intensive workloads can cause the ARC to become permanently filled with dnode_t objects as they're pinned by the VFS layer. Subsequent data-intensive workloads may only benefit from about 25% of the potential ARC (arc_c_max - arc_meta_limit). In order to help track metadata usage more precisely, the other_size metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size. The new zfs_arc_dnode_limit tunable, which defaults to 10% of zfs_arc_meta_limit, defines the minimum number of bytes which is desirable to be consumed by dnodes. Attempts to evict non-metadata will trigger async prune tasks if the space used by dnodes exceeds this limit. The new zfs_arc_dnode_reduce_percent tunable specifies the amount by which the excess dnode space is attempted to be pruned as a percentage of the amount by which zfs_arc_dnode_limit is being exceeded. By default, it tries to unpin 10% of the dnodes. The problem of dnode metadata pinning was observed with the following testing procedure (in this example, zfs_arc_max is set to 4GiB): - Create a large number of small files until arc_meta_used exceeds arc_meta_limit (3GiB with default tuning) and arc_prune starts increasing. - Create a 3GiB file with dd. Observe arc_mata_used. It will still be around 3GiB. - Repeatedly read the 3GiB file and observe arc_meta_limit as before. It will continue to stay around 3GiB. With this modification, space for the 3GiB file is gradually made available as subsequent demands on th…

ironMann force-pushed the wip-simd branch from 2bbdf07 to e5361b9 Compare February 12, 2016 14:27

ironMann force-pushed the wip-simd branch 2 times, most recently from 01bce6e to 7e8471f Compare February 15, 2016 20:55

ironMann force-pushed the wip-simd branch 2 times, most recently from 6773297 to 2e741d3 Compare February 18, 2016 15:02

ironMann force-pushed the wip-simd branch from 2e741d3 to 4d0bb18 Compare February 18, 2016 16:51

behlendorf mentioned this pull request Feb 26, 2016

compute fletcher 4 with avx instructions #4330

Closed

ironMann mentioned this pull request Feb 29, 2016

Support for vectorized algorithms on x86 #4381

Closed

ironMann force-pushed the wip-simd branch 4 times, most recently from db01fae to 331af94 Compare March 12, 2016 08:42

ironMann force-pushed the wip-simd branch from 331af94 to b175a2b Compare March 21, 2016 17:22

ironMann force-pushed the wip-simd branch from b175a2b to 47c25ac Compare March 22, 2016 15:21

ironMann mentioned this pull request Mar 22, 2016

[RFC][WIP] Vectorized RAIDZ generate and reconstruct methods [abd] #4439

Closed

ironMann force-pushed the wip-simd branch from 47c25ac to bfe56ad Compare March 22, 2016 16:15

ironMann mentioned this pull request Apr 7, 2016

CPU cache miss is very high for RAID-Z1/2/3 parity computation #4497

Closed

ironMann force-pushed the wip-simd branch from bfe56ad to d57f5a0 Compare April 26, 2016 17:33

ironMann changed the title ~~[WIP][comments] SIMD implementation of vdev_raidz generate and reconstruct routines~~ [RFC] SIMD implementation of vdev_raidz generate and reconstruct routines Apr 26, 2016

ironMann force-pushed the wip-simd branch from d57f5a0 to ee7ce8a Compare April 27, 2016 05:38

ironMann force-pushed the wip-simd branch from a9a6f2b to 77f4ed1 Compare June 18, 2016 14:10

ironMann force-pushed the wip-simd branch from 77f4ed1 to a31e1d9 Compare June 18, 2016 23:06

behlendorf closed this in ab9f4b0 Jun 21, 2016

angstymeat mentioned this pull request Jun 21, 2016

SSE not available in New RAIDZ implementation #4783

Closed

thegreatgazoo reviewed Jul 12, 2016
View reviewed changes

behlendorf mentioned this pull request Jul 12, 2016

Add RAID-Z routines for SSE2 instruction set, in x86_64 mode. #4815

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] SIMD implementation of vdev_raidz generate and reconstruct routines #4328

[RFC] SIMD implementation of vdev_raidz generate and reconstruct routines #4328

ironMann commented Feb 11, 2016 •

edited

Loading

behlendorf commented Feb 11, 2016

ironMann commented Feb 12, 2016

rdolbeau commented Feb 12, 2016

sempervictus commented Feb 16, 2016

ironMann commented Feb 18, 2016

rdolbeau commented Feb 18, 2016

behlendorf commented Feb 19, 2016

behlendorf commented Mar 21, 2016

ironMann commented Mar 24, 2016

behlendorf commented Mar 24, 2016

ofaaland commented Jun 13, 2016

ironMann commented Jun 16, 2016

behlendorf commented Jun 16, 2016

ofaaland commented Jun 17, 2016 •

edited by behlendorf

Loading

ironMann commented Jun 18, 2016

behlendorf commented Jun 18, 2016

rdolbeau commented Jun 19, 2016

ironMann commented Jun 20, 2016

behlendorf commented Jun 20, 2016

ironMann commented Jun 20, 2016

behlendorf commented Jun 21, 2016 •

edited

Loading

thegreatgazoo commented Jul 12, 2016

thegreatgazoo Jul 12, 2016

ironMann commented Jul 12, 2016

behlendorf commented Jul 12, 2016

ironMann commented Jul 12, 2016

rdolbeau commented Jul 12, 2016 •

edited

Loading

thegreatgazoo commented Jul 12, 2016

thegreatgazoo commented Jul 12, 2016

ironMann commented Jul 12, 2016

[RFC] SIMD implementation of vdev_raidz generate and reconstruct routines #4328

[RFC] SIMD implementation of vdev_raidz generate and reconstruct routines #4328

Conversation

ironMann commented Feb 11, 2016 • edited Loading

behlendorf commented Feb 11, 2016

ironMann commented Feb 12, 2016

rdolbeau commented Feb 12, 2016

sempervictus commented Feb 16, 2016

ironMann commented Feb 18, 2016

rdolbeau commented Feb 18, 2016

behlendorf commented Feb 19, 2016

behlendorf commented Mar 21, 2016

ironMann commented Mar 24, 2016

behlendorf commented Mar 24, 2016

ofaaland commented Jun 13, 2016

ironMann commented Jun 16, 2016

behlendorf commented Jun 16, 2016

ofaaland commented Jun 17, 2016 • edited by behlendorf Loading

ironMann commented Jun 18, 2016

behlendorf commented Jun 18, 2016

rdolbeau commented Jun 19, 2016

ironMann commented Jun 20, 2016

behlendorf commented Jun 20, 2016

ironMann commented Jun 20, 2016

behlendorf commented Jun 21, 2016 • edited Loading

thegreatgazoo commented Jul 12, 2016

thegreatgazoo Jul 12, 2016

Choose a reason for hiding this comment

ironMann commented Jul 12, 2016

behlendorf commented Jul 12, 2016

ironMann commented Jul 12, 2016

rdolbeau commented Jul 12, 2016 • edited Loading

thegreatgazoo commented Jul 12, 2016

thegreatgazoo commented Jul 12, 2016

ironMann commented Jul 12, 2016

ironMann commented Feb 11, 2016 •

edited

Loading

ofaaland commented Jun 17, 2016 •

edited by behlendorf

Loading

behlendorf commented Jun 21, 2016 •

edited

Loading

rdolbeau commented Jul 12, 2016 •

edited

Loading