-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TRIM support #661
Add TRIM support #661
Conversation
* #define DF_WAIT_SYNC 0x00000001 / * Wait for full write-out of free. * / | ||
* IOStorageUnmapOptions is only 0 | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using command supplied by rottegift fs_usage -f filesys -w | egrep -i 'unmap|trim'
shows that the ldi_vnode.c
ioctl works, but the ldi_iokit
call to unmap does nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any ideas @evansus ? Since it works in vnode mode, I assume it has nothing to do with trimforce
etc. With calling unmap()
the only things that can be wrong is media
and client
, but hard to see what they could otherwise be. I suppose I could also drop the iokit version, and find a way to call over to vnode version for trim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed on IRC, the error return should be checked as
if ((error = <assignment>) != kIOReturnSuccess) {
<debug print error return>
<handle error>
}
As opposed to checking if the error == 0 and ignoring other results (kIOReturnSuccess == 0)
And if this returns failure, zfs won’t retry trim/unmap on this vdev until it is reopened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - fixed error check and changed returncode so it will give up trims to the device. Also know now that unmap
calls fails due to IOSCSIBlockCommandsDevice::IsUnmapAllowed
. Which could be due to VMware disks - might need testing on real hardware.
c511844
to
31330b7
Compare
Latest version of code triggers:
|
Above panic appears to be due to bad merge, apologies for the noise. |
Ok, the IOkit unmap still does not pass |
This PR will be replaced by the refreshed ZOL PR openzfs/zfs#8419 once it is ready. Meanwhile, we have the issue of IOKit Unmap being rejected. With the help of Hopper I took a look at Which in turn appears to use something along the lines of So it should simply be a matter of
So there should be no reason for |
We want newer versions of libzfs_core to run against an existing zfs kernel module (i.e. a deferred reboot or module reload after an update). Programmatically document, via a zfs_ioc_key_t, the valid arguments for the ioc commands that rely on nvpair input arguments (i.e. non legacy commands from libzfs_core). Automatically verify the expected pairs before dispatching a command. This initial phase focuses on the non-legacy ioctls. A follow-on change can address the legacy ioctl input from the zfs_cmd_t. The zfs_ioc_key_t for zfs_keys_channel_program looks like: static const zfs_ioc_key_t zfs_keys_channel_program[] = { {"program", DATA_TYPE_STRING, 0}, {"arg", DATA_TYPE_UNKNOWN, 0}, {"sync", DATA_TYPE_BOOLEAN_VALUE, ZK_OPTIONAL}, {"instrlimit", DATA_TYPE_UINT64, ZK_OPTIONAL}, {"memlimit", DATA_TYPE_UINT64, ZK_OPTIONAL}, }; Introduce four input errors to identify specific input failures (in addition to generic argument value errors like EINVAL, ERANGE, EBADF, and E2BIG). ZFS_ERR_IOC_CMD_UNAVAIL the ioctl number is not supported by kernel ZFS_ERR_IOC_ARG_UNAVAIL an input argument is not supported by kernel ZFS_ERR_IOC_ARG_REQUIRED a required input argument is missing ZFS_ERR_IOC_ARG_BADTYPE an input argument has an invalid type Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]>
Added -n flag to zpool reopen that allows a running scrub operation to continue if there is a device with Dirty Time Log. By default if a component device has a DTL and zpool reopen is executed all running scan operations will be restarted. Added functional tests for `zpool reopen` Tests covers following scenarios: * `zpool reopen` without arguments, * `zpool reopen` with pool name as argument, * `zpool reopen` while scrubbing, * `zpool reopen -n` while scrubbing, * `zpool reopen -n` while resilvering, * `zpool reopen` with bad arguments. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Arkadiusz Bubała <[email protected]>
PROBLEM ======== When invoking "zpool initialize" on a pool the command will create a thread to initialize each disk. Unfortunately, it does this serially across many transaction groups which can result in commands taking a long time to return to the user and may appear hung. The same thing is true when trying to suspend/cancel the operation. SOLUTION ========= This change refactors the way we invoke the initialize interface to ensure we can start or stop the intialization in just a few transaction groups. When stopping or cancelling a vdev initialization perform it in two phases. First signal each vdev initialization thread that it should exit, then after all threads have been signaled wait for them to exit. On a pool with 40 leaf vdevs this reduces the vdev initialize stop/cancel time from ~10 minutes to under a second. The reason for this is spa_vdev_initialize() no longer needs to wait on multiple full TXGs per leaf vdev being stopped. This commit additionally adds some missing checks for the passed "initialize_vdevs" input nvlist. The contents of the user provided input "initialize_vdevs" nvlist must be validated to ensure all values are uint64s. This is done in zfs_ioc_pool_initialize() in order to keep all of these checks in a single location. Updated the innvl and outnvl comments to match the formatting used for all other new sytle ioctls. Reviewed by: Matt Ahrens <[email protected]> Reviewed-by: loli10K <[email protected]> Reviewed-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: George Wilson <[email protected]>
Original-patch-by: Saso Kiselkov <[email protected]> Contributions-by: Tim Chase <[email protected]> Contributions-by: Chunwei Chen <[email protected]> Contributions-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Fix extended iostat reporting. We should consider bundling the trim and autotrim IOs together to keep the reporting clear. While useful during development I don't see a use case for keeping them split. * Include aggregate TRIMs in request size histograms. Currently TRIMs will never be aggregated but that's not set in stone. So let's report this in the same way as all other IOs. A module options was added to allow aggreation for testing. * Adding missing NULL to IOS_LATENCY headers. * Prevent print_iostat_labels from underflowing when calculating the number of spaces to print. * Reduce number of threads allowing in TRIM taskq * Updated man pages. * Switched the functional/trim/* ZTS tests to file based vdev, this allows for testing raidz3. Additionally, reduced the maximum file size and number of directories to speed up the test. * In vdev_autotrim_thread() check the vdev_autotrim_thread under the appropriate lock. * Fix ztest ASSERT where a top-level device is replaced by its sole child. In this case the autotrim process must be stopped on the tlv and then restarted if needed on the new tld. Signed-off-by: Brian Behlendorf <[email protected]>
While useful for debugging this particular distinction does not need to be presented to users. Consolidating the types allows for further simplification of the code and CLI reporting. Signed-off-by: Brian Behlendorf <[email protected]>
* Adding missing zpool_trim.kshlib to Makefile.am * space_map_allocated() should be metslab_allocated_space() * The `zpool trim` command should not print an error for devices which do not support TRIM when only the pool name was given. Signed-off-by: Brian Behlendorf <[email protected]>
* Generate full path names for vdevs. When automatically adding vdevs to be trimmed for a pool use the full path name. This ensures the correct device will be found even when using symlinks from non-standard locations. * Assorted ZTS reliability improvements for issues encountered by the CI. Signed-off-by: Brian Behlendorf <[email protected]>
Initially, metaslabs and space maps used to be the same thing in ZFS. Later, we started differentiating them by referring to the space map as the on-disk state of the metaslab, making the metaslab a higher-level concept that is metadata that deals with space accounting. Today we've managed to split that code furthermore, with the space map being its own on-disk data structure used in areas of ZFS besides metaslabs (e.g. the vdev-wide space maps used for zpool checkpoint or vdev removal features). This patch refactors the space map code to further split the space map code from the metaslab code. It does so by getting rid of the idea that the space map can have a different in-core and on-disk length (sm_length vs smp_length) which is something that is only used for the metaslab code, and other consumers of space maps just have to deal with. Instead, this patch introduces changes that move the old in-core length of the metaslab's space map to the metaslab structure itself (see ms_synced_length field) while making the space map code only care about the actual space map's length on-disk. The result of this is that space map consumers no longer have to deal with syncing two different lengths for the same structure (e.g. space_map_update() goes away) while metaslab specific behavior stays within the metaslab code. Specifically, the ms_synced_length field keeps track of the amount of data metaslab_load() can read from the metaslab's space map while working concurrently with metaslab_sync() that may be appending to that same space map. As a side note, the patch also adds a few comments around the metaslab code documenting some assumptions and expected behavior. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]>
This PR has now been refreshed with the ZOL PR openzfs/zfs#8419 |
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
Before a metaslab can be unmarked and new allocations allowed, all inflight TRIMs still being handled need to be complete. Signed-off-by: Brian Behlendorf <[email protected]>
Add DKIOCGETFEATURES to probe for UNMAP support of the device. Correct use of VDEV_NAME_PATH to make sure shortnames work.
* Start the autotrim thread as needed when adding a new top level device to the pool. * Reverse the order metaslabs are trimmed in, they are now handled from highest to lowest in order to front load the largest expected benefits from TRIM. Metaslabs nearer the end of the device are expected to contain more free space. * Provide a larger tolence for the zpool_trim_partial and zpool_trim_verify_trimmed test cases. Spurious failures observed when running in the CI. Additionally fix an issue where the partial TRIM flag was being preserved between subsequent manual runs. * Add zfs_vdev_trim_min_active and zfs_vdev_trim_max_active module options and documenation. Reduced default maximum number of outstanding TRIMs to 2 to minimize impact of TRIM during normal pool operation. * ztest: Always stop an autotrim thread in vdev_remove_complete(). * Added gettext() calls when printing initialize/trim status. * Use destroy_pool for cleanup in ZTS tests.
* Speed up test cases which will be frequently run and were observed to be slow in the CI. Locally, in a VM it takes roughly 5 minutes to run all of the ZTS trim tests. * Use the standard short header for all ZTS tests. * Update several tests to resolve some spurious failures. * Add trim.cfg for convenience. * Minor man page rewording. Signed-off-by: Brian Behlendorf <[email protected]>
This functionality was originally added to aid debugging by writing a known pattern. It is no longer required and can be removed. Signed-off-by: Brian Behlendorf <[email protected]>
Minor updates, more are expected. What missing is still the big theory section which I'll follow up with. Signed-off-by: Brian Behlendorf <[email protected]>
* Added ioctl coverage for ZFS_IOC_POOL_TRIM. * Renamed POOL_INITIALIZE_DO to POOL_INITIALIZE_START * Renamed POOL_TRIM_DO to POOL_TRIM_START Signed-off-by: Brian Behlendorf <[email protected]>
While it would be nice to be able to migrate autotrim threads from top-level vdev to children and visa-versa that's not easy to accomplish due to the required coordination. Therefore, since configuation changes are relatively infrequent always stop and restart the autotrim threads for the correct vdevs. Signed-off-by: Brian Behlendorf <[email protected]>
* Convert the vdev initialize metaslab exclusion code in to generic functionality which can be used by initialize, TRIM, and other planned features. * Includes vdev initialize fix for 8461 Original-patch-by: Matt Ahrens <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
* Added -d flag to `zpool trim` to request a secure discard. This operation will fail if not supported by the underlying device. * When performing a secure discard the minimum extent size is reduced to SPA_MINBLOCKSIZE so no ranges are skipped. * The autotrim has not been updated to perform a secure trim. This could be supported by adding a 'securetrim=on|off' property if this is a desirable feature. * Added autoconf code to determine if the kernel supports secure trimming. 2.6.36 and newers kernels all provide this support. When the kernel or does not support the flag the `zpool trim -d` command with fail * Minor additional changes. - s/fulltrim/partial/ this cleanup was previous overlooked. - Renamed vd->vdev_notrim to vd->vdev_trim and fixed callers. This is consistent with the existing vdev naming convertions. * Updated man pages and added a partial test case. Additional infrasture may need to be added to reliably verify this functionality in the ZTS. Signed-off-by: Brian Behlendorf <[email protected]>
* Resolve vdev_initialize crash uncovered by ztest. This issue was not introduced by TRIM but the same case exists for TRIM and the same fix was applied. When restarting check that a removal wasn't started while not initializing. * Fix clearing kstat * Removed minimum rate limit. Signed-off-by: Brian Behlendorf <[email protected]>
Refactoring of metaslab_enable/disable, there is a race where the sync doesn't happen because two threads were exiting simultaneously. Remove this optimization. Signed-off-by: Brian Behlendorf <[email protected]>
* Updated documentation and improved comments. * Changed delay() to an updated version of txg_wait_open() which does not always force a txg sync. This allows the TRIM code to patiently wait for the next open transaction. * SPA_ASYNC_AUTOTRIM -> SPA_ASYNC_AUTOTRIM_RESTART * P2PHASE changes Signed-off-by: Brian Behlendorf <[email protected]>
* Added and improved existing comments. * Resolve false positive in zpool_trim_verify_checksums.ksh. The pool must be exported in order to use zdb to reliably validate it. * Moved vdev_autotrim_restart() under the spa_namespace_lock to serialize when the autotrim threads can be restarted. This resolves a failure observed by ztest when detaching. The vdev_trim_restart() and vdev_initialize_restart() functions resolved this issue in the same fashion. Signed-off-by: Brian Behlendorf <[email protected]>
When a pool is full the sync task may not run due to lack of space. This can result in a memory leak when the sync task provied to dsl_sync_task_nowait() is responsible for freeing the passed arg. For now, handle this by switch to ZFS_SPACE_CHECK_NONE to ensure it always gets called which is already what's done for the other dsl_sync_task_nowait() callers. A more complete fix has been provided for initial review as openzfs/zfs#8497. Signed-off-by: Brian Behlendorf <[email protected]>
* Updated comments. * Converted ZIO_PRIORITY_TRIM to ASSERTs in vdev_queue_io() Signed-off-by: Brian Behlendorf <[email protected]>
* Added Big Theory section which discusses the high level details of both manual and automatic TRIM. * Updated partial TRIM description in zpool(8) using Richard Laagers proposed language. * Documented the hot spare / replacing limitation. * Brought zio_vdev_io_assess() back inline with the existing code and fixed the 'persistent bit' portion of the comment. * Fixed typos in comments. Signed-off-by: Brian Behlendorf <[email protected]>
During review it was observed that the current semantics of a partial TRIM are not well-defined. If we make the current partial trim a first class feature, folks will come to rely on the current behavior, despite any disclaimers in the manpage. This creates risk to future changes, which must now maintain the behavior of trying not to touch new metaslabs until absolutely necessary, lest partial trims become less useful. For this reason, until the work is done to fully design a well-defined partial TRIM feature it has been removed from the `zpool trim` command. That said, the existing partial TRIM behavior is still highly useful for very large thinly-provisioned pools. Therefore, it has been converted to a new 'zfs_trim_metaslab_skip' module option which is the appropriate place for this kind of setting. All of the existing partial TRIM behavior has been retained. To issue a partial TRIM set 'zfs_trim_metaslab_skip=1' then initiate a manual TRIM. After this point the module option may be changed without any effect on the running TRIM. The setting is stored on-disk and will persist for the duration of the requested TRIM. $ echo 1 >/sys/module/zfs/parameters/zfs_trim_metaslab_skip $ zpool trim partial-pool This means that a full TRIM can be requested for a different pool while the previous partial TRIM is still running. $ echo 0 >/sys/module/zfs/parameters/zfs_trim_metaslab_skip $ zpool trim full-pool Signed-off-by: Brian Behlendorf <[email protected]> On OSX the instructions are: $ sysctl kstat.zfs.darwin.tunable.zfs_trim_metaslab_skip = 1
The issued_trim flag must always be set to B_TRUE before calling vdev_trim_ranges() since this function may be interrupted. In which case there can be outstanding TRIM I/Os which must complete before allowing new allocations. Signed-off-by: Brian Behlendorf <[email protected]>
Observed while running ztest. The SCL_CONFIG read lock should not be held when calling metaslab_disable() because the following dead lock is possible. This was introduced by allowing the metaslab_enable() function to wait for a txg sync. Thread 1191 (Thread 0x7ffa908c1700 (LWP 16776)): cv_wait spa_config_enter -> Waiting for SCL_CONFIG read lock, blocked due to pending writer (994) spa_txg_history_init_io txg_sync_thread Thread 994 (Thread 0x7ffa903a4700 (LWP 16791)): cv_wait spa_config_enter -> Waiting for SCL_ALL write lock, blocked due to holder (983) spa_vdev_exit spa_vdev_add ztest_vdev_aux_add_remove -> Holding ztest_vdev_lock ztest_execute Thread 1001 (Thread 0x7ffa90902700 (LWP 18020)): cv_wait txg_wait_synced -> Waiting on txg sync to decrement mg_ms_disabled (1191) metaslab_enable vdev_autotrim_thread Thread 983 (Thread 0x7ffa63451700 (LWP 18019)): cv_wait metaslab_group_disabled_increment -> Waiting for mg_ms_disabled to be less than max_disabled_ms (1001) metaslab_disable vdev_initialize_thread -> *** Holding SCL_CONFIG read lock *** Signed-off-by: Brian Behlendorf <[email protected]>
Always wait for the next txg after checking a metaslab group. When no TRIM commands were issued then don't force a quiesce so idle pools continue to generate no-op txgs. Signed-off-by: Brian Behlendorf <[email protected]>
Merged with 38a6bed |
No description provided.