Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TRIM support #661

Closed
wants to merge 36 commits into from
Closed

Add TRIM support #661

wants to merge 36 commits into from

Conversation

lundman
Copy link
Contributor

@lundman lundman commented Oct 12, 2018

No description provided.

* #define DF_WAIT_SYNC 0x00000001 / * Wait for full write-out of free. * /
* IOStorageUnmapOptions is only 0
*/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using command supplied by rottegift fs_usage -f filesys -w | egrep -i 'unmap|trim' shows that the ldi_vnode.c ioctl works, but the ldi_iokit call to unmap does nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas @evansus ? Since it works in vnode mode, I assume it has nothing to do with trimforce etc. With calling unmap() the only things that can be wrong is media and client, but hard to see what they could otherwise be. I suppose I could also drop the iokit version, and find a way to call over to vnode version for trim.

Copy link
Contributor

@evansus evansus Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on IRC, the error return should be checked as

if ((error = <assignment>) != kIOReturnSuccess) {
   <debug print error return>
   <handle error>
}

As opposed to checking if the error == 0 and ignoring other results (kIOReturnSuccess == 0)

And if this returns failure, zfs won’t retry trim/unmap on this vdev until it is reopened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - fixed error check and changed returncode so it will give up trims to the device. Also know now that unmap calls fails due to IOSCSIBlockCommandsDevice::IsUnmapAllowed. Which could be due to VMware disks - might need testing on real hardware.

@lundman
Copy link
Contributor Author

lundman commented Nov 5, 2018

Latest version of code triggers:

panic(cpu 0 caller 0xffffff7f8373b9d0): zfs: allocating allocated segment(offset=512 size=134217728) of (offset=512 size=2147483136)
Backtrace (CPU 0), Frame : Return Address
0xffffff9121abbad0 : 0xffffff8000cdab52 mach_kernel : _panic + 0xe2
0xffffff9121abbb50 : 0xffffff7f8373b9d0 net.lundman.spl : _vcmn_err + 0x90
0xffffff9121abbc80 : 0xffffff7f849f8c9f net.lundman.zfs : _zfs_panic_recover + 0x6f
0xffffff9121abbcf0 : 0xffffff7f849aa562 net.lundman.zfs : _range_tree_add_impl + 0x282
0xffffff9121abbd90 : 0xffffff7f849ab611 net.lundman.zfs : _range_tree_walk + 0x31
0xffffff9121abbdc0 : 0xffffff7f849a9905 net.lundman.zfs : _metaslab_trim_done + 0x55
0xffffff9121abbdf0 : 0xffffff7f84a3362c net.lundman.zfs : _zio_done + 0xc2c
0xffffff9121abbe50 : 0xffffff7f84a2da1d net.lundman.zfs : _zio_execute + 0x2cd
0xffffff9121abbea0 : 0xffffff7f84a3251b net.lundman.zfs : _zio_vdev_io_start + 0x37b
0xffffff9121abbee0 : 0xffffff7f84a2dd9d net.lundman.zfs : ___zio_execute + 0x2cd
0xffffff9121abbf30 : 0xffffff7f8374e8f5 net.lundman.spl : _taskq_thread + 0x265
0xffffff9121abbfb0 : 0xffffff8000dca257 mach_kernel : _call_continuation + 0x17

@lundman
Copy link
Contributor Author

lundman commented Nov 5, 2018

Above panic appears to be due to bad merge, apologies for the noise.

@lundman
Copy link
Contributor Author

lundman commented Nov 5, 2018

Ok, the IOkit unmap still does not pass IsUnmapAllowed on hardware.

@lundman
Copy link
Contributor Author

lundman commented Feb 21, 2019

This PR will be replaced by the refreshed ZOL PR openzfs/zfs#8419 once it is ready.

Meanwhile, we have the issue of IOKit Unmap being rejected. With the help of Hopper I took a look at IOSCSIBlockCommandsDevice::IsUnmapAllowed() which just returns the this value at offset 0x31. It is only set by IOSCSIBlockCommandsDevice::LogicalBlockProvisioningUnmapSupport.

Which in turn appears to use something along the lines of if (exists boolean of values of entries whose (key of it = "Unmap") of dictionaries of values of entries whose (key of it = "IOStorageFeatures") of (dictionaries of nodes of nodes of nodes of it; dictionaries of it) of (if (name of it = "SATA") of node "AppleACPIPCI" of node "PCI0" of node "AppleACPIPlatformExpert" of service plane of iokit registry)

So it should simply be a matter of ioreg -al having an entry like:


        <key>IORegistryEntryName</key>
        <string>Macintosh HD</string>
        <key>IOStorageFeatures</key>
        <dict>
                <key>Unmap</key>
                <true/>
        </dict>

So there should be no reason for ::Unmap() to not work and it must be a bug in how we call it.

Don Brady and others added 9 commits February 22, 2019 12:12
We want newer versions of libzfs_core to run against an existing
zfs kernel module (i.e. a deferred reboot or module reload after
an update).

Programmatically document, via a zfs_ioc_key_t, the valid arguments
for the ioc commands that rely on nvpair input arguments (i.e. non
legacy commands from libzfs_core). Automatically verify the expected
pairs before dispatching a command.

This initial phase focuses on the non-legacy ioctls. A follow-on
change can address the legacy ioctl input from the zfs_cmd_t.

The zfs_ioc_key_t for zfs_keys_channel_program looks like:

static const zfs_ioc_key_t zfs_keys_channel_program[] = {
       {"program",     DATA_TYPE_STRING,               0},
       {"arg",         DATA_TYPE_UNKNOWN,              0},
       {"sync",        DATA_TYPE_BOOLEAN_VALUE,        ZK_OPTIONAL},
       {"instrlimit",  DATA_TYPE_UINT64,               ZK_OPTIONAL},
       {"memlimit",    DATA_TYPE_UINT64,               ZK_OPTIONAL},
};

Introduce four input errors to identify specific input failures
(in addition to generic argument value errors like EINVAL, ERANGE,
EBADF, and E2BIG).

ZFS_ERR_IOC_CMD_UNAVAIL the ioctl number is not supported by kernel
ZFS_ERR_IOC_ARG_UNAVAIL an input argument is not supported by kernel
ZFS_ERR_IOC_ARG_REQUIRED a required input argument is missing
ZFS_ERR_IOC_ARG_BADTYPE an input argument has an invalid type

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Added -n flag to zpool reopen that allows a running scrub
operation to continue if there is a device with Dirty Time Log.

By default if a component device has a DTL and zpool reopen
is executed all running scan operations will be restarted.

Added functional tests for `zpool reopen`

Tests covers following scenarios:
* `zpool reopen` without arguments,
* `zpool reopen` with pool name as argument,
* `zpool reopen` while scrubbing,
* `zpool reopen -n` while scrubbing,
* `zpool reopen -n` while resilvering,
* `zpool reopen` with bad arguments.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tom Caputi <[email protected]>
Signed-off-by: Arkadiusz Bubała <[email protected]>
PROBLEM
========

When invoking "zpool initialize" on a pool the command will
create a thread to initialize each disk. Unfortunately, it does
this serially across many transaction groups which can result
in commands taking a long time to return to the user and may
appear hung. The same thing is true when trying to suspend/cancel
the operation.

SOLUTION
=========

This change refactors the way we invoke the initialize interface
to ensure we can start or stop the intialization in just a few
transaction groups.

When stopping or cancelling a vdev initialization perform it
in two phases.  First signal each vdev initialization thread
that it should exit, then after all threads have been signaled
wait for them to exit.

On a pool with 40 leaf vdevs this reduces the vdev initialize
stop/cancel time from ~10 minutes to under a second.  The reason
for this is spa_vdev_initialize() no longer needs to wait on
multiple full TXGs per leaf vdev being stopped.

This commit additionally adds some missing checks for the passed
"initialize_vdevs" input nvlist.  The contents of the user provided
input "initialize_vdevs" nvlist must be validated to ensure all
values are uint64s.  This is done in zfs_ioc_pool_initialize() in
order to keep all of these checks in a single location.

Updated the innvl and outnvl comments to match the formatting used
for all other new sytle ioctls.

Reviewed by: Matt Ahrens <[email protected]>
Reviewed-by: loli10K <[email protected]>
Reviewed-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: George Wilson <[email protected]>
Original-patch-by: Saso Kiselkov <[email protected]>
Contributions-by: Tim Chase <[email protected]>
Contributions-by: Chunwei Chen <[email protected]>
Contributions-by: Brian Behlendorf <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
* Fix extended iostat reporting.  We should consider bundling
  the trim and autotrim IOs together to keep the reporting
  clear.  While useful during development I don't see a use
  case for keeping them split.

* Include aggregate TRIMs in request size histograms.  Currently
  TRIMs will never be aggregated but that's not set in stone.  So
  let's report this in the same way as all other IOs.  A module
  options was added to allow aggreation for testing.

* Adding missing NULL to IOS_LATENCY headers.

* Prevent print_iostat_labels from underflowing when calculating
  the number of spaces to print.

* Reduce number of threads allowing in TRIM taskq

* Updated man pages.

* Switched the functional/trim/* ZTS tests to file based vdev,
  this allows for testing raidz3.  Additionally, reduced the
  maximum file size and number of directories to speed up the
  test.

* In vdev_autotrim_thread() check the vdev_autotrim_thread under
  the appropriate lock.

* Fix ztest ASSERT where a top-level device is replaced by its
  sole child.  In this case the autotrim process must be stopped
  on the tlv and then restarted if needed on the new tld.

Signed-off-by: Brian Behlendorf <[email protected]>
While useful for debugging this particular distinction does not
need to be presented to users.  Consolidating the types allows
for further simplification of the code and CLI reporting.

Signed-off-by: Brian Behlendorf <[email protected]>
* Adding missing zpool_trim.kshlib to Makefile.am

* space_map_allocated() should be metslab_allocated_space()

* The `zpool trim` command should not print an error for
  devices which do not support TRIM when only the pool name
  was given.

Signed-off-by: Brian Behlendorf <[email protected]>
* Generate full path names for vdevs.  When automatically adding
  vdevs to be trimmed for a pool use the full path name.  This
  ensures the correct device will be found even when using
  symlinks from non-standard locations.

* Assorted ZTS reliability improvements for issues encountered
  by the CI.

Signed-off-by: Brian Behlendorf <[email protected]>
Initially, metaslabs and space maps used to be the same thing
in ZFS. Later, we started differentiating them by referring
to the space map as the on-disk state of the metaslab, making
the metaslab a higher-level concept that is metadata that deals
with space accounting. Today we've managed to split that code
furthermore, with the space map being its own on-disk data
structure used in areas of ZFS besides metaslabs (e.g. the
vdev-wide space maps used for zpool checkpoint or vdev removal
features).

This patch refactors the space map code to further split the
space map code from the metaslab code. It does so by getting
rid of the idea that the space map can have a different in-core
and on-disk length (sm_length vs smp_length) which is something
that is only used for the metaslab code, and other consumers
of space maps just have to deal with. Instead, this patch
introduces changes that move the old in-core length of the
metaslab's space map to the metaslab structure itself (see
ms_synced_length field) while making the space map code only
care about the actual space map's length on-disk.

The result of this is that space map consumers no longer have
to deal with syncing two different lengths for the same
structure (e.g. space_map_update() goes away) while metaslab
specific behavior stays within the metaslab code. Specifically,
the ms_synced_length field keeps track of the amount of data
metaslab_load() can read from the metaslab's space map while
working concurrently with metaslab_sync() that may be
appending to that same space map.

As a side note, the patch also adds a few comments around
the metaslab code documenting some assumptions and expected
behavior.

Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed by: Pavel Zakharov <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
@lundman lundman changed the title OS-7058 need zfs trim support Add TRIM support Feb 22, 2019
@lundman
Copy link
Contributor Author

lundman commented Feb 22, 2019

This PR has now been refreshed with the ZOL PR openzfs/zfs#8419

@lundman lundman mentioned this pull request Feb 22, 2019
12 tasks
lundman and others added 5 commits February 22, 2019 17:04
Before a metaslab can be unmarked and new allocations allowed, all
inflight TRIMs still being handled need to be complete.

Signed-off-by: Brian Behlendorf <[email protected]>
Add DKIOCGETFEATURES to probe for UNMAP support of the device.

Correct use of VDEV_NAME_PATH to make sure shortnames work.
* Start the autotrim thread as needed when adding a new top
  level device to the pool.

* Reverse the order metaslabs are trimmed in, they are now
  handled from highest to lowest in order to front load the
  largest expected benefits from TRIM.  Metaslabs nearer the
  end of the device are expected to contain more free space.

* Provide a larger tolence for the zpool_trim_partial and
  zpool_trim_verify_trimmed test cases.  Spurious failures
  observed when running in the CI.  Additionally fix an
  issue where the partial TRIM flag was being preserved
  between subsequent manual runs.

* Add zfs_vdev_trim_min_active and zfs_vdev_trim_max_active
  module options and documenation.  Reduced default maximum
  number of outstanding TRIMs to 2 to minimize impact of TRIM
  during normal pool operation.

* ztest: Always stop an autotrim thread in vdev_remove_complete().

* Added gettext() calls when printing initialize/trim status.

* Use destroy_pool for cleanup in ZTS tests.
lundman and others added 21 commits February 27, 2019 11:05
* Speed up test cases which will be frequently run and were observed
  to be slow in the CI.  Locally, in a VM it takes roughly 5 minutes
  to run all of the ZTS trim tests.

* Use the standard short header for all ZTS tests.

* Update several tests to resolve some spurious failures.

* Add trim.cfg for convenience.

* Minor man page rewording.

Signed-off-by: Brian Behlendorf <[email protected]>
This functionality was originally added to aid debugging by writing
a known pattern.  It is no longer required and can be removed.

Signed-off-by: Brian Behlendorf <[email protected]>
Minor updates, more are expected.  What missing is still the
big theory section which I'll follow up with.

Signed-off-by: Brian Behlendorf <[email protected]>
* Added ioctl coverage for ZFS_IOC_POOL_TRIM.
* Renamed POOL_INITIALIZE_DO to POOL_INITIALIZE_START
* Renamed POOL_TRIM_DO to POOL_TRIM_START

Signed-off-by: Brian Behlendorf <[email protected]>
While it would be nice to be able to migrate autotrim threads
from top-level vdev to children and visa-versa that's not easy
to accomplish due to the required coordination.  Therefore,
since configuation changes are relatively infrequent always
stop and restart the autotrim threads for the correct vdevs.

Signed-off-by: Brian Behlendorf <[email protected]>
* Convert the vdev initialize metaslab exclusion code in to
  generic functionality which can be used by initialize,
  TRIM, and other planned features.

* Includes vdev initialize fix for 8461

Original-patch-by: Matt Ahrens <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
* Added -d flag to `zpool trim` to request a secure discard.  This
  operation will fail if not supported by the underlying device.

* When performing a secure discard the minimum extent size is
  reduced to SPA_MINBLOCKSIZE so no ranges are skipped.

* The autotrim has not been updated to perform a secure trim.
  This could be supported by adding a 'securetrim=on|off' property
  if this is a desirable feature.

* Added autoconf code to determine if the kernel supports secure
  trimming.  2.6.36 and newers kernels all provide this support.
  When the kernel or does not support the flag the `zpool trim -d`
  command with fail

* Minor additional changes.
  - s/fulltrim/partial/ this cleanup was previous overlooked.
  - Renamed vd->vdev_notrim to vd->vdev_trim and fixed callers.
    This is consistent with the existing vdev naming convertions.

* Updated man pages and added a partial test case.  Additional
  infrasture may need to be added to reliably verify this
  functionality in the ZTS.

Signed-off-by: Brian Behlendorf <[email protected]>
* Resolve vdev_initialize crash uncovered by ztest.  This issue was
  not introduced by TRIM but the same case exists for TRIM and the
  same fix was applied.  When restarting check that a removal wasn't
  started while not initializing.

* Fix clearing kstat

* Removed minimum rate limit.

Signed-off-by: Brian Behlendorf <[email protected]>
Refactoring of metaslab_enable/disable, there is a race where the
sync doesn't happen because two threads were exiting simultaneously.
Remove this optimization.

Signed-off-by: Brian Behlendorf <[email protected]>
* Updated documentation and improved comments.

* Changed delay() to an updated version of txg_wait_open()
  which does not always force a txg sync.  This allows the
  TRIM code to patiently wait for the next open transaction.

* SPA_ASYNC_AUTOTRIM -> SPA_ASYNC_AUTOTRIM_RESTART

* P2PHASE changes

Signed-off-by: Brian Behlendorf <[email protected]>
* Added and improved existing comments.

* Resolve false positive in zpool_trim_verify_checksums.ksh.  The
  pool must be exported in order to use zdb to reliably validate it.

* Moved vdev_autotrim_restart() under the spa_namespace_lock to
  serialize when the autotrim threads can be restarted.  This
  resolves a failure observed by ztest when detaching.  The
  vdev_trim_restart() and vdev_initialize_restart() functions
  resolved this issue in the same fashion.

Signed-off-by: Brian Behlendorf <[email protected]>
When a pool is full the sync task may not run due to lack of space.
This can result in a memory leak when the sync task provied to
dsl_sync_task_nowait() is responsible for freeing the passed arg.
For now, handle this by switch to ZFS_SPACE_CHECK_NONE to ensure
it always gets called which is already what's done for the other
dsl_sync_task_nowait() callers.

A more complete fix has been provided for initial review as
openzfs/zfs#8497.

Signed-off-by: Brian Behlendorf <[email protected]>
* Updated comments.
* Converted ZIO_PRIORITY_TRIM to ASSERTs in vdev_queue_io()

Signed-off-by: Brian Behlendorf <[email protected]>
* Added Big Theory section which discusses the high level details
  of both manual and automatic TRIM.

* Updated partial TRIM description in zpool(8) using Richard
  Laagers proposed language.

* Documented the hot spare / replacing limitation.

* Brought zio_vdev_io_assess()  back inline with the existing code
  and fixed the 'persistent bit' portion of the comment.

* Fixed typos in comments.

Signed-off-by: Brian Behlendorf <[email protected]>
During review it was observed that the current semantics of a partial
TRIM are not well-defined.  If we make the current partial trim a first
class feature, folks will come to rely on the current behavior, despite
any disclaimers in the manpage. This creates risk to future changes, which
must now maintain the behavior of trying not to touch new metaslabs until
absolutely necessary, lest partial trims become less useful.

For this reason, until the work is done to fully design a well-defined
partial TRIM feature it has been removed from the `zpool trim` command.

That said, the existing partial TRIM behavior is still highly useful for
very large thinly-provisioned pools.  Therefore, it has been converted to
a new 'zfs_trim_metaslab_skip' module option which is the appropriate
place for this kind of setting.  All of the existing partial TRIM behavior
has been retained.

To issue a partial TRIM set 'zfs_trim_metaslab_skip=1' then initiate a
manual TRIM.  After this point the module option may be changed without
any effect on the running TRIM.  The setting is stored on-disk and will
persist for the duration of the requested TRIM.

$ echo 1 >/sys/module/zfs/parameters/zfs_trim_metaslab_skip
$ zpool trim partial-pool

This means that a full TRIM can be requested for a different pool while
the previous partial TRIM is still running.

$ echo 0 >/sys/module/zfs/parameters/zfs_trim_metaslab_skip
$ zpool trim full-pool

Signed-off-by: Brian Behlendorf <[email protected]>

On OSX the instructions are:

$ sysctl kstat.zfs.darwin.tunable.zfs_trim_metaslab_skip = 1
The issued_trim flag must always be set to B_TRUE before calling
vdev_trim_ranges() since this function may be interrupted.  In
which case there can be outstanding TRIM I/Os which must complete
before allowing new allocations.

Signed-off-by: Brian Behlendorf <[email protected]>
Observed while running ztest.  The SCL_CONFIG read lock should
not be held when calling metaslab_disable() because the following
dead lock is possible.  This was introduced by allowing the
metaslab_enable() function to wait for a txg sync.

Thread 1191 (Thread 0x7ffa908c1700 (LWP 16776)):
  cv_wait
  spa_config_enter		-> Waiting for SCL_CONFIG read lock,
				   blocked due to pending writer (994)
  spa_txg_history_init_io
  txg_sync_thread

Thread 994 (Thread 0x7ffa903a4700 (LWP 16791)):
  cv_wait
  spa_config_enter		-> Waiting for SCL_ALL write lock,
				   blocked due to holder (983)
  spa_vdev_exit
  spa_vdev_add
  ztest_vdev_aux_add_remove	-> Holding ztest_vdev_lock
  ztest_execute

Thread 1001 (Thread 0x7ffa90902700 (LWP 18020)):
  cv_wait
  txg_wait_synced		-> Waiting on txg sync to decrement
				   mg_ms_disabled (1191)
  metaslab_enable
  vdev_autotrim_thread

Thread 983 (Thread 0x7ffa63451700 (LWP 18019)):
  cv_wait
  metaslab_group_disabled_increment -> Waiting for mg_ms_disabled to be
				       less than max_disabled_ms (1001)
  metaslab_disable
  vdev_initialize_thread	-> *** Holding SCL_CONFIG read lock ***

Signed-off-by: Brian Behlendorf <[email protected]>
Always wait for the next txg after checking a metaslab group.
When no TRIM commands were issued then don't force a quiesce
so idle pools continue to generate no-op txgs.

Signed-off-by: Brian Behlendorf <[email protected]>
@lundman
Copy link
Contributor Author

lundman commented Mar 29, 2019

Merged with 38a6bed

@lundman lundman closed this Mar 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants