Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TRIM support #8419

Merged
merged 7 commits into from
Mar 29, 2019
Merged

Add TRIM support #8419

merged 7 commits into from
Mar 29, 2019

Conversation

behlendorf
Copy link
Contributor

@behlendorf behlendorf commented Feb 16, 2019

Motivation and Context

UNMAP/TRIM support is a frequently-requested feature to help prevent performance from degrading on SSDs and on various other SAN-like storage back-ends. By issuing UNMAP/TRIM commands for sectors which are no longer allocated the underlying device can often more efficiently manage itself.

Description

This TRIM implementation is modeled on the zpool initialize feature which writes a pattern to all unallocated space in the pool. The new zpool trim command uses the same vdev_xlate() code to calculate what sectors are unallocated, the same per-vdev TRIM thread model and locking, and the same basic CLI for a consistent user experience. The core difference is that instead of writing a pattern it will issue UNMAP/TRIM commands for those extents.

The zio pipeline was updated to accomidate this by adding a new ZIO_TYPE_TRIM type and associated spa taskq. This new type makes is straight forward to add the platform specific TRIM/UNMAP calls to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs. This made it possible to largely avoid changing the pipeline, one exception is that TRIM zio's may exceed the 16M block size limit since they contain no data.

In addition to the manual zpool trim command a background automatic TRIM was added and is controlled by the 'autotrim' property. It relies on the exact same infrastructure as the manual TRIM. However, instead of relying on the extents in a metaslab's ms_allocatable range tree, a ms_trim tree is kept per metaslab. When 'autotrim=on' ranges added back to the ms_allocatable tree are also added to the ms_trim tree. The ms_trim tree is then periodically consumed by an autotrim thread which systematically walks a top level vdev's metaslabs.

Since the automatic TRIM will skip ranges it considers too small there is still value in occasionally running a full zpool trim when autotrim=on. This may be useful when the freed blocks are small and not enough time was allowed to aggregate them. An automatic TRIM and a manual zpool trim may be run concurrently, in which case the automatic TRIM will yield to the manual TRIM.

Added commands and example output.

# Reports TRIM progress.
$ zpool status -t

  pool: testpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:09:35 with 0 errors on Mon Feb 25 09:54:39 2019
config:

	NAME        STATE     READ WRITE CKSUM
	slag4       ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    U18     ONLINE       0     0     0  (100% trimmed, completed at Sat Feb 23 07:28:43 2019)
	    U19     ONLINE       0     0     0  (100% trimmed, completed at Sat Feb 23 07:28:43 2019)
	    U20     ONLINE       0     0     0  (100% trimmed, completed at Sat Feb 23 07:28:44 2019)
# TRIM request size histogram (right two columns)
$ zpool iostat -r

testpool      sync_read    sync_write    async_read    async_write      scrub         trim    
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0
4K          14.3M      0  1.86M      0   821K      0  48.3M      0  1.17M      0      0      0
8K          22.6K  17.4K  17.5K      4     42  67.1K  3.63M  20.1M   177K  59.0K      0      0
16K         11.8K  17.3K  56.2K    890  3.23K  62.4K  2.52M  13.8M  61.8K  80.0K      0      0
32K         47.7K  8.82K  29.8K  8.20K  11.7K  65.1K  3.38M  10.2M  26.1K   361K  1.77M      0
64K            36   1005   107K  12.3K      0  17.3K  3.12M  6.85M  10.1K   472K  1.20M      0
128K            0  1.66K   888K  43.2K      0  1.31K  13.5M  5.19M  47.9K   505K   946K      0
256K            0  1.78K      0   363K      0     60      0  7.52M      0   290K   542K      0
512K            0      0      0  1.58M      0      1      0  7.84M      0   367K   354K      0
1M              0      0      0  16.9M      0      0      0  13.2M      0  1.62M   231K      0
2M              0      0      0      0      0      0      0      0      0      0   177K      0
4M              0      0      0      0      0      0      0      0      0      0   130K      0
8M              0      0      0      0      0      0      0      0      0      0  89.9K      0
16M             0      0      0      0      0      0      0      0      0      0   193K      0
----------------------------------------------------------------------------------------------
# TRIM latency histogram (right column)
$ zpool iostat -w

slag4        total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0      0      0      0      0      0      0
255ns           0      0      0      0      0      0      0      0      0      0
511ns           0      0      0      0  21.4K  19.8K     22  18.0K      3      0
1us             0      0      0      0  5.96M  1.51M  49.6K  1.72M   171K  44.9K
2us             0      0      0      0  7.39M   683K   552K  10.0M   911K  1.40M
4us             0      0      0      0   522K   117K  90.4K  10.9M   199K  2.94M
8us             0      0      0      0   101K  26.4K  5.70K  3.81M  17.6K  1.03M
16us            0      0      0      0  96.0K  9.39K  2.98K  2.64M  9.80K  75.8K
32us           94      1    405    339   120K  6.93K  4.04K  4.26M  12.1K  74.2K
65us        6.50K  11.2M  10.5K  61.1M   128K  9.13K  6.18K  8.38M  21.4K  2.75K
131us       2.23M  14.5M  2.65M  19.4M  97.7K  16.2K  25.5K  11.6M   364K    912
262us       13.5M  19.7M  13.6M  18.1M  25.0K  23.8K  51.8K  15.7M   419K  1.56K
524us       1020K  29.9M  1.57M  19.8M  5.00K  30.8K  58.3K  26.1M   653K  2.85K
1ms         1.02M  29.8M   709K  16.9M  2.45K  40.8K  72.8K  22.8M   262K  1.78K
2ms          578K  20.1M  1.92M  9.46M  2.06K  54.2K  73.5K  12.5M   184K    587
4ms          420K  12.6M   186K  6.61M  1.01K  67.4K  41.9K  10.4M   343K     62
8ms          875K  8.29M  8.91K  4.08M    486  88.7K  7.53K  6.61M  1.37M      4
16ms        1.02M  6.10M  6.28K  9.10M    386   122K  2.97K  5.12M   244K      4
33ms        54.8K  5.13M  4.83K  14.4M    394   148K  2.18K  3.70M  45.6K      2
67ms        31.3K  2.58M  1.48K  2.02M    256   248K    881  1.45M  28.4K      0
134ms       15.0K  2.12M      9  18.4K     56   457K    323  1.04M  14.5K      0
268ms       2.29K  1.38M      3    478      0   886K     29   338K  2.24K      0
536ms          96  1.87M      0    171      0  1.76M      0  98.0K     96      0
1s            105  3.93M      0      0      0  3.89M      0  34.0K    105      0
2s              0  6.18M      0     30      0  6.10M      0  15.2K      0      0
4s              0  5.29M      0      0      0  5.23M      0  5.57K      0      0
8s              0   367K      0      0      0   344K      0   1010      0      0
17s             0      7      0      0      0      0      0      7      0      0
34s             0      0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0      0
--------------------------------------------------------------------------------

How Has This Been Tested?

The ztest command has been updated to periodically initiate a zpool trim and the autotrim property will be randomly toggled on/off. This has been proven to provide excellent coverage for the TRIM functionality under a wide range of pool configurations and to expose otherwise hard to hit race conditions. In particular ztest does an excellent job verifying that a new feature interacts correctly with the existing features.

25 new test cases where added to the ZFS Test Suite to provide additional test coverage. These include slightly modified versions of all the zpool initialize test cases which verify that zpool trim CLI behaves as intended (in a similar way to initialize). New zpool trim tests were added to test -p partial TRIM, -r rate limited TRIM, and that multiple zpool trim commands can be issued. Additionally, stress tests were added to verify that zpool trim and zpool set autotrim=on only discard unallocated space when under a heavy alloc/free workload.

$ scripts/zfs-tests.sh -T zpool_trim,trim 

Test: cli_root/zpool_trim/setup (run as root) [00:00] [PASS]
Test: cli_root/zpool_trim/zpool_trim_attach_detach_add_remove (run as root) [00:01] [PASS]
Test: cli_root/zpool_trim/zpool_trim_import_export (run as root) [00:03] [PASS]
Test: cli_root/zpool_trim/zpool_trim_multiple (run as root) [00:08] [PASS]
Test: cli_root/zpool_trim/zpool_trim_neg (run as root) [00:00] [PASS]
Test: cli_root/zpool_trim/zpool_trim_offline_export_import_online (run as root) [00:01] [PASS]
Test: cli_root/zpool_trim/zpool_trim_online_offline (run as root) [00:01] [PASS]
Test: cli_root/zpool_trim/zpool_trim_partial (run as root) [00:30] [PASS]
Test: cli_root/zpool_trim/zpool_trim_rate (run as root) [00:17] [PASS]
Test: cli_root/zpool_trim/zpool_trim_rate_neg (run as root) [00:00] [PASS]
Test: cli_root/zpool_trim/zpool_trim_split (run as root) [00:01] [PASS]
Test: cli_root/zpool_trim/zpool_trim_start_and_cancel_neg (run as root) [00:00] [PASS]
Test: cli_root/zpool_trim/zpool_trim_start_and_cancel_pos (run as root) [00:00] [PASS]
Test: cli_root/zpool_trim/zpool_trim_suspend_resume (run as root) [00:08] [PASS]
Test: cli_root/zpool_trim/zpool_trim_unsupported_vdevs (run as root) [00:02] [PASS]
Test: cli_root/zpool_trim/zpool_trim_verify_checksums (run as root) [00:14] [PASS]
Test: cli_root/zpool_trim/zpool_trim_verify_trimmed (run as root) [00:04] [PASS]
Test: cli_root/zpool_trim/cleanup (run as root) [00:00] [PASS]
Test: trim/setup (run as root) [00:00] [PASS]
Test: trim/autotrim_integrity (run as root) [00:35] [PASS]
Test: trim/autotrim_config (run as root) [00:44] [PASS]
Test: trim/autotrim_trim_integrity (run as root) [01:10] [PASS]
Test: trim/trim_integrity (run as root) [00:41] [PASS]
Test: trim/trim_config (run as root) [00:32] [PASS]
Test: trim/cleanup (run as root) [00:00] [PASS]

Results Summary
PASS      25

Running Time:   00:05:23
Percent passed: 100.0%

Performance testing was done using both mirrored and raidz configurations in order to assess the overall impact on applications while the autotrim=on property is set. The test case consists of:

  • Create a pool. Both a mirror and raidz2 pool were tested, three 800GB Seagate ST800FM0173 devices were used for each configuration. This has the advantage that total usable pool capacity was approximately the same for both configurations (800G).

  • Warm up - Repeat the following commands for 2000 iterations. For each iteration, N is randomly selected between 1 and 200 and copy of the Linux kernel git tree is removed and then copied back from tmpfs (~3.4GB). This was done to both fill and fragment the unallocated pool space until a steady-state was reached. The value 200 was selected for N in order to keep the pool at ~76% capacity.

    • rm -r /testpool/fs/linux-$N
    • time (cp -a /tmp/linux /testpool/fs/Linux-$N; sync)
  • Manual TRIM - Issue a manual zpool trim, then repeat the commands from the previous step for 250 iterations. This was done to both verify a manual TRIM restores the initial performance and to determine approximately how long it takes to return to a steady-state. The manual TRIM phase was repeated twice.

  • Automatic TRIM - Set the autotrim=on property, then repeat the commands from the previous step for 1000 iterations. As blocks are freed and added back in to the unallocated pool space those extents are asynchronously trimmed. What we want to verify is that the asynchronous TRIM does not negatively impact overall pool performance, that after enough iterations performance returns to close it its initial levels, and finally that performance is maintained at that level.

The following graph shows pool performance for all of the phases described above. On the Y-axis the wall clock time in seconds is plotted for each cp -a; sync command. On the X-axis is each iteration, from 1 to 3500 (2000 warm up, 250 + 250 manual TRIM, 1000 automatic TRIM). Multiple test runs were performed for each pool configuration and the results of these runs then averaged together and graphed. To further reduce the noise a moving average was then plotted using the previously averaged data.

image

Configuration:

  • RHEL 7 - 3.10.0-957.0.0 kernel
  • 3 Seagate ST800FM0173 devices
  • Dual 12Gb/s SAS attached

[edit] Updated motivation and description to reflect current PR.
[edit] Removed work in progress warning.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added Type: Feature Feature request or new feature Status: Work in Progress Not yet ready for general review labels Feb 16, 2019
@bgly
Copy link

bgly commented Feb 18, 2019

Testing on this version vs the ntrim3 from @dweeezil I am not able to run zpool trim -p PoolName

[root@bl]# zpool trim -p rpoole66c72b7534b499f965c331515ca3020
cannot trim 'part1': no such device in pool

When I switch back to ntrim3 it works again.

Some debug info:

[root@bl-vsnap-100 ~]# zpool status -t
pool: rpoole66c72b7534b499f965c331515ca3020
state: ONLINE
scan: none requested
config:

    NAME                                   STATE     READ WRITE CKSUM
    rpoole66c72b7534b499f965c331515ca3020  ONLINE       0     0     0
      part1                                ONLINE       0     0     0  (untrimmed)

errors: No known data errors

1550458286 spa_misc.c:408:spa_load_note(): spa_load($import, config trusted): LOADED
1550458286 spa_misc.c:408:spa_load_note(): spa_load($import, config trusted): UNLOADING
1550458286 spa.c:5366:spa_import(): spa_import: importing rpoole66c72b7534b499f965c331515ca3020
1550458286 spa_misc.c:408:spa_load_note(): spa_load(rpoole66c72b7534b499f965c331515ca3020, config trusted): LOADING
1550458286 vdev.c:124:vdev_dbgmsg(): disk vdev '/dev/bdg/da9ce426-945e-4fa6-a357-893d5dd3f82e/part1': best uberblock found for spa rpoole66c72b7534b499f965c331515ca3020. txg 91
1550458286 spa_misc.c:408:spa_load_note(): spa_load(rpoole66c72b7534b499f965c331515ca3020, config untrusted): using uberblock with txg=91
1550458286 vdev.c:2061:vdev_copy_path_impl(): vdev_copy_path: vdev 18321001255259253189: path changed from '/dev/sdd1' to '/dev/bdg/da9ce426-945e-4fa6-a357-893d5dd3f82e/part1'
1550458287 spa.c:7435:spa_async_request(): spa=rpoole66c72b7534b499f965c331515ca3020 async request task=1
1550458287 spa_misc.c:408:spa_load_note(): spa_load(rpoole66c72b7534b499f965c331515ca3020, config trusted): LOADED
1550458287 spa_history.c:319:spa_history_log_sync(): txg 93 open pool version 5000; software version unknown; uts bl-vsnap-100.devad.catalogic.us 4.19.8-1.el7.elrepo.x86_64 #1 SMP Sat Dec 8 10:07:47 EST 2018 x86_64
1550458287 spa.c:7435:spa_async_request(): spa=rpoole66c72b7534b499f965c331515ca3020 async request task=32
1550458287 spa_history.c:319:spa_history_log_sync(): txg 95 import pool version 5000; software version unknown; uts bl-vsnap-100.devad.catalogic.us 4.19.8-1.el7.elrepo.x86_64 #1 SMP Sat Dec 8 10:07:47 EST 2018 x86_64
1550458287 spa_history.c:306:spa_history_log_sync(): command: zpool import -m -d /dev/bdg/da9ce426-945e-4fa6-a357-893d5dd3f82e -o altroot=/vsnap -o cachefile=/opt/vsnap-data/restore_7861a25e3cb54d35b2057701e3e75f01_1550458283/pool.cachefile opool7861a25e3cb54d35b2057701e3e75f01 rpoole66c72b7534b499f965c331515ca3020
1550458287 spa_history.c:319:spa_history_log_sync(): txg 97 guid change old=5064559353479793234 new=17919770463577801432
1550458288 spa_history.c:306:spa_history_log_sync(): command: zpool reguid rpoole66c72b7534b499f965c331515ca3020
1550458288 spa_history.c:314:spa_history_log_sync(): txg 98 set rpoole66c72b7534b499f965c331515ca3020 (id 54) atime=0
1550458288 spa_history.c:306:spa_history_log_sync(): command: zfs set atime=off rpoole66c72b7534b499f965c331515ca3020
1550458288 spa_history.c:314:spa_history_log_sync(): txg 99 set rpoole66c72b7534b499f965c331515ca3020 (id 54) sync=2
1550458288 spa_history.c:306:spa_history_log_sync(): command: zfs set sync=disabled rpoole66c72b7534b499f965c331515ca3020
1550458288 spa_history.c:314:spa_history_log_sync(): txg 103 rename rpoole66c72b7534b499f965c331515ca3020/fsoffload (id 144) -> rpoole66c72b7534b499f965c331515ca3020/fse66c72b7534b499f965c331515ca3020
1550458293 spa_history.c:306:spa_history_log_sync(): command: zfs rename -fp rpoole66c72b7534b499f965c331515ca3020/fsoffload rpoole66c72b7534b499f965c331515ca3020/fse66c72b7534b499f965c331515ca3020

Copy link
Contributor

@dweeezil dweeezil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting my comments so far. I've not fully reviewed the new vdev_trim.c file. That should be happening soon.

cmd/zpool/zpool_main.c Outdated Show resolved Hide resolved
module/zfs/vdev_trim.c Outdated Show resolved Hide resolved
include/sys/spa.h Show resolved Hide resolved
man/man8/zpool.8 Outdated Show resolved Hide resolved
man/man8/zpool.8 Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
module/zfs/spa_stats.c Show resolved Hide resolved
module/zfs/spa_stats.c Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
@behlendorf
Copy link
Contributor Author

@bgly thanks for taking the time to test this PR. Unfortunately, I wasn't able to reproduce your failure. Can you describe your configuration, in particular and what kind of device part1 is and how the pool created.

@bgly
Copy link

bgly commented Feb 20, 2019

@bgly thanks for taking the time to test this PR. Unfortunately, I wasn't able to reproduce your failure. Can you describe your configuration, in particular and what kind of device part1 is and how the pool created.

The exposed device is via loopback from LIO. So think of it as a virtual disk exposed as a block device via loopback. part1 is just /dev/sdd1, which in the dbgmsg is:

1550634583 vdev.c:2067:vdev_copy_path_impl(): vdev_copy_path: vdev 6718177743574283759: path changed from '/dev/sdd1' to '/dev/bdg/7035cc52-4a75-4ced-b392-ef6bfa63fb2b/part1'

/dev/bdg/7035cc52-4a75-4ced-b392-ef6bfa63fb2b - This is the virtual storage device
/dev/sdd1 = loopback making it look like block device in linux

[root@bl-vsnap-100 ~]# lsblk
NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sdd               8:48   0    4P  0 disk
├─sdd9            8:57   0   64M  0 part
└─sdd1            8:49   0    4P  0 part
sdb               8:16   0  100G  0 disk
├─sdb9            8:25   0    8M  0 part
└─sdb1            8:17   0  100G  0 part
sr0              11:0    1 1024M  0 rom
fd0               2:0    1    4K  0 disk
sdc               8:32   0  100G  0 disk /mnt
sda               8:0    0   50G  0 disk
├─sda2            8:2    0   49G  0 part
│ ├─centos-swap 253:1    0  7.8G  0 lvm  [SWAP]
│ └─centos-root 253:0    0 41.2G  0 lvm  /
└─sda1            8:1    0    1G  0 part /boot

You can see the commands ran to create the pool also in the dbgmsg:

1550634582   spa_misc.c:408:spa_load_note(): spa_load($import, config trusted): LOADING
1550634582   vdev.c:124:vdev_dbgmsg(): disk vdev '/dev/bdg/7035cc52-4a75-4ced-b392-ef6bfa63fb2b/part1': best uberblock found for spa $import. txg 88
1550634582   spa_misc.c:408:spa_load_note(): spa_load($import, config untrusted): using uberblock with txg=88
1550634582   vdev.c:2067:vdev_copy_path_impl(): vdev_copy_path: vdev 6718177743574283759: path changed from '/dev/sdd1' to '/dev/bdg/7035cc52-4a75-4ced-b392-ef6bfa63fb2b/part1'
1550634583   spa_misc.c:408:spa_load_note(): spa_load($import, config trusted): LOADED
1550634583   spa_misc.c:408:spa_load_note(): spa_load($import, config trusted): UNLOADING
1550634583   spa.c:5366:spa_import(): spa_import: importing rpool5e2f64bc38e3475eb5395c37e1e836d8
1550634583   spa_misc.c:408:spa_load_note(): spa_load(rpool5e2f64bc38e3475eb5395c37e1e836d8, config trusted): LOADING
1550634583   vdev.c:124:vdev_dbgmsg(): disk vdev '/dev/bdg/7035cc52-4a75-4ced-b392-ef6bfa63fb2b/part1': best uberblock found for spa rpool5e2f64bc38e3475eb5395c37e1e836d8. txg 88
1550634583   spa_misc.c:408:spa_load_note(): spa_load(rpool5e2f64bc38e3475eb5395c37e1e836d8, config untrusted): using uberblock with txg=88
1550634583   vdev.c:2067:vdev_copy_path_impl(): vdev_copy_path: vdev 6718177743574283759: path changed from '/dev/sdd1' to '/dev/bdg/7035cc52-4a75-4ced-b392-ef6bfa63fb2b/part1'
1550634584   spa.c:7441:spa_async_request(): spa=rpool5e2f64bc38e3475eb5395c37e1e836d8 async request task=1
1550634584   spa_misc.c:408:spa_load_note(): spa_load(rpool5e2f64bc38e3475eb5395c37e1e836d8, config trusted): LOADED
1550634584   spa_history.c:319:spa_history_log_sync(): txg 90 open pool version 5000; software version unknown; uts bl-vsnap-100.devad.catalogic.us 4.19.8-1.el7.elrepo.x86_64 #1 SMP Sat Dec 8 10:07:47 EST 2018 x86_64
1550634584   spa.c:7441:spa_async_request(): spa=rpool5e2f64bc38e3475eb5395c37e1e836d8 async request task=32
1550634584   spa_history.c:319:spa_history_log_sync(): txg 92 import pool version 5000; software version unknown; uts bl-vsnap-100.devad.catalogic.us 4.19.8-1.el7.elrepo.x86_64 #1 SMP Sat Dec 8 10:07:47 EST 2018 x86_64
1550634584   spa_history.c:306:spa_history_log_sync(): command: zpool import -m -d /dev/bdg/7035cc52-4a75-4ced-b392-ef6bfa63fb2b -o altroot=/vsnap -o cachefile=/opt/vsnap-data/restore_7861a25e3cb54d35b2057701e3e75f01_1550634580/pool.cachefile opool7861a25e3cb54d35b2057701e3e75f01 rpool5e2f64bc38e3475eb5395c37e1e836d8
1550634584   spa_history.c:319:spa_history_log_sync(): txg 94 guid change old=3408341575703009700 new=11403916235060901602
1550634584   spa_history.c:306:spa_history_log_sync(): command: zpool reguid rpool5e2f64bc38e3475eb5395c37e1e836d8
1550634584   spa_history.c:314:spa_history_log_sync(): txg 95 set rpool5e2f64bc38e3475eb5395c37e1e836d8 (id 54) atime=0
1550634584   spa_history.c:306:spa_history_log_sync(): command: zfs set atime=off rpool5e2f64bc38e3475eb5395c37e1e836d8
1550634584   spa_history.c:314:spa_history_log_sync(): txg 96 set rpool5e2f64bc38e3475eb5395c37e1e836d8 (id 54) sync=2
1550634585   spa_history.c:306:spa_history_log_sync(): command: zfs set sync=disabled rpool5e2f64bc38e3475eb5395c37e1e836d8
1550634585   spa_history.c:314:spa_history_log_sync(): txg 100 rename rpool5e2f64bc38e3475eb5395c37e1e836d8/fsoffload (id 68) -> rpool5e2f64bc38e3475eb5395c37e1e836d8/fs5e2f64bc38e3475eb5395c37e1e836d8
1550634590   spa_history.c:306:spa_history_log_sync(): command: zfs rename -fp rpool5e2f64bc38e3475eb5395c37e1e836d8/fsoffload rpool5e2f64bc38e3475eb5395c37e1e836d8/fs5e2f64bc38e3475eb5395c37e1e836d8

Btw I just tried your latest push and I do not have the issue of zpool iostat printing spaces and overflowing my text file.

@bgly
Copy link

bgly commented Feb 20, 2019

[/home/kash]-[root@fezzik]-[0]-[10085]
[:)] # zpool trim rpool
cannot trim 'wwn-0x5000cca25ccde9bc': trim operations are not supported by this device
cannot trim 'wwn-0x5000cca25cce59c1': trim operations are not supported by this device
[/home/kash]-[root@fezzik]-[255]-[10086]
[:(] # zpool status -v 
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 8 days 18:42:35 with 0 errors on Sun Feb 10 15:03:32 2019
config:

        NAME                        STATE     READ WRITE CKSUM
        rpool                       ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x5000cca25ccde9bc  ONLINE       0     0     0
            wwn-0x5000cca25cce59c1  ONLINE       0     0     0
        special 
          mirror-1                  ONLINE       0     0     0
            nvme0n1p4               ONLINE       0     0     0  (trimming)
            nvme1n1p4               ONLINE       0     0     0  (trimming)
        logs
          mirror-2                  ONLINE       0     0     0
            nvme0n1p2               ONLINE       0     0     0
            nvme1n1p2               ONLINE       0     0     0

errors: No known data errors

interestingly, it does not trim SLOG devices and has non-zero return when it does start trimming..

Are you using like an ISCSI lun or something? I believe if you do that you have to make sure you enable TPU via targetcli if that is what you are using for your storage mappings.

@codecov
Copy link

codecov bot commented Feb 20, 2019

Codecov Report

Merging #8419 into master will increase coverage by 0.17%.
The diff coverage is 93.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8419      +/-   ##
==========================================
+ Coverage    78.6%   78.77%   +0.17%     
==========================================
  Files         380      381       +1     
  Lines      115903   116951    +1048     
==========================================
+ Hits        91105    92130    +1025     
- Misses      24798    24821      +23
Flag Coverage Δ
#kernel 79.15% <92.1%> (+0.12%) ⬆️
#user 67.55% <89.55%> (+0.31%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb1be77...73ae8f3. Read the comment docs.

@behlendorf
Copy link
Contributor Author

Refreshed PR, to resolve conflicts add address a few issues observed while testing and by the CI.

cannot trim 'wwn-0x5000cca25ccde9bc': trim operations are not supported by this device

I've suppressed this warnings when only the pool name was specified since some devices may correctly not support discard.

well, eventually the (trimming) disappeared but I did not get any status lines about it.

Similar to other vdev specific operations (initializing, resilvering, repairing) a short version of the status is printed after the vdev while in progress. For trim in particular additional detail can be requested with the new zpool status -t option. There is no longer a global top level trim status.

$ zpool status -t
  pool: tank
 state: ONLINE
  scan: none requested
config:

	NAME              STATE     READ WRITE CKSUM
	tank              ONLINE       0     0     0
	  /var/tmp/vdev1  ONLINE       0     0     0  (100% trimmed, completed at Wed 20 Feb 2019 06:12:21 PM PST)
	  vdd             ONLINE       0     0     0  (trim unsupported)
	logs	
	  /var/tmp/log    ONLINE       0     0     0  (100% trimmed, completed at Wed 20 Feb 2019 06:12:21 PM PST)

Lastly for clarity, autotrim and trim statistics are no longer reported separately in the extended iostats. For example:

$ zpool iostat -r

testpool      sync_read    sync_write    async_read    async_write      scrub         trim    
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0  21.2K      0    113      0      0      0
1K              0      0  3.06K      0      0      0  22.2K  5.49K     61     18      0      0
2K              0      0      0      0      0      0  15.4K  10.4K    129     46      0      0
4K              0      0      0      0      0      0  7.42K  8.29K     16     86  1.43K      0
8K             24      0     40      0      0      0      8  5.37K      0     48  1.57K      0
16K             0      0      0      0      0      0  4.46K  2.03K      2     85    808      0
32K             0      0      0      0      0      0     56  2.19K      0    110    328      0
64K            16      0     64      0      0      0    502  1.16K     15     33    328      0
128K            0      0     16      0      0      0  2.28K  1.64K     24     55    416      0
256K            0      0      0      0      0      0      0  1.24K      0    129    236      0
512K            0      0      0      0      0      0      0  1.41K      0    492    204      0
1M              0      0      0      0      0      0      0      7      0      0    228      0
2M              0      0      0      0      0      0      0      0      0      0     72      0
4M              0      0      0      0      0      0      0      0      0      0     80      0
8M              0      0      0      0      0      0      0      0      0      0     12      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------

$ zpool iostat -w

testpool     total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0      0      0      0      0      0      0
255ns           0      0      0      0      0      0      0      0      0      0
511ns           0      0      0      0      0     19      0      4      0     86
1us             0      0      0      0      0    592      0  1.73K      0    260
2us             0      0      0      0     10    688      0  8.07K      0  1.34K
4us             0      0      0      1      0    129      0  2.87K      0  1.11K
8us             0     23      0    192      2     82      0    881      0    141
16us            0    948      0  2.13K      0      8      0  1.11K      0      9
32us            0  3.37K      0  5.85K      0      2      0  1.77K      0      2
65us            4  5.73K      4  8.21K      0      0      0  2.41K      0      0
131us           8  6.81K      8  6.56K      0      0      0  2.81K      0      0
262us           0  5.83K      0  3.27K      0      0      0  2.93K      0      0
524us           0  3.54K      0  1.62K      0      0      0  1.81K      0      0
1ms             0  1.65K      0    987      0      0      0    871      0      0
2ms             0    890      0    425      0      0      0    456      0      0
4ms             0    493      0    180      0      0      0    201      0      0
8ms             0    200      0     68      0      0      0     98      0      0
16ms            0     28      0      7      0      0      0      6      0      0

@bgly I wasn't able to look at your issue today, but you're analysis in on the right track. The label will always exist but for some reason with the LIO devices you're using the new code is failing to locate it. I hope to have some time to look at this tomorrow. If you could reproduce the issue with something other than an LIO device that would be helpful.

@bgly
Copy link

bgly commented Feb 21, 2019

@behlendorf

I have found the issue. When I create a symbolic link to the device and re-import with zpool import -d then this is when I get the issue.

zpool trim works in this case:

[root@bl-vsnap-100 ~]# lsblk
NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sdd               8:48   0    4P  0 disk
├─sdd9            8:57   0    8M  0 part
└─sdd1            8:49   0    4P  0 part
sdb               8:16   0  100G  0 disk
├─sdb9            8:25   0    8M  0 part
└─sdb1            8:17   0  100G  0 part
sr0              11:0    1 1024M  0 rom
fd0               2:0    1    4K  0 disk
sdc               8:32   0  100G  0 disk /mnt
sda               8:0    0   50G  0 disk
├─sda2            8:2    0   49G  0 part
│ ├─centos-swap 253:1    0  7.8G  0 lvm  [SWAP]
│ └─centos-root 253:0    0 41.2G  0 lvm  /
└─sda1            8:1    0    1G  0 part /boot

[root@bl-vsnap-100 ~]# zpool import test_pool
[root@bl-vsnap-100 ~]# zpool status
  pool: test_pool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        test_pool   ONLINE       0     0     0
          sdd       ONLINE       0     0     0

errors: No known data errors

  pool: vpool2
 state: ONLINE
  scan: none requested
config:

        NAME                                      STATE     READ WRITE CKSUM
        vpool2                                    ONLINE       0     0     0
          scsi-36000c290f7e3d3e0c3644005631d8107  ONLINE       0     0     0

errors: No known data errors

trim does not work in this case:

[root@bl-vsnap-100 ~]# ls -lah test_link
lrwxrwxrwx. 1 root root 9 Feb 21 17:53 test_link -> /dev/sdd1

[root@bl-vsnap-100 ~]# zpool import -d /root/test_link
   pool: test_pool
     id: 396397379647903103
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        test_pool    ONLINE
          test_link  ONLINE
[root@bl-vsnap-100 ~]# zpool status
  pool: vpool2
 state: ONLINE
  scan: none requested
config:

        NAME                                      STATE     READ WRITE CKSUM
        vpool2                                    ONLINE       0     0     0
          scsi-36000c290f7e3d3e0c3644005631d8107  ONLINE       0     0     0

errors: No known data errors

[root@bl-vsnap-100 ~]# zpool trim -p test_pool
cannot open 'test_pool': no such pool

The reason we do a symbolic link is to be able to import without allowing zfs to scan through bad devices if we specify a specific path.

@bgly
Copy link

bgly commented Feb 21, 2019

@behlendorf

It seems like I found another bug somewhere with trim. With ntrim3 everything seems to work fine, but your latest branch it looks like I am again having issues.

[root@bl-vsnap-100 vsnap]# zpool status -t
  pool: rpool73ac2ec5b7134ca192b020681ab913d8
 state: ONLINE
  scan: none requested
config:

        NAME                                   STATE     READ WRITE CKSUM
        rpool73ac2ec5b7134ca192b020681ab913d8  ONLINE       0     0     0
          sdd                                  ONLINE       0     0     0  (0% trimmed, started at Thu 21 Feb 2019 07:32:44 PM UTC)

errors: No known data errors

  pool: vpool2
 state: ONLINE
  scan: none requested
config:

        NAME                                      STATE     READ WRITE CKSUM
        vpool2                                    ONLINE       0     0     0
          scsi-36000c290f7e3d3e0c3644005631d8107  ONLINE       0     0     0  (trim unsupported)

errors: No known data errors

I took off the symlink code when importing and I am now able to trim, but the status doesn't show progression. Or it is not getting completion somewhere since:

1550777558   spa_history.c:306:spa_history_log_sync(): command: zfs rename -fp rpool73ac2ec5b7134ca192b020681ab913d8/fsoffload rpool73ac2ec5b7134ca192b020681ab913d8/fs73ac2ec5b7134ca192b020681ab913d8
1550777564   spa_history.c:319:spa_history_log_sync(): txg 108 trim vdev=/dev/sdd1 activated
1550777570   spa_history.c:339:spa_history_log_sync(): ioctl trim
1550777570   spa_history.c:306:spa_history_log_sync(): command: zpool trim -p rpool73ac2ec5b7134ca192b020681ab913d8
[root@bl-vsnap-100 vsnap]#

I am monitoring SCSI commands going through block layer/scsi layer/LIO/tcmu so I am able to see that all the unmap commands the large amounts of them going through and completing, but the progress bar in zpool status -t does not move? I noticed a bunch of unmap's coming through and then when it finished it just did some small writes. Thus I would have expected the trim status to show 100% completed.

I also want to add that this is with your latest changes.

@bgly
Copy link

bgly commented Feb 21, 2019

@behlendorf

It seems like I found another bug somewhere with trim. With ntrim3 everything seems to work fine, but your latest branch it looks like I am again having issues.

[root@bl-vsnap-100 vsnap]# zpool status -t
  pool: rpool73ac2ec5b7134ca192b020681ab913d8
 state: ONLINE
  scan: none requested
config:

        NAME                                   STATE     READ WRITE CKSUM
        rpool73ac2ec5b7134ca192b020681ab913d8  ONLINE       0     0     0
          sdd                                  ONLINE       0     0     0  (0% trimmed, started at Thu 21 Feb 2019 07:32:44 PM UTC)

errors: No known data errors

  pool: vpool2
 state: ONLINE
  scan: none requested
config:

        NAME                                      STATE     READ WRITE CKSUM
        vpool2                                    ONLINE       0     0     0
          scsi-36000c290f7e3d3e0c3644005631d8107  ONLINE       0     0     0  (trim unsupported)

errors: No known data errors

I took off the symlink code when importing and I am now able to trim, but the status doesn't show progression. Or it is not getting completion somewhere since:

1550777558   spa_history.c:306:spa_history_log_sync(): command: zfs rename -fp rpool73ac2ec5b7134ca192b020681ab913d8/fsoffload rpool73ac2ec5b7134ca192b020681ab913d8/fs73ac2ec5b7134ca192b020681ab913d8
1550777564   spa_history.c:319:spa_history_log_sync(): txg 108 trim vdev=/dev/sdd1 activated
1550777570   spa_history.c:339:spa_history_log_sync(): ioctl trim
1550777570   spa_history.c:306:spa_history_log_sync(): command: zpool trim -p rpool73ac2ec5b7134ca192b020681ab913d8
[root@bl-vsnap-100 vsnap]#

I am monitoring SCSI commands going through block layer/scsi layer/LIO/tcmu so I am able to see that all the unmap commands the large amounts of them going through and completing, but the progress bar in zpool status -t does not move? I noticed a bunch of unmap's coming through and then when it finished it just did some small writes. Thus I would have expected the trim status to show 100% completed.

I also want to add that this is with your latest changes.

I left it running and it actually completed roughly 28 minutes later. I didn't even delete anything, so there was nothing to trim. There must be something that is taking a long time to show completion?

1550777570   spa_history.c:306:spa_history_log_sync(): command: zpool trim -p rpool73ac2ec5b7134ca192b020681ab913d8
1550779214   spa_history.c:319:spa_history_log_sync(): txg 433 trim vdev=/dev/sdd1 complete
[root@bl-vsnap-100 vsnap]# zpool status -t
  pool: rpool73ac2ec5b7134ca192b020681ab913d8
 state: ONLINE
  scan: none requested
config:

        NAME                                   STATE     READ WRITE CKSUM
        rpool73ac2ec5b7134ca192b020681ab913d8  ONLINE       0     0     0
          sdd                                  ONLINE       0     0     0  (100% trimmed, completed at Thu 21 Feb 2019 08:00:14 PM UTC)

errors: No known data errors

If you compare the start time from my previous post and the completion time it is ~28 minutes to basically trim no data. - Also note that from the SCSI layer the unmap was almost instantaneous in terms of completion. I saw the unmap commands coming through and it took at most ~5 secs to handle all the unmap commands.

@behlendorf
Copy link
Contributor Author

@bgly can you check the contents of /proc/spl/kstat/zfs/tank/iostats. It may give us some insight in to what you're seeing. Can you also verify that you weren't seeing this issue with commit 2a114a4.

@bgly
Copy link

bgly commented Feb 22, 2019

@bgly can you check the contents of /proc/spl/kstat/zfs/tank/iostats. It may give us some insight in to what you're seeing. Can you also verify that you weren't seeing this issue with commit 2a114a4.

[root@bl-vsnap-100 vsnap]# cat /proc/spl/kstat/zfs/rpool73ac2ec5b7134ca192b020681ab913d8/iostats
98 1 0x01 12 3264 8083609007857 23563313212151
name                            type data
trim_extents_written            4    9420
trim_bytes_written              4    1228180611072
trim_extents_skipped            4    545
trim_bytes_skipped              4    8245248
trim_extents_failed             4    0
trim_bytes_failed               4    0
autotrim_extents_written        4    0
autotrim_bytes_written          4    0
autotrim_extents_skipped        4    0
autotrim_bytes_skipped          4    0
autotrim_extents_failed         4    0
autotrim_bytes_failed           4    0

@behlendorf
The same stats as above are shown after it shows 100% completion. This also occurs on 2a114a4

I'm assuming its a reporting issue? I do not see any further SCSI UNMAP commands processing from the point where it starts to 30 minutes later when zpool status -t shows completion.

@behlendorf
Copy link
Contributor Author

@bgly thanks. Nothing out of place there, it does seem like a reporting issue. You could watch that kstat, run zpool iostat -r 5, or zpool iostat -w 5 to get more visibility in to how the TRIM/UNMAPs are being issued. I'll see if I can reproduce it locally.

I've updated the PR with another commit to address a few additional issues, including the symlink one.
Incidentally, thanks for finding that it was only an issue since your symlinks were in a non-standard location which wasn't under /dev/disk/.

@bgly
Copy link

bgly commented Feb 22, 2019

@bgly thanks. Nothing out of place there, it does seem like a reporting issue. You could watch that kstat, run zpool iostat -r 5, or zpool iostat -w 5 to get more visibility in to how the TRIM/UNMAPs are being issued. I'll see if I can reproduce it locally.

I've updated the PR with another commit to address a few additional issues, including the symlink one.
Incidentally, thanks for finding that it was only an issue since your symlinks were in a non-standard location which wasn't under /dev/disk/.

@behlendorf Thanks for fixing the symlink issue, that works for me now.

I will work on monitoring the trim, but while I was testing some more I found another bug.

1550805781   spa_history.c:319:spa_history_log_sync(): txg 109 trim vdev=/dev/bdg/7762094b-1a5a-4eae-9c54-d1fc71dde0d6/part1 activated
1550805786   spa_history.c:339:spa_history_log_sync(): ioctl trim
1550805786   spa_history.c:306:spa_history_log_sync(): command: zpool trim rpoolc6a158f1ca914a5aae30247c0ac48c68
1550806174   spa_history.c:319:spa_history_log_sync(): txg 188 trim vdev=/dev/bdg/7762094b-1a5a-4eae-9c54-d1fc71dde0d6/part1 suspended
1550806179   spa_history.c:339:spa_history_log_sync(): ioctl trim
1550806179   spa_history.c:306:spa_history_log_sync(): command: zpool trim -s rpoolc6a158f1ca914a5aae30247c0ac48c68
1550806203   spa_history.c:319:spa_history_log_sync(): txg 199 trim vdev=/dev/bdg/7762094b-1a5a-4eae-9c54-d1fc71dde0d6/part1 canceled
1550806208   spa_history.c:339:spa_history_log_sync(): ioctl trim
1550806208   spa_history.c:306:spa_history_log_sync(): command: zpool trim -c rpoolc6a158f1ca914a5aae30247c0ac48c68
1550806218   spa_history.c:319:spa_history_log_sync(): txg 204 trim vdev=/dev/bdg/7762094b-1a5a-4eae-9c54-d1fc71dde0d6/part1 activated
1550806223   spa_history.c:339:spa_history_log_sync(): ioctl trim
1550806223   spa_history.c:306:spa_history_log_sync(): command: zpool trim -p rpoolc6a158f1ca914a5aae30247c0ac48c68
1550806269   spa_history.c:319:spa_history_log_sync(): txg 216 trim vdev=/dev/bdg/7762094b-1a5a-4eae-9c54-d1fc71dde0d6/part1 canceled

When I issued a full device trim, it looked like zpool iostat -r 5 was showing:

rpoolc6a158f1ca914a5aae30247c0ac48c68    sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                 ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
-------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----512                                        0      0      0      0      0      0      0      0      0      0      0      0
1K                                         0      0      0      0      0      0      0      0      0      0      0      02K                                         0      0      0      0      0      0      0      0      0      0      0      0
4K                                         0      0      0      0      0      0      8      0      0      0      0      0
8K                                         0      0      0      0      0      0      0      1      0      0      0      0
16K                                        0      0      0      0      0      0      0      1      0      0      0      0
32K                                        0      0      0      0      0      0      0      0      0      0      0      0
64K                                        0      0      0      0      0      0      0      0      0      0      0      0
128K                                       0      0      0      0      0      0      0      0      0      0      0      0
256K                                       0      0      0      0      0      0      0      0      0      0      0      0
512K                                       0      0      0      0      0      0      0      0      0      0      0      0
1M                                         0      0      0      0      0      0      0      0      0      0      0      0
2M                                         0      0      0      0      0      0      0      0      0      0      0      0
4M                                         0      0      0      0      0      0      0      0      0      0      0      0
8M                                         0      0      0      0      0      0      0      0      0      0      0      0
16M                                        0      0      0      0      0      0      0      0      0      0  5.31K      0
-------------------------------------------------------------------------------------------------------------------------

This was good because I was consistently getting data in the SCSI layer doing UNMAP. As this was happening zpool status -t was progressing, which I was happy with.

I then realized that trimming was taking awhile so I then suspended it and canceled it. Afterwards I immediately issued a partial trim on the same device thinking this would work a lot faster. This is where issues started to occur. zpool status -t kept going from 9%-99% and it kept repeating infinitely. I also noticed zpool iostat -r 5 was not showing any trim ops flowing through and SCSI layer also had nothing. I then proceeded to try to cancel the trim with zpool trim -c pool_name and that is just hanging. Hopefully some of the stack traces below would help? - Let me know what you would need.

[root@bl-vsnap-100 ~]# grep -H . /proc/*/stack | grep zfs
/proc/12535/stack:[<0>] txg_quiesce_thread+0x39b/0x3a0 [zfs]
/proc/12536/stack:[<0>] txg_sync_thread+0x185/0x4a0 [zfs]
/proc/12537/stack:[<0>] mmp_thread+0x266/0x4b0 [zfs]
/proc/12538/stack:[<0>] zthr_procedure+0x75/0x130 [zfs]
/proc/12539/stack:[<0>] zthr_procedure+0x75/0x130 [zfs]
/proc/12878/stack:[<0>] spa_open_common+0x68/0x4c0 [zfs]
/proc/12878/stack:[<0>] spa_get_stats+0x57/0x560 [zfs]
/proc/12878/stack:[<0>] zfs_ioc_pool_stats+0x39/0x90 [zfs]
/proc/12878/stack:[<0>] zfsdev_ioctl+0x82e/0x890 [zfs]
/proc/14543/stack:[<0>] vdev_trim_stop_wait+0xa2/0x100 [zfs]
/proc/14543/stack:[<0>] spa_vdev_trim+0x1ef/0x570 [zfs]
/proc/14543/stack:[<0>] zfs_ioc_pool_trim+0x1cb/0x220 [zfs]
/proc/14543/stack:[<0>] zfsdev_ioctl+0x5a9/0x890 [zfs]
/proc/14620/stack:[<0>] spa_open_common+0x68/0x4c0 [zfs]
/proc/14620/stack:[<0>] spa_get_stats+0x57/0x560 [zfs]
/proc/14620/stack:[<0>] zfs_ioc_pool_stats+0x39/0x90 [zfs]
/proc/14620/stack:[<0>] zfsdev_ioctl+0x82e/0x890 [zfs]
/proc/14888/stack:[<0>] spa_open_common+0x68/0x4c0 [zfs]
/proc/14888/stack:[<0>] spa_get_stats+0x57/0x560 [zfs]
/proc/14888/stack:[<0>] zfs_ioc_pool_stats+0x39/0x90 [zfs]
/proc/14888/stack:[<0>] zfsdev_ioctl+0x82e/0x890 [zfs]
/proc/3093/stack:[<0>] zthr_procedure+0x75/0x130 [zfs]
/proc/3106/stack:[<0>] zthr_procedure+0xdd/0x130 [zfs]
/proc/3133/stack:[<0>] dbuf_evict_thread+0x7c/0x1e0 [zfs]
/proc/3193/stack:[<0>] l2arc_feed_thread+0x6e/0x440 [zfs]
/proc/5044/stack:[<0>] txg_quiesce_thread+0x39b/0x3a0 [zfs]
/proc/5045/stack:[<0>] txg_sync_thread+0x185/0x4a0 [zfs]
/proc/5046/stack:[<0>] mmp_thread+0x266/0x4b0 [zfs]
/proc/5059/stack:[<0>] zthr_procedure+0x75/0x130 [zfs]
/proc/5060/stack:[<0>] zthr_procedure+0x75/0x130 [zfs]

--- Additional Note:

It looks like it finally freed or timed out whatever the case, it is no longer hung.

[root@bl-vsnap-100 ~]# grep -H . /proc/*/stack | grep zfs
/proc/12535/stack:[<0>] txg_quiesce_thread+0x39b/0x3a0 [zfs]
/proc/12536/stack:[<0>] txg_sync_thread+0x185/0x4a0 [zfs]
/proc/12537/stack:[<0>] mmp_thread+0x266/0x4b0 [zfs]
/proc/12538/stack:[<0>] zthr_procedure+0x75/0x130 [zfs]
/proc/12539/stack:[<0>] zthr_procedure+0x75/0x130 [zfs]
/proc/3093/stack:[<0>] zthr_procedure+0x75/0x130 [zfs]
/proc/3106/stack:[<0>] zthr_procedure+0xdd/0x130 [zfs]
/proc/3133/stack:[<0>] dbuf_evict_thread+0x7c/0x1e0 [zfs]
/proc/3193/stack:[<0>] l2arc_feed_thread+0x6e/0x440 [zfs]
/proc/5044/stack:[<0>] txg_quiesce_thread+0x39b/0x3a0 [zfs]
/proc/5045/stack:[<0>] txg_sync_thread+0x185/0x4a0 [zfs]
/proc/5046/stack:[<0>] mmp_thread+0x266/0x4b0 [zfs]
/proc/5059/stack:[<0>] zthr_procedure+0x75/0x130 [zfs]
/proc/5060/stack:[<0>] zthr_procedure+0x75/0x130 [zfs]

@bgly
Copy link

bgly commented Feb 22, 2019

@bgly thanks. Nothing out of place there, it does seem like a reporting issue. You could watch that kstat, run zpool iostat -r 5, or zpool iostat -w 5 to get more visibility in to how the TRIM/UNMAPs are being issued. I'll see if I can reproduce it locally.

If I were to guess it seems like for a partial trim in which nothing was deleted, then the UNMAP occurs so fast that I do not even get a chance to see anything from zpool iostat -r 5. I do see the commands in SCSI layer, again very few UNMAP commands and is completed in less than a few seconds. This seems to be consistent with zpool status -t showing 0% for ~30 minutes. The only problem I see with this being at 0% is that I can not go ahead and delete date and re-run the partial trim. If i do that results in: [root@bl-vsnap-100 ~]# zpool trim -p rpool0b423ce29c2c49d7b1c39c81d829e3be cannot trim '/dev/bdg/865eaa0e-786c-482b-96c2-472be1beb525/part1': currently trimming If I attempt to cancel the trim then it hangs for like 45mins. I have not gone through the code in depth yet, but there must be some lock held waiting for completion. The weird thing is I would have assumed zpool iostat -r 5 to atleast show some I/O in regards to unmap occuring...

@lundman
Copy link
Contributor

lundman commented Feb 22, 2019

I have refreshed the OSX PR openzfsonosx/zfs#661 to be based on this one - plus a bunch of commits I needed to make it merge easier.

Testing is a little hampered by that we frequently use shortnames with disks:

	NAME        STATE     READ WRITE CKSUM
	zones       ONLINE       0     0     0
	  disk2     ONLINE       0     0     0  (untrimmed)

# ./cmd.sh zpool trim zones
cannot trim 'disk2': no such device in pool

It appears the low-level Linux TRIM takes one (block,szie) pair only - so the vdev_disk/vdev_file call is no longer an array. Can I confirm it will stay a single entry each time, will simplify my own code permanently.

@behlendorf
Copy link
Contributor Author

@bgly am I correct in understanding the issue you're currently seeing only occurs when using the partial trim option? It sounds as if a normal trim proceeded as expected.

@lundman thanks for picking this up for OSX. Regarding the short names this code was picked up from zpool initialize, do you not have the same issue there?

Can I confirm it will stay a single entry each time

Yes, the expectation is we'll keep it this way.

@kpande thanks for letting me know, I'll run that down.

@bgly
Copy link

bgly commented Feb 22, 2019

The issue with completion taking awhile seems to be on partial trim. The issue with the infinite looping of stat going from 0-100 I only tried with full trim, suspend, cancel, then partial. This then causes that infinite loop to occur until an export and import is done.

@behlendorf
Copy link
Contributor Author

@bgly I wasn't able to reproduce your issue on my test setup. I didn't see any kind of looping behavior, but I did fix two issues related to cancelling a TRIM. When a TRIM was cancelled (instead of completed) it would not restart at the beginning nor would it clear a previously requested TRIM rate. Both issues have been fixed, and might explain some of the slowness you were seeing if you used the -r option.

@dweeezil
Copy link
Contributor

I'd like to update the status of the issue uncovered by the autotrim_trim_integrity test. @behlendorf wrote in the initial PR comment above:

... Additionally, the new autotrim_trim_intergrity ZTS test case which verifies correctness when autotrim=on, zpool trim is run frequently, and the pool in allocating/freed has been observed to occasionally fail with checksum errors. This is currently being investigated, the issue has been non-reproducible when running autotrim or manual TRIM independently.

Investigation of the issue over the past several days has shown the problem appears to actually be with manual trim and can occur any time a manual trim is performed while other allocations are occurring. I've been able to trivially reproduce the problem with specially created small pools and tuning set to reduce the number of metaslabs.

I have pushed e196c2a to dweeezil:ntrim4 as heavy-handed fix. This patch is highly unlikely to be the ultimate solution but it does seem to make manual trim safe to use.

UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends.  By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more effeciently manage itself.

This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool.  The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience.  The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.

The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and assoicated spa taskq.  This new type makes
is straightforward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c.  These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This made it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.

In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property.  It relies on the exact same infrastructure as the
manual TRIM.  However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab.  When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree.  The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.

Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`.  This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them.  An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.

Contributions-by: Saso Kiselkov <[email protected]>
Contributions-by: Tim Chase <[email protected]>
Contributions-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
* Updated comments.
* Converted ZIO_PRIORITY_TRIM to ASSERTs in vdev_queue_io()

Signed-off-by: Brian Behlendorf <[email protected]>
* Added Big Theory section which discusses the high level details
  of both manual and automatic TRIM.

* Updated partial TRIM description in zpool(8) using Richard
  Laagers proposed language.

* Documented the hot spare / replacing limitation.

* Brought zio_vdev_io_assess()  back inline with the existing code
  and fixed the 'persistent bit' portion of the comment.

* Fixed typos in comments.

Signed-off-by: Brian Behlendorf <[email protected]>
During review it was observed that the current semantics of a partial
TRIM are not well-defined.  If we make the current partial trim a first
class feature, folks will come to rely on the current behavior, despite
any disclaimers in the manpage. This creates risk to future changes, which
must now maintain the behavior of trying not to touch new metaslabs until
absolutely necessary, lest partial trims become less useful.

For this reason, until the work is done to fully design a well-defined
partial TRIM feature it has been removed from the `zpool trim` command.

That said, the existing partial TRIM behavior is still highly useful for
very large thinly-provisioned pools.  Therefore, it has been converted to
a new 'zfs_trim_metaslab_skip' module option which is the appropriate
place for this kind of setting.  All of the existing partial TRIM behavior
has been retained.

To issue a partial TRIM set 'zfs_trim_metaslab_skip=1' then initiate a
manual TRIM.  After this point the module option may be changed without
any effect on the running TRIM.  The setting is stored on-disk and will
persist for the duration of the requested TRIM.

$ echo 1 >/sys/module/zfs/parameters/zfs_trim_metaslab_skip
$ zpool trim partial-pool

This means that a full TRIM can be requested for a different pool while
the previous partial TRIM is still running.

$ echo 0 >/sys/module/zfs/parameters/zfs_trim_metaslab_skip
$ zpool trim full-pool

Signed-off-by: Brian Behlendorf <[email protected]>
The issued_trim flag must always be set to B_TRUE before calling
vdev_trim_ranges() since this function may be interrupted.  In
which case there can be outstanding TRIM I/Os which must complete
before allowing new allocations.

Signed-off-by: Brian Behlendorf <[email protected]>
@behlendorf
Copy link
Contributor Author

@rugubara thanks for putting together that history. I was able to confirm that the versions you listed should never have suffered from the defect I referenced. While re-reviewing the code for potential issues I did find and fix a potential problem, see cd265fd, but given the scenario you described in the #8419 (comment) above I don't believe you could have hit it.

I'll go over everything again, but at the moment I don't have an explanation for how the pool could have been damaged.

@zerk666
Copy link

zerk666 commented Mar 27, 2019

@rugubara might have missed it, but which SSD brand/models were used when the corruption happened?

@rugubara
Copy link
Contributor

@rugubara might have missed it, but which SSD brand/models were used when the corruption happened?

These are SATA toshiba 60GB (ZIL) and SATA ocz vertex 120GB (l2arc). My understanding is that zil and l2arc are protected by checksums as well - i.e. if I have an abnormal restart and zil is corrupted, the import will stop complaining about checksum errors in ZIL. I can't imagine how a failed ZIL can corrupt the pool.

Observed while running ztest.  The SCL_CONFIG read lock should
not be held when calling metaslab_disable() because the following
dead lock is possible.  This was introduced by allowing the
metaslab_enable() function to wait for a txg sync.

Thread 1191 (Thread 0x7ffa908c1700 (LWP 16776)):
  cv_wait
  spa_config_enter		-> Waiting for SCL_CONFIG read lock,
				   blocked due to pending writer (994)
  spa_txg_history_init_io
  txg_sync_thread

Thread 994 (Thread 0x7ffa903a4700 (LWP 16791)):
  cv_wait
  spa_config_enter		-> Waiting for SCL_ALL write lock,
				   blocked due to holder (983)
  spa_vdev_exit
  spa_vdev_add
  ztest_vdev_aux_add_remove	-> Holding ztest_vdev_lock
  ztest_execute

Thread 1001 (Thread 0x7ffa90902700 (LWP 18020)):
  cv_wait
  txg_wait_synced		-> Waiting on txg sync to decrement
				   mg_ms_disabled (1191)
  metaslab_enable
  vdev_autotrim_thread

Thread 983 (Thread 0x7ffa63451700 (LWP 18019)):
  cv_wait
  metaslab_group_disabled_increment -> Waiting for mg_ms_disabled to be
				       less than max_disabled_ms (1001)
  metaslab_disable
  vdev_initialize_thread	-> *** Holding SCL_CONFIG read lock ***

Signed-off-by: Brian Behlendorf <[email protected]>
module/zfs/vdev_trim.c Outdated Show resolved Hide resolved
@rugubara
Copy link
Contributor

@rugubara thanks for putting together that history. I was able to confirm that the versions you listed should never have suffered from the defect I referenced. While re-reviewing the code for potential issues I did find and fix a potential problem, see cd265fd, but given the scenario you described in the #8419 (comment) above I don't believe you could have hit it.

I'll go over everything again, but at the moment I don't have an explanation for how the pool could have been damaged.

Colleagues, should I keep a damaged pool for investigation? I don't have enough space to copy the data and I would like to rebuild the system if there is no hope of recovering the data. Are you interested in poking the damaged pool for details?

Always wait for the next txg after checking a metaslab group.
When no TRIM commands were issued then don't force a quiesce
so idle pools continue to generate no-op txgs.

Signed-off-by: Brian Behlendorf <[email protected]>
@behlendorf
Copy link
Contributor Author

behlendorf commented Mar 27, 2019

@rugubara unfortunately I don't think the damaged files are recoverable. The scrub would have repaired them if it were possible However, before you do rebuild the pool could you spot check a few of the damaged files and check when they were last updated (atime / mtime). That may provide some insight if this is somehow related to replaying the zil after an abnormal shutdown.

The pool is 6 HDD RAIDZ2 + 1 SLOG (SSD) + 1 L2ARC (SSD).
These are SATA toshiba 60GB (ZIL) and SATA ocz vertex 120GB (l2arc).

Do I understand correctly that the primary pool RAIDZ2 storage is using HDDs and SSDs are only used for the ZIL and L2ARC devices? This would mean that TRIM commands would only be sent to the ZIL device and never the primary pool. Could you please run zpool status -t to verify that only the log device supports TRIM. And if it would make sense, you can run zpool iostat -r and check the "trim / ind" column to determine if any TRIM commands have been sent to the device since the pool was imported.

My understanding is that zil and l2arc are protected by checksums as well - i.e. if I have an abnormal restart and zil is corrupted, the import will stop complaining about checksum errors in ZIL. I can't imagine how a failed ZIL can corrupt the pool.

That's right. The ZIL is checksummed and it won't be replayed if it's corrupted. I don't see how it could have caused this issue either.

@rugubara
Copy link
Contributor

rugubara commented Mar 28, 2019

@rugubara ... could you spot check a few of the damaged files and check when they were last updated (atime / mtime). That may provide some insight if this is somehow related to replaying the zil after an abnormal shutdown.

The list of damaged files starts with a very long list of metadata. I think it is the reason why I can't mount majority of my datasets.
The second biggest contributors to the error list are the snapshots.
As for the files in the datasets that I can mount and look at the atime/mtime, they are very rare - 4 entries out of 246000.

3 are directories
/usr/lib32 mtime 2019-03-21 08:36
/usr/lib64 mtime 2019-03-21 08:37
/usr/libexec mtime 2019-03-21 08:37

and 1 file
/bin/hostname mtime 2019-03-20 07:42

The pool is 6 HDD RAIDZ2 + 1 SLOG (SSD) + 1 L2ARC (SSD).
These are SATA toshiba 60GB (ZIL) and SATA ocz vertex 120GB (l2arc).

... Could you please run zpool status -t to verify that only the log device supports TRIM. And if it would make sense, you can run zpool iostat -r and check the "trim / ind" column to determine if any TRIM commands have been sent to the device since the pool was imported.

zpool status reports that main pool devices don't support trim, and log and cache devices suppot it.
zpool iostat shows some activity in trim columns.

@behlendorf
Copy link
Contributor Author

@rugubara thank you, I wish I had an explanation for you. But I don't see anyway the TRIM code could have caused this. Nor was I able to reproduce any problem locally by repeatedly power cycling a trimming pool which was making heavy use of a dedicated log device. Additionally, ztest(1) which is great at uncovering these kind of issues hasn't encountered anything like this after mutliple weeks or run time. Whatever issue you encountered, I don't believe it was related this PR.

@behlendorf behlendorf merged commit 1b93956 into openzfs:master Mar 29, 2019
allanjude pushed a commit to allanjude/zfs that referenced this pull request Jun 29, 2019
UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends.  By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.

This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool.  The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience.  The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.

The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq.  This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c.  These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.

In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property.  It relies on the exact same infrastructure as the
manual TRIM.  However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab.  When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree.  The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.

Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`.  This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them.  An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.

Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Contributions-by: Saso Kiselkov <[email protected]>
Contributions-by: Tim Chase <[email protected]>
Contributions-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#8419
Closes openzfs#598
@ghost
Copy link

ghost commented Aug 9, 2020

"Be aware that automatic trimming of recently freed data blocks can put significant stress on the underlying storage devices." Link

Does this still apply? Since turning on autotrim=on is the default from now on in some Ubuntu packages and components.
I would like to know if this is still correct. I also saw some load fixes/commits

@behlendorf

@pdf
Copy link

pdf commented Aug 9, 2020

That statement applies regardless of the filesystem you're using, though batching helps somewhat, and the severity of the impact may vary by storage device @TB-G . Manual trim on a schedule is often the optimal approach.

@ghost
Copy link

ghost commented Aug 9, 2020

Im curious about this part "put significant stress on the underlying storage devices". Of course autotrim takes some performance for a few seconds, but is this worse than with other (file)systems?

This is listed separately in the ZFS documentation and so I wonder why they (Ubuntu) turn on or recommend autotrim=on (sometimes in combination with zpool trim) on Desktop and Server OS. Does it have more advantages than disadvantages? Are we talking about NVMe vs SSD? And why (and when) does it cause significant stress on the underlying storage devices?

@pdf
Copy link

pdf commented Aug 9, 2020

This is a hardware issue, not a filesystem issue @TB-G I'd recommend doing some research.

@ghost
Copy link

ghost commented Aug 9, 2020

This is a hardware issue, not a filesystem issue @TB-G I'd recommend doing some research.

I'm not talking about hardware or filesystems.

I want to know why this is explicitly stated in the documentation and I want to know how the implementation of ZFS handles this and why ubuntu turns this on by default. See my other questions.

@pdf
Copy link

pdf commented Aug 9, 2020

You need to be talking about hardware, that's what this is about. It's explicitly stated in the documentation, because it can be important enough that users should be aware of it. You will need to ask the Canonical maintainer for their rationale in enabling it by default there, this is not the correct place for that.

@ghost
Copy link

ghost commented Aug 9, 2020

You need to be talking about hardware, that's what this is about. It's explicitly stated in the documentation, because it can be important enough that users should be aware of it. You will need to ask the Canonical maintainer for their rationale in enabling it by default there, this is not the correct place for that.

I will certainly ask. And I understand that the underlying storage applies, but that is not my question. I want to know how ZFS handles this even though it depends on the underlying hardware. So my conclusion is that autotrim can cause higher iowait in large SSD/NVMe pools and that this still applies to ZFS pools. In this case it is better for me to turn off autotrim and run zpool trim periodically. I just wanted to make this clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.