Write throttle WIP #1696

nedbass · 2013-09-04T20:30:28Z

4045 zfs write throttle & i/o scheduler performance work

ZFS on Linux port of Illumos patch to implement a new IO scheduler.

ryao · 2013-09-08T21:50:45Z

@nedbass ryao/spl@bdc9489 fixes the constructor/destructor calls. It would be better to adopt that patch to fix the SPL than it is to workaround it.

nedbass · 2013-09-09T16:57:53Z

Good idea @ryao. I suggest we handle your patch in a separate pull request from this one. I want to minimize the amount of code to review here. Your SPL patch should have a companion ZFS patch to clean up all of the zio_t initialization that would become superfluous in zio_create().

ryao · 2013-11-04T17:52:28Z

It looks like the changes made in c418410 were dropped in this patch. Also, this needs to be refreshed now that the kstat changes have been merged.

nedbass · 2013-11-04T18:09:40Z

Thanks @ryao, I'll get this refreshed this week.

behlendorf · 2013-11-04T20:56:55Z

@nedbass Most (maybe all) of the dependent changes have been merged in to master. This should make it pretty easy to rebase when you get to it.

stevenj · 2013-11-15T01:07:44Z

A Real World use case:
Netbook AMD E2 CPU (1.6Ghz), Single Seagate Momentus 5400.6 (ST9500325AS, 500 GB)
Single disk zpool with ashift=12, a swapper in a zdev and lz4 compression on most file sets.
Usage: Single user desktop usage, coding, chromium, compiling, testing, email, etc.

My problem:
I was using Gentoo's 0.6.2 ZFS Ebuild. What I was seeing was progressively worse IO. When i first installed gentoo/zfs i didnt notice these issues, but as time progressed they got worse, and lately much much worse.

Example, if nothing else was happening on the machine, I could start a Terminal and in bash type "di" I would then get 10 seconds or more of 100% IO for the bash process, before it would finally give me any completion results. Doing it again and i got the same thing. Running builds would take a subjective eternity to complete because Auto-tools configuration tests would take many seconds each to complete, also showing 100% IO in iotop. txg_sync was sitting high in iotop a lot. Even though IO was pegged at 100%, disk throughput was very poor (low 10's of MB/s). At the same time for these tasks CPU utilization as shown by top would not be anywhere near 100%. This was endemic, everything was showing these symptoms to a lesser or greater extent. rsync to/from the netbook also took an eternity, with 100% IO, and little CPU.

I tried messing with arc max/min parameters which provided no change in the issue.

Changed to Gentoo-9999 Ebuild and 3.12 Linux kernel, and there was no un-subjectively discernible improvement.

Applied this patch to Gentoo-9999 and my IO problems disappeared. All tunables were left at default, they seemed to just work.

Everything seems to run as I would normally expect and is as expected, limited by CPU performance and not IO performance. I have had no strange IO issues, or slow downs, or corruptions, and interactive performance under X has also improved. txg_sync no longer hovers at the top of iotop it comes momentarily and just as quickly disappears. Its been in use for about a week continuously on this machine.

Upshot, I will not stop using this patch because without it this netbook had become unusable.
Thanks Ned. (and Ryao for pointing the patch out to me)

PS. I plan on installing it on my NAS box for video serving over the weekend, as it gets horrible IO problems when rar is executing at the same time as video is playing (ie, rar stops video dead). Hopefully it might help that problem. Will report again once tested.

nedbass · 2013-11-15T01:44:45Z

Thanks for the feedback @stevenj. We've been hoping to get some real world testing on this (but back up important data, please :)) I do need to refresh the patch to fix a bug that could cause a lockup, so watch for a new version after this post.

nedbass · 2013-11-18T07:50:07Z

I've written up some benchmark results for this patch here.

behlendorf · 2013-11-20T22:06:52Z

Nice graphs! It's pretty clear that before this can be merged we'll need to resolve the performance drop off for large thread counts.

nedbass · 2013-11-26T18:54:30Z

I haven't finished re-running the read tests with Illumos 4347, but I've posted new results for the write tests and they look really good now. https://github.com/nedbass/zfs/wiki/ZFS-Write-Throttle-Patch-Benchmarks

nedbass · 2013-11-27T20:17:50Z

Read results are now posted.

behlendorf · 2013-11-29T21:18:50Z

@nedbass Nice results! It's clear that pulling in 57b270f addressed our performance concerns. I don't see any significant performance regressions in the data and it's clear that certain pool configurations and workloads will see a nice boost.

The only remaining wrinkle is correctly handling the putpage() case. We're not going to be able to use TXG_WAIT here like on Illumos because we need to avoid blocking here. This code really needs to continue to use TXG_NOWAIT but be update to return AOP_WRITEPAGE_ACTIVATE back to .writepage() so the VM will attempt to write out a different page.

Fix a lock contention issue by allowing threads not holding ZPL locks to block when waiting to assign a transaction. Porting Notes: zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version. This case may be a contention point just like zfs_write(), however it is not safe to block here since it may be called during memory reclaim. Reviewed by: George Wilson <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Dan McDonald <[email protected]> Reviewed by: Boris Protopopov <[email protected]> Approved by: Dan McDonald <[email protected]> References: illumos/illumos-gate@e722410 https://www.illumos.org/issues/4347 Ported-by: Ned Bass <[email protected]>

nedbass · 2013-12-03T20:17:12Z

Refreshed with per-queue active I/O limits exported as module options and zfs-module-parameters man page updates. I think this addresses all outstanding cleanup issues for this patch.

4045 zfs write throttle & i/o scheduler performance work 1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync read, sync write, async read, async write, and scrub/resilver. The scheduler issues a number of concurrent i/os from each class to the device. Once a class has been selected, an i/o is selected from this class using either an elevator algorithem (async, scrub classes) or FIFO (sync classes). The number of concurrent async write i/os is tuned dynamically based on i/o load, to achieve good sync i/o latency when there is not a high load of writes, and good write throughput when there is. See the block comment in vdev_queue.c (reproduced below) for more details. 2. The write throttle (dsl_pool_tempreserve_space() and txg_constrain_throughput()) is rewritten to produce much more consistent delays when under constant load. The new write throttle is based on the amount of dirty data, rather than guesses about future performance of the system. When there is a lot of dirty data, each transaction (e.g. write() syscall) will be delayed by the same small amount. This eliminates the "brick wall of wait" that the old write throttle could hit, causing all transactions to wait several seconds until the next txg opens. One of the keys to the new write throttle is decrementing the amount of dirty data as i/o completes, rather than at the end of spa_sync(). Note that the write throttle is only applied once the i/o scheduler is issuing the maximum number of outstanding async writes. See the block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for more details. This diff has several other effects, including: * the commonly-tuned global variable zfs_vdev_max_pending has been removed; use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead. * the size of each txg (meaning the amount of dirty data written, and thus the time it takes to write out) is now controlled differently. There is no longer an explicit time goal; the primary determinant is amount of dirty data. Systems that are under light or medium load will now often see that a txg is always syncing, but the impact to performance (e.g. read latency) is minimal. Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this. * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression, checksum, etc. This improves latency by not allowing these CPU-intensive tasks to consume all CPU (on machines with at least 4 CPU's; the percentage is rounded up). --matt APPENDIX: problems with the current i/o scheduler The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem with this is that if there are always i/os pending, then certain classes of i/os can see very long delays. For example, if there are always synchronous reads outstanding, then no async writes will be serviced until they become "past due". One symptom of this situation is that each pass of the txg sync takes at least several seconds (typically 3 seconds). If many i/os become "past due" (their deadline is in the past), then we must service all of these overdue i/os before any new i/os. This happens when we enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in the future. If we can't complete all the i/os in 2.5 seconds (e.g. because there were always reads pending), then these i/os will become past due. Now we must service all the "async" writes (which could be hundreds of megabytes) before we service any reads, introducing considerable latency to synchronous i/os (reads or ZIL writes). Notes on porting to ZFS on Linux: - zio_t gained new members io_physdone and io_phys_children. Because object caches in the Linux port call the constructor only once at allocation time, objects may contain residual data when retrieved from the cache. Therefore zio_create() was updated to zero out the two new fields. - vdev_mirror_pending() relied on the depth of the per-vdev pending queue (vq->vq_pending_tree) to select the least-busy leaf vdev to read from. This tree has been replaced by vq->vq_active_tree which is now used for the same purpose. - vdev_queue_init() used the value of zfs_vdev_max_pending to determine the number of vdev I/O buffers to pre-allocate. That global no longer exists, so we instead use the sum of the *_max_active values for each of the five I/O classes described above. - The Illumos implementation of dmu_tx_delay() delays a transaction by sleeping in condition variable embedded in the thread (curthread->t_delay_cv). We do not have an equivalent CV to use in Linux, so this change replaced the delay logic with a wrapper called zfs_sleep_until(). This wrapper could be adopted upstream and in other downstream ports to abstract away operating system-specific delay logic. - These tunables are added as module parameters, and descriptions added to the zfs-module-parameters.5 man page. spa_asize_inflation zfs_deadman_synctime_ms zfs_vdev_max_active zfs_vdev_async_write_active_min_dirty_percent zfs_vdev_async_write_active_max_dirty_percent zfs_vdev_async_read_max_active zfs_vdev_async_read_min_active zfs_vdev_async_write_max_active zfs_vdev_async_write_min_active zfs_vdev_scrub_max_active zfs_vdev_scrub_min_active zfs_vdev_sync_read_max_active zfs_vdev_sync_read_min_active zfs_vdev_sync_write_max_active zfs_vdev_sync_write_min_active zfs_dirty_data_max_percent zfs_delay_min_dirty_percent zfs_dirty_data_max_max_percent zfs_dirty_data_max zfs_dirty_data_max_max zfs_dirty_data_sync zfs_delay_scale The latter four have type unsigned long, whereas they are uint64_t in Illumos. This accommodates Linux's module_param() supported types, but means they may overflow on 32-bit architectures. The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most likely to overflow on 32-bit systems, since they express physical RAM sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to 2^32 which does overflow. To resolve that, this port instead initializes it in arc_init() to 25% of physical RAM, and adds the tunable zfs_dirty_data_max_max_percent to override that percentage. While this solution doesn't completely avoid the overflow issue, it should be a reasonable default for most systems, and the minority of affected systems can work around the issue by overriding the defaults. - Fixed reversed logic in comment above zfs_delay_scale declaration. - Clarified comments in vdev_queue.c regarding when per-queue minimums take effect. - Replaced dmu_tx_write_limit in the dmu_tx kstat file with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts how many times a transaction has been delayed because the pool dirty data has exceeded zfs_delay_min_dirty_percent. The latter counts how many times the pool dirty data has exceeded zfs_dirty_data_max (which we expect to never happen). - The original patch would have regressed the bug fixed in openzfs/zfs@c418410, which prevented users from setting the zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE. A similar fix is added to vdev_queue_aggregate(). - In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the heap instead of the stack. In Linux we can't afford such large structures on the stack. References: http://www.illumos.org/issues/4045 illumos/illumos-gate@69962b5 Ported-by: Ned Bass <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed by: Adam Leventhal <[email protected]> Reviewed by: Christopher Siden <[email protected]> Reviewed by: Ned Bass <[email protected]> Reviewed by: Brendan Gregg <[email protected]> Approved by: Robert Mustacchi <[email protected]> Closes openzfs#1913

nedbass · 2013-12-04T20:49:17Z

Refreshed with zfs_dirty_data_max_max initialized as a percentage of RAM to avoid overflow on 32-bit systems. Also incorporated tunable from #1913.

nedbass · 2013-12-04T22:07:12Z

@behlendorf the fc19 VM that hung running filebench with this patch seems to be waiting on a "lost" zio. I've posted zpool events -v and task state. It's not immediately clear if or how this is related to this patch set.

behlendorf · 2013-12-06T19:08:42Z

Merged as:

e8b96c6 Illumos #4045 write throttle & i/o scheduler performance work
384f8a0 Illumos #4347 ZPL can use dmu_tx_assign(TXG_WAIT)

The nreserved column in the txgs kstat file always contains 0 following the write throttle restructuring of commit e8b96c6. Prior to that commit, the nreserved column showed the number of bytes temporarily reserved in the pool by a transaction group at sync time. The new write throttle did away with temporary reservations and uses the amount of dirty data instead. To approximate the old output of the txgs kstat, the number of dirty bytes per-txg was passed in as the nreserved value to spa_txg_history_set_io(). This approach did not work as intended, because the per-txg dirty value is decremented as data is written out to disk, so it is zero by the time we call spa_txg_history_set_io(). To fix this, save the number of dirty bytes before calling spa_sync(), and pass this value in to spa_txg_history_set_io(). Also, since the name "nreserved" is now a misnomer, the column heading is now labeled "ndirty". Signed-off-by: Ned Bass <[email protected]> Issue openzfs#1696

The nreserved column in the txgs kstat file always contains 0 following the write throttle restructuring of commit e8b96c6. Prior to that commit, the nreserved column showed the number of bytes temporarily reserved in the pool by a transaction group at sync time. The new write throttle did away with temporary reservations and uses the amount of dirty data instead. To approximate the old output of the txgs kstat, the number of dirty bytes per-txg was passed in as the nreserved value to spa_txg_history_set_io(). This approach did not work as intended, because the per-txg dirty value is decremented as data is written out to disk, so it is zero by the time we call spa_txg_history_set_io(). To fix this, save the number of dirty bytes before calling spa_sync(), and pass this value in to spa_txg_history_set_io(). Also, since the name "nreserved" is now a misnomer, the column heading is now labeled "ndirty". Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1696

ryao mentioned this pull request Sep 8, 2013

limited performance #1705

Closed

ryao mentioned this pull request Oct 9, 2013

Newer Illumos code #1775

Closed

kdavyd mentioned this pull request Dec 1, 2013

Added a tunable for spa_get_asize #1913

Closed

nedbass mentioned this pull request Dec 2, 2013

Record tx_assign_histogram correctly #1915

Closed

nedbass mentioned this pull request Dec 4, 2013

txg_sync hung in zio_wait() #1928

Closed

DeHackEd mentioned this pull request Dec 5, 2013

Heavy lock contention in zfs_zinactive causing ARC metadata to grow #1932

Closed

behlendorf closed this Dec 6, 2013

evujumenuk mentioned this pull request Apr 26, 2017

zfs_dirty_data_max_max grows without bound as memory is installed #6072

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write throttle WIP #1696

Write throttle WIP #1696

nedbass commented Sep 4, 2013

ryao commented Sep 8, 2013

nedbass commented Sep 9, 2013

ryao commented Nov 4, 2013

nedbass commented Nov 4, 2013

behlendorf commented Nov 4, 2013

stevenj commented Nov 15, 2013

nedbass commented Nov 15, 2013

nedbass commented Nov 18, 2013

behlendorf commented Nov 20, 2013

nedbass commented Nov 26, 2013

nedbass commented Nov 27, 2013

behlendorf commented Nov 29, 2013

nedbass commented Dec 3, 2013

nedbass commented Dec 4, 2013

nedbass commented Dec 4, 2013

behlendorf commented Dec 6, 2013

Write throttle WIP #1696

Write throttle WIP #1696

Conversation

nedbass commented Sep 4, 2013

ryao commented Sep 8, 2013

nedbass commented Sep 9, 2013

ryao commented Nov 4, 2013

nedbass commented Nov 4, 2013

behlendorf commented Nov 4, 2013

stevenj commented Nov 15, 2013

nedbass commented Nov 15, 2013

nedbass commented Nov 18, 2013

behlendorf commented Nov 20, 2013

nedbass commented Nov 26, 2013

nedbass commented Nov 27, 2013

behlendorf commented Nov 29, 2013

nedbass commented Dec 3, 2013

nedbass commented Dec 4, 2013

nedbass commented Dec 4, 2013

behlendorf commented Dec 6, 2013