-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit async IO starvation caused by elevator sorting. #9714
Conversation
Async IO startvation is rare, since for read when important IO can be promoted by demand read, and for write it is limited by the time of single TXG commit. Unfortunately we saw multiple cases in a wild when some very slow (probably failing) drive extends TXG commit for many minutes, that combined with IO starvation makes vdev_deadman() crash the system without any IO request really stuck at the block layer. This change introduces hard ~4 seconds time intervals, limiting scope of elevator sorting and so maximal starvation time. With allocation throttling limiting maximal queue depth, this effectively prevents vdev_deadman() from false positive reactions. In manually simulated environment, reliably triggering the panic within minutes, this patch solves the problem as soon as allocation throttling is enabled. Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc.
Curious why you wouldn't want deadman kicking in but also what values you for zfs_vdev_async_* ? How many IO/s was your test case? The deadman solved a few issues were were having with drives that didn't report any error at all but were infact dead or dying. S'pose there is a case to be had for highly optimized IO blasting as much async IO and let the kernel/zfs sort it. |
In case of the real failed drive I am testing now reboot fixes nothing, so it is pointless for deadman to activate. Plus drive still works correctly, just very slow, so if not unfair scheduling of ZFS there would be no reason for deadman to activate, since there are no I/O stuck in the disk. zfs_vdev_async_* are set to default, The disk does stable 60 writes per second, no matter what are the sizes or location. Workload is just a single ZVOL write with small block and intentionally killed aggregation and slightly hacked allocation to simulate fragmented pool.
As I have told, reboot solves nothing in this case, just creates more downtime and questions. What's about sorting, I doubt that sorting beyond multiple seconds makes any sense, considering seek times in dozen milliseconds. Plus allocation throttling already prevents all possible requests to reach the sorter, passing through only small portions at a time. |
That sounds oddly like what we were hitting with Ironwolf and WD Red drives supposedly flashed with "NAS firmware". See #6885
So how do you know sorting is causing it and not the fact that the drive is failing or you've over provisioned something? zfs_vdev_max_active is I believe 1000
Same reasons you'd delay transaction commit and buffer. There can be an enormous benefit with optimized random IO and thrashing.
Indeed which is why I'm a bit puzzled at what bottleneck you're hitting and why. vdev_queue_class_min_active for example. I was expecting you to come back with "we're throwing a few million IO's at it", or some such. |
I am able to insert few printf's to track queues depth and request latency, and same time monitor IOPS done by drive. As I have told, the failing drive is on my desk and I can measure anything. I see that while some requests are progressing, the depth of the active queue remains constant IIRC about 10 and vq_write_offset_tree about 60, while worst request latency is linearly growing, that tells me that that request was never executed.
No. The benefit can not be enormous after you already waited 4 seconds trying to save 10 milliseconds. Elevator algorithm is questionable in general these days if blocks are not directly consequential, considering that in some cases seek may be faster then wait for full platter rotation to position that is closer in LBA.
Yes, we are throwing ~10K IO's at it, even if not same time. Just imagine requests in such LBA order: 1, 2, 10000, 3, 4, 5, ... 9999. How long do you think the request at 10000 will wait in the queue? I tell you -- till the end. New requests will come in, while earlier complete, and that 10000 will still wait indefinitely. |
Then we agree it's a hardware problem and not scheduler, no??
10ms is indeed inconsequential, but given I do exactly that, I disagree. I'm referring to the aggregate on non-interactive workloads.
You're throwing 10k IO's at a single platter disk?? Something isn't adding up. |
I know that this drive is failing! But I don't want my system to reboot every 15 minutes because of it, blaming requests that were not even sent to the disk, waiting in a queue for 15 minutes!
I'm sorry, I lost you here. I am saying that breaking sequential operation once per 4 seconds for some fairness is not a high price. Are you agree or not?
You are doing just the same. Just create a single disk pool on machine with lots of RAM. In that case ARC dirty_max will be in some GBs of data, that may be assigned into the same transaction group before write throttling activates. Divide those few GBs by some small block/record size and you may get millions, not just some 10k IO's at a single platter disk. |
Codecov Report
@@ Coverage Diff @@
## master #9714 +/- ##
=========================================
+ Coverage 67% 80% +13%
=========================================
Files 337 288 -49
Lines 106369 82326 -24043
=========================================
- Hits 71480 65821 -5659
+ Misses 34889 16505 -18384
Continue to review full report at Codecov.
|
Since I do mostly high speed storage, I can say I'd rather not have more code trying to sort out HDD optimizations in the data path -- it can be faster to just put it on disk. |
Seriously though, I'd think that depended on why you're waiting. What is the source of synchronous IO to such a degree that it is starving other classes.
That can be a negative outcome. What type of IO and what priorities are you referring to? Asynchronous IO is by definition, asynchronous. That 4 seconds can be 2 requests or millions. .. I wouldn't agree IO scheduled at batch should just preempt realtime.
These are different things though. |
OK, if there is no interest to this approach, I'll close this PR for now and think more. |
Async IO starvation is rare, since for read when important IO can be promoted by demand read, and for write it is limited by the time of single TXG commit. Unfortunately we saw multiple cases in a wild when some very slow (probably failing) drive extends TXG commit for many minutes, that combined with IO starvation makes vdev_deadman() crash the system without any IO request really stuck at the block layer.
This change introduces hard ~4 seconds time intervals, limiting scope of elevator sorting and so maximal starvation time. With allocation throttling limiting maximal queue depth, this effectively prevents vdev_deadman() from false positive reactions. In manually simulated environment, reliably triggering the panic within minutes, this patch solves the problem as soon as allocation throttling is enabled.
I also switched vq_active_tree to vdev_queue_timestamp_compare(), since its only consumer is vdev_deadman(), which to work most efficiently would prefer sorting on time rather then offset.
Also while being there I've added additional offset comparison into vdev_queue_timestamp_compare(). It is rare, but on systems with low timer resolution I think it should be better to sort IOs by meaningful offset rather then on absolutely random memory addresses.
If somebody sees more efficient way to limit elevator sorting while using AVL trees, or have good ideas about maximum starvation time instead of my 4 seconds, I'll be happy to discuss.
How Has This Been Tested?
I've manually crafted panic scenario using real failing HDD doing only 60 write operations per second on FreeBSD head, and confirmed the fix by this change.
Types of changes
Checklist:
Signed-off-by
.