Limit async IO starvation caused by elevator sorting. #9714

amotin · 2019-12-11T19:35:36Z

Async IO starvation is rare, since for read when important IO can be promoted by demand read, and for write it is limited by the time of single TXG commit. Unfortunately we saw multiple cases in a wild when some very slow (probably failing) drive extends TXG commit for many minutes, that combined with IO starvation makes vdev_deadman() crash the system without any IO request really stuck at the block layer.

This change introduces hard ~4 seconds time intervals, limiting scope of elevator sorting and so maximal starvation time. With allocation throttling limiting maximal queue depth, this effectively prevents vdev_deadman() from false positive reactions. In manually simulated environment, reliably triggering the panic within minutes, this patch solves the problem as soon as allocation throttling is enabled.

I also switched vq_active_tree to vdev_queue_timestamp_compare(), since its only consumer is vdev_deadman(), which to work most efficiently would prefer sorting on time rather then offset.

Also while being there I've added additional offset comparison into vdev_queue_timestamp_compare(). It is rare, but on systems with low timer resolution I think it should be better to sort IOs by meaningful offset rather then on absolutely random memory addresses.

If somebody sees more efficient way to limit elevator sorting while using AVL trees, or have good ideas about maximum starvation time instead of my 4 seconds, I'll be happy to discuss.

How Has This Been Tested?

I've manually crafted panic scenario using real failing HDD doing only 60 write operations per second on FreeBSD head, and confirmed the fix by this change.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.

Async IO startvation is rare, since for read when important IO can be promoted by demand read, and for write it is limited by the time of single TXG commit. Unfortunately we saw multiple cases in a wild when some very slow (probably failing) drive extends TXG commit for many minutes, that combined with IO starvation makes vdev_deadman() crash the system without any IO request really stuck at the block layer. This change introduces hard ~4 seconds time intervals, limiting scope of elevator sorting and so maximal starvation time. With allocation throttling limiting maximal queue depth, this effectively prevents vdev_deadman() from false positive reactions. In manually simulated environment, reliably triggering the panic within minutes, this patch solves the problem as soon as allocation throttling is enabled. Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc.

h1z1 · 2019-12-11T21:34:00Z

Curious why you wouldn't want deadman kicking in but also what values you for zfs_vdev_async_* ? How many IO/s was your test case?

The deadman solved a few issues were were having with drives that didn't report any error at all but were infact dead or dying. S'pose there is a case to be had for highly optimized IO blasting as much async IO and let the kernel/zfs sort it.

amotin · 2019-12-11T21:52:19Z

Curious why you wouldn't want deadman kicking in but also what values you for zfs_vdev_async_* ? How many IO/s was your test case?

In case of the real failed drive I am testing now reboot fixes nothing, so it is pointless for deadman to activate. Plus drive still works correctly, just very slow, so if not unfair scheduling of ZFS there would be no reason for deadman to activate, since there are no I/O stuck in the disk. zfs_vdev_async_* are set to default, The disk does stable 60 writes per second, no matter what are the sizes or location. Workload is just a single ZVOL write with small block and intentionally killed aggregation and slightly hacked allocation to simulate fragmented pool.

The deadman solved a few issues were were having with drives that didn't report any error at all but were infact dead or dying. S'pose there is a case to be had for highly optimized IO blasting as much async IO and let the kernel/zfs sort it.

As I have told, reboot solves nothing in this case, just creates more downtime and questions. What's about sorting, I doubt that sorting beyond multiple seconds makes any sense, considering seek times in dozen milliseconds. Plus allocation throttling already prevents all possible requests to reach the sorter, passing through only small portions at a time.

h1z1 · 2019-12-12T00:56:24Z

Curious why you wouldn't want deadman kicking in but also what values you for zfs_vdev_async_* ? How many IO/s was your test case?

In case of the real failed drive I am testing now reboot fixes nothing, so it is pointless for deadman to activate. Plus drive still works correctly, just very slow, so if not unfair scheduling of ZFS there would be no reason for deadman to activate, since there are no I/O stuck in the disk.

That sounds oddly like what we were hitting with Ironwolf and WD Red drives supposedly flashed with "NAS firmware". See #6885

zfs_vdev_async_* are set to default, The disk does stable 60 writes per second, no matter what are the sizes or location. Workload is just a single ZVOL write with small block and intentionally killed aggregation and slightly hacked allocation to simulate fragmented pool.

So how do you know sorting is causing it and not the fact that the drive is failing or you've over provisioned something? zfs_vdev_max_active is I believe 1000

The deadman solved a few issues were were having with drives that didn't report any error at all but were infact dead or dying. S'pose there is a case to be had for highly optimized IO blasting as much async IO and let the kernel/zfs sort it.

As I have told, reboot solves nothing in this case, just creates more downtime and questions. What's about sorting, I doubt that sorting beyond multiple seconds makes any sense, considering seek times in dozen milliseconds.

Same reasons you'd delay transaction commit and buffer. There can be an enormous benefit with optimized random IO and thrashing.

Plus allocation throttling already prevents all possible requests to reach the sorter, passing through only small portions at a time.

Indeed which is why I'm a bit puzzled at what bottleneck you're hitting and why. vdev_queue_class_min_active for example. I was expecting you to come back with "we're throwing a few million IO's at it", or some such.

amotin · 2019-12-12T02:23:04Z

zfs_vdev_async_* are set to default, The disk does stable 60 writes per second, no matter what are the sizes or location. Workload is just a single ZVOL write with small block and intentionally killed aggregation and slightly hacked allocation to simulate fragmented pool.

So how do you know sorting is causing it and not the fact that the drive is failing or you've over provisioned something?

I am able to insert few printf's to track queues depth and request latency, and same time monitor IOPS done by drive. As I have told, the failing drive is on my desk and I can measure anything. I see that while some requests are progressing, the depth of the active queue remains constant IIRC about 10 and vq_write_offset_tree about 60, while worst request latency is linearly growing, that tells me that that request was never executed.

The deadman solved a few issues were were having with drives that didn't report any error at all but were infact dead or dying. S'pose there is a case to be had for highly optimized IO blasting as much async IO and let the kernel/zfs sort it.

As I have told, reboot solves nothing in this case, just creates more downtime and questions. What's about sorting, I doubt that sorting beyond multiple seconds makes any sense, considering seek times in dozen milliseconds.

Same reasons you'd delay transaction commit and buffer. There can be an enormous benefit with optimized random IO and thrashing.

No. The benefit can not be enormous after you already waited 4 seconds trying to save 10 milliseconds. Elevator algorithm is questionable in general these days if blocks are not directly consequential, considering that in some cases seek may be faster then wait for full platter rotation to position that is closer in LBA.

Plus allocation throttling already prevents all possible requests to reach the sorter, passing through only small portions at a time.

Indeed which is why I'm a bit puzzled at what bottleneck you're hitting and why. vdev_queue_class_min_active for example. I was expecting you to come back with "we're throwing a few million IO's at it", or some such.

Yes, we are throwing ~10K IO's at it, even if not same time. Just imagine requests in such LBA order: 1, 2, 10000, 3, 4, 5, ... 9999. How long do you think the request at 10000 will wait in the queue? I tell you -- till the end. New requests will come in, while earlier complete, and that 10000 will still wait indefinitely.

h1z1 · 2019-12-12T02:57:21Z

zfs_vdev_async_* are set to default, The disk does stable 60 writes per second, no matter what are the sizes or location. Workload is just a single ZVOL write with small block and intentionally killed aggregation and slightly hacked allocation to simulate fragmented pool.

So how do you know sorting is causing it and not the fact that the drive is failing or you've over provisioned something?

I am able to insert few printf's to track queues depth and request latency, and same time monitor IOPS done by drive. As I have told, the failing drive is on my desk and I can measure anything. I see that while some requests are progressing, the depth of the active queue remains constant IIRC about 60, while worst request latency is linearly growing, that tells me that that request was never executed.

Then we agree it's a hardware problem and not scheduler, no??

Same reasons you'd delay transaction commit and buffer. There can be an enormous benefit with optimized random IO and thrashing.

No. The benefit can not be enormous after you already waited 4 seconds trying to save 10 milliseconds. Elevator algorithm is questionable in general these days if blocks are not directly consequential, considering that in some cases seek may be faster then wait for full platter rotation to position that is closer in LBA.

10ms is indeed inconsequential, but given I do exactly that, I disagree. I'm referring to the aggregate on non-interactive workloads.

Plus allocation throttling already prevents all possible requests to reach the sorter, passing through only small portions at a time.

Indeed which is why I'm a bit puzzled at what bottleneck you're hitting and why. vdev_queue_class_min_active for example. I was expecting you to come back with "we're throwing a few million IO's at it", or some such.

Yes, we are throwing ~10K IO's at it, even if not same time. Just imagine requests in such LBA order: 1, 2, 10000, 3, 4, 5, ... 9999. How long do you think the request at 10000 will wait in the queue?

You're throwing 10k IO's at a single platter disk?? Something isn't adding up.

amotin · 2019-12-12T03:27:54Z

zfs_vdev_async_* are set to default, The disk does stable 60 writes per second, no matter what are the sizes or location. Workload is just a single ZVOL write with small block and intentionally killed aggregation and slightly hacked allocation to simulate fragmented pool.

So how do you know sorting is causing it and not the fact that the drive is failing or you've over provisioned something?

I am able to insert few printf's to track queues depth and request latency, and same time monitor IOPS done by drive. As I have told, the failing drive is on my desk and I can measure anything. I see that while some requests are progressing, the depth of the active queue remains constant IIRC about 60, while worst request latency is linearly growing, that tells me that that request was never executed.

Then we agree it's a hardware problem and not scheduler, no??

I know that this drive is failing! But I don't want my system to reboot every 15 minutes because of it, blaming requests that were not even sent to the disk, waiting in a queue for 15 minutes!

Same reasons you'd delay transaction commit and buffer. There can be an enormous benefit with optimized random IO and thrashing.

No. The benefit can not be enormous after you already waited 4 seconds trying to save 10 milliseconds. Elevator algorithm is questionable in general these days if blocks are not directly consequential, considering that in some cases seek may be faster then wait for full platter rotation to position that is closer in LBA.

10ms is indeed inconsequential, but given I do exactly that, I disagree. I'm referring to the aggregate on non-interactive workloads.

I'm sorry, I lost you here. I am saying that breaking sequential operation once per 4 seconds for some fairness is not a high price. Are you agree or not?

Plus allocation throttling already prevents all possible requests to reach the sorter, passing through only small portions at a time.

Indeed which is why I'm a bit puzzled at what bottleneck you're hitting and why. vdev_queue_class_min_active for example. I was expecting you to come back with "we're throwing a few million IO's at it", or some such.

Yes, we are throwing ~10K IO's at it, even if not same time. Just imagine requests in such LBA order: 1, 2, 10000, 3, 4, 5, ... 9999. How long do you think the request at 10000 will wait in the queue?

You're throwing 10k IO's at a single platter disk?? Something isn't adding up.

You are doing just the same. Just create a single disk pool on machine with lots of RAM. In that case ARC dirty_max will be in some GBs of data, that may be assigned into the same transaction group before write throttling activates. Divide those few GBs by some small block/record size and you may get millions, not just some 10k IO's at a single platter disk.

codecov · 2019-12-12T04:27:03Z

Codecov Report

Merging #9714 into master will increase coverage by 13%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #9714      +/-   ##
=========================================
+ Coverage      67%     80%     +13%     
=========================================
  Files         337     288      -49     
  Lines      106369   82326   -24043     
=========================================
- Hits        71480   65821    -5659     
+ Misses      34889   16505   -18384

Flag	Coverage Δ
#kernel	`80% <100%> (?)`
#user	`?`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0bf435...9d32840. Read the comment docs.

richardelling · 2019-12-12T05:23:21Z

zfs_vdev_max_active isn't what you want, you might want zfs_vdev_*_max_active, which for HDDs should be closer to 3 or 4 than 10 for crappy drives (and likely SMR drives, too). This will have the effect of not queuing a bunch of stuff to the drive where it can get stuck in the drive's elevator for a long time.

Since I do mostly high speed storage, I can say I'd rather not have more code trying to sort out HDD optimizations in the data path -- it can be faster to just put it on disk.

h1z1 · 2019-12-12T14:01:22Z

Then we agree it's a hardware problem and not scheduler, no??

I know that this drive is failing! But I don't want my system to reboot every 15 minutes because of it, blaming requests that were not even sent to the disk, waiting in a queue for 15 minutes!

echo wait > /sys/module/zfs/parameters/zfs_deadman_failmode :)

Seriously though, I'd think that depended on why you're waiting. What is the source of synchronous IO to such a degree that it is starving other classes.

10ms is indeed inconsequential, but given I do exactly that, I disagree. I'm referring to the aggregate on non-interactive workloads.

I'm sorry, I lost you here. I am saying that breaking sequential operation once per 4 seconds for some fairness is not a high price. Are you agree or not?

That can be a negative outcome. What type of IO and what priorities are you referring to? Asynchronous IO is by definition, asynchronous. That 4 seconds can be 2 requests or millions. .. I wouldn't agree IO scheduled at batch should just preempt realtime.

Plus allocation throttling already prevents all possible requests to reach the sorter, passing through only small portions at a time.

Indeed which is why I'm a bit puzzled at what bottleneck you're hitting and why. vdev_queue_class_min_active for example. I was expecting you to come back with "we're throwing a few million IO's at it", or some such.

Yes, we are throwing ~10K IO's at it, even if not same time. Just imagine requests in such LBA order: 1, 2, 10000, 3, 4, 5, ... 9999. How long do you think the request at 10000 will wait in the queue?

You're throwing 10k IO's at a single platter disk?? Something isn't adding up.

You are doing just the same. Just create a single disk pool on machine with lots of RAM. In that case ARC dirty_max will be in some GBs of data, that may be assigned into the same transaction group before write throttling activates. Divide those few GBs by some small block/record size and you may get millions, not just some 10k IO's at a single platter disk.

These are different things though.

amotin · 2019-12-12T19:33:03Z

OK, if there is no interest to this approach, I'll close this PR for now and think more.

ahrens self-requested a review December 11, 2019 21:12

amotin closed this Dec 12, 2019

amotin deleted the tsoff branch August 24, 2021 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit async IO starvation caused by elevator sorting. #9714

Limit async IO starvation caused by elevator sorting. #9714

amotin commented Dec 11, 2019

h1z1 commented Dec 11, 2019

amotin commented Dec 11, 2019

h1z1 commented Dec 12, 2019

amotin commented Dec 12, 2019 •

edited

Loading

h1z1 commented Dec 12, 2019

amotin commented Dec 12, 2019

codecov bot commented Dec 12, 2019

richardelling commented Dec 12, 2019

h1z1 commented Dec 12, 2019

amotin commented Dec 12, 2019

Limit async IO starvation caused by elevator sorting. #9714

Limit async IO starvation caused by elevator sorting. #9714

Conversation

amotin commented Dec 11, 2019

How Has This Been Tested?

Types of changes

Checklist:

h1z1 commented Dec 11, 2019

amotin commented Dec 11, 2019

h1z1 commented Dec 12, 2019

amotin commented Dec 12, 2019 • edited Loading

h1z1 commented Dec 12, 2019

amotin commented Dec 12, 2019

codecov bot commented Dec 12, 2019

Codecov Report

richardelling commented Dec 12, 2019

h1z1 commented Dec 12, 2019

amotin commented Dec 12, 2019

amotin commented Dec 12, 2019 •

edited

Loading