Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit async IO starvation caused by elevator sorting. #9714

Closed
wants to merge 1 commit into from

Conversation

amotin
Copy link
Member

@amotin amotin commented Dec 11, 2019

Async IO starvation is rare, since for read when important IO can be promoted by demand read, and for write it is limited by the time of single TXG commit. Unfortunately we saw multiple cases in a wild when some very slow (probably failing) drive extends TXG commit for many minutes, that combined with IO starvation makes vdev_deadman() crash the system without any IO request really stuck at the block layer.

This change introduces hard ~4 seconds time intervals, limiting scope of elevator sorting and so maximal starvation time. With allocation throttling limiting maximal queue depth, this effectively prevents vdev_deadman() from false positive reactions. In manually simulated environment, reliably triggering the panic within minutes, this patch solves the problem as soon as allocation throttling is enabled.

I also switched vq_active_tree to vdev_queue_timestamp_compare(), since its only consumer is vdev_deadman(), which to work most efficiently would prefer sorting on time rather then offset.

Also while being there I've added additional offset comparison into vdev_queue_timestamp_compare(). It is rare, but on systems with low timer resolution I think it should be better to sort IOs by meaningful offset rather then on absolutely random memory addresses.

If somebody sees more efficient way to limit elevator sorting while using AVL trees, or have good ideas about maximum starvation time instead of my 4 seconds, I'll be happy to discuss.

How Has This Been Tested?

I've manually crafted panic scenario using real failing HDD doing only 60 write operations per second on FreeBSD head, and confirmed the fix by this change.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

Async IO startvation is rare, since for read when important IO can be
promoted by demand read, and for write it is limited by the time of
single TXG commit.  Unfortunately we saw multiple cases in a wild when
some very slow (probably failing) drive extends TXG commit for many
minutes, that combined with IO starvation makes vdev_deadman() crash
the system without any IO request really stuck at the block layer.

This change introduces hard ~4 seconds time intervals, limiting scope
of elevator sorting and so maximal starvation time.  With allocation
throttling limiting maximal queue depth, this effectively prevents
vdev_deadman() from false positive reactions.  In manually simulated
environment, reliably triggering the panic within minutes, this patch
solves the problem as soon as allocation throttling is enabled.

Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
@ahrens ahrens self-requested a review December 11, 2019 21:12
@h1z1
Copy link

h1z1 commented Dec 11, 2019

Curious why you wouldn't want deadman kicking in but also what values you for zfs_vdev_async_* ? How many IO/s was your test case?

The deadman solved a few issues were were having with drives that didn't report any error at all but were infact dead or dying. S'pose there is a case to be had for highly optimized IO blasting as much async IO and let the kernel/zfs sort it.

@amotin
Copy link
Member Author

amotin commented Dec 11, 2019

Curious why you wouldn't want deadman kicking in but also what values you for zfs_vdev_async_* ? How many IO/s was your test case?

In case of the real failed drive I am testing now reboot fixes nothing, so it is pointless for deadman to activate. Plus drive still works correctly, just very slow, so if not unfair scheduling of ZFS there would be no reason for deadman to activate, since there are no I/O stuck in the disk. zfs_vdev_async_* are set to default, The disk does stable 60 writes per second, no matter what are the sizes or location. Workload is just a single ZVOL write with small block and intentionally killed aggregation and slightly hacked allocation to simulate fragmented pool.

The deadman solved a few issues were were having with drives that didn't report any error at all but were infact dead or dying. S'pose there is a case to be had for highly optimized IO blasting as much async IO and let the kernel/zfs sort it.

As I have told, reboot solves nothing in this case, just creates more downtime and questions. What's about sorting, I doubt that sorting beyond multiple seconds makes any sense, considering seek times in dozen milliseconds. Plus allocation throttling already prevents all possible requests to reach the sorter, passing through only small portions at a time.

@h1z1
Copy link

h1z1 commented Dec 12, 2019

Curious why you wouldn't want deadman kicking in but also what values you for zfs_vdev_async_* ? How many IO/s was your test case?

In case of the real failed drive I am testing now reboot fixes nothing, so it is pointless for deadman to activate. Plus drive still works correctly, just very slow, so if not unfair scheduling of ZFS there would be no reason for deadman to activate, since there are no I/O stuck in the disk.

That sounds oddly like what we were hitting with Ironwolf and WD Red drives supposedly flashed with "NAS firmware". See #6885

zfs_vdev_async_* are set to default, The disk does stable 60 writes per second, no matter what are the sizes or location. Workload is just a single ZVOL write with small block and intentionally killed aggregation and slightly hacked allocation to simulate fragmented pool.

So how do you know sorting is causing it and not the fact that the drive is failing or you've over provisioned something? zfs_vdev_max_active is I believe 1000

The deadman solved a few issues were were having with drives that didn't report any error at all but were infact dead or dying. S'pose there is a case to be had for highly optimized IO blasting as much async IO and let the kernel/zfs sort it.

As I have told, reboot solves nothing in this case, just creates more downtime and questions. What's about sorting, I doubt that sorting beyond multiple seconds makes any sense, considering seek times in dozen milliseconds.

Same reasons you'd delay transaction commit and buffer. There can be an enormous benefit with optimized random IO and thrashing.

Plus allocation throttling already prevents all possible requests to reach the sorter, passing through only small portions at a time.

Indeed which is why I'm a bit puzzled at what bottleneck you're hitting and why. vdev_queue_class_min_active for example. I was expecting you to come back with "we're throwing a few million IO's at it", or some such.

@amotin
Copy link
Member Author

amotin commented Dec 12, 2019

zfs_vdev_async_* are set to default, The disk does stable 60 writes per second, no matter what are the sizes or location. Workload is just a single ZVOL write with small block and intentionally killed aggregation and slightly hacked allocation to simulate fragmented pool.

So how do you know sorting is causing it and not the fact that the drive is failing or you've over provisioned something?

I am able to insert few printf's to track queues depth and request latency, and same time monitor IOPS done by drive. As I have told, the failing drive is on my desk and I can measure anything. I see that while some requests are progressing, the depth of the active queue remains constant IIRC about 10 and vq_write_offset_tree about 60, while worst request latency is linearly growing, that tells me that that request was never executed.

The deadman solved a few issues were were having with drives that didn't report any error at all but were infact dead or dying. S'pose there is a case to be had for highly optimized IO blasting as much async IO and let the kernel/zfs sort it.

As I have told, reboot solves nothing in this case, just creates more downtime and questions. What's about sorting, I doubt that sorting beyond multiple seconds makes any sense, considering seek times in dozen milliseconds.

Same reasons you'd delay transaction commit and buffer. There can be an enormous benefit with optimized random IO and thrashing.

No. The benefit can not be enormous after you already waited 4 seconds trying to save 10 milliseconds. Elevator algorithm is questionable in general these days if blocks are not directly consequential, considering that in some cases seek may be faster then wait for full platter rotation to position that is closer in LBA.

Plus allocation throttling already prevents all possible requests to reach the sorter, passing through only small portions at a time.

Indeed which is why I'm a bit puzzled at what bottleneck you're hitting and why. vdev_queue_class_min_active for example. I was expecting you to come back with "we're throwing a few million IO's at it", or some such.

Yes, we are throwing ~10K IO's at it, even if not same time. Just imagine requests in such LBA order: 1, 2, 10000, 3, 4, 5, ... 9999. How long do you think the request at 10000 will wait in the queue? I tell you -- till the end. New requests will come in, while earlier complete, and that 10000 will still wait indefinitely.

@h1z1
Copy link

h1z1 commented Dec 12, 2019

zfs_vdev_async_* are set to default, The disk does stable 60 writes per second, no matter what are the sizes or location. Workload is just a single ZVOL write with small block and intentionally killed aggregation and slightly hacked allocation to simulate fragmented pool.

So how do you know sorting is causing it and not the fact that the drive is failing or you've over provisioned something?

I am able to insert few printf's to track queues depth and request latency, and same time monitor IOPS done by drive. As I have told, the failing drive is on my desk and I can measure anything. I see that while some requests are progressing, the depth of the active queue remains constant IIRC about 60, while worst request latency is linearly growing, that tells me that that request was never executed.

Then we agree it's a hardware problem and not scheduler, no??

Same reasons you'd delay transaction commit and buffer. There can be an enormous benefit with optimized random IO and thrashing.

No. The benefit can not be enormous after you already waited 4 seconds trying to save 10 milliseconds. Elevator algorithm is questionable in general these days if blocks are not directly consequential, considering that in some cases seek may be faster then wait for full platter rotation to position that is closer in LBA.

10ms is indeed inconsequential, but given I do exactly that, I disagree. I'm referring to the aggregate on non-interactive workloads.

Plus allocation throttling already prevents all possible requests to reach the sorter, passing through only small portions at a time.

Indeed which is why I'm a bit puzzled at what bottleneck you're hitting and why. vdev_queue_class_min_active for example. I was expecting you to come back with "we're throwing a few million IO's at it", or some such.

Yes, we are throwing ~10K IO's at it, even if not same time. Just imagine requests in such LBA order: 1, 2, 10000, 3, 4, 5, ... 9999. How long do you think the request at 10000 will wait in the queue?

You're throwing 10k IO's at a single platter disk?? Something isn't adding up.

@amotin
Copy link
Member Author

amotin commented Dec 12, 2019

zfs_vdev_async_* are set to default, The disk does stable 60 writes per second, no matter what are the sizes or location. Workload is just a single ZVOL write with small block and intentionally killed aggregation and slightly hacked allocation to simulate fragmented pool.

So how do you know sorting is causing it and not the fact that the drive is failing or you've over provisioned something?

I am able to insert few printf's to track queues depth and request latency, and same time monitor IOPS done by drive. As I have told, the failing drive is on my desk and I can measure anything. I see that while some requests are progressing, the depth of the active queue remains constant IIRC about 60, while worst request latency is linearly growing, that tells me that that request was never executed.

Then we agree it's a hardware problem and not scheduler, no??

I know that this drive is failing! But I don't want my system to reboot every 15 minutes because of it, blaming requests that were not even sent to the disk, waiting in a queue for 15 minutes!

Same reasons you'd delay transaction commit and buffer. There can be an enormous benefit with optimized random IO and thrashing.

No. The benefit can not be enormous after you already waited 4 seconds trying to save 10 milliseconds. Elevator algorithm is questionable in general these days if blocks are not directly consequential, considering that in some cases seek may be faster then wait for full platter rotation to position that is closer in LBA.

10ms is indeed inconsequential, but given I do exactly that, I disagree. I'm referring to the aggregate on non-interactive workloads.

I'm sorry, I lost you here. I am saying that breaking sequential operation once per 4 seconds for some fairness is not a high price. Are you agree or not?

Plus allocation throttling already prevents all possible requests to reach the sorter, passing through only small portions at a time.

Indeed which is why I'm a bit puzzled at what bottleneck you're hitting and why. vdev_queue_class_min_active for example. I was expecting you to come back with "we're throwing a few million IO's at it", or some such.

Yes, we are throwing ~10K IO's at it, even if not same time. Just imagine requests in such LBA order: 1, 2, 10000, 3, 4, 5, ... 9999. How long do you think the request at 10000 will wait in the queue?

You're throwing 10k IO's at a single platter disk?? Something isn't adding up.

You are doing just the same. Just create a single disk pool on machine with lots of RAM. In that case ARC dirty_max will be in some GBs of data, that may be assigned into the same transaction group before write throttling activates. Divide those few GBs by some small block/record size and you may get millions, not just some 10k IO's at a single platter disk.

@codecov
Copy link

codecov bot commented Dec 12, 2019

Codecov Report

Merging #9714 into master will increase coverage by 13%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #9714      +/-   ##
=========================================
+ Coverage      67%     80%     +13%     
=========================================
  Files         337     288      -49     
  Lines      106369   82326   -24043     
=========================================
- Hits        71480   65821    -5659     
+ Misses      34889   16505   -18384
Flag Coverage Δ
#kernel 80% <100%> (?)
#user ?

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0bf435...9d32840. Read the comment docs.

@richardelling
Copy link
Contributor

zfs_vdev_max_active isn't what you want, you might want zfs_vdev_*_max_active, which for HDDs should be closer to 3 or 4 than 10 for crappy drives (and likely SMR drives, too). This will have the effect of not queuing a bunch of stuff to the drive where it can get stuck in the drive's elevator for a long time.

Since I do mostly high speed storage, I can say I'd rather not have more code trying to sort out HDD optimizations in the data path -- it can be faster to just put it on disk.

@h1z1
Copy link

h1z1 commented Dec 12, 2019

Then we agree it's a hardware problem and not scheduler, no??

I know that this drive is failing! But I don't want my system to reboot every 15 minutes because of it, blaming requests that were not even sent to the disk, waiting in a queue for 15 minutes!

echo wait > /sys/module/zfs/parameters/zfs_deadman_failmode :)

Seriously though, I'd think that depended on why you're waiting. What is the source of synchronous IO to such a degree that it is starving other classes.

10ms is indeed inconsequential, but given I do exactly that, I disagree. I'm referring to the aggregate on non-interactive workloads.

I'm sorry, I lost you here. I am saying that breaking sequential operation once per 4 seconds for some fairness is not a high price. Are you agree or not?

That can be a negative outcome. What type of IO and what priorities are you referring to? Asynchronous IO is by definition, asynchronous. That 4 seconds can be 2 requests or millions. .. I wouldn't agree IO scheduled at batch should just preempt realtime.

Plus allocation throttling already prevents all possible requests to reach the sorter, passing through only small portions at a time.

Indeed which is why I'm a bit puzzled at what bottleneck you're hitting and why. vdev_queue_class_min_active for example. I was expecting you to come back with "we're throwing a few million IO's at it", or some such.

Yes, we are throwing ~10K IO's at it, even if not same time. Just imagine requests in such LBA order: 1, 2, 10000, 3, 4, 5, ... 9999. How long do you think the request at 10000 will wait in the queue?

You're throwing 10k IO's at a single platter disk?? Something isn't adding up.

You are doing just the same. Just create a single disk pool on machine with lots of RAM. In that case ARC dirty_max will be in some GBs of data, that may be assigned into the same transaction group before write throttling activates. Divide those few GBs by some small block/record size and you may get millions, not just some 10k IO's at a single platter disk.

These are different things though.

@amotin
Copy link
Member Author

amotin commented Dec 12, 2019

OK, if there is no interest to this approach, I'll close this PR for now and think more.

@amotin amotin closed this Dec 12, 2019
@amotin amotin deleted the tsoff branch August 24, 2021 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants