Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay injection can cause indefinitely hung zios #8404

Merged
merged 1 commit into from
Feb 15, 2019

Conversation

shartse
Copy link
Contributor

@shartse shartse commented Feb 13, 2019

Motivation and Context

I ran into this while doing some performance testing using zinject - sometimes a zio would just never complete.

Looking at how the delay injection is implemented we can see that if we hit the (NSEC_TO_TICK(diff) == 0) condition in zio_delay_interrupt (i.e. there's very little left to delay, so just spin) then after waiting the function returns immediately and zio_interrupt is never called. Therefore, the zio cannot progress.

Description

I added a call to zio_interrupt after zfs_sleep_until completes.

How Has This Been Tested?

I re-ran my performance tests on the VM that had been consistently hitting this issue and was able to complete them without incident. I verifying that I was hitting condition in question by adding some temporary logging.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@shartse shartse requested a review from tonyhutter February 13, 2019 21:37
@ahrens ahrens added Status: Code Review Needed Ready for review and testing Type: Defect Incorrect behavior (e.g. crash, hang) and removed Type: Defect Incorrect behavior (e.g. crash, hang) labels Feb 13, 2019
@shartse
Copy link
Contributor Author

shartse commented Feb 13, 2019

@tonyhutter I noticed that you ported this feature from Illumos - I was wondering if you had thoughts on the use of ticks for taskq_dispatch_delay? It seems like the granularity is pretty low, potentially leading to wakeups before the full time has elapsed. Maybe it could be improved by calling a helper function from taskq_dispatch_delay which calls zfs_sleep_until followed by zio_interrupt to make sure we complete the entire delay.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find, thanks!

If we hit the (NSEC_TO_TICK(diff) == 0) condition in
zio_delay_interrupt, zio_interrupt is never called and the
zio does not progress.

Signed-off-by: sara hartse <[email protected]>
@codecov
Copy link

codecov bot commented Feb 14, 2019

Codecov Report

Merging #8404 into master will decrease coverage by 0.01%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8404      +/-   ##
==========================================
- Coverage   78.57%   78.56%   -0.02%     
==========================================
  Files         380      380              
  Lines      115839   115840       +1     
==========================================
- Hits        91019    91007      -12     
- Misses      24820    24833      +13
Flag Coverage Δ
#kernel 78.93% <0%> (-0.06%) ⬇️
#user 67.24% <ø> (-0.18%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf89a4e...7fe33b5. Read the comment docs.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Feb 15, 2019
@behlendorf behlendorf merged commit f545b6a into openzfs:master Feb 15, 2019
@behlendorf
Copy link
Contributor

I've gone ahead and merged this to resolve the outstanding issue. Let's tackle possible improvements in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants