Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use smaller default slack/delta value for schedule_hrtimeout_range() #9217

Merged
merged 1 commit into from
Aug 28, 2019
Merged

Use smaller default slack/delta value for schedule_hrtimeout_range() #9217

merged 1 commit into from
Aug 28, 2019

Conversation

tonynguien
Copy link
Contributor

@tonynguien tonynguien commented Aug 26, 2019

Motivation and Context

Improve small write performance

Description

For interrupt coalescing, cv_timedwait_hires() uses a 100us slack/delta
for calls to schedule_hrtimeout_range(). This 100us slack can be costly
for small writes.

This change improves small write performance by passing resolution res
parameter to schedule_hrtimeout_range() to be used as delta/slack. A new
tunable spl_schedule_hrtimeout_slack_us is added to preserve old
behavior when desired.

How Has This Been Tested?

Performance observations on 8K recordsize filesystem:

  • 8K random writes at 1-64 threads, up to 60% improvement for one thread
    and smaller gains as thread count increases. At >64 threads, 2-5%
    decrease in performance was observed.
  • 8K sequential writes, similar 60% improvement for one thread and
    leveling out around 64 threads. At >64 threads, 5-10% decrease in
    performance was observed.
  • 128K sequential write sees 1-5 for the 128K. No observed regression at
    high thread count.

Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on VMware ESX.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • [x ] Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing Type: Performance Performance improvement or performance problem labels Aug 27, 2019
module/spl/spl-condvar.c Outdated Show resolved Hide resolved
module/spl/spl-condvar.c Outdated Show resolved Hide resolved
@tonynguien
Copy link
Contributor Author

I updated the PR to make parameter modifiable at runtime and limiting value between 0-1000us.

Testing were done to make sure changes to /sys/module/spl/parameters/spl_schedule_hrtimeout_slack_us are effective and that we get Invalid argument errors for <0 or >1000 values.

@tonynguien
Copy link
Contributor Author

Some builds are failing. Kernels without module_param_cb() has a different function prototype. I believe I need to do something similar to what's in spl-taskq.c

For interrupt coalescing, cv_timedwait_hires() uses a 100us slack/delta
for calls to schedule_hrtimeout_range(). This 100us slack can be costly
for small writes.

This change improves small write performance by passing resolution `res`
parameter to schedule_hrtimeout_range() to be used as delta/slack. A new
tunable `spl_schedule_hrtimeout_slack_us` is added to preserve old
behavior when desired.

Performance observations on 8K recordsize filesystem:
- 8K random writes at 1-64 threads, up to 60% improvement for one thread
  and smaller gains as thread count increases. At >64 threads, 2-5%
  decrease in performance was observed.
- 8K sequential writes, similar 60% improvement for one thread and
  leveling out around 64 threads. At >64 threads, 5-10% decrease in
  performance was observed.
- 128K sequential write sees 1-5 for the 128K. No observed regression at
  high thread count.

Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.

Signed-off-by: Tony Nguyen <[email protected]>
@codecov
Copy link

codecov bot commented Aug 28, 2019

Codecov Report

Merging #9217 into master will decrease coverage by 0.02%.
The diff coverage is 30%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9217      +/-   ##
==========================================
- Coverage   79.24%   79.22%   -0.03%     
==========================================
  Files         400      400              
  Lines      122012   122006       -6     
==========================================
- Hits        96687    96656      -31     
- Misses      25325    25350      +25
Flag Coverage Δ
#kernel 79.76% <30%> (-0.03%) ⬇️
#user 67.15% <ø> (-0.22%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6203d2...943d30b. Read the comment docs.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 28, 2019
@behlendorf behlendorf merged commit 8d04284 into openzfs:master Aug 28, 2019
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Dec 24, 2019
For interrupt coalescing, cv_timedwait_hires() uses a 100us slack/delta
for calls to schedule_hrtimeout_range(). This 100us slack can be costly
for small writes.

This change improves small write performance by passing resolution `res`
parameter to schedule_hrtimeout_range() to be used as delta/slack. A new
tunable `spl_schedule_hrtimeout_slack_us` is added to preserve old
behavior when desired.

Performance observations on 8K recordsize filesystem:
- 8K random writes at 1-64 threads, up to 60% improvement for one thread
  and smaller gains as thread count increases. At >64 threads, 2-5%
  decrease in performance was observed.
- 8K sequential writes, similar 60% improvement for one thread and
  leveling out around 64 threads. At >64 threads, 5-10% decrease in
  performance was observed.
- 128K sequential write sees 1-5 for the 128K. No observed regression at
  high thread count.

Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.

Reviewed-by: Richard Elling <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Closes openzfs#9217
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Dec 27, 2019
For interrupt coalescing, cv_timedwait_hires() uses a 100us slack/delta
for calls to schedule_hrtimeout_range(). This 100us slack can be costly
for small writes.

This change improves small write performance by passing resolution `res`
parameter to schedule_hrtimeout_range() to be used as delta/slack. A new
tunable `spl_schedule_hrtimeout_slack_us` is added to preserve old
behavior when desired.

Performance observations on 8K recordsize filesystem:
- 8K random writes at 1-64 threads, up to 60% improvement for one thread
  and smaller gains as thread count increases. At >64 threads, 2-5%
  decrease in performance was observed.
- 8K sequential writes, similar 60% improvement for one thread and
  leveling out around 64 threads. At >64 threads, 5-10% decrease in
  performance was observed.
- 128K sequential write sees 1-5 for the 128K. No observed regression at
  high thread count.

Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.

Reviewed-by: Richard Elling <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Closes openzfs#9217
tonyhutter pushed a commit that referenced this pull request Jan 23, 2020
For interrupt coalescing, cv_timedwait_hires() uses a 100us slack/delta
for calls to schedule_hrtimeout_range(). This 100us slack can be costly
for small writes.

This change improves small write performance by passing resolution `res`
parameter to schedule_hrtimeout_range() to be used as delta/slack. A new
tunable `spl_schedule_hrtimeout_slack_us` is added to preserve old
behavior when desired.

Performance observations on 8K recordsize filesystem:
- 8K random writes at 1-64 threads, up to 60% improvement for one thread
  and smaller gains as thread count increases. At >64 threads, 2-5%
  decrease in performance was observed.
- 8K sequential writes, similar 60% improvement for one thread and
  leveling out around 64 threads. At >64 threads, 5-10% decrease in
  performance was observed.
- 128K sequential write sees 1-5 for the 128K. No observed regression at
  high thread count.

Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.

Reviewed-by: Richard Elling <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Closes #9217
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants