Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vol_volsize_to_reservation does not account for raidz skip blocks #8973

Closed
wants to merge 4 commits into from

Conversation

mgerdts
Copy link
Contributor

@mgerdts mgerdts commented Jul 1, 2019

This is a port of illumos#9318. When a volume is created
in a pool with raidz vdevs and volblocksize != 128k, the volume can
reference more space than is reserved with the automatically
calculated refreservation. There are two deficiencies in
vol_volsize_to_reservation that contribute to this:

  1. Skip blocks may be added to keep each allocation a multiple of
    parity + 1. This is the dominating factor when volblocksize is close
    to 2^ashift.
  2. raidz deflation for 128 KB blocks is different for most other block
    sizes.

See "The theory of raidz space accounting" comment in
libzfs_dataset.c for a full explanation.

Motivation and Context

As mentioned above, this is a port of illumos#9318 which was recently integrated into illumos-gate. I would like to integrate this into zfsonlinux to minimize divergence.

When virtual machines are created on top of zvols, they commonly use 4k or 8k volblocksize. With 4Kn disks becoming quite common, zvols on raidz end up consuming significantly more space than estimated. The common case is also the pathological case. A zvol with volblocksize=8k on raidz2 will allocate as much space as triple mirroring, but the automatically calculated refreservation will be be only a few percent larger than volsize.

The underlying issues are fully explained in a new theory comment and a recording of a Joyent Technical Discussion.

Description

zvol_volsize_to_reservation() is responsible for the calculation of the default refreservation. Originally, it assumed that the pool layout had no bearing on the amount of space that could be consumed by a volume. I've changed that so that if one or more raidz vdevs exist in the pool, the amount of reserved space will assume that all the writes to go the most inefficient vdev.

The files that his PR changes patched into zfsonlinux with the only conflicts being in copyrights.

How Has This Been Tested?

New tests were written to cover the areas of changed functionality. Results from testing on illumos are available in illumos#9318.

Small test changes were required when porting from illumos. Those include:

  • Use lsblock rathern than prtvtoc to figure out disk sector size
  • ksh on CentOS seems to have a different mixture of bugs from ksh on illumos, requiring additional checks when iterating associative array keys.
  • Adjust the /dev/zvol path to match the Linux way.

After those changes, scripts/zfs-tests.sh -v -d /data -T refreserv matches expectations (refreserv_004_pos fails, but that is expected). Results for running the full test suite are pending.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

This is a port of illumos/illumos-gate#9318.  When a volume is created
in a pool with raidz vdevs and volblocksize != 128k, the volume can
reference more space than is reserved with the automatically
calculated refreservation.  There are two deficiencies in
vol_volsize_to_reservation that contribute to this:
1) Skip blocks may be added to keep each allocation a multiple of
parity + 1. This is the dominating factor when volblocksize is close
to 2^ashift.
2) raidz deflation for 128 KB blocks is different for most other block
sizes.
See "The theory of raidz space accounting" comment in
libzfs_dataset.c for a full explanation.
@mgerdts mgerdts mentioned this pull request Jul 1, 2019
12 tasks
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jul 1, 2019
@behlendorf
Copy link
Contributor

Thanks for porting this, I made a few small comments in #8974 before seeing this PR.

@codecov
Copy link

codecov bot commented Jul 1, 2019

Codecov Report

Merging #8973 into master will decrease coverage by 0.19%.
The diff coverage is 97.36%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #8973     +/-   ##
=========================================
- Coverage   78.74%   78.54%   -0.2%     
=========================================
  Files         388      388             
  Lines      119972   120003     +31     
=========================================
- Hits        94466    94255    -211     
- Misses      25506    25748    +242
Flag Coverage Δ
#kernel 79.32% <ø> (-0.19%) ⬇️
#user 66.18% <97.36%> (-0.51%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 681a85c...d6febf1. Read the comment docs.

@behlendorf behlendorf requested review from ahrens and jwk404 July 2, 2019 18:48
behlendorf
behlendorf previously approved these changes Jul 3, 2019
@behlendorf behlendorf dismissed their stale review July 3, 2019 17:38

Accidentally approved the wrong PR

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 5, 2019
@behlendorf behlendorf closed this in 341166c Jul 5, 2019
@behlendorf
Copy link
Contributor

@mgerdts merged, thank you for porting this!

mgerdts pushed a commit to mgerdts/zfs that referenced this pull request Jul 8, 2019
…skip blocks

When a volume is created in a pool with raidz vdevs and
volblocksize != 128k, the volume can reference more space than is
reserved with the automatically calculated refreservation.  There
are two deficiencies in vol_volsize_to_reservation that contribute
to this:

  1) Skip blocks may be added to keep each allocation a multiple
     of parity + 1. This is the dominating factor when volblocksize
     is close to 2^ashift.

  2) raidz deflation for 128 KB blocks is different for most other
     block sizes.

See "The theory of raidz space accounting" comment in
libzfs_dataset.c for a full explanation.

Authored by: Mike Gerdts <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Sanjay Nadkarni <[email protected]>
Reviewed by: Jerry Jelinek <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Kody Kantor <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Mike Gerdts <[email protected]>

Porting Notes:
* ZTS: wait for zvols to exist before writing
* ZTS: use log_must_busy with {zpool|zfs} destroy

OpenZFS-issue: https://www.illumos.org/issues/9318
OpenZFS-commit: illumos/illumos-gate@b73ccab0
Closes openzfs#8973
TulsiJain pushed a commit to TulsiJain/zfs that referenced this pull request Jul 20, 2019
…skip blocks

When a volume is created in a pool with raidz vdevs and
volblocksize != 128k, the volume can reference more space than is
reserved with the automatically calculated refreservation.  There
are two deficiencies in vol_volsize_to_reservation that contribute
to this:

  1) Skip blocks may be added to keep each allocation a multiple
     of parity + 1. This is the dominating factor when volblocksize
     is close to 2^ashift.

  2) raidz deflation for 128 KB blocks is different for most other
     block sizes.

See "The theory of raidz space accounting" comment in
libzfs_dataset.c for a full explanation.

Authored by: Mike Gerdts <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Sanjay Nadkarni <[email protected]>
Reviewed by: Jerry Jelinek <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Kody Kantor <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Mike Gerdts <[email protected]>

Porting Notes:
* ZTS: wait for zvols to exist before writing
* ZTS: use log_must_busy with {zpool|zfs} destroy

OpenZFS-issue: https://www.illumos.org/issues/9318
OpenZFS-commit: illumos/illumos-gate@b73ccab0
Closes openzfs#8973
TulsiJain pushed a commit to TulsiJain/zfs that referenced this pull request Jul 20, 2019
…skip blocks

When a volume is created in a pool with raidz vdevs and
volblocksize != 128k, the volume can reference more space than is
reserved with the automatically calculated refreservation.  There
are two deficiencies in vol_volsize_to_reservation that contribute
to this:

  1) Skip blocks may be added to keep each allocation a multiple
     of parity + 1. This is the dominating factor when volblocksize
     is close to 2^ashift.

  2) raidz deflation for 128 KB blocks is different for most other
     block sizes.

See "The theory of raidz space accounting" comment in
libzfs_dataset.c for a full explanation.

Authored by: Mike Gerdts <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Sanjay Nadkarni <[email protected]>
Reviewed by: Jerry Jelinek <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Kody Kantor <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Mike Gerdts <[email protected]>

Porting Notes:
* ZTS: wait for zvols to exist before writing
* ZTS: use log_must_busy with {zpool|zfs} destroy

OpenZFS-issue: https://www.illumos.org/issues/9318
OpenZFS-commit: illumos/illumos-gate@b73ccab0
Closes openzfs#8973
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Aug 13, 2019
…skip blocks

When a volume is created in a pool with raidz vdevs and
volblocksize != 128k, the volume can reference more space than is
reserved with the automatically calculated refreservation.  There
are two deficiencies in vol_volsize_to_reservation that contribute
to this:

  1) Skip blocks may be added to keep each allocation a multiple
     of parity + 1. This is the dominating factor when volblocksize
     is close to 2^ashift.

  2) raidz deflation for 128 KB blocks is different for most other
     block sizes.

See "The theory of raidz space accounting" comment in
libzfs_dataset.c for a full explanation.

Authored by: Mike Gerdts <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Sanjay Nadkarni <[email protected]>
Reviewed by: Jerry Jelinek <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Kody Kantor <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Mike Gerdts <[email protected]>

Porting Notes:
* ZTS: wait for zvols to exist before writing
* ZTS: use log_must_busy with {zpool|zfs} destroy

OpenZFS-issue: https://www.illumos.org/issues/9318
OpenZFS-commit: illumos/illumos-gate@b73ccab0
Closes openzfs#8973
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Dec 24, 2019
…skip blocks

When a volume is created in a pool with raidz vdevs and
volblocksize != 128k, the volume can reference more space than is
reserved with the automatically calculated refreservation.  There
are two deficiencies in vol_volsize_to_reservation that contribute
to this:

  1) Skip blocks may be added to keep each allocation a multiple
     of parity + 1. This is the dominating factor when volblocksize
     is close to 2^ashift.

  2) raidz deflation for 128 KB blocks is different for most other
     block sizes.

See "The theory of raidz space accounting" comment in
libzfs_dataset.c for a full explanation.

Authored by: Mike Gerdts <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Sanjay Nadkarni <[email protected]>
Reviewed by: Jerry Jelinek <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Kody Kantor <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Mike Gerdts <[email protected]>

Porting Notes:
* ZTS: wait for zvols to exist before writing
* ZTS: use log_must_busy with {zpool|zfs} destroy

OpenZFS-issue: https://www.illumos.org/issues/9318
OpenZFS-commit: illumos/illumos-gate@b73ccab0
Closes openzfs#8973
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Dec 27, 2019
…skip blocks

When a volume is created in a pool with raidz vdevs and
volblocksize != 128k, the volume can reference more space than is
reserved with the automatically calculated refreservation.  There
are two deficiencies in vol_volsize_to_reservation that contribute
to this:

  1) Skip blocks may be added to keep each allocation a multiple
     of parity + 1. This is the dominating factor when volblocksize
     is close to 2^ashift.

  2) raidz deflation for 128 KB blocks is different for most other
     block sizes.

See "The theory of raidz space accounting" comment in
libzfs_dataset.c for a full explanation.

Authored by: Mike Gerdts <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Sanjay Nadkarni <[email protected]>
Reviewed by: Jerry Jelinek <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Kody Kantor <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Mike Gerdts <[email protected]>

Porting Notes:
* ZTS: wait for zvols to exist before writing
* ZTS: use log_must_busy with {zpool|zfs} destroy

OpenZFS-issue: https://www.illumos.org/issues/9318
OpenZFS-commit: illumos/illumos-gate@b73ccab0
Closes openzfs#8973
tonyhutter pushed a commit that referenced this pull request Jan 23, 2020
…skip blocks

When a volume is created in a pool with raidz vdevs and
volblocksize != 128k, the volume can reference more space than is
reserved with the automatically calculated refreservation.  There
are two deficiencies in vol_volsize_to_reservation that contribute
to this:

  1) Skip blocks may be added to keep each allocation a multiple
     of parity + 1. This is the dominating factor when volblocksize
     is close to 2^ashift.

  2) raidz deflation for 128 KB blocks is different for most other
     block sizes.

See "The theory of raidz space accounting" comment in
libzfs_dataset.c for a full explanation.

Authored by: Mike Gerdts <[email protected]>
Reviewed by: Richard Elling <[email protected]>
Reviewed by: Sanjay Nadkarni <[email protected]>
Reviewed by: Jerry Jelinek <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Kody Kantor <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Mike Gerdts <[email protected]>

Porting Notes:
* ZTS: wait for zvols to exist before writing
* ZTS: use log_must_busy with {zpool|zfs} destroy

OpenZFS-issue: https://www.illumos.org/issues/9318
OpenZFS-commit: illumos/illumos-gate@b73ccab0
Closes #8973
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants