Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highly inefficient use of space observed when using raidz2 with ashift=12 #548

Closed
ryao opened this issue Feb 2, 2012 · 25 comments
Closed
Labels
Type: Documentation Indicates a requested change to the documentation Type: Performance Performance improvement or performance problem

Comments

@ryao
Copy link
Contributor

ryao commented Feb 2, 2012

I am using the latest GIT code, e29be02, on a VMWare Player VM. I booted the VM using the Ubuntu Linux 11.10 LiveCD, with Linux 3.0. The VM contains 6x1GB disks in a raidz2 pool with ashift=12. The pool reports 4GB of space formatted. I unpacked a copy of the portage-tree, which requires about 672M on ext4 with a 4K block size, on a ZFS dataset in the pool, which required 1.5GB of space.

I then tried using zvolumes, and I had similar results. The storage requirements on the pool are consistently double that of the storage requirements of the actual hosted filesystems:

(df -h reported size) - (zfs list reported size)/(size provided at creation time) (mkfs. options)
743688 - 1.58G/1G - ext4 (normal, zvol) -E discard -N 1048576
688912 - 1.58G/1G - ext4 (extra options, zvol) -E discard -N 1048576 -I 128 -m 0 -O ^ext_attr,^resize_inode,^has_journal,^large_file,^huge_file,^dir_nlink,^extra_isize
687856 - 1.40G/1G - ext4 (extra options, zvol) -E discard -N 262144 -I 128 -m 0 -O ^ext_attr,^resize_inode,^has_journal,^large_file,^huge_file,^dir_nlink,^extra_isize
301684 - 607M/1G - reiserfs (default options, zvol)

You can obtain a snapshot of the portage tree at the following site to verify my results from the following link:

http://mirrors.rit.edu/gentoo/snapshots/portage-latest.tar.xz

I am linking to the latest tarball rather than the current one, mostly because dated tarballs are not hosted particularly long. I expect that others can reproduce my findings regardless of whether or not the same exact snapshot is used.

I also tried making a 512MB file. I then formatted it reiserfs and mounted it on the loopback device. I proceeded to mount it and extract the portage tree. Afterward, I examined the disk space used, and it used 513MB.

Lastly, I tried this on my physical system with a 2GB parse file formatted ZFS on top of ext4 with ashift=12 and without any raidz, mirroring or striping. The occupied space reported by df was 830208 bytes, which is a dramatic improvement over raidz2.

I thought I paid the price of parity at the beginning when 1/3 of my array's space was missing, but it seems that I am paying for it twice, even when using zvols which I would expect space-wise to be the equivalent of a giant file. I pay once at pool creation and then again when I do many small writes. Does anyone have any idea why?

@ryao
Copy link
Contributor Author

ryao commented Feb 2, 2012

rlaager and I were able to narrow things down in IRC. It seems that a single disk pool is fine with ashift=9 and ashift=12. raidz2 is also fine when ashift=9, but when ashift=12, space requirements explode.

I did an unpack of the portage tree on a raidz2 ashift=9 pool that I made on my VM host. It used only 436MB:

rpool 437M 3.49G 436M /rpool

I also tried a 1GB zpool formatted ext4 with 2^20 inodes:

/dev/zd0 786112 743000 0 100% /rpool/portage

That is consistent with the host usage and it seems that the 5% space reserved for root enabled the extraction to run to completion.

The explosion in disk usage seems to be caused by a bad interaction between ashift=12 and raidz.

@behlendorf
Copy link
Contributor

Your absolutely right, this is something of a known, although not widely discussed, issue for ZFS and it's one of the reasons why we've left ashift=9 the default. Small files will balloon the storage requirements, for large files things should be much more reasonable.

@ryao
Copy link
Contributor Author

ryao commented Feb 2, 2012

behlendorf, would you clarify why this affects not only small files on a ZFS dataset, but small files stored on a zvol formatted with a completely different filesystem?

My understanding of a zvol was that it should reserve all of the space that it would ever use and never grow past that unless explicitly resized.

@behlendorf
Copy link
Contributor

You see the impact when using a zvol because they default to an 8k block size. If you increase the zvol block size the overhead will decrease. It's analogous to creating files in a zfs filesystem with an 8k block size, which is what happens when you create a small file.

As for zvols they don't reserve their space at creation time. While you do set a maximum volume size they should only allocate space as they are written too like any other object. If you want the behavior your describing for your zvol you need to set a reservation, zfs set reservation=N dataset.

@ryao
Copy link
Contributor Author

ryao commented Feb 2, 2012

That would explain why there appears to be a factor of 2 discrepancy between what the filesystem on the zvol reported and what the zvol actually used.

I suspect that when ashift=12, the zvol will allocate two blocks and only use one (i.e. zero pad it), as opposed to the typical unaligned write behavior where a 4KB logical sector would map to either the upper or lower half of a 8KB physical sector.

@rlaager
Copy link
Member

rlaager commented Feb 2, 2012

behlendorf: When creating a zvol, a refreservation (reservation on zpool <= 8) is created by default. This is covered in the zfs man page and matches my experience with Solaris 11 Express as well as a quick test just now with ZFS on Linux. To get the behavior you describe, you need to add the -s option to: zfs create -V <dataset_name>

@rlaager
Copy link
Member

rlaager commented Feb 2, 2012

gentoofan: Are you seeing writes to a zvol (with the default refreservation) fail with ENOSPC? If so, that seems like a bug (i.e. the reservation isn't properly accounting for the worst-case overhead).

@ryao
Copy link
Contributor Author

ryao commented Feb 2, 2012

I did some more tests. The following is on ashift=12, and I can't see any difference in terms of reported available space when setting a reservation:

localhost ~ # zfs create -V 400M -o reservation=400M rpool/test
localhost ~ # zfs list rpool/test
NAME USED AVAIL REFER MOUNTPOINT
rpool/test 413M 2.08G 144K -
localhost ~ # zfs destroy rpool/test
localhost ~ # zfs create -V 400M rpool/test
localhost ~ # zfs list rpool/test
NAME USED AVAIL REFER MOUNTPOINT
rpool/test 413M 2.08G 144K -

I also tested making a zvol on ashift=9 and I observed the 1.5GB space usage that I had seen on ashift=12. I then repeated with ext4 on top of 'create -V 1G -o volblocksize=4K rpool/ROOT/portage' and the space usage of the zvol correlated to the space usage reported by ext4. Specifically:

root@ubuntu:# df /dev/zd0
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/zd0 917184 687852 229332 75% /mnt/gentoo/usr/portage
root@ubuntu:
# zfs list rpool/ROOT/portage
NAME USED AVAIL REFER MOUNTPOINT
rpool/ROOT/portage 1.03G 1.35G 817M -

Accounting for FS overhead, 2^20 KB - 229332 KB = 819244 KB or approximately 800M. Permitting for ZFS' internal book-keeping, 17MB overhead seems reasonable.

I will retest with ashift=12 soon, although given that I had reproduced the 1.5GB zvol usage with ashift=9, I suspect that toggling this switch will fix things. There is still the issue of why an 1GB zvol with ashift=9 uses 1.5GB to store an ext4 filesystem containing files that a ZFS data with ashift=9 only needs 347MB to store

@ryao
Copy link
Contributor Author

ryao commented Feb 3, 2012

Dagger2, rlaager and dajhorn worked this out in IRC. The issue is that ashift=12 is enforcing a 4KB restriction. The two parity blocks required by raidz2 are both 4KB in size because of ashift=12.

The zvol has a default block size of 8KB, so 2x4KB are written as data with 2x4KB parity. Since the corresponding parity blocks have been consumed, the other two data blocks are marked as in use, even though they aren't doing anything. The consequence is that the smallest amount of data that can be written to a raidz pool is (disks - raidz level) * 2^ashift, which in my situation, is 16KB.

This explains why filesystems on the zvol and on a ZFS dataset would require roughly the same amount of space, despite requiring much less on a a single physical disk.

@ryao
Copy link
Contributor Author

ryao commented Feb 3, 2012

rlaager, I have not observed any write failures, although I imagine one would occur if I made a zvol with 3GB in my configuration with the default 8KB volblocksize and then proceeded to fill it.

By the way, I had been sitting on the comment I made immediately after yours while I was running tests, so I didn't see your comment until now.

@rlaager
Copy link
Member

rlaager commented Feb 3, 2012

I justed test this scenario. I created a zpool with ashift=9 on six 300 GiB files. I ran zfs create -V 1G tank foo. I filled it with dd if=/dev/zero of=/dev/zd0. This worked (I believe), as it showed me having 111M AVAIL left on tank/foo. I then repeated the process with ashift=12. Partway through, I started seeing errors like this:
[85659.830922] Buffer I/O error on device zd0, logical block 37960
[85659.830923] lost page write due to I/O error on zd0

dd completed without error. So, as far as I can tell, there are two bugs here:

  1. when this error occurs, it's not being relayed to the application as a failed write.
  2. reservations do not properly take into account the blocksize and ashift values under some conditions (this being an example). In the ashift=12 case, the reservation would have to be much larger with the default volblocksize of 8K, which means that the zvol creation should have failed.

@ryao
Copy link
Contributor Author

ryao commented Feb 3, 2012

rlaager, I believe your issue is what I described earlier, which involves a bad configuration. Once a zvol with such a configuration exists, there is not much that the code can do about it. If the actual volblocksize < (disks - raidz level) * 2^ashift, things like this can happen when you try to fill the zvol.

It might be worthwhile to make the default volblocksize vary depending on the raidz level and number of disks to prevent users from hitting these configurations by default. It might also be worthwhile to refuse to make zvols that violate this constraint. Of course, that won't help people who have pre-existing zvols that were made by older modules or other implementations.

@ryao
Copy link
Contributor Author

ryao commented Feb 3, 2012

After talking about this with rlaager in IRC for a bit, I would like to suggest that the zvol code be patched to accomplish 4 things, where formula = (disks - raidz_level) * 2^ashift:

  1. Set the default volblocksize to max(2^(4 + ashift), formula[vdev0], formula[vdev1]...).
  2. Refuse to create a zvol that violates volblocksize < formula for any vdev in the pool.
  3. Make zvols that violate volblocksize < formula for a vdev readonly devices at pool import time.
  4. Refuse to add a vdev if it will cause a zvol to violate volblocksize < formula.

That would prevent the issues we encountered from occurring.

@rlaager
Copy link
Member

rlaager commented Feb 3, 2012

Assuming that our idea of "formula" is correct (which probably needs more testing):

3: we should print a kernel message. Also, we should implement the "readonly" part by setting readonly=on on the zvol, which would allow the admin to override it. Imagine, "I upgraded ZFS on Linux and rebooted. All of my virtual machines failed because their zvols went readonly." They can continue at the same risk as before (though now it's known) until they have time to recreate the zvols with a different volblocksize and transfer the data.

4: If you want this to be constant-time (and not have to iterate over the zvols to check), then make the condition "if it would raise the default volblocksize". This is just as safe, but may have false positives (i.e. no zvols exist with a volblocksize less than the new value).

We might want to relax the "zvol" conditions to "non-sparse zvols", where "non-sparse" means "with no reservation or refreservation". If I have a 6 disk raidz ashift=12 pool with a volblocksize=4k zvol for a VM's swap or volblocksize=8k for a database, I might want to waste space in exchange for the performance advantage of avoiding read-modify-write at the zvol level. Sparse zvols are already subject to failure if the pool fills up (and thus discouraged by the man page), so while the increase disk consumption might be a surprise, it's not violating any guarantee that ZFS made.

@Rudd-O
Copy link
Contributor

Rudd-O commented Apr 25, 2012

Is there going to be a bug fix for this issue? I am being affected by the, in such a way that in a pool with compression (1.7x) and dedupe (1.55x) enabled, the storage size is about THE SAME as it was on the old NetApp Filer (which is outrageously big).

Six-disk RAIDZ2 pool here.

@Rudd-O
Copy link
Contributor

Rudd-O commented Apr 25, 2012

Also I see a discrepancy between the total pool size in zpool list and zfs list (used + avail).

@behlendorf
Copy link
Contributor

No work is currently planned to address this issue.

@pyavdr
Copy link
Contributor

pyavdr commented Dec 13, 2012

@behlendorf
4K Disks are mainstream now, even on the enterprise level. Vanished diskspace (see also issue #1089) is simply a costly issue preventing the use of 4K Disks with raidz2 even not using zvols. Inefficiency in using disk space is a very weak point for a filesystem even it is ZFS. With larger disks (5TB and growing) this issue becomes severe, because small pools need the raidz2/raidz3 configuration and other files systems can do it cheaper, while not vasting diskspace. I would like to see an efficent ZFS file system so please plan ahead to solve this for the next year.

@fa2k
Copy link

fa2k commented Apr 4, 2013

I thought that if a process wrote a 16K buffer to a zvol with volblocksize=4K, it would be considered a single block and spread over to 4 drives if available. My testing shows something else: it seems to split the data into 4K blocks and use short stripes even when writing 16K buffers.

E.g. on a 5 disk raidz, create 3 test volumes; one with volblocksize=16K and 2 with volblocksize=4K, then use dd to write in 4K or 16K blocks:

zfs create -V 40G -o volblocksize=16K ypool/test_vb16_dd16

dd if=/dev/urandom of=/dev/zvol/ypool/test_vb16_dd16 bs=16K

zfs create -V 40G -o volblocksize=4K ypool/test_vb4_dd16

dd if=/dev/urandom of=/dev/zvol/ypool/test_vb4_dd16 bs=16K

zfs create -V 40G -o volblocksize=4K ypool/test_vb4_dd4

dd if=/dev/urandom of=/dev/zvol/ypool/test_vb4_dd4 bs=4K

zfs list

ypool/test_vb16_dd16 48.4G 2.56T 48.4G -
ypool/test_vb4_dd16 65.9G 2.56T 65.9G -
ypool/test_vb4_dd4 65.9G 2.56T 65.9G -

This may be what is expected, but then it seems bad to use the default volblocksize of 8K. I get 30 % better random 4K read IOPS when using volblocksize=4K instead of 16K, so there is an argument for using small volblocksize, but the hybrid approach of combining large write buffers seems better if possible.
(Sorry if this was incomprehensible.)

behlendorf pushed a commit to behlendorf/zfs that referenced this issue Apr 12, 2013
Previous patches have allowed you to set an increased ashift to
avoid doing 512b IO with 4k sector devices.  However, it was not
possible to set the ashift lower than the reported physical sector
size even when a smaller logical size was supported.  In practice,
there are several cases where settong a lower ashift is useful:

* Most modern drives now correctly report their physical sector
  size as 4k.  This causes zfs to correctly default to using a 4k
  sector size (ashift=12).  However, for some usage models this
  new default ashift value causes an unacceptable increase in
  space usage.  Filesystems with many small files may see the
  total available space reduced to 30-40% which is unacceptable.

* When replacing a drive in an existing pool which was created
  with ashift=9 a modern 4k sector drive cannot be used.  The
  'zpool replace' command will issue an error that the new drive
  has an 'incompatible sector alignment'.  However, by allowing
  the ashift to be manual specified as smaller, non-optimal,
  value the device may still be safely used.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#1381
Closes openzfs#1328
Issue openzfs#967
Issue openzfs#548
FransUrbo pushed a commit to FransUrbo/zfs that referenced this issue Apr 29, 2013
Previous patches have allowed you to set an increased ashift to
avoid doing 512b IO with 4k sector devices.  However, it was not
possible to set the ashift lower than the reported physical sector
size even when a smaller logical size was supported.  In practice,
there are several cases where settong a lower ashift is useful:

* Most modern drives now correctly report their physical sector
  size as 4k.  This causes zfs to correctly default to using a 4k
  sector size (ashift=12).  However, for some usage models this
  new default ashift value causes an unacceptable increase in
  space usage.  Filesystems with many small files may see the
  total available space reduced to 30-40% which is unacceptable.

* When replacing a drive in an existing pool which was created
  with ashift=9 a modern 4k sector drive cannot be used.  The
  'zpool replace' command will issue an error that the new drive
  has an 'incompatible sector alignment'.  However, by allowing
  the ashift to be manual specified as smaller, non-optimal,
  value the device may still be safely used.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#1381
Closes openzfs#1328
Issue openzfs#967
Issue openzfs#548
FransUrbo pushed a commit to FransUrbo/zfs that referenced this issue Apr 30, 2013
Previous patches have allowed you to set an increased ashift to
avoid doing 512b IO with 4k sector devices.  However, it was not
possible to set the ashift lower than the reported physical sector
size even when a smaller logical size was supported.  In practice,
there are several cases where settong a lower ashift is useful:

* Most modern drives now correctly report their physical sector
  size as 4k.  This causes zfs to correctly default to using a 4k
  sector size (ashift=12).  However, for some usage models this
  new default ashift value causes an unacceptable increase in
  space usage.  Filesystems with many small files may see the
  total available space reduced to 30-40% which is unacceptable.

* When replacing a drive in an existing pool which was created
  with ashift=9 a modern 4k sector drive cannot be used.  The
  'zpool replace' command will issue an error that the new drive
  has an 'incompatible sector alignment'.  However, by allowing
  the ashift to be manual specified as smaller, non-optimal,
  value the device may still be safely used.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#1381
Closes openzfs#1328
Issue openzfs#967
Issue openzfs#548
@barrkel
Copy link

barrkel commented Nov 14, 2013

Last weekend I created a 12x4T raidz2 array and streamed across the contents of my old Nexenta fs to test the viability of zfs on Linux.

Imagine my surprise when I noticed fs size jumped from 3.27T to 5.73T in the transition! The combo of 128k blocks, ashift=12 and raidz2 meant a 75% space overhead, almost entirely eliminating any space savings from using raidz2 rather than mirroring.

This is an issue.

unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
Previous patches have allowed you to set an increased ashift to
avoid doing 512b IO with 4k sector devices.  However, it was not
possible to set the ashift lower than the reported physical sector
size even when a smaller logical size was supported.  In practice,
there are several cases where settong a lower ashift is useful:

* Most modern drives now correctly report their physical sector
  size as 4k.  This causes zfs to correctly default to using a 4k
  sector size (ashift=12).  However, for some usage models this
  new default ashift value causes an unacceptable increase in
  space usage.  Filesystems with many small files may see the
  total available space reduced to 30-40% which is unacceptable.

* When replacing a drive in an existing pool which was created
  with ashift=9 a modern 4k sector drive cannot be used.  The
  'zpool replace' command will issue an error that the new drive
  has an 'incompatible sector alignment'.  However, by allowing
  the ashift to be manual specified as smaller, non-optimal,
  value the device may still be safely used.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#1381
Closes openzfs#1328
Issue openzfs#967
Issue openzfs#548
@byteharmony
Copy link

Testing today on the latest zfs release on centos 6 shows the same space usage problem on raidz3 with 4k drives.

FYI
BK

@behlendorf behlendorf removed this from the 0.7.0 milestone Oct 6, 2014
@behlendorf
Copy link
Contributor

This is just something people need to be aware of. ZoL behaves the same in this regard as all the other OpenZFS implementations. The only real difference is that ZoL is much more likely to default to ashift=12. However, the default ashift can always be overridden at pool creation time if this is an issue. Since no work is planned to change this behavior I'm closing the issue.

@NoAgendaIT
Copy link

Even though this bug has been closed, could someone please share recommendations for creating an ext4 zvol with a pool that has either ashift=9 or ashift=12? Does the above ineffeciency only affect pools that have raidz or are regular mirrors also affected?

@isopix
Copy link

isopix commented Nov 13, 2021

Is it this still an issue for zfs 2.1.1?
Does compression fix that?
Is it only related to raidz2/2 and not original raid-z1?

@sabirovrinat85
Copy link

Is it this still an issue for zfs 2.1.1? Does compression fix that? Is it only related to raidz2/2 and not original raid-z1?

as I understand this have the same problem as with raidz1. I have a copied comments about this from somewhere (specifically about VMs in datasets and ZVOLs):

"This is the problem, volblocksize=8K. When using 8K, you will have many padding blocks. At a time you will write a 8K block on your pool.For a 6 disk pool(raidz2), this will be 8K / 4 data disk = 0.5 K. But for each disk, you can write at minimum 4K(ashift 12), so in reality you will write 4 blocks x 4K =16 K(so it is dubble). So from this perspective(space usage), you will need at least volblocksize=16K"

"please add -o volblocksize= while creating the volume.
If you have x + parity HDDs then
blocksize = 2 ^ floor(log2(x)) * 2 ^ ashift

If you have 16 disk with RAIDZ3 and ashift=12 => x=(16-3)=13 =>
floor(log2(13)) = 3 =>
blocksize = 2 ^ 3 * 2 ^ 12 =>
blocksize = 32k"

"I found this chart that showed that that the default 8k volblocksize was indeed a problem. For Raidz1 with ashift of 12 (4K LBA) you need atleast:
-for 3 discs a volblocksize of 4x LBA = 16K
-for 4 discs a volblocksize of 3x or 16x LBA = 12K (be aware, not 2^n) or 64K
-for 5 discs a volblocksize of 8x LBA = 32K
-for 6 discs a volblocksize of 5x or 8x LBA = 20K (be aware, not 2^n) or 32K"

"The key insight is that normally, datasets are used for files of varying sizes. As such, when you write a small 16k file, ZFS can use a small 16k record. recordsize is a limit on the max record size. Smaller records are allowed.
When storing large unitary VM images, all the data is in a few big files. Since those files are very large, they only use the defined max recordsize. Even when you only create a small 8k file on your VM, at the ZFS layer, that's a 8k section of a 128k record of a 10+GB file. As such, you set recordsize smaller to prevent lots of read-modify-write overhead when editing those large records.
I usually use a 32k recordsize on my vm filesystems, as I want some ability to benefit from compression when using 4k and 8k sector sizes (ashift).

That said, to correctly compare zvols vs dataset, I suggest you to test the following three configurations:

zvol virtual machine: default zvol parameters, disk configured with cache=none (to bypass the pagecache, the hypervisor must issue O_DIRECT writes);

dataset virtual machine: set recordsize=8K, atime=off, xattr=off, use raw file disk image with cache=writeback (note: datasets do not engage the linux pagecache, nor they support O_DIRECT - unless you are using zfs 0.8.x, were a "fake" support for direct writes was added)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Documentation Indicates a requested change to the documentation Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests