Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defragmentation like on btrfs #4785

Closed
RubenKelevra opened this issue Jun 23, 2016 · 6 comments
Closed

Defragmentation like on btrfs #4785

RubenKelevra opened this issue Jun 23, 2016 · 6 comments

Comments

@RubenKelevra
Copy link

It would be nice to implement an defragmentation support and a possible to display the file-fragmentation on a disk.

Currently it's only possible to view the fragmentation of the free space (how is this calculated??).

@Sachiru
Copy link

Sachiru commented Jun 23, 2016

As is repeated multiple times, this requires block pointer rewrite, and as
such, is very far from being done.

There are 10 kinds of people in the world; those who can read binary and
those who can't.

On Thu, Jun 23, 2016 at 3:05 PM, @RubenKelevra [email protected]
wrote:

It would be nice to implement an defragmentation support and a possible to
display the file-fragmentation on a disk.

Currently it's only possible to view the fragmentation of the free space
(how is this calculated??).


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#4785, or mute the thread
https://github.com/notifications/unsubscribe/ADSL63WZgHh7mbPXXfVMqnj90hoxbXefks5qOjA5gaJpZM4I8gzt
.

@ryao
Copy link
Contributor

ryao commented Jun 23, 2016

You can see block pointers with zdb, which lets you see to what extent is a file is fragmented. Run zdb -ddddd $dataset $objectID The object id is the inode number reported by the VFS. You can find it out via the stat command.

Fragmentation is a normal aspect of filesystems. It is required for a CoW filesystem to function and is a requirement for filesystems in general. The main exception would be tape archives and maybe log-based filesystems (provided you never unlink or append to a file). Fragmentation is a non-issue when the average IO size times the IOPS constrains the sequential throughput. The 128KB default record size should limit the extent to which fragmentation can affect performance, at the trade-off of read-modify-write IO amplification when your IOs are routinely smaller than it such as under a database workload (where you will want a smaller recordsize).

The ill effects attributed to file fragmentation in ZFS are often ZFS' own anti-fragmentation measures taking effect when metaslabs reach 96% full to prevent a bad situation from becoming worse. That bad situation being the formation of gang blocks to satisfy the need for a large allocation by using several little allocations in place of one. That increases the IOs required to perform reads and writes whenever one must be used, and consumes free space faster. When metaslab reaches 96% full, the ZFS driver will switch from first fit allocation behavior to best fit allocation behavior to minimize the number of gang blocks that it uses. That is extremely CPU intensive and it can limit IOPS.

Also, ZFS has a feature called LBA weighting that will prefer metaslabs at the outermost tracks (at low LBAs). This increases bandwidth while the pool is relatively empty, but causes metaslabs to enter the best-fit allocator earlier and has the effect of making the decrease in performance when the pool is relatively full more apparent from having mostly only the inner most platters to use. This means that when a pool composed of rotational disks is particularly full, performance of sequential writes and sequential reads of recently written things will be lower than what it was when the pool was empty. This also has the effect of triggering best-fit behavior on the outermost tracks early because those metaslabs can exceed the 96% threshold from it. The effects of LBA weighting as a pool fills are often misattributed to fragmentation.

LBA weighting can be disabled globally via the metaslab_lba_weighting_enabled kernel module parameter. It was automatically disabled on non-rotational was disabled by a commit last year:

fb40095

If you are willing to sacrifice some performance when a pool is empty for more consistent performance as a pool fills, you could disable lba weighting via echo 0 > /sys/module/zfs/parameters/metaslab_lba_weighting_enabled at runtime. This affects all pools on a system. If you are not using an initramfs, you can likely have the kernel module loaded with the option by doing echo options zfs metaslab_lba_weighting_enabled=0 >> /etc/modprobe.d/zfs.conf. If you use an initramfs (or even if you don't), it might be possible to do the same by booting with the kernel commandline parameter zfs.metaslab_lba_weighting_enabled=0.

@RubenKelevra
Copy link
Author

Thanks for your very informative response @ryao, I've using zfs as storage for raw files for vservers. I added a cronjob which do snapshots every two ours. The server exports the diff via send | receive to a backup storage.

Rolling back can be easily done if the snapshot is still on the main machines but I have to copy the entirely data back if it's deleted on the main machines.

So the issue of fragmentation in this case is very very strong, the disks are doing the triple times of IO than one year earlier, but the load is about the same.

So I think my usecase is pretty bad on the current approach because the plain raw files are going to be fragmented more and more, because the original chunks remain as snapshot-data for fast recovery.

It would be nice if it would be possible to just rewrite the actual last data to a new location at once and let the snapshot-data laying fragmented on disk - nobody cares here.

@ryao
Copy link
Contributor

ryao commented Jun 23, 2016

@RubenKelevra I suggest reading this:

http://open-zfs.org/wiki/Performance_tuning

It has some tips. Things like read-modify-write overhead could be happening. read-modify-write is more often an issue rather than fragmentation. vserver does kernel virtualization, so a blanket tip like recordsize=4K like I would say for a VM host might not be the best here. It depends on the IO sizes that your applications are using. e.g. a PostgreSQL database would do best with recordsize=8K.

@RubenKelevra
Copy link
Author

Actually I use only one DB-Server per Hypervisor, and /var got it's own hdd which has 8k recordsize :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@RubenKelevra @ryao @Sachiru and others