Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btrfs bedup equivalent #2554

Closed
bjquinn opened this issue Jul 29, 2014 · 13 comments
Closed

btrfs bedup equivalent #2554

bjquinn opened this issue Jul 29, 2014 · 13 comments
Labels
Type: Feature Feature request or new feature

Comments

@bjquinn
Copy link

bjquinn commented Jul 29, 2014

ZFS dedup is painfully RAM intensive. I'm wondering if there are some compromises to be made that might allow for some deduplication functionality without all of the overhead. Btrfs has bedup which simply does a scan of the filesystem looking for files with identical hashes, then compares actual files for the hash matches, and if the files are truly identical, it submits the file to be deduplicated. Not sure exactly how the internals of this works, but I assume it's a bit like how cp --reflink works.

Anyway, this doesn't have that kind of RAM overhead, and uses a sqlite table to keep up with files it's already hashed. It's file-level only, not block-level. But it seems like this would be very useful for backups. Yes, snapshots and clones are helpful, but not when someone renames a file, or worse, a large tree of files -- if someone simply renames "PHOTO 2014" to "PHOTOS 2014" you're going to store an entire additional copy of that entire folder tree, which in my case is sometimes 100s of GB of incompressible data, which is quite annoying.

Is this something that's possible? I'd be happy to do some work on this, although I'd need some direction. I think I'd be reasonably capable on the userland side of the tool, but I wouldn't know where to start in terms of what hooks I'd need into ZoL or what changes might be required there.

@FransUrbo
Copy link
Contributor

On Jul 29, 2014, at 11:57 PM, bjquinn wrote:

Btrfs has bedup which simply does a scan of the filesystem looking for files with identical hashes, then compares actual files for the hash matches, and if the files are truly identical, it submits the file to be deduplicated.

'to be deduplicated'?

Sounds very I/O intensive and time consuming... But is that a way to convert a non-deduped fs to a deduped fs?

I'm sure we could do that with a little coding. But on the other hand, we have 'cp' which is almost as good :)

But either case, once a filesystem is deduped, it's going to be memory intensive, no matter if it's BTRFS or ZFS...

@bjquinn
Copy link
Author

bjquinn commented Jul 29, 2014

Yeah, it is I/O intensive and time consuming, but it's an offline run-once type of process that you could schedule during downtime and/or run low priority during the day. ZFS dedup generally brings the system to its knees all the time and can't be done on any large scale.

I'm open to suggestions, but I'm not sure how 'cp' solves the problem? We might solve it with 'cp --reflink', but AFAIK that isn't implemented on ZFS yet, and in any case it wouldn't work on read-only snapshots.

File based dedup isn't memory intensive after the fact. You simply create a reflink (or some equivalent). Like a hardlink except that if someone DOES modify the file, it breaks the link and stores two copies. It's no more memory intensive than a hardlink, but better in many cases in the sense that you don't inadvertently change a file you didn't know you were changing.

@FransUrbo
Copy link
Contributor

I'm open to suggestions, but I'm not sure how 'cp' solves the problem?

Well, I was talking about how to convert a non-dedup fs to a deduped ditto.

You create a new fs which have dedup=on, then copy (or move if you're brave) the files and directories from the old to this new filesystem and then destroy the old filesystem once everything have been copied..

File based dedup isn't memory intensive after the fact.

File based dedup... ? As in "if we find a file that is exactly the same somewhere else, don't store the second file, just make a link to the first one"?

Doesn't sound any where near as efficient as block dedup.... Which might explain why btrfs does it - it's the filesystem for small boys, pretending they're men :D

ZFS is for the big boys, that know the're men - they have the money and the time to do it right....

Ok, so insults aside, everything you do will boils down to a compromise. Either you make sure you can run on low memory systems, or you do it 'all in'. Sorry, but ZFS is designed from the ground up with a certain ... "base line system". One that cost a fortune...

Now consider that if I use a ZVOL (basically a block device I can do whatever I want with - don't know if btrfs have something similar - it's much like LVM though).

I export this, via iSCSI to a VM, which installs 'OS-Of-My-Choise' on there.

I then create a second ZVOL, export it, also via iSCSI, to a second VM, which ALSO installs 'OS-Of-My-Choise' on there.

Let's also say that these two installations is identical. It would mean that I theoretically could get away with 50% of the space.

However, and this is the tricky part, they are on a volume, that's outside of the base FS (zfs in this case). In my though example, those two ZVOLs would have a ext4 filesystem on them...

For file based dedup to work, ZFS would have know exactly how ext4 (and every other filesystem available!!) worked so that it can 'go in' there and look for duplicated files....

Technically possible, but in reality IMpossible - no one in their right mind would volunteer to either do this or to maintain it afterwards...

@bjquinn
Copy link
Author

bjquinn commented Jul 29, 2014

I'm a huge ZFS fan, no love lost for btrfs here. :)

I'm talking about data stored directly in ZFS filesystems. I understand zvols exported and formatted as ext4 would lose some of the benefits. You can't use benefits of filesystems you're not using.

I think that file based dedup would result in a low dedup ratio in many circumstances, but it would be very high in others. Take a backup server that keeps snapshots of each nightly backup. The example I gave in the OP was a end user who changes the name of a folder from "PHOTO 2014" to "PHOTOS 2014", and then I back up that folder on the backup server. ZFS doesn't know it's the same data, so it stores it a second time -- once in last night's snapshot as "PHOTO 2014", and once in tonight's snapshot as "PHOTOS 2014". Or they simply make a copy / duplicate of a large folder. In several real world examples I've run into, that has burned 100s of GB (incompressible!) with each rename or copy. Seems a bit difficult to train end users to never rename anything.

Perhaps if #405 were implemented and I used clones instead of snapshots, I could write a script that looked for duplicate files and created a reflink in place of the two duplicate files. Or is there another way this could be done?

End result is that ZFS block-based dedup isn't usable in 99.9% of cases. Even if offline file-based dedup resulted in only half the benefit of online block-based dedup, but was useful in 90% of cases instead of 0.1% of cases, wouldn't that be something to look into?

@DeHackEd
Copy link
Contributor

There have been comments on how to do dedup with less memory usage. You need to understand what the differences between ZFS and BTRFS are.

With BTRFS you can instantly clone files with --reflink as you said. ZFS doesn't have that. There's 2 ways a block can be referenced multiple times: dedup and snapshots. You can't "instantly" clone a file because doing it that way doesn't meet either criteria. In contrast BTRFS seems to use some crazy layers of indirection and trees to let it do that.

Also with ZFS even if you had an offline mode for deduplication, snapshots are truly read-only. You can't even (un)apply dedup to them. That's going to be a pretty major killjoy.

Yes, dedup sucks. They shoehorned it into an existing system that wasn't quite up to the task and it has limitations because of that. Besides the "metadata vdev" idea I don't think it's going to get any better any time soon.

@bjquinn
Copy link
Author

bjquinn commented Jul 29, 2014

@DeHackEd thanks for the response.

Yep, I understand that ZFS doesn't have --reflink. Didn't seem like they were saying it was impossible over there in #405, though. Maybe that's the tree I should be barking up, if nobody has any other suggestions.

I also understand that snapshots are read-only, end of story. However, in the backup scenario, if something like cp --reflink WERE available, I could simply take clones of each nightly backup, rather than snapshots, and then find identical files and reflink them (as long as the files weren't already in the exact same relative path, as those would already be stored efficiently if they were copied over on top of a previous backup with something like rsync --inplace --no-whole-file). This ruins zfs send/recv, of course, but even then it's certainly an option that might be viable depending on what you're doing.

I'll leave this open for a little while longer, but if the consensus is that reflink would be a prerequisite to any dedup solution, I'll start lurking round #405 and close this bug.

Thanks for everyone's input.

@ryao
Copy link
Contributor

ryao commented Jul 30, 2014

@bjquinn Unlike btrfs snapshots, ZFS snapshots are immutable by design, so we cannot update them without rewriting history. That means any kind of offline approach will require some form of block pointer rewrite to update snapshots.

Putting something like that into place is non-trivial, but when we have it, we should be able to leverage it to batch updates to the deduplication table by having a small growing table that is occasionally merged back into the main table when its size reaches a certain threshold.

As a note to my future self or anyone who decides to tackle BPR, I suspect that the trick is to run BPR twice, repurpose the top bit of the vdev id in the DVA to indicate in which of the runs blocks were made and keep an indirection table to map the old DVAs to new DVAs while the operation is in progress. We use that to do a full tree traversal in a manner similar to a scrub. This BPR traversal will tear down the existing vdevs and create "new vdevs" on the same devices as the originals. The new vdevs inherit the spacemaps of the old vdevs and teardown from the "old" vdevs deallocates space from both. New allocations while this is in progress will use the new vdevs. The indirection table is used to stitch the old tree to the new one to ensure that existing lookups will continue working. We should do things twice to fully rewrite history to the point where the intermediate DVAs (with the top bit in the vdev id set) are no longer present when we finish.

@bjquinn
Copy link
Author

bjquinn commented Jul 30, 2014

Lol, all roads lead to BPR, don't they? Why don't we just close all the other bugs and open one BPR bug? :)

Well what about clones? With file based dedup and writable clones, couldn't we accomplish something that way? I suppose it might still depend on reflink support, but couldn't it be possible if we did have reflink support?

@ryao
Copy link
Contributor

ryao commented Jul 31, 2014

@bjquinn Reflinks are issue #405. Directories in ZFS are name-value pairs. Adding reflinks to that is non-trivial. One idea that might work would be to replace the block pointers in the indirection tree with object identifiers and utilize ZAP to store mappings between the object id and a (reference,block pointer) tuple. That would involve a disk format change and would only apply to newly created files. Each block access would suffer an additional indirection. Making reflinks under this scheme would require duplicating the entire indirection tree and updating the corresponding ZAP entries in a single transaction group.

@behlendorf behlendorf added this to the 1.0.0 milestone Jul 31, 2014
@behlendorf behlendorf modified the milestone: 1.0.0 Nov 8, 2014
@rlaager
Copy link
Member

rlaager commented Oct 1, 2016

The scope of this seems unclear and it seems unlikely to happen.

@the8472
Copy link

the8472 commented Oct 1, 2016

The scope of this seems unclear

Well, the btrfs extent_same ioctl would basically do the job for deduping from userspace.

@giox069
Copy link

giox069 commented Jul 28, 2017

Just to add a note: NTFS, starting from windows server 2012, has offline deduplication. It runs automatically when the system is idle, and it's suspended when there is system load.
This saved a lot of space on some general purpose fileservers, where my customers keeps many office files, images, emails, attachments. After 5-6 years of work, we find that 30-40% of used space is duplicated, and automatic NTFS background deduplication saved us that space.
I tried to do the same with zfs online deduplication, but the ram usage grew up to more than 16GB. And not only ram was a problem, but mount time! It took a lot of time to remount a deduplicated disk after a reboot.
I think that offline deduplication for zfs will be a very interesting feature.
For a deduplicating backup system, as asked on the 1st post, which avoids big transfers after a simple directory rename, look for (borg backup)[https://github.com/borgbackup].

@Zocker1999NET
Copy link

Sorry to post something in an old thread like this one, but I want to add one correction before a newby learns something wrong by accident:

Renaming a folder does not (normally) create a copy of all the files inside, it just modifies the name of the directory. That should be true for most modern filesystems. Making a snapshot in ZFS and then diffing it shows that ZFS snapshots also understands that only a rename was done. So using ZFS Snapshots and sending them incrementally to a target does not retransmit all the renamed files, it just retransmits the inodes of the renamed directory (or files, if their names were changed as well).

However, moving files between different partitions or datasets does require retransmitting them. Also external backup tools might need to retransmit them, depending on how they might detect the renames.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

9 participants