Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch mode deduplication support (feature request/discussion) #1071

Open
nightwalk opened this issue Oct 28, 2012 · 8 comments
Open

batch mode deduplication support (feature request/discussion) #1071

nightwalk opened this issue Oct 28, 2012 · 8 comments
Labels
Type: Feature Feature request or new feature

Comments

@nightwalk
Copy link

Mr. Yao seems to think that the current zfs architecture might be close to having what would be required to expand out and do batch deduplication as well. The usefulness of the current inline deduplication method is highly limited due the drastic I/O performance hits it causes, so I thought this might be worth exploring.

Just so we're all on the same page, batch deduplication involves writing the data some place temporarily, then coming back and deduplicating it later. In other words, the deduplication happens in the background so it has as close to zero impact on write performance as possible for userland.

Ideally, blocks would be written out to storage and another process would come back around and dedup 'dirty' blocks at intervals, or as load based on some form of system metric drops below a threshold (similar concept to ksm/ksmtuned for kvm). However, I suspect this would bring about changes that might be worse about breaking compatibility and would also likely be somewhat less than trivial to implement.

Mr. Yao had some initial thoughts on the matter that might have some merit though. He seemed to think that it might be possible to use the ZIL for the temporary storage and do the actual deduplication prior to sending the data to its final resting place in the filesystem.

At least, I believe that was the gist of it. I'm sure he'll correct me if it's not quite right :)

@behlendorf
Copy link
Contributor

So what @ryao suggests is pretty much what happens today with inline dedup. The dedup tables are updated asynchronously during the txg writeback. The fundamental motivation for doing here is that I/O is expensive so it's better to dedup it once while it's already in memory. Of course if your system doesn't have enough memory for the dedup tables you end up performing I/O anyway and trashing performance.

However, it seems to me that there is another logical place where dedup could be performed with minimal extra overhead and that's during a scrub/resilver. Performing a lazy/batch dedup in the context of scrub/resilver has a couple of...

advantages:

  • It's asynchronous and can run as a background process with minimal impact to normal I/O.
  • A scrub/resilver must already pay the cost of reading every byte from disk and verifying the checksum, so this isn't wasted I/O.
  • Allowing the scrub thread to write new blocks means we could apply or remove dedup on a per-dataset basis. This would enable people to remove deduplication from a dataset or entire pool online (this can still be a slow process depending on your system resources)
  • Potentially, this could add some initial infrastructure to allow online defragmentation during scrub/resilver. However, that's a much harder nut to crack and it's still debatable how helpful a full defrag would be.
  • This could probably be done without changing the on disk format since we currently allow deduped and non-deduped data to coexist in the pool.

disadvantages:

  • A lazy dedup would mean no immediate space saving until the pool was scrubbed. This is probably entirely appropriate for a large number of workloads, but it means leaving free scratch space available on disk for new non-deduped writes.
  • This would be a long term development item and significantly change our scrub/resilver code from upstream. It would be best to do this in a way they would accept to keep us in sync.

@andys
Copy link

andys commented Nov 8, 2012

IMHO, the biggest problem facing dedupe today is extended lock up time when doing a zfs destroy, and worse, long mount time when rebooting in the middle of a destroy.

Even with gobs of RAM, rebooting a pool that is in the middle of destroying a big deduped filesystem can take hours, days, or even weeks if the disks are slow at random IO. (I personally have experienced a reboot that took over 2 weeks on a large capacity system used for backups.)

I believe the solution to this general area is two ways:

First, make zfs destroy a lazy, background process, which does not hold any locks and does not affect pool mount time.

Second, let us add a new vdev type: metadata. Get (say) two mirrored SSDs and add to the pool as "metadata" type, similar to how you would add log or cache currently.

All metadata would live on this vdev and give a nice boost to normal (non-deduped) pool performance, as well as getting rid of the problem of loss of cache on reboot.

@dajhorn
Copy link
Contributor

dajhorn commented Nov 9, 2012

@andys, this improvement is already implemented upstream in Illumos. (See https://www.illumos.org/issues/2619)

ZoL will get it when the Feature Flags code is merged.

@andys
Copy link

andys commented Nov 10, 2012

Cool. What do you think of my "dedicated metadata vdev" idea? I believe it has been implemented before by tegile.com - they added a vdev type called "meta" for their proprietary ZFS-based SAN.

@behlendorf behlendorf removed this from the 1.0.0 milestone Oct 6, 2014
@maci0
Copy link
Contributor

maci0 commented Oct 24, 2014

+1 for lazy dedup

@pavel-odintsov
Copy link

Hello!

I tried to use online dedupliation for big 70Tb storage and got excessive performance degradation. But for my load pattern I have whole night without any I/O load and I can run deduplication and compression manually.

But ZFS provides only online deduplication. Will be fine if you add ability to run deduplication in background.

Thank you for attention!

@koraa
Copy link

koraa commented Sep 15, 2015

@kpande How do the penalties compare? How much penalty would you suffer from with lazy dedup?

@mufunyo
Copy link

mufunyo commented Jun 3, 2020

Just wanted to mention that this feature is still very much wanted. I love dedup because it allows lazy backups; just dump a system backup into the designated backup dataset which has dedup enabled, and if anything already exists in a prior backup, dedup will catch it.

However, in its current implementation, I have to copy backups to a non-deduped dataset first, and then slowly copy the backup over to the deduped dataset, periodically pausing once I/O gets clogged (or else any other application trying to access any data from the same pool will block, great design btw). This works but requires a lot of manual intervention, pausing the copy, unpausing it again when system load goes back to normal, etc. It's sort of workable as I can keep an eye on it on my second monitor while doing other things, but it seems like a ridiculous amount of manual supervision to ensure that a simple file copy doesn't become a poolwide DoS attack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

9 participants