batch mode deduplication support (feature request/discussion) #1071

nightwalk · 2012-10-28T04:44:54Z

Mr. Yao seems to think that the current zfs architecture might be close to having what would be required to expand out and do batch deduplication as well. The usefulness of the current inline deduplication method is highly limited due the drastic I/O performance hits it causes, so I thought this might be worth exploring.

Just so we're all on the same page, batch deduplication involves writing the data some place temporarily, then coming back and deduplicating it later. In other words, the deduplication happens in the background so it has as close to zero impact on write performance as possible for userland.

Ideally, blocks would be written out to storage and another process would come back around and dedup 'dirty' blocks at intervals, or as load based on some form of system metric drops below a threshold (similar concept to ksm/ksmtuned for kvm). However, I suspect this would bring about changes that might be worse about breaking compatibility and would also likely be somewhat less than trivial to implement.

Mr. Yao had some initial thoughts on the matter that might have some merit though. He seemed to think that it might be possible to use the ZIL for the temporary storage and do the actual deduplication prior to sending the data to its final resting place in the filesystem.

At least, I believe that was the gist of it. I'm sure he'll correct me if it's not quite right :)

behlendorf · 2012-10-29T16:58:09Z

So what @ryao suggests is pretty much what happens today with inline dedup. The dedup tables are updated asynchronously during the txg writeback. The fundamental motivation for doing here is that I/O is expensive so it's better to dedup it once while it's already in memory. Of course if your system doesn't have enough memory for the dedup tables you end up performing I/O anyway and trashing performance.

However, it seems to me that there is another logical place where dedup could be performed with minimal extra overhead and that's during a scrub/resilver. Performing a lazy/batch dedup in the context of scrub/resilver has a couple of...

advantages:

It's asynchronous and can run as a background process with minimal impact to normal I/O.
A scrub/resilver must already pay the cost of reading every byte from disk and verifying the checksum, so this isn't wasted I/O.
Allowing the scrub thread to write new blocks means we could apply or remove dedup on a per-dataset basis. This would enable people to remove deduplication from a dataset or entire pool online (this can still be a slow process depending on your system resources)
Potentially, this could add some initial infrastructure to allow online defragmentation during scrub/resilver. However, that's a much harder nut to crack and it's still debatable how helpful a full defrag would be.
This could probably be done without changing the on disk format since we currently allow deduped and non-deduped data to coexist in the pool.

disadvantages:

A lazy dedup would mean no immediate space saving until the pool was scrubbed. This is probably entirely appropriate for a large number of workloads, but it means leaving free scratch space available on disk for new non-deduped writes.
This would be a long term development item and significantly change our scrub/resilver code from upstream. It would be best to do this in a way they would accept to keep us in sync.

andys · 2012-11-08T05:33:50Z

IMHO, the biggest problem facing dedupe today is extended lock up time when doing a zfs destroy, and worse, long mount time when rebooting in the middle of a destroy.

Even with gobs of RAM, rebooting a pool that is in the middle of destroying a big deduped filesystem can take hours, days, or even weeks if the disks are slow at random IO. (I personally have experienced a reboot that took over 2 weeks on a large capacity system used for backups.)

I believe the solution to this general area is two ways:

First, make zfs destroy a lazy, background process, which does not hold any locks and does not affect pool mount time.

Second, let us add a new vdev type: metadata. Get (say) two mirrored SSDs and add to the pool as "metadata" type, similar to how you would add log or cache currently.

All metadata would live on this vdev and give a nice boost to normal (non-deduped) pool performance, as well as getting rid of the problem of loss of cache on reboot.

dajhorn · 2012-11-09T11:53:15Z

@andys, this improvement is already implemented upstream in Illumos. (See https://www.illumos.org/issues/2619)

ZoL will get it when the Feature Flags code is merged.

andys · 2012-11-10T00:35:20Z

Cool. What do you think of my "dedicated metadata vdev" idea? I believe it has been implemented before by tegile.com - they added a vdev type called "meta" for their proprietary ZFS-based SAN.

maci0 · 2014-10-24T10:03:50Z

+1 for lazy dedup

pavel-odintsov · 2014-11-11T15:18:48Z

Hello!

I tried to use online dedupliation for big 70Tb storage and got excessive performance degradation. But for my load pattern I have whole night without any I/O load and I can run deduplication and compression manually.

But ZFS provides only online deduplication. Will be fine if you add ability to run deduplication in background.

Thank you for attention!

koraa · 2015-09-15T07:29:28Z

@kpande How do the penalties compare? How much penalty would you suffer from with lazy dedup?

mufunyo · 2020-06-03T19:09:12Z

Just wanted to mention that this feature is still very much wanted. I love dedup because it allows lazy backups; just dump a system backup into the designated backup dataset which has dedup enabled, and if anything already exists in a prior backup, dedup will catch it.

However, in its current implementation, I have to copy backups to a non-deduped dataset first, and then slowly copy the backup over to the deduped dataset, periodically pausing once I/O gets clogged (or else any other application trying to access any data from the same pool will block, great design btw). This works but requires a lot of manual intervention, pausing the copy, unpausing it again when system load goes back to normal, etc. It's sort of workable as I can keep an eye on it on my second monitor while doing other things, but it seems like a ridiculous amount of manual supervision to ensure that a simple file copy doesn't become a poolwide DoS attack.

behlendorf removed this from the 1.0.0 milestone Oct 6, 2014

behlendorf added the Difficulty - Hard label Oct 6, 2014

pavel-odintsov mentioned this issue Jan 13, 2015

Ability to run/trigger compression/deduplication of pool/volume manually #3013

Open

grahamperrin mentioned this issue Sep 16, 2015

Metadata Allocation Class #3779

Closed

behlendorf removed the Difficulty - Hard label Oct 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch mode deduplication support (feature request/discussion) #1071

batch mode deduplication support (feature request/discussion) #1071

nightwalk commented Oct 28, 2012

behlendorf commented Oct 29, 2012

andys commented Nov 8, 2012

dajhorn commented Nov 9, 2012

andys commented Nov 10, 2012

maci0 commented Oct 24, 2014

pavel-odintsov commented Nov 11, 2014

koraa commented Sep 15, 2015

mufunyo commented Jun 3, 2020

batch mode deduplication support (feature request/discussion) #1071

batch mode deduplication support (feature request/discussion) #1071

Comments

nightwalk commented Oct 28, 2012

behlendorf commented Oct 29, 2012

andys commented Nov 8, 2012

dajhorn commented Nov 9, 2012

andys commented Nov 10, 2012

maci0 commented Oct 24, 2014

pavel-odintsov commented Nov 11, 2014

koraa commented Sep 15, 2015

mufunyo commented Jun 3, 2020