Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightweight Deduplication (for Copying) #13572

Open
Haravikk opened this issue Jun 19, 2022 · 14 comments
Open

Lightweight Deduplication (for Copying) #13572

Haravikk opened this issue Jun 19, 2022 · 14 comments
Labels
Type: Feature Feature request or new feature

Comments

@Haravikk
Copy link

Haravikk commented Jun 19, 2022

Describe the feature would like to see added to OpenZFS

Inspired by discussion on issue #13392, I would like to see a "lightweight" deduplication optimised for copying operations.

The basic idea is that instead of attempting to build a full dedup table for one or more entire datasets (which can require a lot of RAM), the "lightweight" table would only track blocks that have been recently read, with some options for tuning the amount of dedup data held in memory using this method.

Depending upon how well this "lightweight" dedup table is tuned, it should be possible for ZFS to eliminate large amounts of duplication as a result of copying within the pool, as any data that is written out shortly after reading (i.e- copied) should be detectable using this more limited dedup table.

How will this feature improve OpenZFS?

In most cases full deduplication is an overkill feature, requiring a large amount of RAM to enable. However, for many use cases the primary source of duplication is going to be copying within or between datasets, as well as typical read/write activity on files where files are written out with only some parts changed.

By focusing dedup on the internal copying use cases, it should be possible to allow dedup to be used with a much smaller memory impact, and for use cases where internal copying is the main source of duplication this ought to achieve much the same benefits as full deduplication; namely fast (metadata only) copying/moving of files within the same or related (same encryption root, similar settings) datasets, and reduced impact on capacity.

As with full deduplication this should also benefit partial copies, i.e- where only some of a file's records are copied while others are discarded, replaced or added. The main limitation of this "lightweight" deduplication is that data that has left the smaller deduplication table cannot be deduplicated, i.e- files that are opened but not written out until some time later, or files that are imported from an external source, however if these are minority cases for a pool there will still be a benefit from enabling the feature.

Additional context

Firstly, there's almost certainly a better name for what I'm describing, but "lightweight" deduplication is the best I've come up with so far, since the idea is "deduplication of the most recently read files", i.e- deduplication of short-term copying/editing. I also quite like "temporal deduplication" because it sounds extra fancy.

The actual tuning of this dedup table is a little tricky, but I have three options I've thought of so far:

  1. Tie the Dedup to ARC/L2ARC: Since the ARC/L2ARC already covers recently accessed data, my current preferred solution is to effectively generate a dedup table for the ARC (and also maybe L2ARC)'s contents only. Unless a user is using a weirdly huge ARC/L2ARC the dedup table will still be a lot smaller than it would be for full deduplication, and all we really need is a way to lookup ARC/L2ARC content using a hash. In this case the only setting required is something on the receiving dataset to indicate whether deduplication should be used where possible (e.g- deduplication=lightweight?), otherwise the dedup table is ignored and a copy occurs as normal. A possible secondary setting on the pool could determine whether L2ARC is included?
  2. Deduplicate open (and recently opened) read streams: This would mean the dedup table scales with the amount of read activity (potential copying) on the enabled dataset(s), with a setting to determine how quickly space is reclaimed once a read stream is closed. This shouldn't be too quick, or copies may be missed, but the longer the cleanup takes, the bigger the maximum size will become. This would require the same target dataset setting as the above option, as well as a source dataset option to enable dedup of reads, and also possibly a per dataset setting to limit the amount of "recently open" data. While this method should be functional, it would also be the most complicated to configure and control the size of.
  3. Maximum Size: The simplest option, requiring a global setting to set the size limit for the "lightweight" dedup table as shared across all enabled datasets on the system (if any). This would leave performance/effectiveness to be determined by the user with more/less performance the larger/smaller the maximum size is.

In all three cases, any enabled dedup devices could be used as normal for storing the actual table(s), though this would mainly be of benefit if your dedup devices aren't big enough to store whole dataset dedup tables.

@Haravikk Haravikk added the Type: Feature Feature request or new feature label Jun 19, 2022
@Haravikk Haravikk changed the title Lightweight Deduplication (for Copying Only) Lightweight Deduplication (for Copying) Jun 19, 2022
@CyberCr33p
Copy link

We do webhosting and it's very common people need a "clone" of their website to do tests before upgrading their main website. With this feature it will be possible users to not need double their user quota as the most space used is from their images.

@GregorKopka
Copy link
Contributor

The idea to scan the ARC MRU for dedup candidates (technically done using through block cloning, to not have the penalties and prerequisited of dedup) is a nice idea.

While watching https://www.youtube.com/watch?v=qsE3R0Ysc8g I went unsure if that is actually feasible though, as it might require finding the BP for the data block identified (which might be a problem, as there is no pointer from the data to the parent BP, AFAIK)...?

@amotin
Copy link
Member

amotin commented Oct 3, 2023

ARC does not store full block pointer, only the first DVA. ARC does not store checksums. Currently dedup flag is required to be set in all pointers to the data to correctly do reference counting, so enabling the lightweight dedup would mean thal all blocks would still be needed marked as dedup with additional free cost, while potential hit rate would be much lower.

@Haravikk
Copy link
Author

Haravikk commented Oct 3, 2023

ARC does not store full block pointer, only the first DVA. ARC does not store checksums. Currently dedup flag is required to be set in all pointers to the data to correctly do reference counting, so enabling the lightweight dedup would mean thal all blocks would still be needed marked as dedup with additional free cost, while potential hit rate would be much lower.

My thinking was that the pointer and checksum would be stored separately in a dedup table as normal, the connection to ARC/L2ARC is in how the dedup table is populated, i.e- when a record is loaded from a dataset and eligible to go into ARC it is also added to the dedup table (if eligible for that too), likewise when an entry is discarded from ARC/L2ARC the corresponding entry is also discarded from the dedup table (if the space is needed, there's no harm keeping it if not).

So they'd be separate but parallel, with the dedup table's content and size limited by what is sent to ARC/L2ARC.

@amotin
Copy link
Member

amotin commented Oct 3, 2023

when a record is loaded from a dataset and eligible to go into ARC it is also added to the dedup table (if eligible for that too), likewise when an entry is discarded from ARC/L2ARC the corresponding entry is also discarded from the dedup table.

So you propose to trash the dedup table on disks each time somebody merely reads anything into ARC or evict from it. Not cool.

@Haravikk
Copy link
Author

Haravikk commented Oct 3, 2023

So you propose to trash the dedup table on disks each time somebody merely reads anything into ARC or evict from it.

Not even remotely, no? I'm not sure how you're getting from "add or remove to/from the dedup table" to "completely destroy the dedup table every single time for no reason"?

For an in-memory dedup table (which this feature request is more geared towards since we're being size conscious) that may mean it needs to be resized to fit as it's hard to guess how big it needs to be, and it may need to shrink if empty space needs to be freed (system is low on memory) but we've had hash tables designed to do this for literally decades, that's a solved problem, usually they resize by factors of two until they reach a stable size.

The in-memory deduplication case doesn't require anything on disk because the goal is to gain some dedup while minimising overhead, mostly aimed at copying/rewriting; the compromise is that some deduplication will be missed, but if the alternative is you don't have enough memory to run dedup at all then that doesn't matter.

In the case of a dedicated dedup device you either a) don't need this feature, or b) the table will just be the size of the device as normal; the important part is adding the entries as they appear, entries being removed can happen at any time once they're no longer in ARC/L2ARC. While the goal is to minimise dedup table size, if you're allocating a fixed size device for it then that doesn't matter, though it's still useful to know which entries are no longer needed (as they can be cleared out first to make room for "active" ones). Strictly speaking this is true of in-memory as well, i.e- the dedup tables entries don't need to be removed immediately unless the table needs to shrink to reclaim memory.

@GregorKopka
Copy link
Contributor

I think the point @amotin wants to make: the DDT (both for classic dedup and the more efficient per-vdev ones for block cloning) has to be persisted on-disk, as it's not only to identify identical blocks but an integral part of pool-level free space accounting.
Discarding them would very quickly confront you with the dreaded 'restore pool from backup' message.

Also: he uses trashing in the specialized domain (not in the sense of opening up a garbage can), same as he does with dedup table - while you seem to connect different meanings to these words, which not only leads to problems in communication but likely also to having difficulties understanding the perspective of the ones seeing technical problems in implementing certain suggestions.

@Haravikk
Copy link
Author

Haravikk commented Oct 4, 2023

it's not only to identify identical blocks but an integral part of pool-level free space accounting.

Then how does free space accounting work with deduplication disabled? Clearly it can, so this feature should only need to use the same method; the primary goal of this feature is to make it possible to enable limited dedup without the huge RAM requirements, nothing on disk should need to change except for pointers being created in place of duplicate copies.

Also: he uses trashing in the specialized domain

You say trashing but linked to thrashing? I'm well aware of what thrashing is, and clearly the intention of this feature is not to cause it.

@GregorKopka
Copy link
Contributor

Should you want to know how ZFS space accounting works: feel free to read up on it, I don't feel like spoonfeeding.
Should you want to implement the feature you suggested: feel free to do that, you'll either figure out what I meant - or not.

As I seem to be unable to communicate with you in a way that transports information undamaged and I don't fell like figuring out if that is a problem of the sender or receiver... Have a good day.

@KungFuJesus
Copy link

I don't see what this would buy you that explicit reflink copies wouldn't, at least in the use case you're calling out. Sure, maybe if the data mutated after being copied (the same argument as clone versus dedup), but that seems pretty unlikely in your use case.

@amotin
Copy link
Member

amotin commented Oct 5, 2023

Thinking more about this topic, I see a theoretical potential now, whether it is what meant originally or not. ;) Would we have sufficient information in memory (checksums in searchable structures, plus parameters required to reconstruct the original block pointer) we could dynamically add records to BRT table used for block cloning. Unlike original dedup table BRT does not require modification of already existing pointers, and it does not need entries to be added before actual checksum match is found, so it does not need updates when data just added to or evicted from the checksum table. But when we do detect a checksum match, we could just create a new BRT record for it.

I am not sure though how reasonable it is to link the hash table with ARC, unless one already has significant portion of required data, since that would limit amount of dedupable data to the size of ARC, that is usually a very tiny fraction of all data on the pool. It would likely mean very low dedup rates. We could possibly keep checksum data in memory longer, as much as memory allow, after their blocks are evicted from ARC. Though it would take the system some time to accumulate decent amount after reboot.

Another worry I have is that BRT used by block cloning is optimized for sequential operations, normally reducing required BRT updates to a very localized areas. The moment we start using it for hash-based deduplication, the amount of random updates may explode, requiring more BRT data to be kept in memory.

@Haravikk
Copy link
Author

Haravikk commented Oct 5, 2023

@KungFuJesus:

I don't see what this would buy you that explicit reflink copies wouldn't

It wouldn't require you to explicitly reflink copies, for one. 😝

The vast majority of software and users aren't aware of reflinks, and they'll just copy the old-fashioned way unless the file system takes the decision away from them (which isn't always easy to do, I believe); the idea is to be able to capture that copying and deduplicate it automatically, but without the cost of an entire dataset deduplication table.

There will certainly be cases where whole disk deduplication will be superior (at added cost), and there will be cases where explicit reflinking will be better (e.g- on a file server where a user uploads a file identical to one that already exists), but that's part of the trade-off (to gain some deduplication without the high RAM cost, need for dedicated storage devices etc.).

With both features there'll still be room for proposals like deferred deduplication (i.e- a deduplicating scrub) to pick up anything that was missed.

@amotin:

I am not sure though how reasonable it is to link the hash table with ARC, unless one already has significant portion of required data, since that would limit amount of dedupable data to the size of ARC

The idea behind tying it to ARC is to limit the size of any data structure used for looking up checksums; currently full deduplication requires a lot of RAM to really be usable, either that or a dedicated high speed dedup vdev.

But if the bulk of your deduplication comes from copying of files then the full deduplication table is overkill; all you need is to be able to recognise recently accessed records/blocks when they're either directly copied, or written out with partial changes (most/some records/blocks are the same) soon after being read.

Basically the idea is that the ARC should be a good enough view of what has been read recently to serve this purpose; I proposed a few alternatives but IMO tying to ARC is the most viable.

Currently deduplication is a feature that most setups shouldn't use, because the added requirements usually outweigh the benefits; the idea here is to add a more a limited form that will be beneficial to more setups. If block cloning can be used to implement this more easily then I'm all for that.

@KungFuJesus
Copy link

KungFuJesus commented Oct 5, 2023

What you are describing is basically following the block cloning PR exactly, though. It simply requires supplying an argument to the cp command to make use of the reflink enabled syscalls. I think some Linux distributions are in fact turning the reflink option to be enabled by default in coreutils, and for FreeBSD it'd be just as easy to do so for the bsd version of cp. BRT is trying the make a lot of the same compromises you are proposing, but as far as I understand is leveraging known shared ranges fed into the copy_file_range syscall instead of trying to track cryptographically strong hash tables.

@Haravikk
Copy link
Author

Haravikk commented Jan 14, 2024

What you are describing is basically following the block cloning PR exactly, though.

While they're certainly related features (both will result in ref-links as opposed to copies), they're not mutually exclusive; the block cloning PR requires all copying to go through specific file system calls – while that may well capture the bulk of copying (which will be fantastic) it's far from guaranteed as plenty of software copies files as streams and other methods that block-cloning won't detect (it has no way to do so).

I see the two features as complementary, as deduplication can produce ref-links that block cloning will miss, and it still has the ability to perform non-copying deduplication as well (where it recognises identical blocks that aren't a result of copying), it just won't be as effective as running with a full deduplication table but that's kind of the point (to get deduplication of commonly accessed data without the overhead of full deduplication).

Block-cloning is a feature that will hopefully become something you just turn on as standard (and that's assuming it isn't by default), since there's no reason not to unless you specifically want the extra copies (don't have any other redundancy?) but that seems pretty niche, and can be handled by cp --reflink=never just as easily.

But I'd still very much like the ability to run limited deduplication if I want to, as I can still see it benefiting cases where block cloning isn't catching enough copying, or you're handling a lot of incoming files that include duplicates. Long term a deduplicating scrub might be the better option, but that seems like it would require block pointer rewriting which is a blocker on a lot of similar features; unless/until we get that, features very much need to be all about setting records as they're created, rather than trying to change them later.

Deduplication could also allow for "cloning" between related datasets, such that copying or moving a file from one dataset to another could result in ref-links being created on the target dataset – obviously for this the datasets would need to be compatible (same encryption root, and either comparable settings for compression and recordsize, or options to ignore these when cloning, i.e- prefer cloning to copying). I'm not sure how much of these existing deduplication does already, but I think it's a feature better positioned to handle it as IIRC cp and related file system calls always consider different volumes to be unrelated, though I guess if ZFS can do some kind of check it might be able to handle it anyway somehow? Probably needs investigating in its own issue though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

5 participants