Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable KSM for ARC cache #14279

Open
devZer0 opened this issue Dec 12, 2022 · 2 comments
Open

enable KSM for ARC cache #14279

devZer0 opened this issue Dec 12, 2022 · 2 comments
Labels
Type: Feature Feature request or new feature

Comments

@devZer0
Copy link

devZer0 commented Dec 12, 2022

@Sachiru

For non-deduplicated datasets or filesystems, ARC cache retains full blocks
in memory even if they are duplicates of something else.

KSM (Kernel Same Page Merging, http://en.wikipedia.org/wiki/Kernel_SamePage_Merging_(KSM) )
is supposed to optimize memory usage especially for memory-heavy applications. Although it is
true that blocks have variable sizes, they are still allocated as 4k pages in memory (IIRC),
which can then be examined and deduplicated

@behlendorf

This situation here will be considerably better in the 0.7.0 release.
ARC buffers are now compressed in memory and the ARC is better
about not keeping multiple copies of the same buffer.

i checked this , and this does not seem to apply for zfs-2.1.6.

i created 10 identical 100mb files (test#.dat) with contents from /dev/urandom, dropped caches with "echo 3 >/proc/sys/vm/drop_caches" and read those file with "cat test*.dat >/dev/null".

before reading, arc was <200mb, after reading arc was at 1,2GB

that means, arc is not able to detect that the files contents are the same

so, hereby i'm reopening #2772

if arc has no internal deduplication, it should benefit from standard kernel memory deduplication feature.

ram is precious ressource and most systems have lots of unused cpu.

@devZer0 devZer0 added the Type: Feature Feature request or new feature label Dec 12, 2022
@devZer0 devZer0 changed the title enable KSM for arc cache, i.e. make enable KSM for ARC cache Dec 12, 2022
@ryao
Copy link
Contributor

ryao commented Dec 13, 2022

KSM is meant for anonymous pages of child processes that are not from files. As far as I know, the KSM code as designed cannot be applied to either the page cache or ARC, so it is not something that can be enabled.

The deduplication that @behlendorf mentioned was for cases where ZFS should know that the buffers are the same, such as when the files have the same places on disk in snapshots and you are looking at them from snapshots. It does not apply to identical files unless dedup=on was present when they were written.

Offhand, the way that I would expect KSM to work is that it periodically hashes the anonymous pages and stores those hashes in a data structure. Upon getting a hit, it will mark the page as CoW in both places, verify that the two are the same and then have one point to the other while increasing the page's reference counter. Implementing the idea in ARC would be non-trivial. I would expect getting it right to require significant effort spanning at least a year. :/

@Haravikk
Copy link

Haravikk commented May 9, 2023

I posted an issue suggesting "lightweight" deduplication and I wonder if it would cover this case?

Basically in issue #13572 my proposed/preferred solution is to allow deduplication to be enabled only for the contents of the ARC (and L2ARC), rather than for the entire contents of a dataset, to massively reduce the RAM impact of dedup.

The intention of that issue is to enable dedup for file copying, since this usually involves reading records into ARC (and thus the "lightweight" dedup table) shortly before writing out the new copies. Since they'd be in the dedup table, they would instead be written out as cloned blocks (reflinks) rather than full copies.

If that issue were implemented first, then the same basic mechanism could be used to dedup the contents of ARC, because there would be a dedup table there to be used. Basically if two identical records are loaded, they'd generate the same hash for the dedup table, and so one can be eliminated in ARC (or even retroactively eliminated on disk).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

3 participants