Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rsync hard-links to save space #102

Merged
merged 3 commits into from
Jul 29, 2022
Merged

Rsync hard-links to save space #102

merged 3 commits into from
Jul 29, 2022

Conversation

art-w
Copy link
Contributor

@art-w art-w commented Apr 27, 2022

Please note that I have NO idea what I'm doing: I'm working under the assumption that files are never modified in place in the store but always copied elsewhere first. If that ain't the case, well, please ignore and close this PR harshly!

I was hoping to save a bit of disk space in the obuilder store when using rsync:

/obuilder/store/result/ $ du -sh
2.1G	3438610c272ea59ea6c9b0cb93557cc430009e63f3d89d85416af6889fb94150
2.1G	ce0813553200cf888b0efa0cb2b2421dba8aaf4ba2c898d1a7dffc4483700ebd

Here ce0813 was created from 343861 by running sudo ln -f /usr/bin/opam-2.0 /usr/bin/opam. As a full build involves a dozen steps, the copy-everything is eating my disk alive... But by asking nicely, rsync could observe that files from ce0813 are identical to those in 343861 and create hard links to the originals rather than a real copy. This is obviously wrong if either can be updated in place later!

Regarding the rsync arguments:

  • --link-dest= is the path in which the original files will be discovered (and hard-linked to). When this argument is a relative path, it is interpreted as relative to the dst directory (which would be plain wrong here!)... Hopefully the paths are always absolutes, hence the cmdliner quick fix to enforce it. I tried relative paths for the rsync store to see if I was breaking existing functionality, but it was already crashing runc because it led to the wrong store path.
    => I'm not sure if the btrfs backend has the same limitation? (the doc seems to imply so)

  • --checksum may not be 100% required, but otherwise rsync might hard-link files even though they could be different as it only checks the filename, file size, modifications dates, ... and not the actual content.

Anyway, the result makes me sad. Hard-links are correctly created for files, but not directories: (because "stuff tends to break when your fs is not a tree")

/obuilder/store/result/ $ du -sh
2.1G	3438610c272ea59ea6c9b0cb93557cc430009e63f3d89d85416af6889fb94150
394M	ce0813553200cf888b0efa0cb2b2421dba8aaf4ba2c898d1a7dffc4483700ebd

"Everything is a file, but some files are more files than others."

@talex5 talex5 requested a review from patricoferris April 29, 2022 08:58
@talex5
Copy link
Contributor

talex5 commented Apr 29, 2022

Sounds like a good idea to me, but I'm not familiar with this backend.

@patricoferris
Copy link
Contributor

Thanks for this @art-w ! With @talex5's help I wrote the rsync backend (mainly for it's convenience and also for the macOS port of obuilder).

I'm working under the assumption that files are never modified in place in the store but always copied elsewhere first.

Yep, that is a correct assumption (or at least should be!). Anything in result/<hash> should be immutable. The rsync backend is based off of the btrfs backend and typically will:

  • Copy a cached build, say result/1234, to a new home in result-tmp, say result/5678.
  • The build, with runc, will then use the result-tmp during the build step.
  • Finally, if the build is successful, we do a rename of result-tmp to result, which if iiuc is where you have changed this to another copy to add the hardlinks and a delete of the result-tmp?

So, from a Linux perspective, I think this change is good. It does come at the added cost of doing a copy rather than a rename, but I think the potential disk space saving is worth it. The rsync backend isn't supposed to be fast. I'll follow up again soon after I rebase and try this with the experimental macOS port (see #87).

@art-w
Copy link
Contributor Author

art-w commented May 2, 2022

Thanks, your explanations does match my intuition! Yes the original mv result-tmp/xyz result/xyz becomes a two-step "cp" result-tmp/xyz result/xyz ; rm result-tmp/xyz and yes, the copy is much slower than a rename :/

(Out of curiosity, I'm going to run some tests without --checksum to see how expensive it is, but I don't think it's 100% safe to skip it in obuilder use-case.)

Copy link
Contributor

@patricoferris patricoferris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @art-w. I've tried this with quite a few example on linux and also on the macOS PR (see this branch which has that work rebased on this PR) and everything seems to be working just fine.

I did a small, not particularly scientific, test on the performance impact using this obuilder spec file:

((from ocaml/opam@sha256:dca2be63c27c3860560bd70001f94c39c32e8a22fdc270e5e77d297b665c871f)
 (run (shell "echo 1"))
 (run (shell "echo 2"))
 (run (shell "echo 3")))

I ran the spec file once with only the from section to get the image untarred into the build cache and then compared running the three echoes with the proposed changes and master. On the machine I was using, your PR took 4m20.703 and master took 1m36.558s.

There's a tradeoff here (as per usual) between space and time and I was thinking maybe it would be better to let the user decide whether or not to use hardlinks? I think an rsync store that has sometimes used hardlinks and other times hasn't would still be a working store although I'm not sure (I'm thinking here if you forgot to add the --use-hardlinks flag or whatever way we could expose that in the CLI). What do you think?

@art-w
Copy link
Contributor Author

art-w commented May 13, 2022

Thanks for testing it out on another platform! I like the idea of letting the user choose the tradeoff so I added the corresponding CLI flag: by default it keeps the original copy behavior, but it's safe to switch to hardlink (back and forth with copy, the store shouldn't care).

On your example, I get 7.9G in 3m25s for copy, 3.1G in 7m35s for hardlink ... and 3.1G in 4m20s for hardlink_unsafe (no checksum, which is the mode I would like to use when developing). Let me know if you would rather not have this third dangerous option!

@art-w art-w force-pushed the rsync-hard branch 2 times, most recently from 4a9edbe to b24bef1 Compare May 13, 2022 14:44
Copy link
Contributor

@patricoferris patricoferris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @art-w, I'm happy with these changes and it's a nice addition I think :))

@talex5 talex5 merged commit 79a7e02 into ocurrent:master Jul 29, 2022
tmcgilchrist added a commit to tmcgilchrist/opam-repository that referenced this pull request Nov 7, 2022
CHANGES:

- Add --fuse-path to allow selection of the path redirected by FUSE (@mtelvers ocurrent/obuilder#128, reviewed by @MisterDA )
- Pre-requisites for Windows support using docker for Windows (@MisterDA ocurrent/obuilder#116, reviewed by @tmcgilchrist)
- Additional tests and prerequistes for Windows support (@MisterDA ocurrent/obuilder#130, reviewed by @tmcgilchrist)
- Add support for Docker/Windows spec (@MisterDA ocurrent/obuilder#117, reviewed by @tmcgilchrist)
- Depend on Lwt.5.6.1 for bugfixes (@MisterDA ocurrent/obuilder#108, reviewed by @tmcgilchrist)

- Add macOS support (@patricoferris ocurrent/obuilder#87, reviewed by @tmcgilchrist @talex5 @kit-ty-kate)
- Enable macOS tests only on macOS (@MisterDA ocurrent/obuilder#126, reviewed by @tmcgilchrist)
- Dune 3.0 generates empty intf for executables (@MisterDA ocurrent/obuilder#111, reviewed by @talex5)
- Fix warnings and CI failure (@MisterDA ocurrent/obuilder#110, reviewed by @talex5)

- Expose store root and cmdliner term with non-required store (@MisterDA ocurrent/obuilder#119, reviewed by @tmcgilchrist)
- Expose Rsync_store module (@MisterDA ocurrent/obuilder#114, reviewed by @talex5)
- Rsync hard-links to save space (@art-w ocurrent/obuilder#102, reviewed by @patricoferris)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants