Lockfiles bloat the Nixpkgs tarball #327064

Atemu · 2024-07-14T09:52:38Z

Introduction

The size of the Nixpkgs tarball places burden onto internet connections and storage systems of every user. We should therefore strive to keep it small. Over the past years that I've been contributing, it has more than doubled in size.

In #327063 link I discovered quite negative effects of Cargo.lock files in the Nixpkgs tree with just 300 packages bloating the compressed Nixpkgs tarball by ~6MiB.

Here I'd like to document the status quo of sizes of lockfiles found in Nixpkgs and other automatically generated files of significant size.

Methodology

ncdu --apparent-size on the nixos-24.05 tree (a046c12)
Manual look through the tree
Looked at everything where the directory is larger than your average Cargo.lock file (a few dozen KiB)
Only considered files that were obviously auto-generated
- i.e. not kodi add-ons, they're all separate drvs updated separately
Compressed sizes were measured using gzip -9 < file | wc -c or tar -cf - files... | gzip -9 | wc -c

Lockfiles were either manually measured or using these commands:
Amounts:

$ for file in Cargo.lock composer.lock package-lock.json yarn.nix yarn.lock gemset.nix Gemfile.lock ; do echo -n "$file " ; fd -t f "^$file\$" | wc -l ; done

Sizes:

$ for file in Cargo.lock composer.lock package-lock.json yarn.nix yarn.lock gemset.nix Gemfile.lock ; do echo -n "$file " ; fd -t f "^$file\$" -x sh -c 'gzip -9 < {} | wc -c' | jq -s 'add' ; done

Results

Numbers for the lockfiles and patches are (total bytes) or (total bytes / number of files = average per file)

Lockfiles
- Cargo.lock (5986458 / 316 = 18944.5)
- composer.lock (185411 / 14 = 13243.6)
- package-lock.json (923349 / 17 = 54314.6)
- info.json (51904 / 2 = 25952) (electron)
- yarn.nix (41356 / 1 = 41356)
- yarn.lock (661464 / 5 = 132293)
- gemset.nix (262092 / 141 = 1858.81)
- Gemfile.lock (86498 / 138 = 626.797)
- bazel_7 locks (105719 / 3 = 105719)
- nuget deps.nix (489003 / 67 = = 7298.55)
patches (2807106 / 3929 = 714.458)
- Particularly large patches:
  - glibc patch
  - terraform-docs
hackage-packages (2435846)
node2nix
- elm/packages (180222)
- node-packages (843310)
- netlify-cli (111438)
cran-packages (1473653)
lisp-modules (319315)
android-env (255090)
cuda-modules (109929)
tree-sitter/gammars/ (20964)
elisp-packages (1183087)
jetbrains/{brokenplugins,idea_maven_artefacts}.json (273277)
vim/plugins (174661)
vscode extensions (36921)
firefox-bin (33067)
libreoffice (364120)
kde (13013)
gnome/extensions.json (613383)
perl-packages.nix (2324210)

Notable non-generated files

For comparison and out of interest I also recorded the compressed sizes of notable files that were made by hand:

The allmighty all-packages.nix (251060)
python-packages.nix (76738)
aliases.nix (24372)
haskell-modules/configuration-hackage2nix/broken.yaml (75753)
haskell-modules/configuration-common.nix (40859)
maintainer-list.nix (143647)
doc (317070)
lib (233078)
nixos (3595412)

Analysis

Lockfiles Contribute greatly to nixpkgs compressed tarball size. In total, you can attribute 8793206 Bytes ~= 8.4MiB out of the ~41MiB to lockfiles used in individual packages (~20%). The biggest offender by far are rust packages' Cargo.locks which are analysed in deeper detail in #327063.

The worst offenders in terms of Bytes per package are packages which lock their yarn dependencies at ~130KiB/package. Though these are fortunately rare but still add up to ~600KiB.
The next worst appears to be bazel_7 which single-handedly requires ~100KiB of compressed data.
More notably bloated packages are those which have a package-lock.json at ~50KiB/package and electron's two info.jsons combining to ~50KiB.

Patches also present significant burden for compressed tarball size. Individually, they're usually quite small but they're very common, adding up to 2.6MiB.

All automatically generated files discovered here (package lockfiles + set lock files) sum up to 19558712 Bytes ~= 18.6 MiB (compressed) which is about half the size of the Nixpkgs tarball.

Discussion

Should huge lockfiles continue to be allowed in Nixpkgs?
- Sometimes they might be the only option?
Should we impose a Byte limit per package?
- Some packages are clearly out of hand, requiring >100KiB each
- If every package did that, the nixpkgs tarball would approach 10GiB in compressed size
- Even if you think hundreds of KiB is fine, would it be okay for a single package to use multiple MiB? Multiple dozen MiB?

Solutions

There are a few measures that could be taken to reduce file size of generated files:

Summarise hashes (i.e. vendorHash)

Rather than hashing a bunch of objects individually, hash a reproducible record of all objects. This is already the status quo for i.e. buildGoModule.

Record less info

Some info is not strictly necessary to record for the lock files to function. For each elisp package for instance, at least two commit ids and two hashes are recorded. Commit IDs could probably be dropped entirely here which would reduce the compressed file size by 1/3.

Fetch files rather than vendoring them

Often times, files required for some derivation are available from an online source. Fetching the file rather than vendoring it into the nixpkgs tree reduces the space required to a few dozen Bytes (~32 Bytes for the hash and a similar amount for the URL).
This is especially relevant for patches as those are frequently available elsewhere. Use pkgs.fetchpatch2 in such cases.

Lock an entire package set

Lockfiles usually represent a set of desired transitive dependency versions that some language-specific external SAT solver spat out. These are frequently duplicated because many separate packages use the same libraries but are often not exact duplicates due to differences in upstream-defined dependency constraints.

Instead, it is possible to record one large snapshot of the latest desirable versions of all packages in existence in some ecosystem and have dependent packages use the "one true version" instead of their externally locked versions.

It also provides efficiency gains as dependencies are only built once and brings us closer to what the purpose of a software distribution has traditionally been: Integrate one set of packages.

This approach is used quite successfully by i.e. the haskellPackages, measuring at just 133 Bytes per package.

This is not feasible for all ecosystems however as just the names of all 3330720 npm packages (no hashes) is ~20MiB compressed and the hashes would be at least another 100MiB. Though perhaps a subset approach could be used; only accepting packages into the auto-generated set that are depended upon at least once in Nixpkgs.

Future work

Calculate and analyse bytes / package for package sets
Some lockfile formats were perhaps not recognised as such or aren't actually lockfiles

Amendments

Another solution: External lockfile repo

This is another solution I came up with after publishing and being exposed to some of the reasons why lockfiles are vendored. It often happens because upstream provides no lockfiles themselves but one is necessary for the software to build reproducibly which in our case often times means to build at all.

A lockfile must:

match the revision of the package
remain available unchanged for at least as long as the matching package version (or in perpetuity)
be generated by a trusted source (Bad actors could easily use tampered lockfiles to facilitate supply-chain attacks)

Vendoring lockfiles into the Nixpkgs tree achieves all of these but it's not the only way to achieve that.

For such cases, it would alternatively be possible to store these 3rd-party generated lockfiles in a separate repository and merely fetch them from Nixpkgs. You'd fetch them individually, not as a whole, so the issue of size only affects build time closures which would have been affected either way. (The current issue of lockfiles is that they bloat Nixpkgs regardless of whether they are useful to the user or not.)

This solution would work in cases where lockfiles are only required as derivation inputs (not eval inputs) which I believe to cover most usages of vendored lockfiles in Nixpkgs.

This could even become a cross-distro effort as we surely are not the only distro which requires pre-made lockfiles in its packaging.

The text was updated successfully, but these errors were encountered:

nixos-discourse · 2024-07-14T09:53:24Z

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/cargo-lock-considered-harmful/49047/2

Atemu · 2024-07-14T12:44:12Z

I have amended the OP with another possible solution.

Frontear · 2024-07-15T04:38:10Z

I think for externalizing lockfiles itd be a good idea to actually determine which other distros do similar things to nixpkgs (vendor their own lockfiles for reproducibility), before commiting to such an idea.

Main reason I say this is I love the idea, but I think it could easily become management nightmare when things are externalized this way, and would really only be worth the effort if and only if its actually maintained by a larger team outside of Nix, such as any of the aforementioned distros.

chayleaf · 2024-07-17T01:06:37Z

Lockfiles are sadly a necessity whenever dependencies aren't pinned (and even then parsing lockfiles can be better than a FOD alternative).

IPFS for the external lockfile repo seems like it'd be a good fit? Just pin the files after merging the PRs. Of course, hosting them normally is an option as well, but all potential nixpkgs contributors will need upload access for WIP PRs.

The problem with external lockfile repo is that we'd have to completely ditch lockfile parsing (as it would require IFD) and switch to FODs, which may force us to rewrite some Nix code (currently, Gradle support does that, so it would be affected) and maintain more hashes. It still seems like the better option out of the two though.

Atemu · 2024-07-17T04:16:17Z

IPFS for the external lockfile repo seems like it'd be a good fit?

Interesting thought but the problem with IPFS remains that we need someone to pin the files or they will inevitably be lost.

Of course, hosting them normally is an option as well, but all potential nixpkgs contributors will need upload access for WIP PRs.

Anyone can create a PR. Ideally though, we wouldn't even let users upload lockfiles and rather have them be generated by some trusted infrastructure with users merely providing upstream versions they need to have a lockfile for. Remember, lockfiles are security-critical.
As for who should have access, while the process has been lost in the current turmoil, we can simply use the same set of "trusted users" that we have for Nixpkgs merge access and that'd be fine once we have recovered as a community. We'd only need to do basic QA on code correctness, whether something actually needs to be added and prevent spam but the code users would provide should be very very basic and simple in that repo.

The problem with external lockfile repo is that we'd have to completely ditch lockfile parsing

Given the performance issues of Cargo.lock parsing, my first impression of that would be that it's a good thing.

and switch to FODs, which may force us to rewrite some Nix code (currently, Gradle support does that, so it would be affected) and maintain more hashes.

Note that the need for this to happen exists on the time scale of months~years, not days~weeks.

Also, not all lockfiles must necessarily go but there must be some sort of limit how much of our "data budget" we use on them.

ehmry · 2024-07-20T11:12:18Z

I migrated Nim to to lockfiles and it has fixed a lot of problems but the lockfiles are only getting bigger. I'm in favor of deduplicating the contents of the lockfiles in centralized place but I think it would take special tooling that would be somewhat consistent across languages.

If we can make it clear that lockfiles and "supply-chain" security are one in the same then maybe we can get funding for a solution, but now I see that the NGI budget is getting cut.

nixos-discourse · 2024-07-28T04:55:54Z

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/every-new-release-of-the-nixos-unstable-channel-leads-to-a-download-of-around-42mb/49747/2

MagicRB · 2024-07-31T06:00:19Z

Just throwing an idea here, what if we allowed "import from builtin" that would allow us to store lockfiles in a different repo, fetch them lazily and still use them at eval time. It would still slow down eval, but not nearly as much as arbitrary IFD.

adisbladis · 2024-08-01T05:05:26Z

Summarise hashes (i.e. vendorHash)

I'd like to point out that this is a space-time trade-off.
Large FODs are more efficient from the nixpkgs point of view, but bloats binary caches.

This is also a negative for security. We have no insight into what a single hash represents in terms of dependency graph.

Atemu · 2024-08-01T06:15:08Z

I'd like to point out that this is a space-time trade-off.
Large FODs are more efficient from the nixpkgs point of view, but bloats binary caches.

That's a good point. I'd say that makes it a space-space trade-off though: Space in the tarball vs. space in the binary cache.

I consider space in the tarball to be a lot more precious as it affects each and every user because of the tarball's status as the source of all truth. The tarball size is also only one order of magnitude greater than the size of all lockfiles, making lockfiles a significant contributor to bloat.
Meanwhile the binary cache size only affects one entity, will always be gigantic, and is 5-6 orders of magnitude greater than all rust packages' vendor tarballs combined. Additionally, it could conceivably be deduplicated in the future in which case I'd expect the size of all vendor tarballs to deduplicate down to what they would require if represented by small FODs.

This is also a negative for security. We have no insight into what a single hash represents in terms of dependency graph.

You don't have such insights at eval time but, while convenient, that not a necessity. You could just take a look at the dependency declaration file aswell as the vendor tarball to figure out the "full" dependency graph.
Given that there will at least be some usages of "big FOD"s, tooling would have to be able to deal with that anyhow.

adisbladis · 2024-08-01T07:54:51Z

Meanwhile the binary cache size only affects one entity

This is not true. Binary cache size growth is a problem that cost some users dearly.
There are plenty of places in the world (I've lived in some) where unmetered internet connections are impossible to get and bandwidth is expensive.

You don't have such insights at eval time but, while convenient, that not a necessity.

It is a necessity to statically reason about the dependency graph. Sure, you can write tooling that inspects derivation outputs, but that's another level of tooling complexity, and it makes it very expensive to scan a package tree.

Additionally I've never seen a convincing overrides story for any FOD packager.

I feel like we are sacrificing way too much about what makes Nix good with these hacks.

Atemu · 2024-08-02T00:29:24Z

This is not true. Binary cache size growth is a problem that cost some users dearly.

Sure but, as I mentioned previously, "big" vendor FODs just simply aren't a great contributor here. It's not uncommon for output paths to be a few orders of magnitude larger than "big" vendor FODs and those change on every rebuild (x4 for all our platforms) while FODs only change on updates and usually are the same on any platform.

As also mentioned, optimisations for the binary cache that IMV are unavoidable going forward such as deduplication will reduce the difference between "big" FODs and lots of tiny FODs to almost nothing.

It's not a significant contributor to unsustainable growth currently and will likely even be less significant going forward; at the worst slightly less efficient than the alternative. I don't see a significant point to be had w.r.t. binary cache size.

There are plenty of places in the world (I've lived in some) where unmetered internet connections are impossible to get and bandwidth is expensive.

The "cost" of big FODs only hits you when you're building stuff yourself and in that case you'd have to compare the 15-30MiB to the rest of the inputDerivation which, for a typical rust package such as fd, is >1.6GiB. Using a "big" FOD or not would be as significant as a rounding error here.

It is a necessity to statically reason about the dependency graph.

We all use Nix for this reason; I can feel you. I'd much prefer if we had a reasonably manageable package set a la haskellPackages instead of a separate subset package set for each drv which the current lockfiles represent.

That'd allow for static reasoning aswell as sustainable tarball size & eval time growth but that's not the reality we live in: We have to choose one.

Given that the use-cases for reasoning about the entire source dependency graph (remember: this is source code, not build artifacts) are rather fringe and could be done less elegantly through other methods, I see the trade-off in favour abstaining from lockfiles.

Additionally I've never seen a convincing overrides story for any FOD packager.

At a theoretical level, I don't see how it'd be any different to a lockfile packager. You'd pass a new/updated/different lockfile in either case but you'd have to update the vendor hash with a FOD packager which is a slight overhead and a little inefficiency but not unreasonably so.

I feel like we are sacrificing way too much about what makes Nix good with these hacks.

I feel that both hacks sacrifice what makes Nix and Nixpkgs good; neither is ideal.

The best solution is and always will be to do our job as a distro and define one package set for all dependant packages to use, making any lockfile irrelevant. That's really hard work of course though.

emilazy · 2024-08-10T14:57:19Z

Linking #333702 here, which is Rust‐specific but that i hope can point to a better approach for language ecosystems in general.

nixos-discourse · 2024-10-01T15:03:22Z

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/state-of-haskell-nix-ecosystem-2024/53740/8

DD5HT · 2024-11-13T10:28:32Z

I was wondering if switching the compression algorithm may be worth it.

I did some measurements myself on commit fa06fc60 and it looks like that zstd can already provide ~10% smaller files. Maybe it's worth looking into that.

zstd -19

Cargo.lock 6413105
composer.lock 137284
package-lock.json 1056106
yarn.nix 38849
yarn.lock 463078
gemset.nix 260515
Gemfile.lock 85488

gzip -9


Cargo.lock 7482267
composer.lock 158142
package-lock.json 1251946
yarn.nix 41731
yarn.lock 498772
gemset.nix 272307
Gemfile.lock 88855

Atemu · 2024-11-13T10:39:33Z

We do not control github's tarball compression.

The only other place where the size of lockfiles matters is git which also only supports gzip compression.

zstd or other means of compression are not relevant to this discussion.

MagicRB · 2024-11-13T10:43:58Z

zstd or other means of compression are not relevant to this discussion.

Well, if nix could decompress and cache files, say builtins.fromJSON (builtins.decompress ./lock.json.zstd)

MattSturgeon · 2024-11-13T14:14:35Z

zstd or other means of compression are not relevant to this discussion.

Well, if nix could decompress and cache files, say builtins.fromJSON (builtins.decompress ./lock.json.zstd)

We'd then have the issue that we'd be committing binary files instead of text files.

Git is at its best when working with text, especially when resolving merge conflicts.

Although perhaps some specific lockfiles are already bad at avoiding merge collisions, so in those very specific edge cases we wouldn't be losing much by committing binary data...

pbsds · 2024-11-13T15:54:39Z

Decompressing lock files at eval time could wreak havoc on eval times.

MagicRB · 2024-11-13T15:56:10Z

You could run a fast enough hash on it first and then use a cached copy of available.

pbsds · 2024-11-13T16:12:52Z

We would then keep a copy of nixpkgs with its compressed artifacts in the store, alongside a bunch of decompressed lock files cached to the store with no gcroot, just to save a few kilobytes over the wire. The decompression could in cppnix also require pausing eval like IFD currently does.

It makes more sense for me to switch the releases.nixos.org tarballs to use ztsd and discourage using github tarballs

Pandapip1 · 2024-11-13T18:20:48Z

We'd then have the issue that we'd be committing binary files instead of text files. Git is at its best when working with text, especially when resolving merge conflicts.

Generally, Cargo.lock files are updated in tandem with the corresponding package's version, so this should generally not be an issue. However, the other drawbacks mentioned mean that this is still nonetheless a bad idea.

Ref: NixOS#327064

Atemu added 6.topic: hygiene 9.needs: maintainer feedback significant Novel ideas, large API changes, notable refactorings, issues with RFC potential, etc. labels Jul 14, 2024

Atemu mentioned this issue Jul 14, 2024

Cargo.lock considered harmful #327063

Closed

Aleksanaa added the 6.topic: architecture Relating to code and API architecture of Nixpkgs label Jul 14, 2024

GetPsyched mentioned this issue Jul 16, 2024

atlauncher: build from source #327592

Merged

13 tasks

This comment was marked as off-topic.

Sign in to view

emilazy mentioned this issue Jul 26, 2024

Init deno language support #326003

Draft

29 tasks

tomodachi94 mentioned this issue Aug 1, 2024

veloren: init at 0.16.0 #267239

Merged

13 tasks

natsukium mentioned this issue Aug 3, 2024

nadesiko3: init at 3.6.22 #317222

Draft

13 tasks

pineapplehunter mentioned this issue Aug 15, 2024

super-productivity: build from source #334933

Open

13 tasks

adisbladis mentioned this issue Aug 18, 2024

emacsPackages.tsc: refactor #335438

Merged

13 tasks

SuperSandro2000 mentioned this issue Aug 18, 2024

treewide: format files generated by nuget-to-nix with nixfmt and make code generation of nuget-to-nix follow nixfmt rules #325053

Draft

13 tasks

jian-lin mentioned this issue Aug 18, 2024

Support embedding recipe of MELPA packages when generating the index (recipes-archive-melpa.json) #334888

Open

TheArcaneBrony mentioned this issue Aug 24, 2024

Draupnir: init at 2.0.0-beta.6 #274052

Open

13 tasks

adisbladis mentioned this issue Sep 19, 2024

Try to rewrite emacs packages updater. #330595

Open

ehmry added a commit to ehmry/nixpkgs that referenced this issue Nov 14, 2024

buildNimSbom: init a new package builder for Nim

3be81ac

Ref: NixOS#327064

ehmry mentioned this issue Nov 14, 2024

init buildNimSbom, update nim_lk, init preserves-nim #355877

Open

13 tasks

roberth mentioned this issue Nov 14, 2024

Dynamic derivations-based lock-based fetching #355909

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lockfiles bloat the Nixpkgs tarball #327064

Lockfiles bloat the Nixpkgs tarball #327064

Atemu commented Jul 14, 2024 •

edited

Loading

nixos-discourse commented Jul 14, 2024

Atemu commented Jul 14, 2024

Frontear commented Jul 15, 2024

chayleaf commented Jul 17, 2024 •

edited

Loading

Atemu commented Jul 17, 2024 •

edited

Loading

ehmry commented Jul 20, 2024 •

edited

Loading

This comment was marked as off-topic.

nixos-discourse commented Jul 28, 2024

MagicRB commented Jul 31, 2024

adisbladis commented Aug 1, 2024

Atemu commented Aug 1, 2024

adisbladis commented Aug 1, 2024

Atemu commented Aug 2, 2024

emilazy commented Aug 10, 2024

nixos-discourse commented Oct 1, 2024

DD5HT commented Nov 13, 2024 •

edited

Loading

Atemu commented Nov 13, 2024

MagicRB commented Nov 13, 2024

MattSturgeon commented Nov 13, 2024

pbsds commented Nov 13, 2024

MagicRB commented Nov 13, 2024

pbsds commented Nov 13, 2024 •

edited

Loading

Pandapip1 commented Nov 13, 2024

Lockfiles bloat the Nixpkgs tarball #327064

Lockfiles bloat the Nixpkgs tarball #327064

Comments

Atemu commented Jul 14, 2024 • edited Loading

Introduction

Methodology

Results

Notable non-generated files

Analysis

Discussion

Solutions

Summarise hashes (i.e. vendorHash)

Record less info

Fetch files rather than vendoring them

Lock an entire package set

Future work

Amendments

Another solution: External lockfile repo

nixos-discourse commented Jul 14, 2024

Atemu commented Jul 14, 2024

Frontear commented Jul 15, 2024

chayleaf commented Jul 17, 2024 • edited Loading

Atemu commented Jul 17, 2024 • edited Loading

ehmry commented Jul 20, 2024 • edited Loading

This comment was marked as off-topic.

nixos-discourse commented Jul 28, 2024

MagicRB commented Jul 31, 2024

adisbladis commented Aug 1, 2024

Atemu commented Aug 1, 2024

adisbladis commented Aug 1, 2024

Atemu commented Aug 2, 2024

emilazy commented Aug 10, 2024

nixos-discourse commented Oct 1, 2024

DD5HT commented Nov 13, 2024 • edited Loading

Atemu commented Nov 13, 2024

MagicRB commented Nov 13, 2024

MattSturgeon commented Nov 13, 2024

pbsds commented Nov 13, 2024

MagicRB commented Nov 13, 2024

pbsds commented Nov 13, 2024 • edited Loading

Pandapip1 commented Nov 13, 2024

Atemu commented Jul 14, 2024 •

edited

Loading

chayleaf commented Jul 17, 2024 •

edited

Loading

Atemu commented Jul 17, 2024 •

edited

Loading

ehmry commented Jul 20, 2024 •

edited

Loading

DD5HT commented Nov 13, 2024 •

edited

Loading

pbsds commented Nov 13, 2024 •

edited

Loading