Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate packaging Rust crates separately #333702

Open
emilazy opened this issue Aug 10, 2024 · 31 comments
Open

Investigate packaging Rust crates separately #333702

emilazy opened this issue Aug 10, 2024 · 31 comments
Labels

Comments

@emilazy
Copy link
Member

emilazy commented Aug 10, 2024

Right now, every Rust package is an ecosystem unto itself, with dependency versions being selected from each package’s upstream Cargo.lock file (or a vendored one if none is present). This stands in contrast to how many language ecosystems in Nixpkgs work, and has caused us problems:

  1. Cargo.lock considered harmful #327063 – keeping Cargo.lock files in the repository bloats its size, but gives us more static insight into the dependencies used by packages and avoids hacky FOD duplication.

  2. Rust 1.80.0 breaks some packages #332957 – in an ideal world, we could bump time once to the fixed version, rather than playing whack‐a‐mole with all the broken packages.

  3. This hasn’t happened yet, but I’m reading dealing with all the pinned ffmpeg-sys-next dependencies when I upgrade the default FFmpeg to 7. In general, it’s just pretty painful for us to patch Rust dependencies in a way that it isn’t for many other language ecosystems.

I don’t think it’s practical for us to manually maintain a Rust package set like the Python one. However, I think we could do better here. The idea is that we could essentially have one big Cargo.lock file that pins versions of all crates used across the tree, and abandon both cargoHash and vendored Cargo.lock files. The hope is that this would give us the static advantages of vendored Cargo.lock files, let us reduce the number of versions of the same packages that we ship, and do treewide bumps of package versions with much less pain, while (I hope) still taking up less space in the repository than the status quo.

It wouldn’t be feasible to maintain exactly one version of every package. Many popular packages have incompatible major versions, and we may not want to keep a package exclusively on an older version just for some outdated software that pins an older version than most software is compatible with. However, I suspect we could vastly reduce the proliferation of alternate crate versions across the tree.

A downside would be that we would no longer be using the “known good” versions of packages from our various upstreams. For some packages with incompletely‐declared dependency ranges, this could result in broken builds or functionality. In those cases, we would still have the option to vendor a package‐specific Cargo.lock file. Note that this is how it works in most Linux distributions, so although we might package more Rust software than the average Linux distribution, these challenges aren’t unique to us.

This would not necessarily have to literally be one huge Cargo.lock file; we just need something we can turn into Cargo.lock files or a Cargo source to replace crates.io, as suggested in #327063 (comment). As @alyssais pointed out in the issue I linked, we don’t need the dependencies array, and as I pointed out, one single file is conflict‐prone. I suspect we would want something like a JSON or TOML‐based format with one file per package (or package version). That should be comparable in size and organization to our other language package sets, and minimize conflicts.

There are unsolved problems here, e.g.:

  1. We probably don’t want every bump of any Rust library to rebuild every Rust application. We’d need to figure out some way to narrow down the rebuilds to what’s required by each package. The best option I can currently think of is adapting the cargoHash‐style FOD stuff to the task of selecting the subset of our One Big Lock File that is present in the upstream Cargo.lock/Cargo.toml somehow. For instance, it may be acceptable if every crate version bump rebuilds Rust derivations across the tree that check against the src’s Cargo.toml and either succeed because the applicable versions remained constant, or fail because the hash of those versions no longer matches. We just need to be able to short‐circuit the actual builds.

  2. We’d need automation to manage the One Big Lock File. In particular, we’d want to be able to tell when a dependency bump is compatible with the version bounds in various packages so that we can decide between bumping vs. adding a new available version, and keep a set of crates that is consistent with the things we package. This could probably be as simple as just automating and unconditionally accepting SemVer‐compatible bumps, dealing with any fallout by hand, and trying SemVer‐incompatible bumps when we feel brave.

Ultimately, though, I think that the current status quo is causing a lot of problems, and that if we can successfully pursue this proposal, we’ll hopefully make all the groups here happier: the people who maintain Rust packages, the people who worry about the repository size and evaluation performance of Nixpkgs, and the people who worry about losing Nix’s static guarantees.

cc @matklad who suggested the source approach

cc @alyssais who said that we used to do this but stopped

cc @Atemu who opened the issue about Cargo.lock

@emilazy
Copy link
Member Author

emilazy commented Aug 10, 2024

I also forgot to mention that we could maybe share more actual crate builds between packages if we did this, though I don’t know how much that’d actually help or how tricky it’d be to set up the machinery for it.

@alyssais
Copy link
Member

My idea would be to have Rust programs depend on packages containing source code for their dependencies, like "time_0_3". We'd have one package for each semver boundary, so "time_0_3", "time_0_4", "time_1", etc. cargoSetupHook would learn to assemble all Rust source inputs into a vendor directory. We'd have a script that, given a package for Rust program, generated the required packages, and output the list of dependencies to paste into the expression for the Rust program. The source packages could have an updateScript, so updating Rust packages could be taken care of via r-ryantm.

I think this would be relatively straightforward to implement, and it's attractive because it makes Rust dependencies feel as much like normal dependencies as possible, even though they would just be source packages under the hood (because Cargo forces this).

@emilazy
Copy link
Member Author

emilazy commented Aug 10, 2024

If we could make a scheme like that work, it’d be fantastic. I don’t want to add more generated Nix code in‐tree, or hand‐written boilerplate, than is necessary, though. If we can generate non‐code spec files for packages that Nix code turns into packages that work like that, that’d be fantastic. For instance, we could have a generator that turns a hand‐written src.nix into a rust-package.toml that lists its direct Cargo.toml dependencies and other information we might need, and a package.nix that just does { mkRustPackage }: mkRustPackage ./src.nix ./rust-package.toml, where mkRustPackage resolves those dependencies to the actual packages.

@Mic92
Copy link
Member

Mic92 commented Aug 10, 2024

Rather than one big lock file, sharding crates in smaller files by some name prefix make be both easier review (i.e. editors and github ui doesn't go mad because of the size) and more efficient to store in git (I remember @alyssais talking about some in-efficiencies with all-packages.nix).

@emilazy
Copy link
Member Author

emilazy commented Aug 10, 2024

Right; my proposal in the original post (which I realize is tl;dr) was actually one file per package, to avoid Git conflicts. It’s only conceptually one big Cargo.lock :)

@eclairevoyant
Copy link
Contributor

more efficient to store in git (I remember @alyssais talking about some in-efficiencies with all-packages.nix).

Specifically git stores files, not diffs/changes, so anything that reduces the filesize in a given commit will help massively. (Identical files are deduped.)

@Atemu
Copy link
Member

Atemu commented Aug 10, 2024

Git is a lot smarter than that; files that are only slightly different are also deduped. In fact quite efficiently so: I have yet to find a more efficient storage format than a bare git repository for a bunch of large text files that differ slightly in a bunch of places.

Having one big file is actually quite efficient. I have not measured it but I'd expect the per-file overhead (obj ID, file name, mode) and lack of huffman coding between the files' contents to be quite a lot less efficient than One Big File.

@emilazy
Copy link
Member Author

emilazy commented Aug 10, 2024

I think you might both be right and it depends on whether the refs have been packed or not yet? But I don’t know that much about Git’s storage layer, so my knowledge might be terribly out of date.

One single file doesn’t seem good for conflict handling, anyway. I think one file per package would be fine if it’s deduplicated across the whole tree, because after all Nixpkgs already consists basically entirely of files for individual packages.

The more I think about (my version of) @alyssais’ proposal the more I am tempted to try and implement it. It would make things very normal. The only question is whether it can be automated enough to be comparably seamless to the status quo.

@Atemu
Copy link
Member

Atemu commented Aug 10, 2024

You can basically always expect git objects to be packed when size is of importance.

As for conflict handling: I don't see why it'd be a factor as the file should basically never be edited by hand, always by automation. We don't worry about conflicts in i.e. the hackage packages file either.

@emilazy
Copy link
Member Author

emilazy commented Aug 10, 2024

It’d be edited by automation in independent PRs pretty frequently as crates are added to satisfy new dependencies in packages. That results in a lot of opportunity for conflicts because of mismatching diff context.

Anyway, the main problem is that we need to avoid changes to the locked package set rebuilding all Rust crates, which is hard without CA derivations. Alyssa’s approach avoids that in a very simple way.

@MostAwesomeDude
Copy link
Contributor

(Note: git does not deduplicate within blob objects. A pack of objects may have cross-object deduplication performed by compression, but this isn't part of the object model itself. Source: I recently read through stuff like upstream docs on the object model while implementing tools which don't use the porcelain.)

@reckenrode
Copy link
Contributor

reckenrode commented Aug 10, 2024

I like what the Ruby ecosystem in nixpkgs does, which is have a version-independent builders for gems. There are some crates I have had to fix over and over across different packages because they need some help building on Darwin (e.g., #328588, #328598, #328593). It would be better if fixes like that need only be done once.

@emilazy
Copy link
Member Author

emilazy commented Aug 10, 2024

Yes, I am hoping that we can attach native libraries and other build instructions to the relevant crates. I am hopeful that if we deduplicate packages by SemVer major by default that will be all the sharing we need; we should carry as few incompatible versions as possible, and those are likely to differ enough that sharing is less of a concern.

I think my attempt to nerd‐snipe someone into working on this has successfully boomeranged back onto me…

@Atemu
Copy link
Member

Atemu commented Aug 11, 2024

Anyway, the main problem is that we need to avoid changes to the locked package set rebuilding all Rust crates, which is hard without CA derivations. Alyssa’s approach avoids that in a very simple way.

I don't quite get why non-update operations on the package set would cause rebuilds. When you add or remove packages from the set, the existing entries should stay the same.

Updates would of course need to be done in a staged manner like all of our package sets already do.

git does not deduplicate within blob objects.

We're talking about large text files here, not large binary files (Binary Large OBject). Those get deduped just fine. I've deduped a dozen or two ~30MiB files into a few MiBs using a git repo before where tar.xz, borg, bup and zpaq would all produce something on the order of 200MiB.

You should generally never store BLOBs in git unless they're really tiny and perhaps not even then. We should probably enforce this in Nixpkgs btw. but that's for another topic.

@jvanbruegge
Copy link
Contributor

I am sceptical of this. Using the python package set is extremely annoying because I end up overwriting it to lock dependencies to the version the package needs. So the approach hinted at earlier with one package per minor version would be much appreciated.

@emilazy
Copy link
Member Author

emilazy commented Aug 11, 2024

I don't quite get why non-update operations on the package set would cause rebuilds. When you add or remove packages from the set, the existing entries should stay the same.

Updates would of course need to be done in a staged manner like all of our package sets already do.

If we have one Cargo.lock file that we pass down to all Rust derivations, then any change to that lock file at all will rebuild all Rust derivations. Clearly that’s no good, so we need to filter down what parts of the lock file affect derivation hashes. If we had content‐addressed derivations, we could simply have a derivation that looks at the src’s Cargo.toml and narrows down the Cargo.lock as applicable, to use as an input to the actual build.

Unfortunately we don’t have content‐addressed derivations, so we have to do the narrowing at Nix evaluation time without access to the src. That means encoding some information redundant to the Cargo.toml files in Nix code.

In other words, we need explicit dependency lists of some kind. The trivial solution would be to just list every entry that would be relevant to the Cargo.lock file in each derivation, but that would be verbose and have a lot of churn. Since we’re having to specify dependencies anyway, that’s why I’ve warmed to @alyssais’ solution: each Rust derivation specifies its direct dependencies with major versions, which are themselves derivations that specify their direct dependencies, etc., and we achieve the desired end result without a global lock file. The result looks a lot like a typical package set, except source‐based (because Cargo) and largely automatically generated.

We're talking about large text files here, not large binary files (Binary Large OBject). Those get deduped just fine. I've deduped a dozen or two ~30MiB files into a few MiBs using a git repo before where tar.xz, borg, bup and zpaq would all produce something on the order of 200MiB.

You should generally never store BLOBs in git unless they're really tiny and perhaps not even then. We should probably enforce this in Nixpkgs btw. but that's for another topic.

blob is the Git internals term for file content objects. FWIW, @MostAwesomeDude’s statements match my distant recollection of how Git works.

I am sceptical of this. Using the python package set is extremely annoying because I end up overwriting it to lock dependencies to the version the package needs. So the approach hinted at earlier with one package per minor version would be much appreciated.

The Python ecosystem is much worse about following SemVer and avoiding gratuitous breaking changes than Rust. Cargo assumes SemVer, and the convention in Rust is to just pin a minimum version and let Cargo automatically pick higher versions within that major bound. Rust developers are generally sticklers enough about breaking changes that this just works. The idea is that we would package one minor version of every SemVer‐major version we need (i.e. 0.1.*, 0.2.*, 1.*, 2.*, …).

There are still opportunities for Hyrum’s law issues when we pick versions that aren’t the exact ones pinned in upstream lock files, so we may end up having to package multiple minor versions of the same major version sometimes, but that will hopefully be rare enough that the small amount of manual intervention required won’t be too annoying.

Also, to be clear, I’m solely focused on in‐tree Nixpkgs use right now. It’s not (yet) my expectation that anyone outside of Nixpkgs would consume this package set rather than doing the same things they’d do now.

@SuperSandro2000
Copy link
Member

The biggest files in the git history are the generated node-packages and the hackage file.
We also need to avoid at all costs the pitfalls the nodePackages lock file had, namely:

  • it took 90+ Minutes to generate on Gigabit connections because of sequential downloads
  • merge conflicts happened all the time
  • every upgrade broke some package which where often ignored due to the above points

Many popular packages have incompatible major versions, and we may not want to keep a package exclusively on an older version just for some outdated software that pins an older version than most software is compatible with.

Or minor. Also some people love to pint exact patch versions of crates for no real reason which would make this more difficult than necessary. We could end up in situation like in python land where version constraints are recommendations if tests fail.

@gaykitty
Copy link
Contributor

One thing that I have not yet seen brought up is how backporting package updates to stable would work. I suppose the easiest answer is, that would only be done manually.

One reason I bring this up is because, while the Rust community is generally good about respecting semver, increasing the minimum required compiler version is often not considered a "breaking change". So any backport to stable also requires figuring out if dependencies can be built on the compiler version in stable. The rust-version key in Cargo.toml makes this easier if it's included.

@emilazy
Copy link
Member Author

emilazy commented Aug 17, 2024

The biggest files in the git history are the generated node-packages and the hackage file. We also need to avoid at all costs the pitfalls the nodePackages lock file had, namely:

  • it took 90+ Minutes to generate on Gigabit connections because of sequential downloads
  • merge conflicts happened all the time
  • every upgrade broke some package which where often ignored due to the above points

I think these issues can be avoided with the approach I intend to explore.

Or minor. Also some people love to pint exact patch versions of crates for no real reason which would make this more difficult than necessary. We could end up in situation like in python land where version constraints are recommendations if tests fail.

Right. The architecture I have in mind could support arbitrarily‐precise version requirements if needed, but it’d be good to avoid if necessary. I don’t yet have an idea of how much of a problem it’d be and whether we’d feel the desire to patch Cargo.toml files regularly. Keeping extra versions of packages should be cheap, however.

One thing that I have not yet seen brought up is how backporting package updates to stable would work. I suppose the easiest answer is, that would only be done manually.

One reason I bring this up is because, while the Rust community is generally good about respecting semver, increasing the minimum required compiler version is often not considered a "breaking change". So any backport to stable also requires figuring out if dependencies can be built on the compiler version in stable. The rust-version key in Cargo.toml makes this easier if it's included.

I hadn’t really thought about backports but I guess they should probably just be handled by running the automation from scratch on the release branch (and, yeah, making sure it takes MSRV into account).

@emilazy
Copy link
Member Author

emilazy commented Aug 17, 2024

I should say: I don’t expect the Rust package set to be small, necessarily. I’m sure it will still take up a meaningful portion of the repository, even as we will be able to get rid of the Cargo.lock files (which will help a lot in terms of making it scale and avoiding unsustainable levels of growth). The fact is that there is just a lot of Rust code out there and we package a sizeable portion of it. The manually‐maintained Python package set is also pretty huge, and after all, Nixpkgs is close to 100% package definitions by weight; package definitions are what it exists for!

The hope is that we will get a less redundant package set, deduplicating versions where possible instead of vendoring entirely separate lock files, while gaining insight into dependency trees for all packages rather than hiding them behind opaque FODs, and allowing the Rust package maintenance to scale better by being able to apply patches, version bumps, and build tweaks on a per‐crate basis. It remains to be seen exactly how it will pan out, but I am optimistic that we will get much more value out of it than we currently do from the space we spend on Cargo.lock files.

In any case, there will definitely not be one huge file prone to merge conflicts. In that sense, this issue is pretty badly named. (I just find One Big Lock File funny, and can’t think of a particularly good title.)

@workingjubilee
Copy link

Please let us know if there are cargo or rustc features that could help with this situation. I know that in general cargo has been growing more features to handle the MSRV thing.

@emilazy emilazy changed the title Investigate using One Big Lock File for Rust packages Investigate packaging Rust crates separately Aug 31, 2024
@micahcc
Copy link

micahcc commented Sep 5, 2024

How would this work in a nix shell? I do most of my rust development this way and it seems like there would need to be tooling to sync Cargo.toml to match nixpkgs?

@nbraud

This comment was marked as off-topic.

@GoldsteinE

This comment was marked as off-topic.

@nbraud

This comment was marked as off-topic.

@GoldsteinE

This comment was marked as off-topic.

@nbraud

This comment was marked as off-topic.

@nbdd0121
Copy link
Contributor

nbdd0121 commented Sep 18, 2024

You can’t dynamically link crates that are not prepared for this since generics don’t work through shared objects. Even caching crates for static linking is hard due to the way Cargo’s feature unification works: every combination of flags in crate or any of its dependencies results in a completely separate crate and flags are unified globally, so many builds will have a lot of totally unique crates.

For some crates we should be able to enable superset of features. This doesn't work for crates that have mutual exclusive features though.

@nbraud
Copy link
Contributor

nbraud commented Sep 18, 2024

we have to do the narrowing at Nix evaluation time without access to the src.

Isn't that something dynamic derivations are meant to address? As far as I understand, that would let us process src as a build step to emit the actual derivation building the package.

I'm not very in-tune with nix-interpreter development, but checking the relevant issues it seems to be actively happening: the last necessary change is written but blocked on the resolution of another bug which it exposed.

(sorry for the repost, split it from the threat I tagged as off-topic)

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/state-of-haskell-nix-ecosystem-2024/53740/9

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/what-is-nixpkgs-preferred-programming-language/53848/33

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests