-
-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autogenerated data in Nixpkgs #15480
Comments
About @zimbatm 's work on autogenerating Ruby metadata (#15007 (comment)). I could imagine we check into a repo all that data, but then generate archives which have a selection of it, e.g. only the last X versions, or only the last release per major release. That way you should be able to reduce the size of the archive significantly. |
What kind of use-cases do you see for having a package inventory ? Just collecting a list of (name, version) => hashes could be useful. It can be used to find if upstream was manipulated and just the sha256 of the index is enough to get all of your packages' hashes as a developer. It's ~100M of download which is still reasonable for developers, users should get the pre-compiled binary only. My main motivation is to avoid talking to rubygems.org when converting a Gemfile.lock to a nix expression and having that list of hashes is probably enough. Then instead of building a gems.nix file I can make a nix function that takes the rubygems hashes repo, the Gemfile.lock, and builds me the gems.nix dynamically. |
@zimbatm you're right, end-users shouldn't actually ever need to download this file because they would use only pre-built binaries. And as developer you don't want having to download such massive files every couple of days either I would think. About the different fields I think we can say that
When you said description, what type of description did you mean? A long text, or a one-liner like we use for the description field? I would imagine that if the description is a one-liner the total size would not be 3.2 GB but more in the order of ~500 MB, judging from the size of the hashes. |
Yeah I don't really know how to solve that problem. If we don't include all of the packages then the index becomes less useful. Projects often lag behind in terms of dependencies so it's not enough to just include the latest version of each package.
agreed
Just the one-liner. I processes the first 1440 packages and based on a linear progression of the disk usage the full 800k would be 3.2G of disk. One compromise might be to keep only the values of the last published version per package. |
Seems to be a fair compromise. In the case of PyPI there is only one description and license per package. I will have a look also what the size is going to be of the file. I think actually that, if we would go this way, it might be a good idea to ask upstream whether they can create regular dumps of the database. |
Do the pypi and/or ruby online databases allow fetching the database at a certain state? Similarly to what can be done with git repositories? |
@bjornfor in the case of PyPI there are a couple of API's, but no way to query for a specific state of the database. But generating a dump of their database seems to be on their To Do list:
|
@bjornfor in the case of rubygems, a single |
I checked again the API's for PyPI and found that via xmlrpc it is possible to obtain the changes since a certain time or revision:
So we just put the serial in the JSON and use that when updating. We check what new events occurred, and for each package in that set we retrieve the metadata and update the previous JSON file. With a cron job we can automate this and push the latest version into the git repo. |
nice |
Highly relevant: NixOS/nix#52 |
And #14897 |
I created a separate script to generate/update JSON for Python. The source can be found at The scripts pulls the Note that for now I checked in only the first 500 packages, because I think there are still some changes to the format needed, and it takes quite some time to retrieve the data for all 80.000 packages. Some open issues:
Next step is to create a NixOS module or just a service. I have no idea yet how to though :-) |
In the rubygems2nix project I store one |
i'm 👎 on including all language metadata into nixpkgs repo. i like the proposal from @gilligan ( #14532 ) that only applications and their dependencies are kept in nixpkgs. what can/should be done is that there are repos for particular language which extend nixpkgs (eg. nixpkgs-python, nixpkgs-haskell, nixpkgs-ruby). then those repositories can be arbitrary big as those communities see fit best. |
Good idea. It would actually make the updating process slightly easier for me as well.
Saying we only keep some, not all, is tricky because where do you draw the line. In this case you have a pretty well-defined line: only packages that are dependencies of applications. It sounds fair to me, and I would be interested in seeing what percentage of Python packages is not used then. It could be a fair amount, but even so, we have many tiny dependencies that are hardly ever updated because nobody simply bothers with it. We have to find a solution for that. Some packages that should not be in Nixpkgs anymore then are
You don't want the actual data in the repo, or not reference it from the repo? |
I wouldn't split it along the lines of the (as you point out) poorly defined categories of "application" or otherwise. I'd just do as we've suggested elsewhere and make The main thing to figure out in this space is how |
Just to give an idea how this will look, is there an example of Nix expressions that are imported from another git repo? Or is this a feature that isn't in Nix yet? It sounds very promising though. |
@bobvanderlinden i imagine the same way you would usually "pin" down nixpkgs revision you are using in your project. i wrote once about this here |
@garbas I think you're missing an |
So just to recap, it'll look something like: {
pythonPackages = import (nixpkgs.fetchFromGitHub {
owner = "nixos";
repo = "nixpkgs-python";
rev = "d0dd1f61b33d64e29d8bc1372a94ef6a2fee76a9";
sha256 = "71529cd6998a14f6482a78f078e1e868570f37b946ea4ba04820879479e3aa2c";
}) {};
} And Also, regarding See https://gist.github.com/edolstra/efcadfd240c2d8906348:
That said, it might still be possible to add names for the applications themselves in nixpkgs whereas the nixpkgs-python will have their own |
I've generated from PyPI a JSON file package, and grouped the packages on the first letter. |
1000 issues issue further and here (#16005) is a branch where you can use buildPyPIPackage with the generated json. |
Since cc5adac the hashes for KDE5 are also stored outside of Nixpkgs in https://github.com/ttuegel/nixpkgs-kde-qt |
Another issue is that I started something similar for the rubygems: https://github.com/zimbatm/nix-rubygems . I'm still in the process of fetching the hashes of all 800k of them. |
I believe that's the essential problem: the nix tooling isn't really designed for the possibility of requiring builds to happen during evaluation (downloads are just a specific type of build). If it's just a matter of downloading several megabytes and caching those in nix store, that could be made not-too obtrusive in principle... Another example: now such fetching always runs verbose IIRC even during queries or |
I'm not entirely sure we are all trying to solve the same problem so maybe we should start with that: what is it that you are trying to solve? Maybe a solution will appear from that, or we'll just realize we were all trying to solve different things. As a ruby developer in a mixed developer environment I cannot ask my colleagues to update the gemset.nix file. Whenever the ruby dependencies get updated it would also generates an even larger diff in the commits which is not ideal for reviewing the code (on github). So my idea was to generate the hashes of all the gems so I could generate the gemset.nix dynamically. Then all I have is a single default.nix in the repo root and an external list of rubygems that I can keep up2date independently. I don't particularly care about the binary cache so it doesn't matter too much if it gets out of sync with nixpkgs. An optimisation would be if nixpkgs could hold a couple of those expensive ones like nokogiri that builds a C extension very slowly, but that come later. |
In the case of Python the goal is to have all of PyPI available because
In #16005 an archive is downloaded and the included |
I believe so. Any importing from derivation shouldn't work, perhaps except for the case you already have the output. |
What if there was a channel per $lang repository, what would be the drawbacks of that approach? We could still copy one version of select executables to nixpkgs (or develop tooling to automate that). We would probably lose the binary cache but it doesn't matter too much for scripting languages as the build step is non-existent or very short. |
Archives of Nixpkgs consist only of a specific branch, and I think the idea was to deprecate channels.
At least with Python there's several packages that do take a long time to build and so for which the binary cache is important. We could choose to keep those in main Nixpkgs but I find such a line quite arbitrary as I pointed out before.
The Nix manual states
and
This is with |
I've tested whether a Hydra would build packages that use |
@zimbatm is it perhaps an idea in your case to chain two repositories. One that has per package a file containing all relevant metadata, and another that contains the filenames and hashes of those files. You would then use the commit identifier in a url like https://raw.githubusercontent.com/FRidh/srcs-pypi/12bc2a4cbcec42a867e4cc2fdef67e76b2c0b169/data/j/jupyter.json to get the gem. |
I've just started sthg to convert rockspec to nix https://github.com/teto/luarocks/tree/nix . |
(from triage) I think NUR provides a quite good solution to this issue: have a NUR repository per auto-generated source. This could be compatible with the idea that nixpkgs only keeps the applications' dependencies (auto-imported), solving the size issue, and at the same time allow for a generated complete copy of the archive to co-exist. I really don't like the idea of chaining two repositories even for applications: if the second repository becomes necessary in a lot of setups (just think how many packages require at least one python package), then might as well just put the two together, it'd display the download at the correct time at least. |
This comment has been minimized.
This comment has been minimized.
This issue has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/future-of-npm-packages-in-nixpkgs/14285/3 |
I marked this as stale due to inactivity. → More info |
In Nixpkgs we have on several places autogenerated data. This issue is a placeholder for discussions on autogenerated data in Nixpkgs.
Discussion on importing from derivation and how Hydra handles it: NixOS/nix#954
The text was updated successfully, but these errors were encountered: