Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autogenerated data in Nixpkgs #15480

Open
FRidh opened this issue May 15, 2016 · 39 comments
Open

Autogenerated data in Nixpkgs #15480

FRidh opened this issue May 15, 2016 · 39 comments
Labels
2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 6.topic: policy discussion

Comments

@FRidh
Copy link
Member

FRidh commented May 15, 2016

In Nixpkgs we have on several places autogenerated data. This issue is a placeholder for discussions on autogenerated data in Nixpkgs.

Discussion on importing from derivation and how Hydra handles it: NixOS/nix#954

@FRidh
Copy link
Member Author

FRidh commented May 15, 2016

About @zimbatm 's work on autogenerating Ruby metadata (#15007 (comment)). I could imagine we check into a repo all that data, but then generate archives which have a selection of it, e.g. only the last X versions, or only the last release per major release. That way you should be able to reduce the size of the archive significantly.

@zimbatm
Copy link
Member

zimbatm commented May 15, 2016

What kind of use-cases do you see for having a package inventory ?

Just collecting a list of (name, version) => hashes could be useful. It can be used to find if upstream was manipulated and just the sha256 of the index is enough to get all of your packages' hashes as a developer. It's ~100M of download which is still reasonable for developers, users should get the pre-compiled binary only.

My main motivation is to avoid talking to rubygems.org when converting a Gemfile.lock to a nix expression and having that list of hashes is probably enough. Then instead of building a gems.nix file I can make a nix function that takes the rubygems hashes repo, the Gemfile.lock, and builds me the gems.nix dynamically.

@FRidh
Copy link
Member Author

FRidh commented May 15, 2016

@zimbatm you're right, end-users shouldn't actually ever need to download this file because they would use only pre-built binaries. And as developer you don't want having to download such massive files every couple of days either I would think.

About the different fields I think we can say that

  • Hashes: Important
  • License: Good to have
  • Short description: Good to have
  • homepage: Good to have.
  • Long description: Nice to have, but not important and likely not interesting because of the size.

When you said description, what type of description did you mean? A long text, or a one-liner like we use for the description field? I would imagine that if the description is a one-liner the total size would not be 3.2 GB but more in the order of ~500 MB, judging from the size of the hashes.

@zimbatm
Copy link
Member

zimbatm commented May 15, 2016

And as developer you don't want having to download such massive files every couple of days either I would think.

Yeah I don't really know how to solve that problem. If we don't include all of the packages then the index becomes less useful. Projects often lag behind in terms of dependencies so it's not enough to just include the latest version of each package.

About the different fields I think we can say that [snip]

agreed

When you said description, what type of description did you mean?

Just the one-liner. I processes the first 1440 packages and based on a linear progression of the disk usage the full 800k would be 3.2G of disk. One compromise might be to keep only the values of the last published version per package.

@FRidh
Copy link
Member Author

FRidh commented May 15, 2016

One compromise might be to keep only the values of the last published version per package.

Seems to be a fair compromise. In the case of PyPI there is only one description and license per package. I will have a look also what the size is going to be of the file.

I think actually that, if we would go this way, it might be a good idea to ask upstream whether they can create regular dumps of the database.

@bjornfor
Copy link
Contributor

Do the pypi and/or ruby online databases allow fetching the database at a certain state? Similarly to what can be done with git repositories?

@FRidh
Copy link
Member Author

FRidh commented May 15, 2016

@bjornfor in the case of PyPI there are a couple of API's, but no way to query for a specific state of the database. But generating a dump of their database seems to be on their To Do list:

TO-DO list
A big structured dump of all package meta-data.

@zimbatm
Copy link
Member

zimbatm commented May 15, 2016

@bjornfor in the case of rubygems, a single specs.4.8.gz 2.7M file can be loaded that contains all list of all the published gems. gems are immutable once published, the only state that can be changes is whenever they get yanked from the repo. there is also an api end-point that returns the last 50 published gems but no "last published since " end-point.

@FRidh
Copy link
Member Author

FRidh commented May 16, 2016

I checked again the API's for PyPI and found that via xmlrpc it is possible to obtain the changes since a certain time or revision:

changelog(since, with_ids=False)

Retrieve a list of four-tuples (name, version, timestamp, action), or five-tuple including the serial id if ids >are requested, since the given timestamp. All timestamps are UTC values. The argument is a UTC >integer seconds since the epoch.

changelog_since_serial(since_serial)

Retrieve a list of five-tuples (name, version, timestamp, action, serial) since the event identified by the >given serial. All timestamps are UTC values. The argument is a UTC integer seconds since the epoch.

So we just put the serial in the JSON and use that when updating. We check what new events occurred, and for each package in that set we retrieve the metadata and update the previous JSON file. With a cron job we can automate this and push the latest version into the git repo.

@zimbatm
Copy link
Member

zimbatm commented May 16, 2016

nice

@copumpkin
Copy link
Member

copumpkin commented May 16, 2016

Highly relevant: NixOS/nix#52

@zimbatm
Copy link
Member

zimbatm commented May 16, 2016

And #14897

@FRidh
Copy link
Member Author

FRidh commented May 17, 2016

I created a separate script to generate/update JSON for Python. The source can be found at
https://github.com/FRidh/srcs-pypi-update
The generated JSON is pushed to https://github.com/FRidh/srcs-pypi.

The scripts pulls the srcs-pypi. If the JSON file is there, it uses the stored timestamp to determine what packages need to be updated. If there is no JSON it retrieves data for all packages.

Note that for now I checked in only the first 500 packages, because I think there are still some changes to the format needed, and it takes quite some time to retrieve the data for all 80.000 packages.

Some open issues:

  • What to do with fields for which data is not available? E.g., license? Now I store null, but it might be better to just drop the attribute in that case.
  • The xmlrpc API offers two methods for creating a changelog, either with a serial number or a timestamp. For the initial run I need to put in a timestamp but after that it is possible to use the serial instead. For now I stick with the timestamp.
  • Get rid of white space to make the file smaller but less readable.

Next step is to create a NixOS module or just a service. I have no idea yet how to though :-)

@zimbatm
Copy link
Member

zimbatm commented May 17, 2016

In the rubygems2nix project I store one .json file per package-version-platform and then have a top-level default.nix which is just a function that takes a "name" and "version" attribute and returns the parsed json as a response. That way the content is only lazily-loaded.

@garbas
Copy link
Member

garbas commented May 31, 2016

i'm 👎 on including all language metadata into nixpkgs repo.

i like the proposal from @gilligan ( #14532 ) that only applications and their dependencies are kept in nixpkgs.

what can/should be done is that there are repos for particular language which extend nixpkgs (eg. nixpkgs-python, nixpkgs-haskell, nixpkgs-ruby). then those repositories can be arbitrary big as those communities see fit best.

@FRidh
Copy link
Member Author

FRidh commented May 31, 2016

In the rubygems2nix project I store one .json file per package-version-platform and then have a top-level default.nix which is just a function that takes a "name" and "version" attribute and returns the parsed json as a response. That way the content is only lazily-loaded.

Good idea. It would actually make the updating process slightly easier for me as well.

i like the proposal from @gilligan ( #14532 ) that only applications and their dependencies are kept in nixpkgs.

Saying we only keep some, not all, is tricky because where do you draw the line. In this case you have a pretty well-defined line: only packages that are dependencies of applications. It sounds fair to me, and I would be interested in seeing what percentage of Python packages is not used then.

It could be a fair amount, but even so, we have many tiny dependencies that are hardly ever updated because nobody simply bothers with it. We have to find a solution for that.

Some packages that should not be in Nixpkgs anymore then are scipy, pandas and django.
But what about say jupyter? It should be considered an application, it is used also with e.g. Haskell, but it also has a great many dependencies.

i'm 👎 on including all language metadata into nixpkgs repo.

You don't want the actual data in the repo, or not reference it from the repo?
My idea is to have a srcs-pypi, srcs-github and so on repo's. Nixpkgs would reference those and so would itself remain much smaller. The disadvantage here is that for the evaluation you would likely have to download a large file.

@copumpkin
Copy link
Member

copumpkin commented May 31, 2016

I wouldn't split it along the lines of the (as you point out) poorly defined categories of "application" or otherwise. I'd just do as we've suggested elsewhere and make haskellPackages import from another repository, and same with pythonPackages and others. If your evaluation forces a value from one of those attribute sets, Nix transparently downloads and evaluates it. There could be some building during evaluation, but I don't think it's terrible.

The main thing to figure out in this space is how nix-env's name-based crap works with it. I never use it so I don't know if nix-env normally recurses into the language-specific packages but if it does, that'll force a download of the language repositories once per channel update.

@bobvanderlinden
Copy link
Member

Just to give an idea how this will look, is there an example of Nix expressions that are imported from another git repo? Or is this a feature that isn't in Nix yet? It sounds very promising though.

@garbas
Copy link
Member

garbas commented Jun 2, 2016

@bobvanderlinden i imagine the same way you would usually "pin" down nixpkgs revision you are using in your project. i wrote once about this here

@copumpkin
Copy link
Member

@garbas I think you're missing an import (you fetchgit the repo but don't import the result; I might also use fetchFromGitHub these days) and a function application (to {}) in your example, but otherwise that's what I was thinking of.

@bobvanderlinden
Copy link
Member

So just to recap, it'll look something like:

{
    pythonPackages = import (nixpkgs.fetchFromGitHub {
        owner = "nixos";
        repo = "nixpkgs-python";
        rev = "d0dd1f61b33d64e29d8bc1372a94ef6a2fee76a9";
        sha256 = "71529cd6998a14f6482a78f078e1e868570f37b946ea4ba04820879479e3aa2c";
    }) {};
}

And nixpkgs-python will be fully/partially automatically generated.

Also, regarding nix-env's name-based crap, this will probably get a less prominent role with the new nix cli.

See https://gist.github.com/edolstra/efcadfd240c2d8906348:

By default, packages are selected by attribute name, rather than the name attribute.

That said, it might still be possible to add names for the applications themselves in nixpkgs whereas the nixpkgs-python will have their own name attribute.

@FRidh
Copy link
Member Author

FRidh commented Jun 5, 2016

I've generated from PyPI a JSON file package, and grouped the packages on the first letter.
See https://github.com/FRidh/srcs-pypi
The archive is 80 MB.

@FRidh
Copy link
Member Author

FRidh commented Jun 5, 2016

1000 issues issue further and here (#16005) is a branch where you can use buildPyPIPackage with the generated json.

@FRidh
Copy link
Member Author

FRidh commented Jun 22, 2016

Since cc5adac the hashes for KDE5 are also stored outside of Nixpkgs in https://github.com/ttuegel/nixpkgs-kde-qt
Should that repo be moved into Nixpkgs organization or can it be kept outside? I suppose the latter is a bit more flexible for @ttuegel

@ttuegel
Copy link
Member

ttuegel commented Jun 22, 2016

@FRidh That was reverted, see #16130. It is impossible to have generated Nix expressions in Nixpkgs.

@zimbatm
Copy link
Member

zimbatm commented Jun 22, 2016

Another issue is that nix-env -qaP would have to download these external repos isn't it?

I started something similar for the rubygems: https://github.com/zimbatm/nix-rubygems . I'm still in the process of fetching the hashes of all 800k of them.

@vcunat
Copy link
Member

vcunat commented Jun 22, 2016

I believe that's the essential problem: the nix tooling isn't really designed for the possibility of requiring builds to happen during evaluation (downloads are just a specific type of build). If it's just a matter of downloading several megabytes and caching those in nix store, that could be made not-too obtrusive in principle... Another example: now such fetching always runs verbose IIRC even during queries or --dry-run.

@zimbatm
Copy link
Member

zimbatm commented Jun 22, 2016

I'm not entirely sure we are all trying to solve the same problem so maybe we should start with that: what is it that you are trying to solve? Maybe a solution will appear from that, or we'll just realize we were all trying to solve different things.


As a ruby developer in a mixed developer environment I cannot ask my colleagues to update the gemset.nix file. Whenever the ruby dependencies get updated it would also generates an even larger diff in the commits which is not ideal for reviewing the code (on github).

So my idea was to generate the hashes of all the gems so I could generate the gemset.nix dynamically. Then all I have is a single default.nix in the repo root and an external list of rubygems that I can keep up2date independently. I don't particularly care about the binary cache so it doesn't matter too much if it gets out of sync with nixpkgs. An optimisation would be if nixpkgs could hold a couple of those expensive ones like nokogiri that builds a C extension very slowly, but that come later.

@FRidh
Copy link
Member Author

FRidh commented Jun 23, 2016

In the case of Python the goal is to have all of PyPI available because

In #16005 an archive is downloaded and the included default.nix is used to select the right JSON file. That Nix function could also be moved out of the archive I guess. Would there still be a problem with restricted mode on Hydra in that case?

@vcunat
Copy link
Member

vcunat commented Jun 23, 2016

Would there still be a problem with restricted mode on Hydra in that case?

I believe so. Any importing from derivation shouldn't work, perhaps except for the case you already have the output.

@zimbatm
Copy link
Member

zimbatm commented Jun 23, 2016

What if there was a channel per $lang repository, what would be the drawbacks of that approach? We could still copy one version of select executables to nixpkgs (or develop tooling to automate that). We would probably lose the binary cache but it doesn't matter too much for scripting languages as the build step is non-existent or very short.

@FRidh
Copy link
Member Author

FRidh commented Jun 23, 2016

What if there was a channel per $lang repository, what would be the drawbacks of that approach?

Archives of Nixpkgs consist only of a specific branch, and I think the idea was to deprecate channels.

We would probably lose the binary cache but it doesn't matter too much for scripting languages as the build step is non-existent or very short.

At least with Python there's several packages that do take a long time to build and so for which the binary cache is important. We could choose to keep those in main Nixpkgs but I find such a line quite arbitrary as I pointed out before.

I believe so. Any importing from derivation shouldn't work, perhaps except for the case you already have the output.

The Nix manual states

Nix has a new option restrict-eval that allows limiting what paths the Nix evaluator has access to. By passing --option restrict-eval true to Nix, the evaluator will throw an exception if an attempt is made to access any file outside of the Nix search path. This is primarily intended for Hydra to ensure that a Hydra jobset only refers to its declared inputs (and is therefore reproducible).

and

In Nix expressions, via the new builtin function fetchTarball:
with import (fetchTarball https://github.com/NixOS/nixpkgs-channels/archive/nixos-14.12.tar.gz) {};
...
(This is not allowed in restricted mode.)

This is with fetchTarball, which doesn't take a hash. Maybe it does work with fetchurl and fetchgit? I will push #16005 to a Hydra and see what happens.

@FRidh
Copy link
Member Author

FRidh commented Jun 28, 2016

I've tested whether a Hydra would build packages that use buildPyPIPackage (#16005) and it indeed doesn't. That is a pity.

@FRidh
Copy link
Member Author

FRidh commented Aug 13, 2016

@zimbatm is it perhaps an idea in your case to chain two repositories. One that has per package a file containing all relevant metadata, and another that contains the filenames and hashes of those files. You would then use the commit identifier in a url like https://raw.githubusercontent.com/FRidh/srcs-pypi/12bc2a4cbcec42a867e4cc2fdef67e76b2c0b169/data/j/jupyter.json to get the gem.

@teto
Copy link
Member

teto commented Sep 13, 2017

I've just started sthg to convert rockspec to nix https://github.com/teto/luarocks/tree/nix .

@Ekleog
Copy link
Member

Ekleog commented Oct 17, 2018

(from triage) I think NUR provides a quite good solution to this issue: have a NUR repository per auto-generated source. This could be compatible with the idea that nixpkgs only keeps the applications' dependencies (auto-imported), solving the size issue, and at the same time allow for a generated complete copy of the archive to co-exist.

I really don't like the idea of chaining two repositories even for applications: if the second repository becomes necessary in a lot of setups (just think how many packages require at least one python package), then might as well just put the two together, it'd display the download at the correct time at least.

@stale

This comment has been minimized.

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 3, 2020
@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/future-of-npm-packages-in-nixpkgs/14285/3

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 28, 2021
@stale
Copy link

stale bot commented Apr 30, 2022

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Apr 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 6.topic: policy discussion
Projects
None yet
Development

No branches or pull requests