Autogenerated data in Nixpkgs #15480

FRidh · 2016-05-15T17:00:27Z

In Nixpkgs we have on several places autogenerated data. This issue is a placeholder for discussions on autogenerated data in Nixpkgs.

Python: Python: autogenerate PyPI source data (2) #15007 , @FRidh
Ruby: Python: autogenerate PyPI source data (2) #15007 (comment) , @zimbatm
ELPA/MELPA , @ttuegel
KDE Generate Hackage/LTS nix expressions on the fly rather than have them in the repo #16130 (comment) , @ttuegel
Haskell (Generate Hackage/LTS nix expressions on the fly rather than have them in the repo #16130)
Node
Go
Xorg
luarocks packages

Discussion on importing from derivation and how Hydra handles it: NixOS/nix#954

FRidh · 2016-05-15T17:03:59Z

About @zimbatm 's work on autogenerating Ruby metadata (#15007 (comment)). I could imagine we check into a repo all that data, but then generate archives which have a selection of it, e.g. only the last X versions, or only the last release per major release. That way you should be able to reduce the size of the archive significantly.

zimbatm · 2016-05-15T17:47:31Z

What kind of use-cases do you see for having a package inventory ?

Just collecting a list of (name, version) => hashes could be useful. It can be used to find if upstream was manipulated and just the sha256 of the index is enough to get all of your packages' hashes as a developer. It's ~100M of download which is still reasonable for developers, users should get the pre-compiled binary only.

My main motivation is to avoid talking to rubygems.org when converting a Gemfile.lock to a nix expression and having that list of hashes is probably enough. Then instead of building a gems.nix file I can make a nix function that takes the rubygems hashes repo, the Gemfile.lock, and builds me the gems.nix dynamically.

FRidh · 2016-05-15T18:08:19Z

@zimbatm you're right, end-users shouldn't actually ever need to download this file because they would use only pre-built binaries. And as developer you don't want having to download such massive files every couple of days either I would think.

About the different fields I think we can say that

Hashes: Important
License: Good to have
Short description: Good to have
homepage: Good to have.
Long description: Nice to have, but not important and likely not interesting because of the size.

When you said description, what type of description did you mean? A long text, or a one-liner like we use for the description field? I would imagine that if the description is a one-liner the total size would not be 3.2 GB but more in the order of ~500 MB, judging from the size of the hashes.

zimbatm · 2016-05-15T19:10:33Z

And as developer you don't want having to download such massive files every couple of days either I would think.

Yeah I don't really know how to solve that problem. If we don't include all of the packages then the index becomes less useful. Projects often lag behind in terms of dependencies so it's not enough to just include the latest version of each package.

About the different fields I think we can say that [snip]

agreed

When you said description, what type of description did you mean?

Just the one-liner. I processes the first 1440 packages and based on a linear progression of the disk usage the full 800k would be 3.2G of disk. One compromise might be to keep only the values of the last published version per package.

FRidh · 2016-05-15T19:33:58Z

One compromise might be to keep only the values of the last published version per package.

Seems to be a fair compromise. In the case of PyPI there is only one description and license per package. I will have a look also what the size is going to be of the file.

I think actually that, if we would go this way, it might be a good idea to ask upstream whether they can create regular dumps of the database.

bjornfor · 2016-05-15T19:36:21Z

Do the pypi and/or ruby online databases allow fetching the database at a certain state? Similarly to what can be done with git repositories?

FRidh · 2016-05-15T19:51:25Z

@bjornfor in the case of PyPI there are a couple of API's, but no way to query for a specific state of the database. But generating a dump of their database seems to be on their To Do list:

TO-DO list
A big structured dump of all package meta-data.

zimbatm · 2016-05-15T21:04:32Z

@bjornfor in the case of rubygems, a single specs.4.8.gz 2.7M file can be loaded that contains all list of all the published gems. gems are immutable once published, the only state that can be changes is whenever they get yanked from the repo. there is also an api end-point that returns the last 50 published gems but no "last published since " end-point.

FRidh · 2016-05-16T09:33:17Z

I checked again the API's for PyPI and found that via xmlrpc it is possible to obtain the changes since a certain time or revision:

changelog(since, with_ids=False)

Retrieve a list of four-tuples (name, version, timestamp, action), or five-tuple including the serial id if ids >are requested, since the given timestamp. All timestamps are UTC values. The argument is a UTC >integer seconds since the epoch.

changelog_since_serial(since_serial)

Retrieve a list of five-tuples (name, version, timestamp, action, serial) since the event identified by the >given serial. All timestamps are UTC values. The argument is a UTC integer seconds since the epoch.

So we just put the serial in the JSON and use that when updating. We check what new events occurred, and for each package in that set we retrieve the metadata and update the previous JSON file. With a cron job we can automate this and push the latest version into the git repo.

zimbatm · 2016-05-16T11:06:26Z

nice

copumpkin · 2016-05-16T13:10:56Z

Highly relevant: NixOS/nix#52

zimbatm · 2016-05-16T13:18:10Z

And #14897

FRidh · 2016-05-17T15:00:47Z

I created a separate script to generate/update JSON for Python. The source can be found at
https://github.com/FRidh/srcs-pypi-update
The generated JSON is pushed to https://github.com/FRidh/srcs-pypi.

The scripts pulls the srcs-pypi. If the JSON file is there, it uses the stored timestamp to determine what packages need to be updated. If there is no JSON it retrieves data for all packages.

Note that for now I checked in only the first 500 packages, because I think there are still some changes to the format needed, and it takes quite some time to retrieve the data for all 80.000 packages.

Some open issues:

What to do with fields for which data is not available? E.g., license? Now I store null, but it might be better to just drop the attribute in that case.
The xmlrpc API offers two methods for creating a changelog, either with a serial number or a timestamp. For the initial run I need to put in a timestamp but after that it is possible to use the serial instead. For now I stick with the timestamp.
Get rid of white space to make the file smaller but less readable.

Next step is to create a NixOS module or just a service. I have no idea yet how to though :-)

zimbatm · 2016-05-17T16:33:05Z

In the rubygems2nix project I store one .json file per package-version-platform and then have a top-level default.nix which is just a function that takes a "name" and "version" attribute and returns the parsed json as a response. That way the content is only lazily-loaded.

garbas · 2016-05-31T14:22:58Z

i'm 👎 on including all language metadata into nixpkgs repo.

i like the proposal from @gilligan ( #14532 ) that only applications and their dependencies are kept in nixpkgs.

what can/should be done is that there are repos for particular language which extend nixpkgs (eg. nixpkgs-python, nixpkgs-haskell, nixpkgs-ruby). then those repositories can be arbitrary big as those communities see fit best.

FRidh · 2016-05-31T15:08:23Z

In the rubygems2nix project I store one .json file per package-version-platform and then have a top-level default.nix which is just a function that takes a "name" and "version" attribute and returns the parsed json as a response. That way the content is only lazily-loaded.

Good idea. It would actually make the updating process slightly easier for me as well.

i like the proposal from @gilligan ( #14532 ) that only applications and their dependencies are kept in nixpkgs.

Saying we only keep some, not all, is tricky because where do you draw the line. In this case you have a pretty well-defined line: only packages that are dependencies of applications. It sounds fair to me, and I would be interested in seeing what percentage of Python packages is not used then.

It could be a fair amount, but even so, we have many tiny dependencies that are hardly ever updated because nobody simply bothers with it. We have to find a solution for that.

Some packages that should not be in Nixpkgs anymore then are scipy, pandas and django.
But what about say jupyter? It should be considered an application, it is used also with e.g. Haskell, but it also has a great many dependencies.

i'm 👎 on including all language metadata into nixpkgs repo.

You don't want the actual data in the repo, or not reference it from the repo?
My idea is to have a srcs-pypi, srcs-github and so on repo's. Nixpkgs would reference those and so would itself remain much smaller. The disadvantage here is that for the evaluation you would likely have to download a large file.

copumpkin · 2016-05-31T15:46:37Z

I wouldn't split it along the lines of the (as you point out) poorly defined categories of "application" or otherwise. I'd just do as we've suggested elsewhere and make haskellPackages import from another repository, and same with pythonPackages and others. If your evaluation forces a value from one of those attribute sets, Nix transparently downloads and evaluates it. There could be some building during evaluation, but I don't think it's terrible.

The main thing to figure out in this space is how nix-env's name-based crap works with it. I never use it so I don't know if nix-env normally recurses into the language-specific packages but if it does, that'll force a download of the language repositories once per channel update.

bobvanderlinden · 2016-06-01T21:15:11Z

Just to give an idea how this will look, is there an example of Nix expressions that are imported from another git repo? Or is this a feature that isn't in Nix yet? It sounds very promising though.

garbas · 2016-06-02T12:49:10Z

@bobvanderlinden i imagine the same way you would usually "pin" down nixpkgs revision you are using in your project. i wrote once about this here

copumpkin · 2016-06-02T13:51:11Z

@garbas I think you're missing an import (you fetchgit the repo but don't import the result; I might also use fetchFromGitHub these days) and a function application (to {}) in your example, but otherwise that's what I was thinking of.

bobvanderlinden · 2016-06-03T18:34:52Z

So just to recap, it'll look something like:

{
    pythonPackages = import (nixpkgs.fetchFromGitHub {
        owner = "nixos";
        repo = "nixpkgs-python";
        rev = "d0dd1f61b33d64e29d8bc1372a94ef6a2fee76a9";
        sha256 = "71529cd6998a14f6482a78f078e1e868570f37b946ea4ba04820879479e3aa2c";
    }) {};
}

And nixpkgs-python will be fully/partially automatically generated.

Also, regarding nix-env's name-based crap, this will probably get a less prominent role with the new nix cli.

See https://gist.github.com/edolstra/efcadfd240c2d8906348:

By default, packages are selected by attribute name, rather than the name attribute.

That said, it might still be possible to add names for the applications themselves in nixpkgs whereas the nixpkgs-python will have their own name attribute.

FRidh · 2016-06-05T14:23:59Z

I've generated from PyPI a JSON file package, and grouped the packages on the first letter.
See https://github.com/FRidh/srcs-pypi
The archive is 80 MB.

FRidh · 2016-06-05T16:10:07Z

1000 issues issue further and here (#16005) is a branch where you can use buildPyPIPackage with the generated json.

FRidh · 2016-06-22T11:29:07Z

Since cc5adac the hashes for KDE5 are also stored outside of Nixpkgs in https://github.com/ttuegel/nixpkgs-kde-qt
Should that repo be moved into Nixpkgs organization or can it be kept outside? I suppose the latter is a bit more flexible for @ttuegel

ttuegel · 2016-06-22T11:53:14Z

@FRidh That was reverted, see #16130. It is impossible to have generated Nix expressions in Nixpkgs.

zimbatm · 2016-06-22T13:06:40Z

Another issue is that nix-env -qaP would have to download these external repos isn't it?

I started something similar for the rubygems: https://github.com/zimbatm/nix-rubygems . I'm still in the process of fetching the hashes of all 800k of them.

vcunat · 2016-06-22T16:16:00Z

I believe that's the essential problem: the nix tooling isn't really designed for the possibility of requiring builds to happen during evaluation (downloads are just a specific type of build). If it's just a matter of downloading several megabytes and caching those in nix store, that could be made not-too obtrusive in principle... Another example: now such fetching always runs verbose IIRC even during queries or --dry-run.

zimbatm · 2016-06-22T20:04:04Z

I'm not entirely sure we are all trying to solve the same problem so maybe we should start with that: what is it that you are trying to solve? Maybe a solution will appear from that, or we'll just realize we were all trying to solve different things.

As a ruby developer in a mixed developer environment I cannot ask my colleagues to update the gemset.nix file. Whenever the ruby dependencies get updated it would also generates an even larger diff in the commits which is not ideal for reviewing the code (on github).

So my idea was to generate the hashes of all the gems so I could generate the gemset.nix dynamically. Then all I have is a single default.nix in the repo root and an external list of rubygems that I can keep up2date independently. I don't particularly care about the binary cache so it doesn't matter too much if it gets out of sync with nixpkgs. An optimisation would be if nixpkgs could hold a couple of those expensive ones like nokogiri that builds a C extension very slowly, but that come later.

FRidh · 2016-06-23T07:31:42Z

In the case of Python the goal is to have all of PyPI available because

I don't want to keep updating individual packages. I could be fine with my own packages but there's so many dependencies and other outdated packages now that the current method doesn't scale. In this case Python: autogenerate PyPI source data (2) #15007 would be sufficient.
I want to more easily use other versions, in which case we need something like Python: buildPyPIPackage and generated data #16005. The expressions might not work directly with other versions, but often they do.

In #16005 an archive is downloaded and the included default.nix is used to select the right JSON file. That Nix function could also be moved out of the archive I guess. Would there still be a problem with restricted mode on Hydra in that case?

vcunat · 2016-06-23T08:08:05Z

Would there still be a problem with restricted mode on Hydra in that case?

I believe so. Any importing from derivation shouldn't work, perhaps except for the case you already have the output.

zimbatm · 2016-06-23T09:57:40Z

What if there was a channel per $lang repository, what would be the drawbacks of that approach? We could still copy one version of select executables to nixpkgs (or develop tooling to automate that). We would probably lose the binary cache but it doesn't matter too much for scripting languages as the build step is non-existent or very short.

FRidh · 2016-06-23T10:58:58Z

What if there was a channel per $lang repository, what would be the drawbacks of that approach?

Archives of Nixpkgs consist only of a specific branch, and I think the idea was to deprecate channels.

We would probably lose the binary cache but it doesn't matter too much for scripting languages as the build step is non-existent or very short.

At least with Python there's several packages that do take a long time to build and so for which the binary cache is important. We could choose to keep those in main Nixpkgs but I find such a line quite arbitrary as I pointed out before.

I believe so. Any importing from derivation shouldn't work, perhaps except for the case you already have the output.

The Nix manual states

Nix has a new option restrict-eval that allows limiting what paths the Nix evaluator has access to. By passing --option restrict-eval true to Nix, the evaluator will throw an exception if an attempt is made to access any file outside of the Nix search path. This is primarily intended for Hydra to ensure that a Hydra jobset only refers to its declared inputs (and is therefore reproducible).

and

In Nix expressions, via the new builtin function fetchTarball:
with import (fetchTarball https://github.com/NixOS/nixpkgs-channels/archive/nixos-14.12.tar.gz) {};
...
(This is not allowed in restricted mode.)

This is with fetchTarball, which doesn't take a hash. Maybe it does work with fetchurl and fetchgit? I will push #16005 to a Hydra and see what happens.

FRidh · 2016-06-28T18:03:41Z

I've tested whether a Hydra would build packages that use buildPyPIPackage (#16005) and it indeed doesn't. That is a pity.

FRidh · 2016-08-13T14:58:39Z

@zimbatm is it perhaps an idea in your case to chain two repositories. One that has per package a file containing all relevant metadata, and another that contains the filenames and hashes of those files. You would then use the commit identifier in a url like https://raw.githubusercontent.com/FRidh/srcs-pypi/12bc2a4cbcec42a867e4cc2fdef67e76b2c0b169/data/j/jupyter.json to get the gem.

teto · 2017-09-13T02:46:35Z

I've just started sthg to convert rockspec to nix https://github.com/teto/luarocks/tree/nix .

Ekleog · 2018-10-17T14:39:55Z

(from triage) I think NUR provides a quite good solution to this issue: have a NUR repository per auto-generated source. This could be compatible with the idea that nixpkgs only keeps the applications' dependencies (auto-imported), solving the size issue, and at the same time allow for a generated complete copy of the archive to co-exist.

I really don't like the idea of chaining two repositories even for applications: if the second repository becomes necessary in a lot of setups (just think how many packages require at least one python package), then might as well just put the two together, it'd display the download at the correct time at least.

nixos-discourse · 2021-07-28T09:20:43Z

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/future-of-npm-packages-in-nixpkgs/14285/3

stale · 2022-04-30T23:24:13Z

I marked this as stale due to inactivity. → More info

FRidh added the 6.topic: policy discussion label May 15, 2016

FRidh mentioned this issue May 15, 2016

Python: autogenerate PyPI source data (2) #15007

Closed

FRidh mentioned this issue May 31, 2016

RFC: remove node packages #14532

Closed

ericsagnes mentioned this issue Jul 1, 2016

Proposal: improve triaging workflow #16635

Closed

5 tasks

domenkozar mentioned this issue Aug 13, 2016

Python ecosystem improvements (placeholder issue) #1819

Open

37 tasks

FRidh mentioned this issue Aug 30, 2016

Figure out how to make IFD work properly NixOS/nix#954

Open

domenkozar mentioned this issue Jan 12, 2017

add nix-prefetch-source #21734

Merged

FRidh mentioned this issue Dec 29, 2018

Move go & ruby dependencies into go-packages.nix & ruby-packages.nix #52595

Open

This comment has been minimized.

Sign in to view

stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 3, 2020

stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 28, 2021

stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Apr 30, 2022

fgaz mentioned this issue Oct 8, 2022

minetestWithPackages,minetestPackages: init (minetest bundled with mods and games, package set generated from ContentDB) #194749

Open

13 tasks

Autogenerated data in Nixpkgs #15480

Autogenerated data in Nixpkgs #15480

Comments

FRidh commented May 15, 2016 • edited by teto Loading

FRidh commented May 15, 2016 • edited Loading

zimbatm commented May 15, 2016

FRidh commented May 15, 2016

zimbatm commented May 15, 2016 • edited Loading

FRidh commented May 15, 2016 • edited Loading

bjornfor commented May 15, 2016

FRidh commented May 15, 2016

zimbatm commented May 15, 2016 • edited Loading

FRidh commented May 16, 2016

zimbatm commented May 16, 2016

copumpkin commented May 16, 2016 • edited Loading

zimbatm commented May 16, 2016

FRidh commented May 17, 2016 • edited Loading

zimbatm commented May 17, 2016

garbas commented May 31, 2016

FRidh commented May 31, 2016

copumpkin commented May 31, 2016 • edited Loading

bobvanderlinden commented Jun 1, 2016

garbas commented Jun 2, 2016

copumpkin commented Jun 2, 2016

bobvanderlinden commented Jun 3, 2016

FRidh commented Jun 5, 2016

FRidh commented Jun 5, 2016

FRidh commented Jun 22, 2016

ttuegel commented Jun 22, 2016

zimbatm commented Jun 22, 2016

vcunat commented Jun 22, 2016

zimbatm commented Jun 22, 2016

FRidh commented Jun 23, 2016

vcunat commented Jun 23, 2016 • edited Loading

zimbatm commented Jun 23, 2016 • edited Loading

FRidh commented Jun 23, 2016

FRidh commented Jun 28, 2016

FRidh commented Aug 13, 2016

teto commented Sep 13, 2017

Ekleog commented Oct 17, 2018

This comment has been minimized.

nixos-discourse commented Jul 28, 2021

stale bot commented Apr 30, 2022

FRidh commented May 15, 2016 •

edited by teto

Loading

FRidh commented May 15, 2016 •

edited

Loading

zimbatm commented May 15, 2016 •

edited

Loading

FRidh commented May 15, 2016 •

edited

Loading

zimbatm commented May 15, 2016 •

edited

Loading

copumpkin commented May 16, 2016 •

edited

Loading

FRidh commented May 17, 2016 •

edited

Loading

copumpkin commented May 31, 2016 •

edited

Loading

vcunat commented Jun 23, 2016 •

edited

Loading

zimbatm commented Jun 23, 2016 •

edited

Loading