Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation error blocking nixos-unstable #143937

Closed
r-burns opened this issue Oct 31, 2021 · 20 comments
Closed

Evaluation error blocking nixos-unstable #143937

r-burns opened this issue Oct 31, 2021 · 20 comments
Labels
0.kind: bug Something is broken 1.severity: channel blocker Blocks a channel

Comments

@r-burns
Copy link
Contributor

r-burns commented Oct 31, 2021

An eval error is failing hydra evals of nixos-unstable (trunk-combined), preventing channel updates:

https://hydra.nixos.org/jobset/nixos/trunk-combined#tabs-errors

The last successful eval was 2 days ago, on commit 2deb07f

The error is masked by a Hydra bug (NixOS/hydra#822) so the only error message we get is

error: unexpected EOF reading a line
@r-burns r-burns added 0.kind: bug Something is broken 1.severity: channel blocker Blocks a channel labels Oct 31, 2021
@r-burns
Copy link
Contributor Author

r-burns commented Oct 31, 2021

The error can be reproduced locally with nix-shell -p hydra-unstable --run 'hydra-eval-jobs -I . nixos/release-combined.nix --verbose'. I am currently bisecting on this.

@r-burns
Copy link
Contributor Author

r-burns commented Oct 31, 2021

Dang, hydra-eval-jobs is super slow and I gotta get some sleep. Here's where I'm at if someone else wants to pick it up:

git bisect start
# bad: [c713c5d261be7849dab40f4c635abde82eab2ffd] Merge pull request #143910 from jtojnar/gpaste
git bisect bad c713c5d261be7849dab40f4c635abde82eab2ffd
# good: [2deb07f3ac4eeb5de1c12c4ba2911a2eb1f6ed61] Merge pull request #143289 from TredwellGit/electron_13
git bisect good 2deb07f3ac4eeb5de1c12c4ba2911a2eb1f6ed61
# bad: [b96ab960d3a2ef371355f149327975799d41ed67] vtun: remove
git bisect bad b96ab960d3a2ef371355f149327975799d41ed67
# bad: [28f111989d54f0bf5b8ea145ad2b8aac6b3003d7] python38Packages.scikit-hep-testdata: 0.4.9 -> 0.4.10
git bisect bad 28f111989d54f0bf5b8ea145ad2b8aac6b3003d7
# bad: [eddbc253f4dbff3a7f80dc157126b142e11fe717] Merge pull request #143304 from sheeaza/patch-1
git bisect bad eddbc253f4dbff3a7f80dc157126b142e11fe717
# bad: [19e71dfa871977759b6a3eff7479e4a07e77038b] gromacs: 2020.4 -> 2021.3, cuda, mpi cleanups, performance tunings
git bisect bad 19e71dfa871977759b6a3eff7479e4a07e77038b

@vcunat
Copy link
Member

vcunat commented Oct 31, 2021

@r-burns: I suppose the messages from --verbose aren't helpful for identifying the problem?

@r-burns
Copy link
Contributor Author

r-burns commented Oct 31, 2021

I had hoped they would be but I didn't see anything obvious

@vcunat
Copy link
Member

vcunat commented Oct 31, 2021

It's morning here, so let me continue 😄

@collares
Copy link
Member

The linked hydra logs include this:

error: value is a function while a set was expected

       at /nix/store/apli94zyi7haaygpj3xdy0qaiw7zx35x-source/lib/attrsets.nix:381:46:

          380|   */
          381|   zipAttrsWith = f: sets: zipAttrsWithNames (concatMap attrNames sets) f sets;
             |                                              ^
          382|   /* Like `zipAttrsWith' with `(name: values: values)' as the function.

It's somewhere in the middle of the log, so it is easily overlooked. Is there a way to run the job with the equivalent of --show-trace to see where the error comes from?

@vcunat
Copy link
Member

vcunat commented Oct 31, 2021

git bisect good 398a73ac980f2e89981d8d704d7e800bb7a9bfaf
Bisecting: 10 revisions left to test after this (roughly 4 steps)

It's slow as usual. The single step took 160 minutes, and I don't think it's a weak machine.

@vcunat
Copy link
Member

vcunat commented Oct 31, 2021

I don't think I can do the bisection properly. On bffe6436 it ran relatively normally for about 80 minutes and then suddenly started to consume way too much RAM. When I killed it, it used 16G RSS + 17G swap (compressed in zram but still).

Well, I could assume that these explosions only happen in bad cases (or that the waste is bad in itself), though it's still quite annoying. If someone can do this with more RAM, it might work better.

@Artturin
Copy link
Member

i can run it on @jonringer 252G machine just give me the cmd

@vcunat
Copy link
Member

vcunat commented Oct 31, 2021

This one should suffice:

hydra-eval-jobs -Inixpkgs=. nixos/release-combined.nix

@Artturin
Copy link
Member

all the commands please, i haven't used bisect much

@vcunat
Copy link
Member

vcunat commented Oct 31, 2021

Ah, you run all the log of git bisect commands from previous comments. Then it depends... automatic way would be git bisect run hydra-eval-jobs -Inixpkgs=. nixos/release-combined.nix

I often like to inspect the individual failures by hand (to make sure that it's the same error), but in this case we haven't found anything useful in the logs, so it's probably OK to leave it to that automatic git bisect run.

@collares
Copy link
Member

collares commented Oct 31, 2021

https://github.com/NixOS/nixpkgs/blob/master/nixos/tests/ghostunnel.nix#L1 starts with

{ pkgs, ... }: import ./make-test-python.nix {

while every other test in this directory starts with something like

import ./make-test-python.nix ({ pkgs, ... }: {

Perhaps something in #140792 caused this difference to break things? That's the only thing in the bisected range that looks relevant. (Edit: To be more precise, lib/testing-python.nix:makeTest was changed in this range. This function is called in make-test-python.nix, which is called by ghostunnel.nix.)

cc @roberth who is the ghostunnel test maintainer.

I looked at ghostunnel.nix after running hydra-eval-jobs -I . nixos/release-combined.nix --verbose locally because it's the test that's evaluated immediately before the error I mentioned in #143937 (comment).

@vcunat
Copy link
Member

vcunat commented Oct 31, 2021

I can't see that path being touched in the current bisection range. EDIT: but it's e.g. possible there are multiple errors.

@roberth
Copy link
Member

roberth commented Oct 31, 2021

ghostunnel does seem unrelated but here's a fix #144001.

@roberth
Copy link
Member

roberth commented Oct 31, 2021

I'm getting extreme memory usage from just the nextcloud tests. I've stopped evaluation at ~23GB.

@vcunat
Copy link
Member

vcunat commented Oct 31, 2021

From my incomplete tests I (also) find it likely that at least one of the big problems came from PR #140792.

vcunat referenced this issue Oct 31, 2021
Make sure the all derivations referenced by the test script are
available on the nodes. Accessing these derivations works just fine
without this change when using 9p to mount the host's store, but when
an image is built (virtualisation.buildRootImage), the dependencies
need to be copied to the image. We don't want to copy the script
itself, though, since that would trigger unnecessary image rebuilds.
@r-burns
Copy link
Contributor Author

r-burns commented Oct 31, 2021

Good morning everyone, I tried an unattended git bisect run overnight and it points to 329a446 at the breaking commit. Seems like there is some agreement there.

@roberth
Copy link
Member

roberth commented Oct 31, 2021

Fix: #144014

@vcunat
Copy link
Member

vcunat commented Oct 31, 2021

Evaluated: https://hydra.nixos.org/eval/1717845

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken 1.severity: channel blocker Blocks a channel
Projects
None yet
Development

No branches or pull requests

5 participants