Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests hang with ghc 7.10 #148

Open
gelisam opened this issue Jul 5, 2015 · 36 comments
Open

Tests hang with ghc 7.10 #148

gelisam opened this issue Jul 5, 2015 · 36 comments

Comments

@gelisam
Copy link
Owner

gelisam commented Jul 5, 2015

Continuing from #147.

@gelisam gelisam added this to the 1.2 milestone Jul 5, 2015
@gelisam
Copy link
Owner Author

gelisam commented Jul 9, 2015

My last comment was that it hangs inside GHC.typecheckModule. I have annotated that part of the GHC source, and I can now say that it hangs at the following line inside that function:

safe    <- liftIO $ finalSafeMode (ms_hspp_opts ms) tc_gbl_env

I have no idea what any of those things (except liftIO) mean.

@gelisam
Copy link
Owner Author

gelisam commented Jul 9, 2015

The last commit which touched this line says "Fix GHC API with respect to safe haskell", so I guess this is the code which determines whether the code is "inferred safe", "trusted", etc.

@gelisam
Copy link
Owner Author

gelisam commented Jul 10, 2015

What? It didn't hang at the same line inside GHC.typecheckModule this time. I should run it a few times to see if the bug's behaviour looks non-deterministic.

@gelisam
Copy link
Owner Author

gelisam commented Jul 10, 2015

Nope, same line three times in a row. Did I misread the line number the first time? Or did the IO commands I added to pinpoint this line somehow change the behaviour?

Recompiling GHC takes forever, so I'm not really inclined to revert my changes to see if the number reverts from 860 to 868. I'm using the 7.10.2 release candidate, because it has a nice ./configure; make; make install which actually works on my machine, whereas compiling from the git repo didn't work for me in the past despite following the instructions. I think make install is recompiling both stage 1 and stage 2, which is why it's taking forever.

@gelisam
Copy link
Owner Author

gelisam commented Jul 10, 2015

I just timed it and "forever" is actually just 16 minutes. It's not that bad.

@gelisam
Copy link
Owner Author

gelisam commented Jul 10, 2015

Yup, looks like I simply misread the number. I reverted my changes, and the hang is still at 860:

(tc_gbl_env, rn_info)
      <- liftIO $ hscTypecheckRename hsc_env_tmp ms $
                     HsParsedModule { hpm_module = parsedSource pmod,
                                      hpm_src_files = pm_extra_src_files pmod,
                                      hpm_annotations = pm_annotations pmod }

@gelisam
Copy link
Owner Author

gelisam commented Jul 10, 2015

This makes no sense. hscTypecheckRename hangs on a return??

return (tc_result, rn_info)

@gelisam
Copy link
Owner Author

gelisam commented Jul 11, 2015

No, that would be silly. It doesn't hang anymore?

@gelisam
Copy link
Owner Author

gelisam commented Jul 11, 2015

Oh! hscTypecheckRename is an IO computation, but the return is in the Hsc monad. So the hang is probably at the end of runHsc.

@gelisam
Copy link
Owner Author

gelisam commented Jul 11, 2015

Indeed, in hangs inside runHsc, at this line:

printOrThrowWarnings (hsc_dflags hsc_env) w

I almost looked at the wrong line again, and now I know why: I'm debugging by inserting print statements, but since I don't know what's the equivalent of __LINE__ in Haskell, I used some vim macro magic to type in the number of the line at the time I insert the debug statement. Unfortunately as I later edit the code, those line numbers get slightly out of sync. So when my last debugging information talks about line N, I shouldn't assume that GHC hangs on line N+1, but that it hangs on the line after the debugging statement which contains the number N!

@gelisam
Copy link
Owner Author

gelisam commented Jul 11, 2015

printBagOfErrors dflags warns

@gelisam
Copy link
Owner Author

gelisam commented Jul 11, 2015

log_action dflags dflags sev s style (d $$ e)

On the second of four warnings.

@gelisam
Copy link
Owner Author

gelisam commented Jul 11, 2015

log_action is a DynFlags member, it could take a while to track down its source (it would be much easier if I could reproduce the problem in ghci). Can I obtain more information about this second warning? Can I print it out without causing a hang?

@gelisam
Copy link
Owner Author

gelisam commented Jul 11, 2015

src/Control/Monad/Trans/Uncertain.hs:19:21: Warning:
    In the use of type constructor or class ‘ErrorT’
    (imported from Control.Monad.Trans.Error):
    Deprecated: "Use Control.Monad.Trans.Except instead"

This message gets printed over and over (see #142). The hang is at the first occurrence of this particular message. The previous very similar message, about an earlier line, does not cause any hang:

src/Control/Monad/Trans/Uncertain.hs:8:1: Warning:
    Module ‘Control.Monad.Trans.Error’ is deprecated:
      Use Control.Monad.Trans.Except instead

@gelisam
Copy link
Owner Author

gelisam commented Jul 11, 2015

The hang is deterministic, in the sense that running the tests multiple times causes a hang at the exact same line, but it's also unstable, in the sense that the extra test calls to log_action I've added are causing the hang to occur at a later warning. It's an IO computation, so I guess that's not entirely surprising.

@gelisam
Copy link
Owner Author

gelisam commented Jul 12, 2015

I can reproduce the problem by running log_action many times until it hangs on the 77th attempt. I call it with canned values, except for the DynFlags input which is a huge data structure with 127 fields (!). By running my loop in all the functions from typecheckModule to printBagOfErrors, I was hoping to see the moment at which dflags or whatever piece of IO changes in a way which makes this loop hang, always at the 77th iteration. Instead, the hang now appears earlier, before typecheckModule is even called. Way before that: doctest is called on many files and hung in Uncertain.hs, it is now hanging in Cache.hs, the first file on which doctest is called. And the hang occurs on an early call which analyses the dependencies:

mods <- depanal [] False

That computation eventually calls printOrThrowWarnings, which runs the loop and hangs.

Next, I plan to add more loops between depanal and printOrThrowWarnings.

@gelisam
Copy link
Owner Author

gelisam commented Jul 12, 2015

The very first loop at the beginning of depanal caused a hang, on the 77th iteration as usual. Can I expose my hanging loop to doctest directly?

@gelisam
Copy link
Owner Author

gelisam commented Jul 12, 2015

Yes I can. Next, how much of doctest's setup can I skip while still reproducing the hang?

@gelisam
Copy link
Owner Author

gelisam commented Jul 12, 2015

How much can I remove? All of it.

@gelisam
Copy link
Owner Author

gelisam commented Jul 13, 2015

Very strange. The hang disappeared! I can't reproduce it at all.

I was happy to have such a minimal program which could reliably reproduce the hang, so I decided to write down the steps to reproduce it from scratch. To test those steps, I scrapped and re-created my sandbox... and now I can no longer reproduce the bug. I can't even compile the test suite: in order to follow the steps from TravisCI as closely as possible, I was using cabal-1.18, but when I clear the sandbox and try to build with that version and GHC 7.10.1, I encountered an error I have never seen before:

ghc no longer supports single-file style package databases (dist/package.conf.inplace) use 'ghc-pkg init' to create the database with the correct format.

Googling this error tells me that I should upgrade to a more recent version of cabal. When I do, the hang no longer occurs. This isn't entirely surprising: the hang was only occurring when using cabal to run the tests, so it makes sense that using a different version of cabal would hide the problem. But since the hang is occurring inside the latest version of GHC, not inside an old version of cabal, I don't think the conclusion is simply that cabal was broken in 1.18, is fixed in 1.22, and we don't need to worry anymore.

@gelisam
Copy link
Owner Author

gelisam commented Jul 16, 2015

The hang is real: Travis CI hangs on the tests when using GHC head. But I can no longer reproduce it on my machine (why??), so I'll have to postpone working on it.

@gelisam gelisam added the hard label Jul 19, 2015
@gelisam
Copy link
Owner Author

gelisam commented Jul 20, 2015

I can reproduce the problem on my linux VM, by following all the steps from the Travis CI including the apt-get steps:

$ sudo add-apt-repository -y ppa:hvr/ghc
$ sudo apt-get update
$ sudo apt-get install cabal-install-1.18 happy-1.19.3 ghc-head
$ export PATH="/opt/happy/1.19.3/bin:/opt/ghc/head/bin:$PATH"
$ cabal-1.18 update
$ cabal-1.18 install --only-dependencies --enable-tests --enable-benchmarks
$ cabal-1.18 configure --enable-tests --enable-benchmarks -v2
$ cabal-1.18 build
$ cabal-1.18 test
<hangs>

@gelisam
Copy link
Owner Author

gelisam commented Jul 20, 2015

I'd like to test with other versions of cabal and ghc, but:

  1. switching to cabal-1.22 doesn't work, because cabal test complains that I need to re-run the "configure" command, yet cabal configure doesn't help.
  2. switching to ghc-7.10.1 causes the error "ghc no longer supports single-file package databases" during the installation of packages, presumably because cabal-1.18 is too old to know about the newer format.
  3. switching to cabal-1.22 and to ghc-7.10.1 simultaneously, I get the error "setup: unrecognized flags: reference".

I think the latter error is because of my custom Setup.hs. Can I reproduce the hang without it?

@gelisam
Copy link
Owner Author

gelisam commented Jul 20, 2015

With the Simple setup method, cabal-1.18 and ghc-head, the cabal test terminates... and fails? doctest cannot find Hawk's runtime? Weird, it's installed to ~/.cabal. It's not hard to find!

@gelisam
Copy link
Owner Author

gelisam commented Jul 20, 2015

That's weird: I pass the path of the main source and autogenerated files to doctest, but not the path of the runtime. How did that ever work before? I guess the runtime was always already installed for the tests?

Come to think of it, the runtime does need to be installed in order for the tests to succeed, that's the entire point of the custom Setup.hs. Ah, right, I disabled it! So the runtime did not get installed to ~/.cabal after all.

@gelisam
Copy link
Owner Author

gelisam commented Jul 20, 2015

Adding an extra step cabal install before cabal test, the doctests now pass... but the HSpec tests now fail. And it's a real failure! With the binary we've just installed, running hawk [] fails because the type of the expression is ambiguous. Is this a new behaviour in ghc-7.10?

@gelisam
Copy link
Owner Author

gelisam commented Jul 20, 2015

Nope, works fine with ghc-7.10, only fails with ghc-head. We'll cross that bridge when we get there. I'll disable that part of the test for the purpose of investigating this bug.

@gelisam
Copy link
Owner Author

gelisam commented Jul 20, 2015

So, with ghc-head (7.11.20150717) and cabal-1.18.1.3, the tests pass when I use the simple build type, but fail when I use the custom build type. What is different in the two cases? Calling cabal test with -v2, I observe the following differences.

custom $ ghc-pkg init dist/package.conf.inplace

That's clearly related to the error when I try to use ghc 7.10! That's weird though, the error message clearly says that ghc no longer supports it. Maybe support was reintroduced in head?

simple $ ghc --info
simple $ ghc --info

Well that looks doubly useless.

simple $ ghc ... -package-name haskell-awk-1.1.1 ...
warning: -package-name is deprecated, use -this-package-key instead
custom $ ghc ... -this-package-key LcMR...HItnG

Why would the simple and custom methods differ on this? Are they using a different version of the Cabal library behind the scenes again?

simple $ ar -r dist/build/libHShaskell-awk-1.1.1.a ...
custom $ ar -r dist/build/objs-14144/libHShaskell-awk-1.1.1.a-LcMR...HItnH.a ...
custom $ strip dist/build/objs-14144/libHShaskell-awk-1.1.1.a-LcMR...HItnH.a --strip-unneeded

Another indication of a different version of Cabal, or a simple consequence of the -package-name difference?

simple $ ghc ...
             -package-name haskell-awk-1.1.1
             -no-auto-link-packages
             -package-db dist/package.conf.inplace
             ...
             -o dist/build/libHShaskell-awk-1.1.1-ghc7.11.20150717.so
custom $ ghc ...
             '-dynload deploy'
             -optl-Wl,-rpath,/home/gelisam/.cabal/lib/x86_64-linux-ghc-7.11.20150717/{HUnit-1.2.5.2,QuickCheck-2.8.1,...}
             -optl-Wl,-rpath,opt/ghc/head/lib/ghc-7.11.20150717/{base-4.8.2.0-HQfY...,bin-package-db-0.0.0.0-...,binary-0.7.5.0-...,...}
             -this-package-key LcMR...HItnG
             -no-auto-link-packages
             -package-db dist/package.conf.inplace
             ...
             -o dist/build/libHShaskell-awk-1.1.1-LcMR...HItnG-ghc7.11.20150717.so

Interestingly, both are using package.conf.inplace, even though only the simple build bothered to initialize this package database. Maybe cabal detected that it was already initialized? I didn't bother cleaning anything between the two runs.

simple $ ghc --info
simple $ ghc --info

Still useless.

simple $ ghc --make --no-link ... -package-id base.4.8.2.0-5a3f... ...
custom $ ghc --make --no-link ... -package-id base.4.8.2.0-5a3f... ... -package-id haskell-awk-1.1.1-inplace ...

Linking...
simple $ ghc --make ... -package-id base.4.8.2.0-5a3f... ...
custom $ ghc --make ... -package-id base.4.8.2.0-5a3f... ... -package-id haskell-awk-1.1.1-inplace ...

Straightforward consequence of the -package-name difference.

custom $ ./dist/setup test ...
...
installing LICENSE
...
installing library
...
installing executable
...
custom $ ghc --abi-hash ...
Registeding haskell-awk-1.1.1...
custom $ ghc-pkg update - --global --user

Hmm, the fact that ./dist/setup is only called in the custom mode is normal, but why would it only install stuff in that mode?

@gelisam
Copy link
Owner Author

gelisam commented Jul 25, 2015

Since running cabal test with the simple and custom modes ends up running different sets of commands, I'd like to simplify the problem by ignoring cabal and reproducing the problem using only the commands. Running cabal with -v2 lists many commands, but apparently not all of them, since when I run them, I get an error complaining about some packages which are not installed. I then tried to run cabal with -v3, and with those I am able to reproduce a hang, but not during the execution of the test suite! Instead, I get a hang during ghc-pkg update, a command with which I am not familiar.

@gelisam
Copy link
Owner Author

gelisam commented Jul 25, 2015

Aha! ghc-pkg update - did not hang, it was simply expecting input from stdin :)

Now, which input did it expect? Does -v3 say? edit: no.

@gelisam
Copy link
Owner Author

gelisam commented Jul 25, 2015

What's especially weird is that running the reference executable immediately after cabal test hangs on that very executable does not itself hang. So unless running reference somehow affects the state of something in such a way that the second run doesn't hang, it would seem that cabal runs reference in a special way which triggers the hang.

With -v3, we see the environment variables which reference is executed with. I tried adding those variables to my shell's environment and running reference, but it did not hang.

@gelisam
Copy link
Owner Author

gelisam commented Jul 25, 2015

I guess I'll have to modify the source of cabal itself. But cabal-1.18 doesn't build with ghc-head. Which means that using cabal-1.18 to build with ghc-head was probably a stretch already, but whatever anticated practices cabal-1.18 used, it shouldn't cause ghc-head to hang, should it?

@gelisam
Copy link
Owner Author

gelisam commented Jul 25, 2015

I'm trying to modify cabal, but it's tricky: the part I want to modify is actually in the Cabal library, not in the cabal-install executable, and even though I have linked cabal-install with my modified Cabal library, running cabal test still uses the unmodified Cabal library. Maybe because my custom cabal-install builds the Setup.hs file by linking it with the unmodified Cabal library?

@gelisam
Copy link
Owner Author

gelisam commented Jul 25, 2015

Before I get too far off track, here's how I reproduce the problem using a custom cabal-install and a custom Cabal library (which doesn't get used, but whatever). Earlier I got into a state from which I couldn't reproduce the problem anymore, and I don't want to make that mistake again.

$ rm -rf dist ~/.cabal ~/.ghc  # start from scratch, to make sure this is reproducible
$ cd ~/Cabal-1.18.1.3
$ export PATH="..."            # using the system's ghc-7.6.3
$ cabal-1.18 update            # using the system's cabal
$ cabal-1.18 install
$ cd ~/cabal-install-1.18.0.4
$ cabal-1.18 install
$ cd ~/hawk
$ export PATH="..."            # using the system's ghc-head
$ cabal install --only-dependencies --enable-tests  # using our custom cabal
$ cabal test                   # hangs

@gelisam
Copy link
Owner Author

gelisam commented Jul 25, 2015

I think the reason why my custom Cabal library isn't being used by Setup.hs is because of the switch from ghc-7.6.3 to ghc-head. Can I compile Cabal and cabal-install using ghc-head?

@gelisam
Copy link
Owner Author

gelisam commented Jul 25, 2015

Yes, via --allow-newer. Which didn't exist in cabal-1.18, so I had to edit Cabal's .cabal file. For Cabal, I also had to hide the die symbol which has appeared in a later versions of System.Exit. And for cabal-install, I had to keep the upper bound on zlib, and add an Applicative instance.

This worked! Setup.hs is now using my custom Cabal.

But it did not hang. Why not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant