-
Notifications
You must be signed in to change notification settings - Fork 37
Split objects and a lurking bug #30
Comments
It is possible for the build system to kill it after encountering a problem elsewhere. However, none of this should be problematic. Shake (unlike Make) considers an operation to be complete after the files have been produced and Shake updates its database to mark the run successful. So if Therefore, the only possibility I can think of is that GHC succeeds but doesn't generate all the files. That would be weird. Perhaps GHC forgets to check the return code from splitting or misses an error about it? |
Thanks Neil, I see. Then the only explanation I have is that it may indeed be a GHC bug. I also noticed the following behaviour which may be relevant. Sometimes the build system terminates with the following error:
And I can see a When this happens, I terminate the |
Failing to terminate nested processes, especially on Windows, is not unexpected or surprising. My code to terminate running processes mostly consists of "whack it, whack it a bit more, give up" - and there are lots of circumstances (GHC 7.2 compiled binaries with FFI, Cygwin shell scripts) which cause it to have to give up. See https://github.com/ndmitchell/shake/blob/master/src/General/Process.hs#L127 for this code in Shake. |
Ah, actually, I realised why we can have a stray |
Aha, so it is not necessarily my fault. By the way, is there a nice way to tell Shake to stop? Ctrl-C doesn't seem to work nicely as other processes keep running (this is especially annoying when a long |
I think there are some Ctrl C bugs in Shake, see ndmitchell/shake#169. At work we always spawn it through Cygwin, which itself has Ctrl C bugs (often the terminal becomes unresponsive), so tracking this down hasn't been a priority. |
Got hit by this bug again. By the way, a complete rebuild with |
By the way, apparently I was wrong: |
Sounds like a GHC bug. Even if Shake didn't record the command as completed, if it reran and the second time round GHC did nothing, then Shake would record that as completed and then everything would be broken. |
@ndmitchell So far I could reproduce only the following abnormal behaviour of module Test (test) where
test :: Int
test = 0 I compile the above with This does look like a bug. Shall I submit a GHC ticket? |
I guess so - it sounds like the recompilation checker skips the split objects, which would be a bug (maybe one @ezyang is going near?). However, I'm still trying to figure out exactly what went wrong. How did you get to the state where you had the object file, but not all the split objects? In my experiments, GHC creates the .hi file, then the split objects, then the .o file. If I delete any split objects it doesn't rebuild. Also worryingly, if I then go and corrupt the .o file (e.g. like might happen if there is a Ctrl-C while writing the file), it doesn't rebuild (shouldn't this be built somewhere and then |
There are several levels at which GHC's recompilation checker operates.
Why does this work at all? The critical thing is the fact that the interface recompilation check takes a boolean saying whether or not the source has been modified or not. If the object file is old, or doesn't exist, regardless of the state of the interface file GHC concludes that the source code has changed and triggers a rebuild (which in turn triggers the rest of the pipeline to run.) Now, is split objects buggy? If you are expecting that the output of GHC is the individual split objects of course it is: GHC's not checking the right thing when doing recompilation checks (1)/(2). But if GHC's output is supposed to be the merged (splitted) objects, then there is no problem. I don't actually know what the intended semantics of |
@ndmitchell, I have no idea how this happens. Current hypothesis: what if Shake starts to kill all spawned processes, quickly manages to kill the one responsible for creating split objects, but @ezyang Thank you for the detailed comment!
I think we do expect this. The build system needs a way to verify that all split objects are generated in order to proceed with the rest of the build. The easiest way is to assume that when GHC terminates successfully the split objects are up-to-date. An alternative is to find out how many split objects are expected and do a post-GHC validity check, which seems unnecessary complicated. I'll go ahead and submit the ticket. Maybe this is an easy fix. |
The ticket: https://ghc.haskell.org/trac/ghc/ticket/11315. |
So, I just double checked, and GHC does merge the split object files into a single object file. Why doesn't Shake depend on that file? |
It does! However, we somehow end up in a problematic state when the single object file exists, yet some split objects are missing. Then when we rerun the build, GHC doesn't want to restore missing split objects and the only way forward is to manually delete the complete object file to force GHC to restore things to a consistent state. |
Actually, maybe the single object file is also incomplete in such situations: it may be just a merge of split objects that have been created. I haven't thought about this before, but this could explain what happens. Next time the bug occurs I'll check this. |
This all sounds very impossible. It should not be possible, even with hard-kills, for GHC to manage to fail to write out all of the split objects BUT write an up-to-date object file; if any of the assembling fails, that should abort the entire pipeline. The relevant code, by the way, is in |
Thanks @ezyang. I don't have an explanation yet. I tried to reproduce this in a controlled situation, but no luck so far. One interesting scenario which I observed: if I kill the Shake process it fails to terminate GHC processes that keep working on split objects; they eventually succeed by creating complete object files. However, I can restart Shake and then we can have two GHC processes running at the same time, the old one and the new one working on the same object file. It looks like in this situation the old GHC gets stuck (and then never terminates) and the new GHC erases split objects, starts over, and eventually completes the job. However, with several generations of GHC processes running in parallel we might be able to get an impossible-sounding outcome. |
OK:
|
With |
I came across a bug, which is difficult to trigger and which manifests itself in a very mysterious way. It costed me two hours of debugging to catch.
When
ghc
is compiling afile.hs
it produces the objectfile.o
andafter thatmultiple smaller objects placed in directoryfile_o_split
(when executed with flag-split-objs
). Ifghc
is terminated during the creation of split objects, we may get only some split objects but not all of them. Since we don't know how many of them there should be, we can't check that the result is correct. Missing split objects later cause link errors elsewhere, which are difficult to understand because some split objects are there (in my case there were 192 split objects instead of 324 for source fileGHC/Show.hs
inbase
).I'm not yet sure why
ghc
was terminated prematurely in that particular case. Perhaps, it crashed with a segfault (sometimes happens with-j8
) or the build system killed it after it encountered a problem elsewhere. @ndmitchell, is the latter possible?Let's discuss possible solutions here.
Two ideas so far:
Changeghc
so it first produces split objects and only after that generates the main object file. ThenShake
will automatically take care of the rest.The text was updated successfully, but these errors were encountered: