Skip to content
This repository has been archived by the owner on Aug 2, 2020. It is now read-only.

Split objects and a lurking bug #30

Closed
snowleopard opened this issue Dec 23, 2015 · 21 comments
Closed

Split objects and a lurking bug #30

snowleopard opened this issue Dec 23, 2015 · 21 comments

Comments

@snowleopard
Copy link
Owner

I came across a bug, which is difficult to trigger and which manifests itself in a very mysterious way. It costed me two hours of debugging to catch.

When ghc is compiling a file.hs it produces the object file.o and after that multiple smaller objects placed in directory file_o_split (when executed with flag -split-objs). If ghc is terminated during the creation of split objects, we may get only some split objects but not all of them. Since we don't know how many of them there should be, we can't check that the result is correct. Missing split objects later cause link errors elsewhere, which are difficult to understand because some split objects are there (in my case there were 192 split objects instead of 324 for source file GHC/Show.hs in base).

I'm not yet sure why ghc was terminated prematurely in that particular case. Perhaps, it crashed with a segfault (sometimes happens with -j8) or the build system killed it after it encountered a problem elsewhere. @ndmitchell, is the latter possible?

Let's discuss possible solutions here.

Two ideas so far:

  • Change ghc so it first produces split objects and only after that generates the main object file. Then Shake will automatically take care of the rest.
  • Find out how many split objects should be generated use this to check the correctness of the result.
@ndmitchell
Copy link
Collaborator

It is possible for the build system to kill it after encountering a problem elsewhere. However, none of this should be problematic. Shake (unlike Make) considers an operation to be complete after the files have been produced and Shake updates its database to mark the run successful. So if ghc segfaults, or aborts strangely with a non-zero exit, there will be an exception, and Shake won't record that the operation ran at all. Neither of your ideas should be necessary, Shake ensures that property by design.

Therefore, the only possibility I can think of is that GHC succeeds but doesn't generate all the files. That would be weird. Perhaps GHC forgets to check the return code from splitting or misses an error about it?

@snowleopard
Copy link
Owner Author

Thanks Neil, I see. Then the only explanation I have is that it may indeed be a GHC bug.

I also noticed the following behaviour which may be relevant. Sometimes the build system terminates with the following error:

[...] cannot open output file inplace/bin/ghc-stage1.exe: Permission denied
collect2: ld returned 1 exit status

And I can see a ghc-stage1.exe process running in the system. Since I never run it myself it can only be started during a build, which means Shake didn't or couldn't terminate it. I never caught the moment when it actually happened, only after the fact when the above error occurred.

When this happens, I terminate the ghc-stage1.exe process, restart and things usually go smoothly. This is yet another possibility for GHC to be terminated prematurely.

@ndmitchell
Copy link
Collaborator

Failing to terminate nested processes, especially on Windows, is not unexpected or surprising. My code to terminate running processes mostly consists of "whack it, whack it a bit more, give up" - and there are lots of circumstances (GHC 7.2 compiled binaries with FFI, Cygwin shell scripts) which cause it to have to give up. See https://github.com/ndmitchell/shake/blob/master/src/General/Process.hs#L127 for this code in Shake.

@snowleopard
Copy link
Owner Author

Ah, actually, I realised why we can have a stray ghc-stage1.exe process: I sometimes kill the Shake process myself (sorry about that Neil!). So the above is probably unrelated to this issue.

@snowleopard
Copy link
Owner Author

Failing to terminate nested processes, especially on Windows, is not unexpected or surprising.

Aha, so it is not necessarily my fault.

By the way, is there a nice way to tell Shake to stop? Ctrl-C doesn't seem to work nicely as other processes keep running (this is especially annoying when a long configure script is running and I want to restart the build).

@ndmitchell
Copy link
Collaborator

I think there are some Ctrl C bugs in Shake, see ndmitchell/shake#169. At work we always spawn it through Cygwin, which itself has Ctrl C bugs (often the terminal becomes unresponsive), so tracking this down hasn't been a priority.

@snowleopard
Copy link
Owner Author

Got hit by this bug again. By the way, a complete rebuild with -B doesn't help, because ghc doesn't want to recompile the affected file (interface files are unchanged). Very annoying.

@snowleopard
Copy link
Owner Author

By the way, apparently I was wrong: ghc first produces split objects and then the complete object file.

@ndmitchell
Copy link
Collaborator

Sounds like a GHC bug. Even if Shake didn't record the command as completed, if it reran and the second time round GHC did nothing, then Shake would record that as completed and then everything would be broken.

@snowleopard
Copy link
Owner Author

@ndmitchell So far I could reproduce only the following abnormal behaviour of ghc.

module Test (test) where

test :: Int
test = 0

I compile the above with ghc -split-objs Test.hs and get two split object files in Test_o_split. Then if I delete any of these files, or even the whole folder, and rerun ghc -split-objs Test.hs the compiler doesn't bother to restore split objects, pretending that everything is up-to-date. However, if I delete Test.o all object files are restored.

This does look like a bug. Shall I submit a GHC ticket?

@ndmitchell
Copy link
Collaborator

I guess so - it sounds like the recompilation checker skips the split objects, which would be a bug (maybe one @ezyang is going near?). However, I'm still trying to figure out exactly what went wrong. How did you get to the state where you had the object file, but not all the split objects?

In my experiments, GHC creates the .hi file, then the split objects, then the .o file. If I delete any split objects it doesn't rebuild. Also worryingly, if I then go and corrupt the .o file (e.g. like might happen if there is a Ctrl-C while writing the file), it doesn't rebuild (shouldn't this be built somewhere and then mv'd?). I think the recompilation checker and avoidance is being a bit optimistic!

@ezyang
Copy link

ezyang commented Dec 29, 2015

There are several levels at which GHC's recompilation checker operates.

  1. ghc --make does a compilation check that involves timestamps of hi and o files, and may skip the interface recompilation check entirely if it concludes that an object file is "stable".
  2. ghc -c has a disjoint compilation check in runPhase on Hsc, which SOLELY looks at the timestamp of the o file to determine if the source code has changed, which is something that the interface recompilation checker needs to know (if the source is modified, always recompile!)
  3. The interface recompilation check (that's the thing that lets GHC say "Compilation is NOT necessary" does NOT look at object files; its only job in life is to determine if the interface file is up-to-date. However, if the interface recompiler decides that recompilation is not necessary, the OBJECT is not rebuilt.

Why does this work at all? The critical thing is the fact that the interface recompilation check takes a boolean saying whether or not the source has been modified or not. If the object file is old, or doesn't exist, regardless of the state of the interface file GHC concludes that the source code has changed and triggers a rebuild (which in turn triggers the rest of the pipeline to run.)

Now, is split objects buggy? If you are expecting that the output of GHC is the individual split objects of course it is: GHC's not checking the right thing when doing recompilation checks (1)/(2). But if GHC's output is supposed to be the merged (splitted) objects, then there is no problem.

I don't actually know what the intended semantics of -split-objs is. It's not stated in the manual, and the fact that a A_o_split directory is created suggests that these files are not intended to be temporary.

@snowleopard
Copy link
Owner Author

However, I'm still trying to figure out exactly what went wrong. How did you get to the state where you had the object file, but not all the split objects?

@ndmitchell, I have no idea how this happens. Current hypothesis: what if Shake starts to kill all spawned processes, quickly manages to kill the one responsible for creating split objects, but ghc is more resilient and, thinking that splitting objects is now complete, manages to write the object file and terminate successfully? That's quite a convoluted scenario, but I don't have anything better at the moment.

@ezyang Thank you for the detailed comment!

Now, is split objects buggy? If you are expecting that the output of GHC is the individual split objects of course it is

I think we do expect this. The build system needs a way to verify that all split objects are generated in order to proceed with the rest of the build. The easiest way is to assume that when GHC terminates successfully the split objects are up-to-date. An alternative is to find out how many split objects are expected and do a post-GHC validity check, which seems unnecessary complicated.

I'll go ahead and submit the ticket. Maybe this is an easy fix.

@snowleopard
Copy link
Owner Author

@ezyang
Copy link

ezyang commented Dec 30, 2015

I think we do expect this. The build system needs a way to verify that all split objects are generated in order to proceed with the rest of the build. The easiest way is to assume that when GHC terminates successfully the split objects are up-to-date. An alternative is to find out how many split objects are expected and do a post-GHC validity check, which seems unnecessary complicated.

So, I just double checked, and GHC does merge the split object files into a single object file. Why doesn't Shake depend on that file?

@snowleopard
Copy link
Owner Author

Why doesn't Shake depend on that file?

It does! However, we somehow end up in a problematic state when the single object file exists, yet some split objects are missing. Then when we rerun the build, GHC doesn't want to restore missing split objects and the only way forward is to manually delete the complete object file to force GHC to restore things to a consistent state.

@snowleopard
Copy link
Owner Author

Actually, maybe the single object file is also incomplete in such situations: it may be just a merge of split objects that have been created. I haven't thought about this before, but this could explain what happens. Next time the bug occurs I'll check this.

@ezyang
Copy link

ezyang commented Dec 30, 2015

This all sounds very impossible. It should not be possible, even with hard-kills, for GHC to manage to fail to write out all of the split objects BUT write an up-to-date object file; if any of the assembling fails, that should abort the entire pipeline.

The relevant code, by the way, is in compiler/main/DriverPipeline.hs; look for the SplitAs section in runPhase.

@snowleopard
Copy link
Owner Author

Thanks @ezyang. I don't have an explanation yet. I tried to reproduce this in a controlled situation, but no luck so far.

One interesting scenario which I observed: if I kill the Shake process it fails to terminate GHC processes that keep working on split objects; they eventually succeed by creating complete object files. However, I can restart Shake and then we can have two GHC processes running at the same time, the old one and the new one working on the same object file. It looks like in this situation the old GHC gets stuck (and then never terminates) and the new GHC erases split objects, starts over, and eventually completes the job. However, with several generations of GHC processes running in parallel we might be able to get an impossible-sounding outcome.

@ezyang
Copy link

ezyang commented Dec 30, 2015

OK:

  1. Two GHC's clobbering each other makes a good deal of sense, because the split object directory is shared,
  2. This split object directory is intended to be a user visible output (Cabal must link the archive against the split objects, not the merged object), so indeed, it's jut a bug in the recompilation checker.

@snowleopard
Copy link
Owner Author

With --split-objs disabled by default I am no longer bothered by this. Since we can't do anything about this on our side anyway, I think we better close this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants