-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Zinc compilations reproducible #333
Comments
I am generally in support for making effort towards reproducible or referentially transparent builds. The jar files in sbt is made using the module called IO.jar is implemented here - https://github.com/sbt/io/blob/21ed6ec2a2e33fd88a5f013f643d04fda26de4de/io/src/main/scala/sbt/io/IO.scala#L446 Unless there are significant performance cost, I think it would make sense to move towards generating deterministic jars. |
Hey @raboof, thanks for opening a ticket, this is something I've devoted some thought lately. I believe that the fixes to make compilation more reproducible need to be merged in sbt rather than Zinc. A good example that backs this statement up is Pants, which is already providing reproducible Scala compiles with the current Zinc (as of Zinc 1.0.x). If you want to pursue this further, I suggest you have a look at their source code. That being said, if we see that there is some core logic that is shared across reproducible build tools and can be merged upstream, I'm up for merging it. But in my opinion the best way to move forward is to make changes in the build tools first, and then abstract over the requirements to get reproducible compiles. |
Thanks for reaffirming that this is a useful goal. I'll definitely check out how Pants solves this (though this is a spare-time thing for me so I can't really promise when I'll find significant time to spend on this ;) ). |
@raboof Any help is appreciated, no matter when it comes. 👍 |
I had a look at pants - AFAICS they actually don't produce reproducible builds: I created a small test project with a Scala |
.. and sorting entries by name. This made building a minimal test application 'repeatable' when building on the same machine but clearing the target directory. Related to sbt/zinc#333
Correct: pants builds are not currently bitwise identical. zinc is one piece of that problem, but there are plenty of others. We're moving toward an execution model that will allow for bitwise reproducibility, but it's not clear yet how we will make that work in the presence of zinc, given that the zinc analysis is a sort of mutable state that needs to cycle back into each re-execution of the zinc process. Moving zinc analysis on disk to being entirely hash based rather than timestamp based would be one important piece here... also stabilizing the order of all collections in the analysis. |
@stuhood Agreed that removing any part that relies on timestamps in Zinc is important, and that's why I opened #371 which will allow us to use file fingerprints for everything: class files, binaries, sources. I also agree that stabilizing the order of all collections is important too. In my opinion, there's room for compromise when it comes to reproducible builds. I think we should be optimizing for the use case where you clone a project and checkout a branch. In those cases, you do want your build to be reproducible. But when it comes to your developer environment, it seems unproductive to ditch incremental compilation as a whole only to get reproducibility for errors that rarely happen. I think there's a good tradeoff here, and I don't mind giving up some of my reproducibility to get faster compile times. An idea that @smarter gave me last weekend is that we could test how good incremental compilation is now by creating a script that compiles from scratch every commit in a repository and then tries to detect where the incremental compilation is producing different class files than the batch compilation. I think that would be a good experiment to convince users how rare inconsistencies due to incremental compilations are. And if they are not rare, we can always work to make them disappear. |
I agree completely that ditching incremental compilation as a whole only to get reproducibility would not be good. I assume you mean "how rare inconsistencies due to incremental compilations are"? I'd really like to have fully-reproducible "published artifacts", but during development I don't care so much (and would prefer to have fast builds over reproducible ones :) ) |
That's a great idea.
In the absence of incremental compile, the primary portions that would be important to make reproducible would be the output classfiles, which are mostly scalac's responsibility. If zinc were to start jarring outputs on its own, either it or scalac would probably want to support an option to disable/zero file timestamps on written files. |
I've set up something similar and we've get a tons of differences (I've compared crc in jar entries). After further examination it turns out that we've got a lot of differences even in full builds of the same commit on different machines. Comparing classes using javap shows that:
|
Tree traversal is supposed to be deterministic. What we should ensure, though, is that all the compiler inputs are deterministic too. One example to achieve that is to sort the inputs. @romanowski Let me know if something comes up related to incremental compilation. |
@jvican this was on a full compile, not incremental. I don't believe that we currently sort the input files, but maybe this could be a -Y option, or the default. Specifically 2 compilations from the same Git commit, but from different CI compilations ( so different actual directories) and on Linux (I think) some of the differences were classes generated by macros, and others were from anonymous blocks @romanowski also saw differences in the names of the methods generated for default argument access. I have not seem that, and we do have some special code for these default parameters in our plugin, so this may or may not be our issue Anecdotally the numbers generated by classes from macros were very high, so I thing that that generation is using a name generator backed on the macro name, not the full scope |
Yes, that was my impression from reading @romanowski previous comment. Let me know if there's changes with incremental compilation, I would be interested in having a look. I don't think that sorting should go into the compiler. It's something the build tools should be doing, at least for now. In my opinion, we should encourage such a change in sbt and other "popular" build tools like Gradle / Maven.
I think this is how it's implemented in paradise. But I don't remember well, it's been a while since I read the sources. However, if this is happening, it means that the typechecker has not deterministic traversals. Otherwise it wouldn't matter where the synthetic counter comes from. |
@jvican for the macros - my point was that as we are getting numbers related to the use of the macros then they cannot be deterministic for incremental compilation WRT full compilation, so will need to be addressed seperately. we are seeing numbers for the macros generation up to 100K and where are macros that will only run once at a given location WRT to ordering and external tools As I read the compiler, I dont think that the files to be compiled are compiled in the order that they are specified
Not sure if the warnings are related, but making this a SortedSet woould make this order ddetermined |
I wrote the comment, but I don't know what I can add. The issue was that somewhere in the resident compiler an iterator was created but not immediately used, and tests depended on the iteration order reflecting the state of unitbuf at the time of iteration instead of at the time of creation. Whether it still is so, I could not guess. |
.. and sorting entries by name. This made building a minimal test application 'repeatable' when building on the same machine but clearing the target directory. Related to sbt/zinc#333
.. and sorting entries by name. This made building a minimal test application 'repeatable' when building on the same machine but clearing the target directory. Related to sbt/zinc#333
It'd be useful to be able to support 'reproducible' builds from sbt/zinc (in the https://reproducible-builds.org/ sense).
One current source of nondeterminism in generated artifacts is the fact that
JarOutputStream
andZipOutputStream
from java.util are used in IO.jar(), which will include timestamps in the generated jar file. Also I'm not sure the ordering of the files in the archive is deterministic.There's generally 2 ways to make builds reproducible: generating the assets in a deterministic way, or post-processing them. I'd say generating the jars in a deterministic way would be the nicest approach and make it easiest to integrate into an sbt project.
What would be a good place for extension points so this behavior can be overridden to be deterministic? Or would it even be acceptable to make this the default behavior?
The text was updated successfully, but these errors were encountered: