-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace slow SHA-1 by xxHash for classpath hashes #371
Conversation
This commit replaces `SHA-1` with `xxHash`, an extremely fast non-cryptographic hash algorithm that is used extensively in production systems all over the world (check the website: http://cyan4973.github.io/xxHash/). The reason why this new hashing function has been added is because `MessageDigest` and, concretely, `MixedAnalyzingCompiler.config` showed up in some profiles consuming too much cpu time: ``` sbt.internal.inc.MixedAnalyzingCompiler$.$anonfun$makeConfig$1(File) 1213ms ``` The culprit is `Stamper.forHash` that relies on `MessageDigest` to checksum the jars on the fly: ``` sbt.internal.inc.Stamper$.$anonfun$forHash$2(File) 1213ms -> FilterInputStream.java:107 java.security.DigestInputStream.read(byte[], int, int) 1146ms ``` However, it is not reasonable that for not such a big classpath (sbt's main module classpath) this takes 1213ms. This is totally exaggerated, and it's a price that it's paid every time we create a `MixedAnalyzingCompiler`, which is instantiated every time we invoke `compileIncrementally` in the `IncrementalCompilerImpl`. Basically, in every compile iteration. The benefit of this change is motivated by the following scenarios: 1. Users incrementally compile lots of times. Reducing the running time of every iteration makes the incremental compilation start faster. 2. Big projects can have **huge** classpaths which will make incremental compilation significantly slower. This is especially true when every jar is big. There are faster hashing functions than xxHash. However, given that xxHash is used in important projects and has been battle-tested, I've decided to use it for Zinc. Note that cryptographic hashes are not necessary to ensure that the classpath has or has not changed. It is safe to use `xxHash` since the collision rate is really low.
The new `Hash32` class is designed to wrap a 32-byte integer product of the underlying hash algorithm. Instead of making this accept a `Long` (xxHash works at the 64-byte level), we produce `Int` because `FileHash` needs an integer and we don't want to break that API. Truncating from 64-byte to 32-byte is safe, read http://fastcompression.blogspot.ch/2014/07/xxhash-wider-64-bits.html. The following commit hence removes any reference to the previous `Hash`, that was using hexadecimal values, and deprecates any method that assumes its existence `getHash` in the `Stamp` Java interface. It also makes the changes in the underlying formats, the binary change being binary compatible.
I forgot to note that Zinc also hashes source files. Source files will benefit from faster checksum, though this is not important really since they don't (or should not) suppose a bottleneck. The business is with classpath entries. |
@@ -1,16 +0,0 @@ | |||
package sbt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dwijnand This had to go since the old Hash
does not exist anymore.
I made a quick informal experiment in the console. I ran both versions for 2600 jars of my ivy cache, that add up to 2.3GB. These are the results that I get:
That's 20x over the status quo. |
Would be good to see before and after numbers for the same scenario, rather than just before numbers. I'm a bit concerned about using a 32 bit hash for this usecase... if you're going to add a 3rdparty dep, adding one that allows for larger outputs (at least 64 bits) would be good: almost anything should be faster than SHA1. |
Yeah, agreed. I cannot do this before I can compile the sbt codebase with RC. I'll try then.
The current 3rd party dep is specialised for 64bytes, the only issue is that we truncate it. See the commit message to know why 😄. I go into the details. In short: there I explain that it's safe to truncate from 64 to 32. The resources I've read in the Internet (and the blog post I link to in the commit) claim that the quality of the 32-bit version is excellent. By the way, as an interesting fact, this is the hash that Spark and MySQL use. |
Two more notes here:
To address Stu's concerns on the length of the hash: I am going to try to change the API of |
See discussion in sbt#371. Motivations to have a 64-byte hashes for all the file hashes: 1. In a distributed environment, 32-byte is not sufficient to ensure that there are no collisions. 2. The quality of the 64-byte hash is supposed to be greater than 32. 3. Previously, we were using the `hashCode` as a 32-byte hash for the hexadecimal representation of a SHA1 of some file contents. This is unsafe, and by switching to `long` we can also make the underlying hash file content implementation to be 64-byte based.
@stuhood My last change makes the necessary changes to the |
@@ -25,7 +25,7 @@ | |||
* | |||
* @return A valid string-based representation for logical equality, not referential equality. | |||
*/ | |||
public int getValueId(); | |||
public long getValueId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll reiterate that if you're going to change this, changing it to use Array[Byte]
or ByteBuffer
or protobuf's ByteString
or something else that has dynamic length would be strongly preferable to using a fixed width value. It allows you to change the implementation of the hash function without changing the interface (which is fine for a hash function, as it just causes misses). In the future when you need more than 64 bits, it would not be necessary to change the API again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
val acc = new Array[Long](2) | ||
val buffer = new Array[Byte](BufferSize) | ||
while (is.read(buffer) >= 0) { | ||
val checksumChunk = xxHash.hashBytes(buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include a reference to where you found this pattern?
Alternatively, using a hash library with a more usable interface would be great. Guava's is particularly good, and includes hash of file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guava notoriously break binary compatibility all the time, it would be good to not depend on it in core sbt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stuhood I'm looking into a way of mixing hashes safely.
Note that Guava does not support xxHash. I like our current dependency because is high-quality and self-contained, and pretty light.
The following commit removes the streaming by mmapping the contents of the files to be hashed, which is faster and better because no copying occurs between OS regions and userbase regions, meaning that the OS optimizes its access in virtual memory. This also means that the subsequent times that this method is executed it will be faster since the memory will be already mapped. Mapping my whole ivy cache is 561ms the first time, and around ~350ms for the rest of the time.
This affects the `FileHash` and `Hash64` APIs. This will allow us to store the representation no matter what the hash is. It's friendlier to API changes in the future, as Stu has mentioned in the review. See https://github.com/sbt/zinc/pull/371/files for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One critical issue.
return this.hash[0]; | ||
} | ||
|
||
public byte[] hash64() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needn't be named hash64
anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot be called hash
because it conflicts with the previous signature. I'll call it hashBytes
.
} | ||
|
||
public int hashCode() { | ||
return 37 * (37 * (37 * (17 + "xsbti.compile.FileHash".hashCode()) + file().hashCode()) + hash64().hashCode()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important: The hashCode
and equals
methods are not useful (need to use the static Arrays.equals()
, iirc), which is one reason to prefer ByteBuffers for this.
The other reason to prefer ByteBuffers is that they can act as views into other data, which can avoid copying it: also, protobuf ByteStrings can be "viewed" as ByteBuffers with asReadOnlyByteBuffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 We're already using asReadOnlyByteBuffer
in the protobuf converters, but I agree ByteBuffer
s are superior over Array[Byte]
.
override def writeStamp: String = s"hash($hexHash)" | ||
override def getValueId: Int = hexHash.hashCode() | ||
override def getHash: Optional[String] = Optional.of(hexHash) | ||
final class Hash64(val hash: Long) extends StampBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to just hold the hash bytes here, and use its hashCode as the hashCode? Not sure which API constraints you're working with, but: converting to 32 is necessary for JVM hashCode methods is necessary, but converting back and forth to Long
values just shouldn't (ever) be necessary except in the text based analysis format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good suggestion 👍. I'll try to see how it goes.
@@ -26,11 +27,17 @@ final class ProtobufReaders(mapper: ReadMapper) { | |||
java.nio.file.Paths.get(path).toFile | |||
} | |||
|
|||
private def readLong(bs: ByteString) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above: should just take the Array[Byte] directly here. Or better yet, use ByteBuffer in your public APIs and just directly consume asReadOnlyByteBuffer
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not possible to use ByteBuffer
because bytes
in protobuf relies on ByteString
. We cannot change this data type before.
I'm assigning this to the 1.1.x milestone. Even though this change is not really intrusive, I would prefer it to land in 1.1.x instead of 1.0.0. I would like to advertise 1.1.0 as the release that focuses on making the incremental compiler reproducible. |
Seems I'm hitting the same in a project with ~350 source files. No-op compile goes from 16 seconds (according to sbt, but I'm sure it's slower) to 37 seconds. A Yourkit profile shows all the time in
|
Quoting the exact same part of the profile that I did in the initial description of this PR, we can see that it's the same culprit:
I still have to measure whether the impact here is just IO or the hashing algorithm being used. I'll check this PR with my previous experiment to see if the running time does indeed decrease. |
the true cause is this line
i.e. this is all I/O (this is a backtrace). Using NIO to read the file in one operation should be significantly faster. |
I remember seeing |
make sure to swap between sampling and probe mode. |
Thanks for working on this. Any chance this PR can get merged into 1.0.x rather than waiting for 1.1? Sbt is basically unusable right now for iterative development. Every |
It would be good to clarify what scenario we can expect to see speedup in a form of repeatable benchmark test. |
@eed3si9n I've created extensive docs around this issue across the new multiple repo structure (which was confusing as hell) and even sent a POC with a fix. I gave up because there is no way to monkey patch sbt with a custom implementation of any of this stuff. So I sent PRs to allow monkey patching. Then I gave up. It is hard to understate how much of a performance regression this is on anything but the most trivial of projects. A benchmark is trivial... apply this hash to the scala library jar. Cached, or as Jorge points out using the jar's header, is orders of magnitude faster. |
Here's some more data for y'all. 7s on no-change compile. Big list of dependent jars. https://gist.github.com/dispalt/2201b25a6a3476f2f01188bf44b808a3 |
Given the importance of this change and how it's affecting the community, I'll be updating this PR and finding a solution to this hashing problem tomorrow. |
Any hash is going to be too slow to use for this purpose... we need a quick check before then based on OS metadata. Efforts to speed up the hash itself would still be greatly appreciated as it's a constant overhead for all builds. |
I agree. Hashing of some element will be needed so it might as well be fast.
Or does this PR introduce complexity baggage that will be hard to get rid
of?
Am 09.11.2017 9:33 vorm. schrieb "Sam Halliday" <[email protected]>:
… Any hash is going to be too slow to use for this purpose... we need a
quick check before then based on OS metadata. Efforts to speed up the hash
itself would still be greatly appreciated as it's a constant overhead for
all builds.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#371 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAJPMihDwkbB7nilbNzNfLPid_v_z_b1ks5s0rjigaJpZM4Obynd>
.
|
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 133ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms.
Done in #452. I'll follow up this PR and change the algorithm and the API limitations imposed by the old Zinc 1.0 API in the future. So far, #452's PR should:
Numbers on the pull request. |
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms.
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms.
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms. Fixes sbt#433.
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms. Fixes sbt#433.
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms. Fixes sbt#433.
And make it parallel! This patch adds a cache that relies on filesystem metadata to cache hashes for jars that have the same last modified time across different compiler iterations. This is important because until now there was a significant overhead when running `compile` on multi-module builds that have gigantic classpaths. In this scenario, the previous algorithm computed hashes for all jars transitively across all these projects. This patch is conservative; there are several things that are wrong with the status quo of classpath hashing. The most important one is the fact that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't make sense. The second one is that we don't need a SHA-1 checksum for the kind of checks we want to do. sbt#371 explains why. The third limitation with this check is that file hashes are implemented internally as `int`s, which is not enough to represent the richness of the checksum. My previous PR also tackles this problem, which will be solved in the long term. Therefore, this pull request only tackles these two things: * Caching of classpath entry hashes. * Parallelize this IO-bound task. Results, on my local machine: - No parallel hashing of the first 500 jars in my ivy cache: 1330ms. - Parallel hashing of the first 500 jars in my ivy cache: 770ms. - Second parallel hashing of the first 500 jars in my ivy cache: 1ms. Fixes sbt#433.
This commit replaces
SHA-1
withxxHash
, an extremely fastnon-cryptographic hash algorithm that is used extensively in production
systems all over the world (check the website:
http://cyan4973.github.io/xxHash/).
The reason why this new hashing function has been added is because
MessageDigest
and, concretely,MixedAnalyzingCompiler.config
showedup in some profiles consuming too much cpu time:
The culprit is
Stamper.forHash
that relies onMessageDigest
tochecksum the jars on the fly:
However, it is not reasonable that for not such a big classpath (sbt's
main module classpath) this takes 1213ms. This is totally exaggerated,
and it's a price that it's paid every time we create a
MixedAnalyzingCompiler
, which is instantiated every time we invokecompileIncrementally
in theIncrementalCompilerImpl
. Basically,in every compile iteration.
The benefit of this change is motivated by the following scenarios:
Users incrementally compile lots of times. Reducing the running time
of every iteration makes the incremental compilation start faster.
Big projects can have huge classpaths which will make incremental
compilation significantly slower. This is especially true when every jar
is big.
There are faster hashing functions than xxHash. However, given that
xxHash is used in important projects and has been battle-tested, I've
decided to use it for Zinc.
Note that cryptographic hashes are not necessary to ensure that the
classpath has or has not changed. It is safe to use
xxHash
since thecollision rate is really low.