Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all: use SHA256 with SIMD instructions for higher performance and throughout #700

Open
4 tasks
odeke-em opened this issue Jun 7, 2022 · 14 comments
Open
4 tasks

Comments

@odeke-em
Copy link
Contributor

odeke-em commented Jun 7, 2022

In this repository, we heavily use the Go standard library's crypto/sha256. However there exists a Single Instruction Multiple Data (SIMD) package from our friends at Minio per https://github.com/minio/sha256-simd and it promises 8X speed ups when using AVX instructions. We should explore this.

Let's explore if performance radically improves and then plumb it in.

Kindly cc-ing my colleague @elias-orijtech


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@tac0turtle
Copy link
Member

Is it okay to assign this to you and your team @odeke-em

@odeke-em
Copy link
Contributor Author

odeke-em commented Jun 8, 2022

Is it okay to assign this to you and your team @odeke-em

Yes, please @marbar3778! We are working on it. I just need to find a machine with AVX512 so that we can produce benchmarks.

@odeke-em odeke-em self-assigned this Jun 8, 2022
@ValarDragon
Copy link
Contributor

In support of using that library! Though I think its probably advisable to turn off AVX 512 via build flag, given the SDK workload (https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/)

@itsdevbear
Copy link

+1

@kirbyquerby
Copy link

Also of interest for this issue are a number of occurrences of crypto.Sha256():

screenshot

image

These come from what appears to be a helper function that wraps crypto/sha256:

https://github.com/cometbft/cometbft/blob/e9b91405b643b46b011865c4b7e1c1af0aa5c521/crypto/hash.go#L7-L11

We'd probably want to either replace these usages or update cometbft to use the SIMD library as well.

@tac0turtle
Copy link
Member

thanks for the insight, i would advocate for replacing the wrapped function as we are trying to rely less on comet

@yihuang
Copy link
Collaborator

yihuang commented Feb 16, 2023

The last time I check it, I don't see much improvements on dev machines I got at hand (x86_64 mac laptop and arm64 linux), on mac the stdlib is actually much faster, I just rerun the benchmark with go1.20 and post the result as follows:

arm64 linux
~/sha256-simd $ go test -run=^$ -bench=. -benchmem ./ -count=1
goos: linux
goarch: arm64
pkg: github.com/minio/sha256-simd
BenchmarkHash/Generic/8Bytes-8         	 2184978	       549.6 ns/op	  14.56 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/64Bytes-8        	 1000000	      1064 ns/op	  60.17 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/1K-8             	  139132	      8623 ns/op	 118.76 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/8K-8             	   18447	     65101 ns/op	 125.83 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/1M-8             	     144	   8288227 ns/op	 126.51 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/5M-8             	      28	  41402281 ns/op	 126.63 MB/s	       3 B/op	       0 allocs/op
BenchmarkHash/Generic/10M-8            	      14	  82817517 ns/op	 126.61 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/8Bytes-8         	11930301	       100.6 ns/op	  79.55 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/64Bytes-8        	 7533750	       160.1 ns/op	 399.67 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/1K-8             	 1547152	       775.6 ns/op	1320.21 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/8K-8             	  224019	      5354 ns/op	1530.03 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/1M-8             	    1789	    670705 ns/op	1563.39 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/5M-8             	     356	   3352908 ns/op	1563.68 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/10M-8            	     178	   6706550 ns/op	1563.51 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/8Bytes-8        	11268408	       106.6 ns/op	  75.04 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/64Bytes-8       	 8466012	       141.9 ns/op	 450.98 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/1K-8            	 1586331	       756.2 ns/op	1354.14 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/8K-8            	  224902	      5335 ns/op	1535.60 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/1M-8            	    1789	    670623 ns/op	1563.58 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/5M-8            	     356	   3352907 ns/op	1563.68 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/10M-8           	     178	   6703876 ns/op	1564.13 MB/s	       0 B/op	       0 allocs/op
PASS
ok  	github.com/minio/sha256-simd	31.607s
amd64 mac
~/sha256-simd $ go test -run=^$ -bench=. -benchmem ./ -count=1
goos: darwin
goarch: amd64
pkg: github.com/minio/sha256-simd
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkHash/Generic/8Bytes-12 	 2982602	       410.3 ns/op	  19.50 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/64Bytes-12         	 1540022	       782.3 ns/op	  81.81 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/1K-12              	  193633	      6219 ns/op	 164.67 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/8K-12              	   20944	     49602 ns/op	 165.15 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/1M-12              	     202	   6051028 ns/op	 173.29 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/5M-12              	      37	  32201704 ns/op	 162.81 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/10M-12             	      16	  63400945 ns/op	 165.39 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/8Bytes-12         	 6060865	       188.0 ns/op	  42.56 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/64Bytes-12        	 3442257	       342.0 ns/op	 187.13 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/1K-12             	  493141	      2419 ns/op	 423.34 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/8K-12             	   66552	     18119 ns/op	 452.12 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/1M-12             	     512	   2310553 ns/op	 453.82 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/5M-12             	      99	  11535992 ns/op	 454.48 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/10M-12            	      44	  23383451 ns/op	 448.43 MB/s	       0 B/op	       0 allocs/op
PASS
ok  	github.com/minio/sha256-simd	20.488s

@kirbyquerby
Copy link

@yihuang I did some digging and it looks like the Go standard library has support for ARM SHA extensions and AVX2, which could explain why GoStdlib and ArmSha2 have such similar performance (Generic falls so far behind because it's an implementation that doesn't use hardware acceleration).

sha256-simd advertises improved performance for processors with Intel SHA Extensions or AVX512, which the standard library doesn't have optimizations for.

I didn't see any improvements for cosmos-sdk benchmarks with the simd library on my workstation, which has Intel SHA Extensions (5950x), but I plan to also benchmark on a machine with AVX512.

@yihuang
Copy link
Collaborator

yihuang commented Feb 16, 2023

actually iavl library use sha256 heavily, should have bigger impact there.

@kirbyquerby
Copy link

kirbyquerby commented Mar 1, 2023

I ran benchmarks for cosmos-sdk and iavl on machines with AVX512 and Intel SHA Extensions with and without using the SIMD library, and got these results: https://gist.github.com/kirbyquerby/6635113b003abdaeaa93618d4e6970a2

There didn't seem to be significant improvements (in many benchmarks, there's even a slowdown) for using the SIMD library in either cosmos-sdk or iavl.

@tac0turtle
Copy link
Member

would be interesting to test this in iavl https://github.com/prysmaticlabs/gohashtree. see if there is any change

@yihuang
Copy link
Collaborator

yihuang commented Mar 9, 2023

would be interesting to test this in iavl https://github.com/prysmaticlabs/gohashtree. see if there is any change

I can reproduce the intel benchmark result on my mac laptop, it's faster by 6x if you do at least 16 hashing operations in a batch.
but their api assume user always hash 64bytes into 32bytes digest, so it can hard code the padding block, and can do multiple hashes in parallel, for iavl tree:

  • we don't have the fixed block to hard code
  • to exploit opportunities of parallel hashing, we need to change our ways of traversing the tree, for example, hashing all the leaf nodes first in a batch, then all the height=1 nodes, etc.

@robert-zaremba
Copy link
Collaborator

Shall we close this issue, and open new in IAVL if we want to dig more gohashtree usage there?

@tac0turtle
Copy link
Member

tac0turtle commented Mar 10, 2023

ill transfer this issue there.

but their api assume user always hash 64bytes into 32bytes digest

we can either modify our code or have a variation of their code

@tac0turtle tac0turtle transferred this issue from cosmos/cosmos-sdk Mar 10, 2023
@tac0turtle tac0turtle moved this to 👀 To Do in Cosmos-SDK Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 📋 Backlog
Development

No branches or pull requests

8 participants
@yihuang @robert-zaremba @odeke-em @ValarDragon @kirbyquerby @tac0turtle @itsdevbear and others