perf!: concurrent merge with optional debugging #61

mark-rushakoff · 2023-04-14T20:55:54Z

Background

The non-determinism tests in the SDK have been taking excessively long to run. Profiling the test showed that there are many calls to codec/types.NewInterfaceRegistry, which calls proto.MergedRegistry in this module. Prior to these optimizations, that method would be limited to a single CPU for about 5ms on my machine when not performing validations, or about 225ms with validations on; with the optimizations, that drops to about 1.5ms or 55ms respectively, on my machine.

While the simulations need to be cleaned up to not call that method so frequently, this was still worth optimizing because:

other tests will continue to instantiate a new merged registry as needed
the merged registry is currently required when starting simd either in client or server mode. The validations are currently off, but should probably be on as a sanity check -- shaving off 150+ milliseconds of CLI startup time is worth doing for faster automated recovery and for a better CLI experience.

This PR supersedes #56. It differs by offering cleaner control over child goroutines to process file descriptors; backfilling tests and benchmarks; and using an explicitly limited number of goroutines that reuse allocation-heavy values and process data arriving on channels, rather than one goroutine per file descriptor. Thank you to @mmsqe for the work in #56 that influenced this change.

API change

Calls to proto.MergedRegistry() do not need to change.

Existing calls to proto.MergedFileDescriptors() need to change to call one of proto.MergedGlobalFileDescriptors() or proto.MergedGlobalFileDescriptorsWithValidation(). I chose to move the existing debug parameter into an explicit part of the method name rather than a boolean argument. This will make call sites clear whether they will fail due to import path mismatches or incorrectly duplicated file descriptors, without the reader needing context on the literal true or false argument.

For finer control over a merge, proto.MergedFileDescriptors has been updated to accept arguments (globalFiles *protoregistry.Files, appFiles map[string][]byte). I'm not sure if this will be useful outside of tests, but it was necessary for testing and benchmarking these changes.

Performance data

In a draft PR on the SDK, this brings the time for test-sim-nondeterminism's test stage down to under 9 minutes: https://github.com/cosmos/cosmos-sdk/actions/runs/4724996838/jobs/8382891823. With the previous implementation, runs taking longer than 45 minutes were common.

Here are benchmarks comparing the previous and new implementations, against a representative but incomplete copy of the protobuf file descriptors as of cosmos/cosmos-sdk@c4489d9:

$ go test ./test/mergedregistry/ -cpu=1,2,4,8,10 -bench=Repres -benchtime=5s
goos: darwin
goarch: arm64
pkg: github.com/cosmos/gogoproto/test/mergedregistry
BenchmarkRepresentativeMergedFileDescriptors/sequential            	    1790	   3259606 ns/op	 1636177 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/sequential-2          	    2041	   2902895 ns/op	 1636184 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/sequential-4          	    2101	   2864636 ns/op	 1636200 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/sequential-8          	    2031	   2933869 ns/op	 1636229 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/sequential-10         	    2024	   2959004 ns/op	 1636226 B/op	   42145 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent            	    1783	   3364500 ns/op	 1637314 B/op	   42165 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent-2          	    3027	   1978660 ns/op	 1694548 B/op	   42176 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent-4          	    4794	   1301653 ns/op	 1808718 B/op	   42198 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent-8          	    4257	   1356715 ns/op	 2030442 B/op	   42241 allocs/op
BenchmarkRepresentativeMergedFileDescriptors/concurrent-10         	    4261	   1396325 ns/op	 2135427 B/op	   42262 allocs/op

See these trace screenshots for some visualizations of the performance data. Note that the time scale differs in each trace, and that work is properly spread across all available procs.

And extract MergedFileDescriptors's default behavior to MergedGlobalFileDescriptors, and update MergedFileDescriptors to accept arguments. Now that MergedFileDescriptors is testable, backfill some tests for the error cases.

Add a basic benchmark for MergedGlobalFileDescriptors, and add a new package so that we can import jhump/protoreflect in order to dynamically create new file descriptors. Right now, this is a crude implementation that produces very simple file descriptors; but if we need to improve the test cases, we now have a framework in place.

And add tests and benchmarks. I'm slightly stuck on the benchmarks, in deserializing the "global" file descriptors from simd. Once I figure that out, we will have benchmarks representative of real load.

CI still uses go1.19.

This doesn't include every single global file, but it includes enough to make it a representative benchmark now.

Now that we are no longer using jhump/protoreflect for this, these tests can live in the same package as the other benchmarks. Also mark file descriptor json files as binary so that they no longer show up in diffs or git grep.

I was focused on benchmarks, so wasn't running this test -- which ended up breaking due to function renames.

And since I didn't include the benchmark data in a commit message earlier, this is what it looks like, before I remove the old style in the next commit. On my 10-core M1 Max machine, with a few different CPU values to exercise varying core counts: $ go test ./test/mergedregistry/ -cpu=1,2,4,8,10 -bench=Repres -benchtime=5s goos: darwin goarch: arm64 pkg: github.com/cosmos/gogoproto/test/mergedregistry BenchmarkRepresentativeMergedFileDescriptors/sequential 1790 3259606 ns/op 1636177 B/op 42145 allocs/op BenchmarkRepresentativeMergedFileDescriptors/sequential-2 2041 2902895 ns/op 1636184 B/op 42145 allocs/op BenchmarkRepresentativeMergedFileDescriptors/sequential-4 2101 2864636 ns/op 1636200 B/op 42145 allocs/op BenchmarkRepresentativeMergedFileDescriptors/sequential-8 2031 2933869 ns/op 1636229 B/op 42145 allocs/op BenchmarkRepresentativeMergedFileDescriptors/sequential-10 2024 2959004 ns/op 1636226 B/op 42145 allocs/op BenchmarkRepresentativeMergedFileDescriptors/concurrent 1783 3364500 ns/op 1637314 B/op 42165 allocs/op BenchmarkRepresentativeMergedFileDescriptors/concurrent-2 3027 1978660 ns/op 1694548 B/op 42176 allocs/op BenchmarkRepresentativeMergedFileDescriptors/concurrent-4 4794 1301653 ns/op 1808718 B/op 42198 allocs/op BenchmarkRepresentativeMergedFileDescriptors/concurrent-8 4257 1356715 ns/op 2030442 B/op 42241 allocs/op BenchmarkRepresentativeMergedFileDescriptors/concurrent-10 4261 1396325 ns/op 2135427 B/op 42262 allocs/op PASS A single core machine unsurprisingly would perform slightly better on the sequential approach. With two cores, the concurrent approach is measurably better. I'm slightly surprised to see throughput go lower at 8 and 10 cores. But this approach is still more than twice as fast in the 4-10 core case compared to the sequential approach, so I'm going to leave the GOMAXPROCS count as-is. This should still make a measurable improvement in the simulation tests for the SDK, and it should be much less of a bottleneck than the sequential approach.

Benchmarks, and manually run experiments using simd and other cosmos-sdk tests, have shown that the concurrent implementation is faster regardless of whether validation is included. This commit's parent can be used if any other performance comparison is needed later.

mmsqe · 2023-04-18T01:12:17Z

proto/merge.go

+	// GOMAXPROCS is the number of CPU cores available, by default.
+	// Respect that setting as the number of CPU-bound goroutines,
+	// and for channel sizes.
+	nProcs := runtime.GOMAXPROCS(0)


limit with CPU-bound is a good idea, but might need rename to mergeFileDescriptors

Good catch, thank you. Comments corrected in cf9db71.

Just a missed search and replace.

mmsqe

LGTM, thanks for the refactor.

So that we don't need a separate commit after #61 is merged.

julienrbrt

lgtm!

mark-rushakoff added 9 commits April 14, 2023 08:26

refactor: enable debug by default for MergedFileDescriptors

47904a0

And extract MergedFileDescriptors's default behavior to MergedGlobalFileDescriptors, and update MergedFileDescriptors to accept arguments. Now that MergedFileDescriptors is testable, backfill some tests for the error cases.

feat: add concurrent proto registry merge functions

0d6d88b

And add tests and benchmarks. I'm slightly stuck on the benchmarks, in deserializing the "global" file descriptors from simd. Once I figure that out, we will have benchmarks representative of real load.

chore: don't use bytes.Clone from go1.20

0bfdfd3

CI still uses go1.19.

test: fix representative benchmark to include global FD set

e6d1320

This doesn't include every single global file, but it includes enough to make it a representative benchmark now.

test: move representative benchmarks to proper test package

7b6c214

Now that we are no longer using jhump/protoreflect for this, these tests can live in the same package as the other benchmarks. Also mark file descriptor json files as binary so that they no longer show up in diffs or git grep.

test: fix test to use merge with validation

4169e5b

I was focused on benchmarks, so wasn't running this test -- which ended up breaking due to function renames.

mark-rushakoff changed the title ~~perf: concurrent merge with optional debugging~~ perf!: concurrent merge with optional debugging Apr 17, 2023

chore: update changelog

9d031c6

mark-rushakoff mentioned this pull request Apr 17, 2023

build: use cosmos/[email protected] for optimized registry merge cosmos/cosmos-sdk#15862

Merged

mark-rushakoff marked this pull request as ready for review April 17, 2023 19:52

mark-rushakoff requested a review from a team as a code owner April 17, 2023 19:52

mark-rushakoff mentioned this pull request Apr 17, 2023

handle all file descriptors in parallel #56

Closed

mmsqe reviewed Apr 18, 2023

View reviewed changes

chore: use correct function name in comments

cf9db71

Just a missed search and replace.

mmsqe approved these changes Apr 18, 2023

View reviewed changes

chore: update changelog for 1.4.8 release today

66dc75b

So that we don't need a separate commit after #61 is merged.

julienrbrt approved these changes Apr 18, 2023

View reviewed changes

mark-rushakoff merged commit a0d0171 into main Apr 18, 2023

mark-rushakoff deleted the mr/concurrent-merge-with-debug branch April 18, 2023 13:48

mark-rushakoff mentioned this pull request Apr 19, 2023

refactor!: use x/tx/signing in client.TxConfig cosmos/cosmos-sdk#15822

Merged

19 tasks

facundomedica restored the mr/concurrent-merge-with-debug branch April 21, 2023 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf!: concurrent merge with optional debugging #61

perf!: concurrent merge with optional debugging #61

mark-rushakoff commented Apr 14, 2023 •

edited

Loading

mmsqe Apr 18, 2023

mark-rushakoff Apr 18, 2023

mmsqe left a comment

julienrbrt left a comment

perf!: concurrent merge with optional debugging #61

perf!: concurrent merge with optional debugging #61

Conversation

mark-rushakoff commented Apr 14, 2023 • edited Loading

Background

API change

Performance data

mmsqe Apr 18, 2023

Choose a reason for hiding this comment

mark-rushakoff Apr 18, 2023

Choose a reason for hiding this comment

mmsqe left a comment

Choose a reason for hiding this comment

julienrbrt left a comment

Choose a reason for hiding this comment

mark-rushakoff commented Apr 14, 2023 •

edited

Loading