Profile Horizon #4022

2opremio · 2021-10-21T15:52:50Z

I don't think we have profiled the Time (CPU and Wallclock) and Memory performance of Horizon in a while.

We should try doing that and and improve any low hanging fruit we see. Particularly Horizon's ingestion pipeline.

For starters, this is a CPU profile I took from Horizon's staging pubnet deployment:
pprof.stellar-horizon.samples.cpu.001.pb.gz

2opremio · 2021-10-21T16:26:53Z

Unsurprisingly Horizon spends most of its CPU time decoding base64 and XDR. Most of which is spent making allocations:

2opremio · 2021-10-21T16:52:40Z

I think we can really improve decoding speed by:

Generating custom decoders as opposed to reflection
Using memory pools (for ledgers, transactions and operations)

2opremio · 2021-11-02T16:39:32Z

I took a quick look at the decoding object allocation and performance ... and I found that most (if not all) the intermediary buffers used by the xdr decoding library end up allocated in the heap.

Here are the top-most object allocating functions of Horizon:

If we ignore the calls from the reflect package we can see that calls from xdr.(*Decoder) account for ~15% of the objects allocated.

Let's take a a peek at DecodeInt():


pprof
stellar-horizon alloc_objects

      File: stellar-horizon
Type: alloc_objects
Time: Oct 22, 2021 at 1:30pm (CEST)
Active filters:
   focus=github\.com/stellar/go-xdr/xdr3\.\(\*Decoder\)\.DecodeInt
Showing nodes accounting for 8585334, 5.06% of 169782541 total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                           7110754 82.82% |   github.com/stellar/go-xdr/xdr3.(*Decoder).decodeUnion github.com/stellar/[email protected]/xdr3/decode.go:530
                                            753674  8.78% |   github.com/stellar/go-xdr/xdr3.(*Decoder).DecodeBool github.com/stellar/[email protected]/xdr3/decode.go:175
                                            720906  8.40% |   github.com/stellar/go-xdr/xdr3.(*Decoder).decode github.com/stellar/[email protected]/xdr3/decode.go:801
   8585334  5.06%  5.06%    8585334  5.06%                | github.com/stellar/go-xdr/xdr3.(*Decoder).DecodeInt github.com/stellar/[email protected]/xdr3/decode.go:105
----------------------------------------------------------+-------------

All allocations come from line 105, which corresponds to a var buf [4]byte array declaration:

func (d *Decoder) DecodeInt() (int32, int, error) {
	var buf [4]byte
	n, err := io.ReadFull(d.r, buf[:])

One would think that buf would be kept in the stack, but unfortunately that's not the case, due to the call to ReadFull using a slice from created from buf. From the escape analysis output of go 17.2

sweepey:xdr3 fons$ pwd
/Users/fons/go/pkg/mod/github.com/stellar/[email protected]/xdr3
sweepey:xdr3 fons$ go build -gcflags '-m -m -l' . |& grep decode.go
[...]
./decode.go:105:6: buf escapes to heap:
./decode.go:105:6:   flow: {heap} = &buf:
./decode.go:105:6:     from buf (address-of) at ./decode.go:106:32
./decode.go:105:6:     from buf[:] (slice) at ./decode.go:106:32
./decode.go:105:6:     from io.ReadFull(d.r, buf[:]) (call parameter) at ./decode.go:106:23
[...]

The same applies to the other decoding functions.

The problem is that the reader passed to io.ReadFull could keep a copy of buf around and thus Go decides to keep it in the heap ( Actually, I think that Go just lets the variable escape the stack due to the call, regardless of what the io.ReadFull actually does with it )

This means that, in order to dramatically reduce allocations we should change the Decoder's design so that it reads from a raw buffer (not requiring external IO calls causing stack escapes).

leighmcculloch · 2021-11-02T16:49:08Z

This is really interesting finding. 👏🏻 Combined with that finding we could also rewrite the xdrgen generated Go code to use the same pattern as stellar/xdrgen#65 does for encoding, and apply it to decoding so that no reflection is necessary.

I reduced decoding allocations by 1. Adding a scratch buffer to the Encoder/decoder. The scratch buffer is used as a temporary buffer. Before that, temporary buffers were allocated (in the heap since they escaped the stack) for basic type encoding/decoding (e.g. `DecodeInt()` `EncodeInt()` 2. Making `DecodeFixedOpaque()` decode in-place instead of allocating the result. Apart from reducing allocations, this removes the need of copying (as shown by `decodeFixedArray()`. The pre-existing (and admitedly very limited) benchmarks show a sizeable improvement: Before: ``` goos: darwin goarch: amd64 pkg: github.com/stellar/go-xdr/xdr3 cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz BenchmarkUnmarshal BenchmarkUnmarshal-8 3000 1039 ns/op 15.40 MB/s 152 B/op 11 allocs/op BenchmarkMarshal BenchmarkMarshal-8 3000 806.7 ns/op 19.83 MB/s 104 B/op 10 allocs/op PASS ``` After: ``` goos: darwin goarch: amd64 pkg: github.com/stellar/go-xdr/xdr3 cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz BenchmarkUnmarshal BenchmarkUnmarshal-8 3000 679.6 ns/op 23.54 MB/s 144 B/op 8 allocs/op BenchmarkMarshal BenchmarkMarshal-8 3000 609.1 ns/op 26.27 MB/s 104 B/op 8 allocs/op PASS ``` Both decoding and encoding both go down to 8 allocations from 11 (decoding) and 10 (encoding) allocations per operation. More context at stellar/go#4022 (comment)

2opremio · 2021-11-02T21:16:50Z

The problems I identified above are fixed by stellar/go-xdr#16

I reduced the allocation count by 1. Adding a scratch buffer to the Encoder/decoder. The scratch buffer is used as a temporary buffer. Before that, temporary buffers were allocated (in the heap since they escaped the stack) for basic type encoding/decoding (e.g. `DecodeInt()` `EncodeInt()` 2. Making `DecodeFixedOpaque()` decode in-place instead of allocating the result. Apart from reducing allocations, this removes the need of copying (as shown by `decodeFixedArray()`. The pre-existing (and admitedly very limited) benchmarks show a sizeable improvement: Before: ``` goos: darwin goarch: amd64 pkg: github.com/stellar/go-xdr/xdr3 cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz BenchmarkUnmarshal BenchmarkUnmarshal-8 3000 1039 ns/op 15.40 MB/s 152 B/op 11 allocs/op BenchmarkMarshal BenchmarkMarshal-8 3000 806.7 ns/op 19.83 MB/s 104 B/op 10 allocs/op PASS ``` After: ``` goos: darwin goarch: amd64 pkg: github.com/stellar/go-xdr/xdr3 cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz BenchmarkUnmarshal BenchmarkUnmarshal-8 3000 679.6 ns/op 23.54 MB/s 144 B/op 8 allocs/op BenchmarkMarshal BenchmarkMarshal-8 3000 609.1 ns/op 26.27 MB/s 104 B/op 8 allocs/op PASS ``` Both decoding and encoding both go down to 8 allocations from 11 (decoding) and 10 (encoding) allocations per operation. More context at stellar/go#4022 (comment)

I reduced the allocation count by 1. Adding a scratch buffer to the Encoder/decoder. The scratch buffer is used as a temporary buffer. Before that, temporary buffers were allocated (in the heap since they escaped the stack) for basic type encoding/decoding (e.g. `DecodeInt()` `EncodeInt()` 2. Making `DecodeFixedOpaque()` decode in-place instead of allocating the result. Apart from reducing allocations, this removes the need of copying (as shown by `decodeFixedArray()`. The pre-existing (and admitedly very limited) benchmarks show a sizeable improvement: Before: ``` goos: darwin goarch: amd64 pkg: github.com/stellar/go-xdr/xdr3 cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz BenchmarkUnmarshal BenchmarkUnmarshal-8 1445005 853.5 ns/op 18.75 MB/s 152 B/op 11 allocs/op BenchmarkMarshal BenchmarkMarshal-8 1591068 643.0 ns/op 24.88 MB/s 104 B/op 10 allocs/op PASS ``` After: ``` goos: darwin goarch: amd64 pkg: github.com/stellar/go-xdr/xdr3 cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz BenchmarkUnmarshal BenchmarkUnmarshal-8 1753954 661.4 ns/op 24.19 MB/s 144 B/op 8 allocs/op BenchmarkMarshal BenchmarkMarshal-8 2103310 537.1 ns/op 29.79 MB/s 104 B/op 8 allocs/op PASS ``` Both decoding and encoding both go down to 8 allocations from 11 (decoding) and 10 (encoding) allocations per operation. More context at stellar/go#4022 (comment)

2opremio · 2021-11-16T01:22:08Z

I think that after all the recent PRs we can consider XDR encoding and decoding sufficiently optimized for now.

If we decide to do more work in that regard, I think we should try to reduce decoding allocations, which will be hard but not impossible. For instance we could try to intern the Assets much like string interning

2opremio · 2021-11-16T12:16:16Z

Now, with all the codec improvements in place, I have profiled a staging captive core ingestion instance (for 240 seconds).

It seems that ingestion is dominated by the state verifier:

2opremio · 2021-11-16T16:14:55Z

While the CPU execution is dominated by the state verifier, Go's profiler only measures CPU time and not wallclock time.

Go's profiler also provides tracing, which AFAIU can be used to infer IO wait time

Finally https://github.com/felixge/fgprof combines both, providing wallclock profiling.

I am going to evaluate both in staging and see what I get. See #4079

2opremio · 2021-11-16T17:36:54Z

Here is the fgprof time profile: pprof.samples.time.001.pb.gz

Unfortunately, although it looks very different at the top level:

When focusing on RunAllProcessorsOnLedger, it looks very similar to what it looked before:

So I am not sure fgprof is working well. It seems that, for some reason, it's missing DB queries.

bartekn · 2021-11-16T17:44:43Z

Add added metrics to specific batch inserts in #4080 and it turns out that Exec takes 34.91/37.35 = 93% of all time spent in TransactionProcessor (sample of 280 pubnet ledgers). So there's definitely something about the query itself (or some preprocessing of the query in Exec).

2opremio · 2021-11-17T09:15:52Z

I think it would also be useful to add tracing to db queries in general

This golang issue includes some alternatives: golang/go#18080

bartekn · 2021-11-19T14:23:23Z

It looks like all the XDR improvements and #4087 made TransactionProcessor fast again. Should we close this or move to backlog?

2opremio · 2021-11-19T18:02:28Z

Move it to the backlog. I think we will need go keep working on this.

Also (without downplaying the improvements) the transaction's processor is still the bottleneck, right?

I still think there is a lot to be gained by improving the performance of the batch insert builder in general.

bartekn · 2021-12-20T14:25:10Z

What do you think about closing this issue and creating specific issues connected to performance improvements? This one is very general and I guess could live in the backlog forever.

ire-and-curses added the horizon label Oct 21, 2021

2opremio mentioned this issue Nov 2, 2021

Reduce encoding and decoding allocations stellar/go-xdr#16

Merged

2opremio mentioned this issue Nov 2, 2021

Add stellar/go-xdr#16 (reduction on XDR encoding/decoding allocations) #4042

Merged

2opremio self-assigned this Nov 9, 2021

2opremio mentioned this issue Nov 16, 2021

horizon: Add fgprof and trace internal profiling endpoints #4079

Closed

This was referenced Nov 22, 2021

support/db: Add benchmark for BatchInsertBuilder #4086

Merged

support/db: PoC: Use COPY instead of INSERT in BatchInsertBuilder #4094

Closed

2opremio removed their assignment Dec 1, 2021

mollykarcher closed this as completed Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile Horizon #4022

Profile Horizon #4022

2opremio commented Oct 21, 2021 •

edited

Loading

2opremio commented Oct 21, 2021

2opremio commented Oct 21, 2021 •

edited

Loading

2opremio commented Nov 2, 2021

leighmcculloch commented Nov 2, 2021

2opremio commented Nov 2, 2021

2opremio commented Nov 16, 2021

2opremio commented Nov 16, 2021 •

edited

Loading

2opremio commented Nov 16, 2021

2opremio commented Nov 16, 2021

bartekn commented Nov 16, 2021

2opremio commented Nov 17, 2021

bartekn commented Nov 19, 2021

2opremio commented Nov 19, 2021

bartekn commented Dec 20, 2021

Profile Horizon #4022

Profile Horizon #4022

Comments

2opremio commented Oct 21, 2021 • edited Loading

2opremio commented Oct 21, 2021

2opremio commented Oct 21, 2021 • edited Loading

2opremio commented Nov 2, 2021

leighmcculloch commented Nov 2, 2021

2opremio commented Nov 2, 2021

2opremio commented Nov 16, 2021

2opremio commented Nov 16, 2021 • edited Loading

2opremio commented Nov 16, 2021

2opremio commented Nov 16, 2021

bartekn commented Nov 16, 2021

2opremio commented Nov 17, 2021

bartekn commented Nov 19, 2021

2opremio commented Nov 19, 2021

bartekn commented Dec 20, 2021

2opremio commented Oct 21, 2021 •

edited

Loading

2opremio commented Oct 21, 2021 •

edited

Loading

2opremio commented Nov 16, 2021 •

edited

Loading