-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: compiler built with PGO crashes occasionally on ppc64{le} #60368
Comments
Change https://go.dev/cl/497455 mentions this issue: |
My current repro approach:
i.e., rebuild the cmd/dist over and over until we get a crash ("goroutine" should appear in stack trace of any compiler crash). |
Reapples CL 495596, which was reverted at CL 496185. The x/tools failure, #60263, has been resolved. The ppc64 failures, #60368, have _not_ been resolved, but are believed to be specific to that port. This CL will make ppc64 flaky while the issue is investigated, but give more soak time on primary ports. Build the compiler with PGO. As go build -pgo=auto is enabled by default, we just need to store a profile in the compiler's directory. The profile is collected from building all std and cmd packages on Linux/AMD64 machine, using profile.sh. This improves the compiler speed. On Linux/AMD64, name old time/op new time/op delta Template 138ms ± 5% 136ms ± 4% -1.44% (p=0.005 n=36+39) Unicode 147ms ± 4% 140ms ± 4% -4.99% (p=0.000 n=40+39) GoTypes 780ms ± 3% 778ms ± 4% ~ (p=0.172 n=39+39) Compiler 105ms ± 5% 99ms ± 7% -5.64% (p=0.000 n=40+40) SSA 5.83s ± 6% 5.80s ± 6% ~ (p=0.556 n=40+40) Flate 89.0ms ± 5% 87.0ms ± 6% -2.18% (p=0.000 n=40+40) GoParser 172ms ± 4% 167ms ± 4% -2.72% (p=0.000 n=39+40) Reflect 333ms ± 4% 333ms ± 3% ~ (p=0.426 n=40+39) Tar 128ms ± 4% 126ms ± 4% -1.82% (p=0.000 n=39+39) XML 173ms ± 4% 170ms ± 4% -1.39% (p=0.000 n=39+40) [Geo mean] 253ms 248ms -2.13% The profile is pretty transferable. Using the same profile, we see a bigger win on Darwin/ARM64, name old time/op new time/op delta Template 71.0ms ± 2% 68.3ms ± 2% -3.90% (p=0.000 n=20+20) Unicode 71.8ms ± 2% 66.8ms ± 2% -6.90% (p=0.000 n=20+20) GoTypes 444ms ± 1% 428ms ± 1% -3.53% (p=0.000 n=19+20) Compiler 48.9ms ± 3% 45.6ms ± 3% -6.81% (p=0.000 n=20+20) SSA 3.25s ± 2% 3.09s ± 1% -5.03% (p=0.000 n=19+20) Flate 44.0ms ± 2% 42.3ms ± 2% -3.72% (p=0.000 n=19+20) GoParser 76.7ms ± 1% 73.5ms ± 1% -4.15% (p=0.000 n=18+19) Reflect 172ms ± 1% 165ms ± 1% -4.13% (p=0.000 n=20+19) Tar 63.1ms ± 1% 60.4ms ± 2% -4.24% (p=0.000 n=19+20) XML 83.2ms ± 2% 79.2ms ± 2% -4.79% (p=0.000 n=20+20) [Geo mean] 127ms 121ms -4.73% For #60368. Change-Id: I2cec0fc85e21c38d57ba6f0e5e90cde5d443ebd2 Reviewed-on: https://go-review.googlesource.com/c/go/+/497455 Reviewed-by: Austin Clements <[email protected]> Run-TryBot: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Bryan Mills <[email protected]>
Found new dashboard test flakes for:
2023-05-23 18:15 linux-ppc64le-buildlet go@b0f15b4a cmd/cgo/internal/testshared.TestTrivialExecutable (log)
|
We have done some experimenting with this. I found that if we build the default.pgo file natively instead of using the one built on amd64 we are not able to get the problem to occur. When using the default pgo file built on amd64 more inlining happens within cmd/compile/internal/ssagen.(*state).stmt and according to the stacktraces that seems like where many of the problems occur. I was going to attempt to force more aggressive inlining and see if that forces the problem. |
Found new dashboard test flakes for:
2023-05-24 01:03 linux-ppc64le-power10osu go@bdc5533f cmd/compile/internal/ssa [build] (log)
|
Found new dashboard test flakes for:
2023-05-24 03:36 linux-ppc64le-buildlet go@f0d575c2 cmd/cgo/internal/testshared.TestTrivialExecutable (log)
|
Found new dashboard test flakes for:
2023-05-24 21:17 linux-ppc64le-power9osu go@78143d96 cmd/link.TestMachOBuildVersion (log)
|
Found new dashboard test flakes for:
2023-05-24 17:54 linux-ppc64-sid-power10 benchmarks@c4c5b3d1 go@1ddab592 google.golang.org/protobuf/internal/impl [build] (log)
|
Found new dashboard test flakes for:
2023-05-24 22:39 linux-ppc64le-power9osu go@79a8997a cmd/internal/obj/x86.TestVexEvexPCrelative (log)
|
Found new dashboard test flakes for:
2023-05-24 23:56 linux-ppc64-sid-buildlet go@6824765b internal/unsafeheader [build] (log)
2023-05-24 23:56 linux-ppc64-sid-buildlet go@6824765b internal/coverage/rtcov [build] (log)
2023-05-24 23:56 linux-ppc64-sid-buildlet go@6824765b internal/goexperiment [build] (log)
|
Found new dashboard test flakes for:
2023-05-24 23:57 linux-ppc64-sid-power10 go@2a4f4fc5 (log)
2023-05-25 00:04 linux-ppc64-sid-power10 go@04c62893 (log)
2023-05-25 00:13 linux-ppc64-sid-power10 go@869da4a2 cmd/compile/internal/ssa.TestDebugLines_53456 (log)
|
@laboger I ran a binary search on the flake, and isolated it to a single PGO inline. you can compare two versions of ssa.html with (full recipe): First checkout experiment CL, build a known-good (no PGO-inlining) compiler, and save it:
Then you can build a bad compiler, or a good compiler, with output to a nearby directory (../bad or ../good):
Note that
I have not had a chance to dig into this yet, it seems likely that you might be able to make faster progress than me, unless something obvious leaps out at me. |
Found new dashboard test flakes for:
2023-05-26 17:03 linux-ppc64le-buildlet go@1ff89009 runtime.TestCgoPprofPIE (log)
|
@mknyszek suggested turning off async preemption, and so far we are almost an hour in with no failure. |
It's at 100 minutes, still running, I will leave it running. |
Found new dashboard test flakes for:
2023-05-26 22:45 linux-ppc64le-buildlet go@260a9b0a cmd/compile/internal/ssa [build] (log)
|
6 hours, no failures. I can get failures with async preemption turned off, darn it. The failure rate seems to depend on whether the machine is busy, maybe? It was low Friday night, I thought that seemed fishy, so I added a second gomote on the same box, and now it fails more often, e.g. with a plain compiler (no special gossahash setting)
I am not happy. Because our mechanical servants don't get the weekend off, I started a pair of searches, one with async preemption disabled, the other not, partly for the exercise of looking for any other failure points, so far none have appeared (i.e., the binary search, which is randomized, is repeating itself). |
I can get it to fail pretty consistently a few times while running index0.go this way:
It only fails if I set GOMAXPROCS=4, which is the number of processors available on the OSU builders. |
The code you are looking for is https://cs.opensource.google/go/go/+/master:src/cmd/compile/internal/inline/inl.go;l=987. tl;dr, of this function is that a function is noramlly inlinable if its cost is < 80 ( |
Found new dashboard test flakes for:
2023-07-10 19:19 linux-ppc64-sid-power10 go@07ede7a5 os/exec.TestConcurrentExec (log)
|
The new failure is unrelated. |
Removed from the "Test Flakes" project so watchflakes will not look at this. It's been long enough so I think we're confident that the original issue is fixed. |
That resulted in I have re-added this issue to the “Test Flakes” project, but added an upper bound on the date to prevent |
Bummer. Sorry about that. Thanks for the fix. |
https://go.dev/cl/495596 added a
default.pgo
profile forcmd/compile
, enabling a PGO build of the compiler (as long as-pgo=none
is not set explicitly).This caused a variety of crashes in the compiler on ppc64{le} builders:
2023-05-18T16:55:07-88f89d8/linux-ppc64le-power9osu
2023-05-18T13:41:27-33a601b/aix-ppc64
2023-05-18T12:52:14-7b0835d/aix-ppc64
2023-05-18T10:23:17-75add1c/aix-ppc64
2023-05-18T09:16:07-774f602/linux-ppc64-sid-power10
2023-05-18T09:16:07-774f602/linux-ppc64le-buildlet
2023-05-18T09:15:25-27906bb/aix-ppc64
2023-05-18T09:15:25-27906bb/linux-ppc64-sid-power10
2023-05-18T01:40:37-6ed8474/linux-ppc64-sid-buildlet
2023-05-18T01:40:37-6ed8474/linux-ppc64-sid-power10
2023-05-18T00:35:53-956d31e/linux-ppc64le-buildlet
2023-05-17T22:11:31-0b86a04/linux-ppc64le-buildlet
2023-05-17T21:53:11-c426c87/linux-ppc64-sid-buildlet
2023-05-17T21:53:11-c426c87/linux-ppc64le-power10osu
2023-05-17T21:44:30-2693ade/linux-ppc64-sid-power10
That CL also caused #60263. Since it caused several issues, the CL was reverted in https://go.dev/cl/496185. #60263 has since been fixed.
Given these failures are all on ppc64{le}, @dr2chase and I suspect that they are due to a bad ppc64-specific optimization (SSA rule, e.g.) that is tickled by the additional inlining caused by PGO.
I have had some success reproducing these crashes. Running all.bash in a loop on three
linux-ppc64-sid-power10
builders concurrently withGOGC=5
usually gets me a failure in <30 minutes. Not stellar, but I think workable.We should then be able to bisect down to a bad function with GOSSAHASH applied in inlineCostOK to enable/disable PGO-based inlining. (Also set
-d=pgodevirtualize=0
to disable PGO-based devirtualization, which was submitted in https://go.dev/cl/492436, after the bad CL above was reverted).Given that there is a path forward to debugging ppc64, and we'd like more soak time on the primary ports, I intend to resubmit https://go.dev/cl/495596, which will make ppc64 flaky until this issue is resolved. (GOARCH=ppc64{le} could also temporarily change the default of -pgo from auto to none if necessary).
cc @golang/ppc64 @dr2chase @aclements @cherrymui
The text was updated successfully, but these errors were encountered: