-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Illegal instruction crash in lighthouse cli on some CPUs #416
Comments
This solution to this would be for us to use our own machines for building, I'm confused as to why you're using github for this. @dboreham what's stopping us from building this off github? |
This is happening in CI jobs for ipld-eth-server, which hasn't been migrated to Gitea, but if we have CI infra ready on Gitea I can migrate it. |
We have the Gitea CI infra, but Gitea broke Actions for mirrored repos in the release we're running. |
As far as the root cause, I wonder if Lighthouse people never implemented their fix? (because if they did, we wouldn't see the illegal instruction exception -- we're using their only container image for lcli, built on only a few weeks ago and according to their commit history the problem was fixed in 2020). Here: https://github.com/sigp/lighthouse/blob/c547a11b0da48db6fdd03bca2c6ce2448bbcc3a9/lcli/Dockerfile#L8 We could build the lcli container ourselves with the flag turned on and see if that fixes the problem. |
It's passed by default into the build for the lcli image: https://github.com/sigp/lighthouse/blob/c547a11b0da48db6fdd03bca2c6ce2448bbcc3a9/.github/workflows/docker.yml#L145 And as part of the target suffix building the lighthouse image, bit convoluted https://github.com/sigp/lighthouse/blob/c547a11b0da48db6fdd03bca2c6ce2448bbcc3a9/.github/workflows/docker.yml#L78 |
Presumably they've introduced another version of the problem somehow? We should be running the right binary already. |
Yeah it doesn't quite make sense. I was speculating it could be an issue of shared libraries, since we are running lcli on the non-portable Fortunately this doesn't happen super often, so we could probably punt on this until there's more clarity. |
What makes you suspect we're not using the portable lighthouse container? |
Presumably it happens when GH picks a runner that has a funky CPU for the job. |
Right, that's what I figured
From my reading of their docker actions, the |
Ah, so we want "not modern" to fix this problem. I was thinking the other way around. All pretty frustrating since it's due to someone not understanding how to ship a binary that selects its vectorized code according to the CPU it's actually running on. If "modern" has worked for us so far, perhaps you're right in aiming to avoid running on an old CPU in CI as the optimal solution. Back to Gitea... |
Fix for the Gitea Actions in mirrored repos bug has been merged into the 1.19 branch, so we can deploy (and get CI working again on git/vdb.to). |
lol this is from our Gitea CI just now:
|
CPU in the case above is: https://www.intel.com/content/www/us/en/products/sku/64589/intel-xeon-processor-e52667-15m-cache-2-90-ghz-8-00-gts-intel-qpi/specifications.html I will see if I can figure out exactly what is going on, since the lcli binary is supposedly compiled for a generic x86-64 target. |
This is now happening consistently on our (resurrected) CI infra, e.g. : https://git.vdb.to/cerc-io/stack-orchestrator/actions/runs/276 Going to see if I can get a VM on the same host to debug on. |
cerc/fixturenet-eth-lighthouse
Reproduced by extracting the
|
I built the container myself, to check if there is some build provenance issue with the image. Still fails:
|
So the problem is that the "portable" container build is in fact not portable. |
It turns out that their
The lighthouse (not lcli) build has targets of the form xxxx-portable, which likely are implemented properly. The result is that there is no such thing as a portable/non-modern lcli build. |
Filed bug against lighthouse: sigp/lighthouse#4370 |
The most obvious solution is for us to build our own lcli container image, which we need to do for ARM support anyway. |
With this fix, the container is built with portable binaries:
|
Now we're building our own lcli container this issue should be resolved. Closing. |
The build script for this image is intermittently hitting SIGILL on Github action runners when running
lcli insecure-validators
.Likely related to sigp/lighthouse#1395 and if so should be fixed by using the portable docker images instead of the native ones.
The text was updated successfully, but these errors were encountered: