Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal instruction crash in lighthouse cli on some CPUs #416

Closed
roysc opened this issue May 29, 2023 · 26 comments · Fixed by #414
Closed

Illegal instruction crash in lighthouse cli on some CPUs #416

roysc opened this issue May 29, 2023 · 26 comments · Fixed by #414
Assignees

Comments

@roysc
Copy link
Collaborator

roysc commented May 29, 2023

The build script for this image is intermittently hitting SIGILL on Github action runners when running lcli insecure-validators.
Likely related to sigp/lighthouse#1395 and if so should be fixed by using the portable docker images instead of the native ones.

@roysc roysc self-assigned this May 29, 2023
@AFDudley
Copy link

This solution to this would be for us to use our own machines for building, I'm confused as to why you're using github for this. @dboreham what's stopping us from building this off github?

@roysc
Copy link
Collaborator Author

roysc commented May 29, 2023

This is happening in CI jobs for ipld-eth-server, which hasn't been migrated to Gitea, but if we have CI infra ready on Gitea I can migrate it.

@dboreham
Copy link
Collaborator

We have the Gitea CI infra, but Gitea broke Actions for mirrored repos in the release we're running.
The fix was merged to main yesterday. I'm working on making a patched build for Shane to deploy, since running mystery meat main branch code seems like not a good idea.

@dboreham
Copy link
Collaborator

dboreham commented May 29, 2023

As far as the root cause, I wonder if Lighthouse people never implemented their fix? (because if they did, we wouldn't see the illegal instruction exception -- we're using their only container image for lcli, built on only a few weeks ago and according to their commit history the problem was fixed in 2020).

Here: https://github.com/sigp/lighthouse/blob/c547a11b0da48db6fdd03bca2c6ce2448bbcc3a9/lcli/Dockerfile#L8
it provides for building a binary that runs on all x86 CPUs, but I can find no evidence that they actually set that (default seems to be to not build the portable binary).

We could build the lcli container ourselves with the flag turned on and see if that fixes the problem.

@roysc
Copy link
Collaborator Author

roysc commented May 29, 2023

@dboreham
Copy link
Collaborator

Presumably they've introduced another version of the problem somehow? We should be running the right binary already.

@roysc
Copy link
Collaborator Author

roysc commented May 29, 2023

Yeah it doesn't quite make sense. I was speculating it could be an issue of shared libraries, since we are running lcli on the non-portable lighthouse container - but Rust would be statically linked by default.

Fortunately this doesn't happen super often, so we could probably punt on this until there's more clarity.

@dboreham
Copy link
Collaborator

What makes you suspect we're not using the portable lighthouse container?
This is the container we're using: https://github.com/cerc-io/stack-orchestrator/blob/main/app/data/container-build/cerc-lighthouse/Dockerfile#L1

@dboreham
Copy link
Collaborator

Fortunately this doesn't happen super often, so we could probably punt on this until there's more clarity.

Presumably it happens when GH picks a runner that has a funky CPU for the job.

@roysc
Copy link
Collaborator Author

roysc commented May 29, 2023

Presumably it happens when GH picks a runner that has a funky CPU for the job.

Right, that's what I figured

What makes you suspect we're not using the portable lighthouse container?

From my reading of their docker actions, the -modern suffix is used for the non-portable builds https://github.com/sigp/lighthouse/blob/c547a11b0da48db6fdd03bca2c6ce2448bbcc3a9/.github/workflows/docker.yml#L96

@dboreham
Copy link
Collaborator

Ah, so we want "not modern" to fix this problem. I was thinking the other way around. All pretty frustrating since it's due to someone not understanding how to ship a binary that selects its vectorized code according to the CPU it's actually running on.

If "modern" has worked for us so far, perhaps you're right in aiming to avoid running on an old CPU in CI as the optimal solution. Back to Gitea...

@dboreham
Copy link
Collaborator

dboreham commented May 30, 2023

Fix for the Gitea Actions in mirrored repos bug has been merged into the 1.19 branch, so we can deploy (and get CI working again on git/vdb.to).

cerc-io/hosting#40

@dboreham
Copy link
Collaborator

lol this is from our Gitea CI just now:

./build_cl.sh: line 48:    63 Illegal instruction     (core dumped) lcli insecure-validators --count $VALIDATOR_COUNT --base-dir $DATADIR --node-count $BN_COUNT
make: *** [Makefile:9: genesis-cl] Error 132

@dboreham
Copy link
Collaborator

CPU in the case above is: https://www.intel.com/content/www/us/en/products/sku/64589/intel-xeon-processor-e52667-15m-cache-2-90-ghz-8-00-gts-intel-qpi/specifications.html
Has only AVX.

I will see if I can figure out exactly what is going on, since the lcli binary is supposedly compiled for a generic x86-64 target.

@roysc roysc reopened this May 31, 2023
@dboreham
Copy link
Collaborator

This is now happening consistently on our (resurrected) CI infra, e.g. : https://git.vdb.to/cerc-io/stack-orchestrator/actions/runs/276

Going to see if I can get a VM on the same host to debug on.

@dboreham dboreham changed the title Sporadic crash when building cerc/fixturenet-eth-lighthouse Illegal instruction crash in lighthouse cli on some CPUs Jun 1, 2023
@dboreham
Copy link
Collaborator

dboreham commented Jun 4, 2023

Reproduced by extracting the lcli binary from the latest container image then running it with this command on an Atom CPU:

$  ./lcli insecure-validators --count 4 --base-dir $(pwd) --node-count 4
Validator 1
Illegal instruction (core dumped)

@dboreham
Copy link
Collaborator

dboreham commented Jun 4, 2023

I built the container myself, to check if there is some build provenance issue with the image. Still fails:

$ ./lcli-test insecure-validators --count 4 --base-dir $(pwd) --node-count 4
Validator 1
Illegal instruction (core dumped)

@dboreham
Copy link
Collaborator

dboreham commented Jun 4, 2023

So the problem is that the "portable" container build is in fact not portable.

@dboreham
Copy link
Collaborator

dboreham commented Jun 4, 2023

It turns out that their PORTABLE flag is not implemented:

$ grep -R PORTABLE .
./.github/workflows/docker.yml:                      --build-arg PORTABLE=true \
./lcli/Dockerfile:ARG PORTABLE
./lcli/Dockerfile:ENV PORTABLE $PORTABLE

The lighthouse (not lcli) build has targets of the form xxxx-portable, which likely are implemented properly.

The result is that there is no such thing as a portable/non-modern lcli build.

@dboreham
Copy link
Collaborator

dboreham commented Jun 4, 2023

Filed bug against lighthouse: sigp/lighthouse#4370

@dboreham
Copy link
Collaborator

dboreham commented Jun 4, 2023

The most obvious solution is for us to build our own lcli container image, which we need to do for ARM support anyway.

@dboreham
Copy link
Collaborator

dboreham commented Jun 4, 2023

@dboreham
Copy link
Collaborator

dboreham commented Jun 4, 2023

With this fix, the container is built with portable binaries:

diff --git a/lcli/Dockerfile b/lcli/Dockerfile
index 98f33f215..485ebfd5b 100644
--- a/lcli/Dockerfile
+++ b/lcli/Dockerfile
@@ -6,7 +6,7 @@ RUN apt-get update && apt-get -y upgrade && apt-get install -y cmake libclang-de
 COPY . lighthouse
 ARG PORTABLE
 ENV PORTABLE $PORTABLE
-RUN cd lighthouse && make install-lcli
+RUN if [ "$PORTABLE" = true ] ; then export FEATURES=portable; fi && cd lighthouse && make install-lcli

 FROM ubuntu:22.04
 RUN apt-get update && apt-get -y upgrade && apt-get clean && rm -rf /var/lib/apt/lists/*

@dboreham
Copy link
Collaborator

Now we're building our own lcli container this issue should be resolved. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants