Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lake segmentation fault #2632

Closed
1 task done
yangky11 opened this issue Oct 6, 2023 · 19 comments · Fixed by #2896
Closed
1 task done

Lake segmentation fault #2632

yangky11 opened this issue Oct 6, 2023 · 19 comments · Fixed by #2896
Labels
bug Something isn't working Lake Lake related issue

Comments

@yangky11
Copy link

yangky11 commented Oct 6, 2023

Prerequisites

  • Put an X between the brackets on this line if you have done all of the following:
    • Check that your issue is not already filed.
    • Reduce the issue to a minimal, self-contained, reproducible test case. Avoid dependencies to mathlib4 or std4.

Description

I get segmentation faults when running any lake commands, including lake build and lake clean. The problem only happens when all of the following three conditions are met:

  1. The package depends on LeanInfer or is LeanInfer itself.
  2. The package uses leanprover/lean4:v4.2.0-rc1 (No problem with leanprover/lean4:v4.0.0).
  3. The package has been built or partially built (No problem with the first run of lake build).

Steps to Reproduce

  1. Check out the LeanInfer-demo branch of lean4-example.
  2. Edit lean4-example's lean-toolchain file to use leanprover/lean4:v4.2.0-rc1.
  3. lake build
  4. Either wait lake build to finish or just kill it in the middle. Then run lake build for a second time and you'll see Segmentation fault (core dumped). In this state, running any lake command will run into segmentation faults, unless you remove the lakefile.olean of LeanInfer.

Versions

Lean (version 4.2.0-rc1, commit a62d2fd49796, Release)

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
@yangky11 yangky11 added the bug Something isn't working label Oct 6, 2023
@yangky11
Copy link
Author

yangky11 commented Oct 6, 2023

Hi @tydeu , FYI.

@tydeu tydeu added the Lake Lake related issue label Oct 6, 2023
@tydeu
Copy link
Member

tydeu commented Oct 6, 2023

@yangky11 Most segfaults with Lake generally come from an out-of-date lakefile.olean, Since the problem is fixed when you rm -rf lake-packages, the LeanInfer lakefile.olean is likely out-of-date. Try running rm lake-packages/LeanInfer/lakefile.olean and seeing if that fixes the segfault. The bug here is likely that something is amiss in Lake's out-of-date checking of the lakefile.olean.

When you say edit the lean-toolchain file, do you mean of the LeanInfer package itself or of the workspace (i.e., the package which depends on it)? The former could certainly produce such an error.

@yangky11
Copy link
Author

yangky11 commented Oct 6, 2023

@tydeu

Try running rm lake-packages/LeanInfer/lakefile.olean

Yes, it has a similar effect to rm -rf lake-packages. It fixes the segfault for one time, but the segfault will appear again in future lake build.

When you say edit the lean-toolchain file, do you mean of the LeanInfer package itself or of the workspace (i.e., the package which depends on it)? The former could certainly produce such an error.

I was talking about lean4-example's lean-toolchain.

@tydeu
Copy link
Member

tydeu commented Oct 6, 2023

@yangky11

Yes, it has a similar effect to rm -rf lake-packages. It fixes the segfault for one time, but the segfault will appear again in future lake build.

Are you saying that deleting the LeanInfer's lakefile.olean fixes the segfault for the next lake build, but that subsequent lake builds, even without further modifying the lean-toolchain, still produce segfaults? If so, that is very unexpected and quite confusing. I will need to spend some time figuring out what might be going on there.

@yangky11
Copy link
Author

yangky11 commented Oct 7, 2023

Are you saying that deleting the LeanInfer's lakefile.olean fixes the segfault for the next lake build, but that subsequent lake builds, even without further modifying the lean-toolchain, still produce segfaults?

@tydeu Yes, exactly. I'm not sure if this has anything to do with cloud release.

@kim-em
Copy link
Collaborator

kim-em commented Oct 15, 2023

Just confirming that I can reproduce the segfault per the instructions.

@tydeu
Copy link
Member

tydeu commented Oct 15, 2023

The segfault in the Reservoir testbed for LeanInfer implies that this bug applies to LeanInfer directly. That means the likely source of this problem is likely the lakefile.olean for std4, aesop, or LeanInfer.

@yangky11 The testbed also flagged a separate problem which is that the aesop-llm branch and the manifest revision no longer exist in aesop (as the changes were ported to master). I don't think this could be causing the segfault, but as I am stumped as to what is who knows?

@yangky11
Copy link
Author

yangky11 commented Oct 15, 2023

@tydeu

The segfault in the Reservoir testbed for LeanInfer implies that this bug applies to LeanInfer directly.

Yes, I just verified that the bug applies to both LeanInfer itself and its dependencies.

The testbed also flagged a separate problem which is that the aesop-llm branch and the manifest revision no longer exist in aesop

This is just outdated lake-manifest.json. I just fixed this problem, but it seems to have nothing to do with segfaults.

@yangky11
Copy link
Author

The problem seems to be LeanInfer's lakefile.olean. When working with LeanInfer directly (instead of its dependencies), removing lakefile.olean will fix the segfault (for one time).

@yangky11
Copy link
Author

Since lean4:v4.2.0-rc2 has been out, I tested using lean4:v4.2.0-rc2 and the problem still persists.

@tydeu
Copy link
Member

tydeu commented Oct 16, 2023

@yangky11 Yean. Unfortunately, debugging segfault issues in Lean is rather hard. There are a lot of moving pieces to this one and there is nothing very clearly amiss that is a likely root cause. I have been thinking about it a lot, but I do not have an answer yet. One thing that could help would be to see if we cannot reproduce a segfault like in a toy setup (i.e. a minimal working example for this bug would be great).

@yangky11
Copy link
Author

@tydeu Thanks! I'll try to simplify LeanInfer to create a minimal example.

@yangky11
Copy link
Author

yangky11 commented Oct 17, 2023

@tydeu This is a minimal example: https://github.com/lean-dojo/LeanInfer/tree/segfault.

This is how to reproduce the error:

(base) kaiyu@tensorlab-DGX:~/tmp$ git clone https://github.com/lean-dojo/LeanInfer
Cloning into 'LeanInfer'...
remote: Enumerating objects: 906, done.
remote: Counting objects: 100% (324/324), done.
remote: Compressing objects: 100% (158/158), done.
remote: Total 906 (delta 177), reused 295 (delta 156), pack-reused 582
Receiving objects: 100% (906/906), 201.55 KiB | 2.04 MiB/s, done.
Resolving deltas: 100% (516/516), done.
(base) kaiyu@tensorlab-DGX:~/tmp$ cd LeanInfer/ 
(base) kaiyu@tensorlab-DGX:~/tmp/LeanInfer$ git checkout segfault
Branch 'segfault' set up to track remote branch 'segfault' from 'origin'.
Switched to a new branch 'segfault'
(base) kaiyu@tensorlab-DGX:~/tmp/LeanInfer$ lake build
info: downloading component 'lean'
179.4 MiB / 179.4 MiB (100 %)  67.6 MiB/s ETA:   0 s
info: installing component 'lean'
[0/1] Building LeanInfer
(base) kaiyu@tensorlab-DGX:~/tmp/LeanInfer$ lake clean
Segmentation fault (core dumped)

It looks like the error is related to how we check arm64 vs. x64 in lakefile.lean.

@tydeu
Copy link
Member

tydeu commented Oct 17, 2023

@yangky11 Ah, that is what I feared it might be. I imagine the problem stems from how we are gluing the lakefile environment to the imported environment. I think we may need to wait for @Kha's expertise to debug this (as he wrote that glue code). However, I may give investigating it myself a try (or work on a workaround).

@kim-em
Copy link
Collaborator

kim-em commented Oct 17, 2023

@tydeu could you point to the code that does this gluing? I'm not sure I can help, but as I've been gluing environments together recently for the REPL and lean4checker I would at least be interested.

@tydeu
Copy link
Member

tydeu commented Oct 17, 2023

@semorrison The glue code is:

/- We could import the olean together with its imports using one
`importModules` call, but that would mean we pay for a full
`finalizeImports` each time, which is linear in the number of imported
constants and extension entries (in fact, it is paid two more times: when
marking the `Environment` object as multi-threaded, and when releasing
it). As most lakefiles use the same set of imports, we instead construct
an `Environment` for it only once, and then apply the lakefile contents
on top of it like the elaborator would. Thus the non-shared, part of the
`Environment` is very small. -/
let (mod, _) ← readModuleData olean
let mut env ← importModulesUsingCache mod.imports leanOpts 1024
-- Apply constants (does not go through the kernel, so order is irrelevant)
env := mod.constants.foldl add env
-- Apply extension entries (`PersistentEnvExtension.addEntryFn` is pure and
-- does not have access to the whole environment, so no dependency worries
-- here either)
let extDescrs ← persistentEnvExtensionsRef.get
let extNameIdx ← mkExtNameMap 0
for (extName, ents) in mod.entries do
if let some entryIdx := extNameIdx.find? extName then
for ent in ents do
env := extDescrs[entryIdx]!.addEntry env ent
return env

@kim-em
Copy link
Collaborator

kim-em commented Oct 18, 2023

Ooh, okay that is a scarier part of the API than I have previously used. :-)

@yangky11
Copy link
Author

yangky11 commented Nov 5, 2023

Hi @tydeu, I see a PR related to this issue has been merged. Has the problem been fixed? Will it be part of the next release?

@tydeu
Copy link
Member

tydeu commented Nov 5, 2023

@yangky11 run_io is part of the latest release candidate (e.g., v4.3.0-rc1) and can be used instead of the elab currently used by LeanInfer and thereby avoid this error. Thus, this bug should be fixed for LeanInfer! 😄 (elab itself stills segfaults, though.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Lake Lake related issue
Projects
None yet
3 participants