Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate ghc.exe: getMBlocks: VirtualAlloc MEM_COMMIT failed: The paging file is too small for this operation to complete CI failures #1961

Closed
RyanGlScott opened this issue Oct 15, 2023 · 2 comments · Fixed by #1962
Labels
performance Issues that involve or include performance problems tooling: CI Issues involving CI/CD scripts or processes

Comments

@RyanGlScott
Copy link
Contributor

The SAWScript.Intepreter file is one of the largest in the SAW codebase, currently clocking in at 4765 lines of code. What's more, most of this file consists of the monolithic primitives table, which consists of thousands of string literals that document all of the SAWScript commands. While no single entry in primitives is likely that taxing to compile times in isolation, each little bit does add up every time a new SAWScript command is added.

Things are finally at a tipping point, however, as the amount of RAM required to compile SAWScript.Intepreter is causing the CI to crash. More specifically, in pull request #1958, which adds two commands to primitives, the Windows GHC 8.10.7 CI job has been observed to fail with this error consistently:

[94 of 94] Compiling SAWScript.Interpreter ( src\SAWScript\Interpreter.hs, D:\a\saw-script\saw-script\dist-newstyle\build\x86_64-windows\ghc-8.10.7\saw-script-1.0.0.99\build\SAWScript\Interpreter.o )
ghc.exe: getMBlocks: VirtualAlloc MEM_COMMIT failed: The paging file is too small for this operation to complete.

Ouch. While it is curious that this error only happens with the Windows GHC 8.10.7 job (and none of the other Windows jobs), this suggests that the current status quo is fragile. Googling for this error message suggests that there is not much we can do besides using less RAM (and indeed, the GitHub Actions CI machines are somewhat constrained on RAM).

One thing that we could quite easily do to reduce RAM requirements is to split up the primitives table a bit. It currently contains every SAWScript definition, but we could quite easily divide it into smaller subtables. As a starting point, we could have separate tables for the LLVM-related commands, JVM-related commands, MIR-related commands, Heapster-related commands, etc., and then primitives would consist of the union of all these tables. This would likely have a big impact on maximum RAM requirements without too much effort, and it worth a try before considering more drastic measures, such as putting the commands' documentation in external files (as suggested here).

@RyanGlScott RyanGlScott added performance Issues that involve or include performance problems tech debt Issues that document or involve technical debt labels Oct 15, 2023
@RyanGlScott RyanGlScott changed the title Reduce RAM requirements for compiling primitives table in SAWScript.Interpreter Investigate ghc.exe: getMBlocks: VirtualAlloc MEM_COMMIT failed: The paging file is too small for this operation to complete CI failures Oct 15, 2023
@RyanGlScott
Copy link
Contributor Author

Upon a closer look, I'm not entirely sure that SAWScript.Intepreter is to blame here. It turns out that the CI on the master branch of SAW is already triggering this error, as seen here:

[86 of 94] Compiling SAWScript.Builtins ( src\SAWScript\Builtins.hs, D:\a\saw-script\saw-script\dist-newstyle\build\x86_64-windows\ghc-8.10.7\saw-script-1.0.0.99\build\SAWScript\Builtins.o )
ghc.exe: getMBlocks: VirtualAlloc MEM_COMMIT failed: The paging file is too small for this operation to complete.

This time, the error arises in a different module. Urk.

Why doesn't the master branch's CI fail as a result of this? It's because the CI retries the cabal build twice in case of spurious errors, and it just so happens that if you retry cabal build after this error on the master branch, then GHC somehow succeeds on the second attempt. For whatever reason, this does not occur in #1958, as retrying cabal build does not get the build unstuck. Very bizarre.

In light of this, it's much less clear to me what is going on.

@RyanGlScott
Copy link
Contributor Author

A plausible explanation for why the Windows GHC 9.2 and 9.4 jobs do not suffer from this issue is that those versions of GHC use Windows' large address space allocator (see GHC#12576), whereas GHC 8.10 does not. While it is still a mystery to me why this class of build error has only recently started to appear in Windows GHC 8.10 jobs, it is unclear if it is worth the effort to debug this further, given that GHC 9.2+ has a more principled way of allocating large addresses.

In light of this, I propose that we:

  • Change the CI so that the SAW binary artifacts are produced using GHC 9.2 rather than 8.10.
  • Mark the Windows GHC 8.10 as allowed to fail for the time being, as it will no longer be critical for that job to succeed.
  • When we update SAW to support GHC 9.6 (and therefore move our GHC support window), we can finally drop all vestiges of GHC 8.10 from the CI.

@RyanGlScott RyanGlScott added tooling: CI Issues involving CI/CD scripts or processes and removed tech debt Issues that document or involve technical debt labels Oct 15, 2023
RyanGlScott added a commit that referenced this issue Oct 16, 2023
Resolves #1961, in the sense that the issue is now worked around.
RyanGlScott added a commit that referenced this issue Oct 16, 2023
The logic for retrying the `cabal build` command three consecutive times in a
row upon failure was apparently added to work around macOS-related dylib
issues, but it is unclear (and unlikely) if these issues still exist. Moreover,
this logic has the distinct disadvantage of potentially masking more serious
issues, such as the issues observed in #1961.

In principle, `cabal build` should not be an especially flaky part of the CI—it
should either work 100% of the time or fail 100% of the time. Let's just remove
the `retry cabal build` logic to make it easier to notice issues with `cabal
build` in the future. We can always revisit this choice later if need be.
@mergify mergify bot closed this as completed in #1962 Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues that involve or include performance problems tooling: CI Issues involving CI/CD scripts or processes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant