Investigate `ghc.exe: getMBlocks: VirtualAlloc MEM_COMMIT failed: The paging file is too small for this operation to complete` CI failures #1961

RyanGlScott · 2023-10-15T14:14:41Z

The SAWScript.Intepreter file is one of the largest in the SAW codebase, currently clocking in at 4765 lines of code. What's more, most of this file consists of the monolithic primitives table, which consists of thousands of string literals that document all of the SAWScript commands. While no single entry in primitives is likely that taxing to compile times in isolation, each little bit does add up every time a new SAWScript command is added.

Things are finally at a tipping point, however, as the amount of RAM required to compile SAWScript.Intepreter is causing the CI to crash. More specifically, in pull request #1958, which adds two commands to primitives, the Windows GHC 8.10.7 CI job has been observed to fail with this error consistently:

[94 of 94] Compiling SAWScript.Interpreter ( src\SAWScript\Interpreter.hs, D:\a\saw-script\saw-script\dist-newstyle\build\x86_64-windows\ghc-8.10.7\saw-script-1.0.0.99\build\SAWScript\Interpreter.o )
ghc.exe: getMBlocks: VirtualAlloc MEM_COMMIT failed: The paging file is too small for this operation to complete.

Ouch. While it is curious that this error only happens with the Windows GHC 8.10.7 job (and none of the other Windows jobs), this suggests that the current status quo is fragile. Googling for this error message suggests that there is not much we can do besides using less RAM (and indeed, the GitHub Actions CI machines are somewhat constrained on RAM).

One thing that we could quite easily do to reduce RAM requirements is to split up the primitives table a bit. It currently contains every SAWScript definition, but we could quite easily divide it into smaller subtables. As a starting point, we could have separate tables for the LLVM-related commands, JVM-related commands, MIR-related commands, Heapster-related commands, etc., and then primitives would consist of the union of all these tables. This would likely have a big impact on maximum RAM requirements without too much effort, and it worth a try before considering more drastic measures, such as putting the commands' documentation in external files (as suggested here).

The text was updated successfully, but these errors were encountered:

RyanGlScott · 2023-10-15T14:47:35Z

Upon a closer look, I'm not entirely sure that SAWScript.Intepreter is to blame here. It turns out that the CI on the master branch of SAW is already triggering this error, as seen here:

[86 of 94] Compiling SAWScript.Builtins ( src\SAWScript\Builtins.hs, D:\a\saw-script\saw-script\dist-newstyle\build\x86_64-windows\ghc-8.10.7\saw-script-1.0.0.99\build\SAWScript\Builtins.o )
ghc.exe: getMBlocks: VirtualAlloc MEM_COMMIT failed: The paging file is too small for this operation to complete.

This time, the error arises in a different module. Urk.

Why doesn't the master branch's CI fail as a result of this? It's because the CI retries the cabal build twice in case of spurious errors, and it just so happens that if you retry cabal build after this error on the master branch, then GHC somehow succeeds on the second attempt. For whatever reason, this does not occur in #1958, as retrying cabal build does not get the build unstuck. Very bizarre.

In light of this, it's much less clear to me what is going on.

RyanGlScott · 2023-10-15T15:05:37Z

A plausible explanation for why the Windows GHC 9.2 and 9.4 jobs do not suffer from this issue is that those versions of GHC use Windows' large address space allocator (see GHC#12576), whereas GHC 8.10 does not. While it is still a mystery to me why this class of build error has only recently started to appear in Windows GHC 8.10 jobs, it is unclear if it is worth the effort to debug this further, given that GHC 9.2+ has a more principled way of allocating large addresses.

In light of this, I propose that we:

Change the CI so that the SAW binary artifacts are produced using GHC 9.2 rather than 8.10.
Mark the Windows GHC 8.10 as allowed to fail for the time being, as it will no longer be critical for that job to succeed.
When we update SAW to support GHC 9.6 (and therefore move our GHC support window), we can finally drop all vestiges of GHC 8.10 from the CI.

Resolves #1961, in the sense that the issue is now worked around.

The logic for retrying the `cabal build` command three consecutive times in a row upon failure was apparently added to work around macOS-related dylib issues, but it is unclear (and unlikely) if these issues still exist. Moreover, this logic has the distinct disadvantage of potentially masking more serious issues, such as the issues observed in #1961. In principle, `cabal build` should not be an especially flaky part of the CI—it should either work 100% of the time or fail 100% of the time. Let's just remove the `retry cabal build` logic to make it easier to notice issues with `cabal build` in the future. We can always revisit this choice later if need be.

RyanGlScott added performance Issues that involve or include performance problems tech debt Issues that document or involve technical debt labels Oct 15, 2023

RyanGlScott mentioned this issue Oct 15, 2023

Support ghost values in all language backends #1958

Merged

RyanGlScott changed the title ~~Reduce RAM requirements for compiling primitives table in SAWScript.Interpreter~~ Investigate ghc.exe: getMBlocks: VirtualAlloc MEM_COMMIT failed: The paging file is too small for this operation to complete CI failures Oct 15, 2023

RyanGlScott added tooling: CI Issues involving CI/CD scripts or processes and removed tech debt Issues that document or involve technical debt labels Oct 15, 2023

RyanGlScott added a commit that referenced this issue Oct 16, 2023

CI: Exclude Windows GHC 8.10.7 job

edd15b1

Resolves #1961, in the sense that the issue is now worked around.

RyanGlScott mentioned this issue Oct 16, 2023

CI: Build artifacts using GHC 9.2.7, exclude Windows GHC 8.10.7 job #1962

Merged

mergify bot closed this as completed in #1962 Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate `ghc.exe: getMBlocks: VirtualAlloc MEM_COMMIT failed: The paging file is too small for this operation to complete` CI failures #1961

Investigate `ghc.exe: getMBlocks: VirtualAlloc MEM_COMMIT failed: The paging file is too small for this operation to complete` CI failures #1961

RyanGlScott commented Oct 15, 2023

RyanGlScott commented Oct 15, 2023

RyanGlScott commented Oct 15, 2023

Investigate ghc.exe: getMBlocks: VirtualAlloc MEM_COMMIT failed: The paging file is too small for this operation to complete CI failures #1961

Investigate ghc.exe: getMBlocks: VirtualAlloc MEM_COMMIT failed: The paging file is too small for this operation to complete CI failures #1961

Comments

RyanGlScott commented Oct 15, 2023

RyanGlScott commented Oct 15, 2023

RyanGlScott commented Oct 15, 2023

Investigate `ghc.exe: getMBlocks: VirtualAlloc MEM_COMMIT failed: The paging file is too small for this operation to complete` CI failures #1961

Investigate `ghc.exe: getMBlocks: VirtualAlloc MEM_COMMIT failed: The paging file is too small for this operation to complete` CI failures #1961