Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler: enable parallel codegen with MT #14748

Merged

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Jun 25, 2024

Implements parallel codegen of object files when MT is enabled in the compiler (-Dpreview_mt).

It only impacts codegen for compilations with more than one compilation unit (module), that is when neither of --single-module, --release or --cross-compile is specified. This behavior is identical to the fork based codegen.

Advantages:

  • allows parallel codegen on Windows (untested);
  • no need to fork many processes;
  • no repeated GC collections in each forked processes;
  • a simple Channel to distribute work efficiently (no need for IPC).

The main points are increased portability and simpler logic, despite having to take care of LLVM thread safety quirks (see comments).

Issues:

  1. The threads arg actually depicts the number of fibers, not threads, which is confusing and problematic: increasing threads but not CRYSTAL_WORKERS will lead to more fibers than threads, with fibers being sheduled on the same threads, which won't bring any improvement.

    In fact CRYSTAL_WORKERS defaults to 4, when threads defaulted to 8. With this patch it defaults to CRYSTAL_WORKERS, so MT can end up being slower if we don't specify CRYSTAL_WORKERS=8.

  2. This is still not as efficient as it could be. The main fiber (that feeds the worker fibers) can get blocked by a worker fiber doing codegen, leading the other workers to starve. This is easily noticeable when compiling with -O1 for example.

Both issues will be fixable with RFC 2 where we can start an explicit context to run the worker fibers or start N isolated contexts (maybe a better idea). Until then, one should increase CRYSTAL_WORKERS.

Supersedes #14227 and doesn't segfault (so far) with LLVM 12 or LLVM 18.1 🤞

TODO:

  • wait for Compiler: refactor codegen #14760
  • cleanup
  • rename the method as mt_parallel(units, n_threads)
  • figure out thread safety of LLVM legacy pass manager (it's thread unsafe 💥)
  • consider increasing the channel size (until we can use ExecutionContext)
  • consider a CRYSTAL_CONFIG_WORKERS to configure the default number of workers at compile time instead of the hardcoded 4 (in a distinct PR)

@straight-shoota
Copy link
Member

This looks great. But it also seems to be a mix of different changes. Could we extract the independent refactorings (such as extracting sequential_codegen and fork_codegen, memoization of some methods) to their own PRs?

@ysbaddaden ysbaddaden marked this pull request as draft June 28, 2024 12:49
@ysbaddaden ysbaddaden force-pushed the feature/compiler-mt-codegen-2 branch from 3a08d9e to ac91f7c Compare July 2, 2024 11:26
@ysbaddaden
Copy link
Contributor Author

Rebased on top of #14760.

src/compiler/crystal/compiler.cr Outdated Show resolved Hide resolved
src/compiler/crystal/compiler.cr Show resolved Hide resolved
src/compiler/crystal/compiler.cr Outdated Show resolved Hide resolved
src/compiler/crystal/compiler.cr Outdated Show resolved Hide resolved
straight-shoota pushed a commit that referenced this pull request Aug 6, 2024
Refactors `Crystal::Compiler`:

1. extracts `#sequential_codegen`, `#parallel_codegen` and `#fork_codegen` methods;
2. merges `#codegen_many_units` into `#codegen` directly;
3. stops collecting reused units: `#fork_codegen` now updates `CompilationUnit#reused_compilation_unit?` state as reported by the forked processes, and `#print_codegen_stats` now counts & filters the reused units.

Prerequisite for #14748 that will introduce `#mt_codegen`.
When compiled with -Dpreview_mt the compiler will take advantage of the
MT environment to codegen the compilation units in parallel, avoiding
fork (that's not supported with MT) and allowing parallel codegen on
Windows.
@ysbaddaden ysbaddaden force-pushed the feature/compiler-mt-codegen-2 branch from 5303d59 to 96b6f77 Compare September 3, 2024 09:26
@ysbaddaden
Copy link
Contributor Author

Rebased from master that merged #14760 (prerequisite) and ready for review.

@ysbaddaden ysbaddaden marked this pull request as ready for review September 3, 2024 09:28
src/compiler/crystal/compiler.cr Outdated Show resolved Hide resolved
@beta-ziliani
Copy link
Member

consider a CRYSTAL_CONFIG_WORKERS to configure the default number of workers at compile time instead of the hardcoded 4 (in a distinct PR)

Isn't this addressed (using CRYSTAL_WORKERS)? Or is it something different?

@ysbaddaden
Copy link
Contributor Author

@beta-ziliani this could be a compile time ENV to change the default number of threads/schedulers. It's tangential to this pull request.

@straight-shoota straight-shoota added this to the 1.14.0 milestone Sep 19, 2024
@Fryguy
Copy link
Contributor

Fryguy commented Sep 19, 2024

The RFC-2 link in the OP here points to a non-existing URL. Think it was supposed to be crystal-lang/rfcs#2?

@straight-shoota straight-shoota merged commit c74f6bc into crystal-lang:master Sep 21, 2024
65 checks passed
@ysbaddaden ysbaddaden deleted the feature/compiler-mt-codegen-2 branch September 22, 2024 08:11
@crysbot
Copy link

crysbot commented Nov 25, 2024

This pull request has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/why-cant-use-multi-core-when-compile-an-application-use-crystal-compiler/7435/2

@zw963
Copy link
Contributor

zw963 commented Nov 25, 2024

Hi, i didn't understand, can i know how to use this new feature in 1.14.0?

  1. Should we build compiler itself with -Dpreview_mt then this feature will be enabled.
  2. Or we still built compiler as usual, but if enable -Dpreview_mt when build app, will enable this feature?

Thanks

@ysbaddaden
Copy link
Contributor Author

  1. Yes.
  2. No.

@crysbot
Copy link

crysbot commented Nov 25, 2024

This pull request has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/why-cant-use-multi-core-when-compile-an-application-use-crystal-compiler/7435/6

@zw963
Copy link
Contributor

zw963 commented Nov 26, 2024

Hi, i do some test, it's seem like no any performance improvement, following is reproduce:

  1. install crystal prebuilt 1.14.0 version use asdf
 ╰──➤ $ cr version
Crystal 1.14.0 [dacd97bcc] (2024-10-09)

LLVM: 18.1.6
Default target: x86_64-unknown-linux-gnu
  1. built crystal compiler myself which preview_mt enabled
FLAGS='--no-debug -Dpreview_mt' LDFLAGS='-s' make crystal
 ╰──➤ $ cr version
Crystal 1.14.0 [dacd97bcc] (2024-10-09)

LLVM: 18.1.8
Default target: x86_64-pc-linux-gnu
  1. Copy the same project twice, set one of them use 1, another one use 2.

  2. Always delete all cached files in ~/.cache/crystal before do following every step.

  3. built use 1, as following:

 ╰──➤ $ time crystal build src/procodile.cr

real    0m2.872s
user    0m5.200s
sys     0m1.399s
  1. built use 2, as following:
 ╰──➤ $ export CRYSTAL_WORKERS=4

 ╰──➤ $ time crystal build src/procodile.cr

real    0m3.223s
user    0m5.126s
sys     0m0.928s

The latter even slower, did I do something wrong?

Thanks.

@crysbot
Copy link

crysbot commented Nov 26, 2024

This pull request has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/why-cant-use-multi-core-when-compile-an-application-use-crystal-compiler/7435/8

@Sija
Copy link
Contributor

Sija commented Nov 26, 2024

@zw963 From the OP:

In fact CRYSTAL_WORKERS defaults to 4, when threads defaulted to 8. With this patch it defaults to CRYSTAL_WORKERS, so MT can end up being slower if we don't specify CRYSTAL_WORKERS=8.

@zw963
Copy link
Contributor

zw963 commented Nov 26, 2024

Okay, i saw a few performance improve when try to build one of my web project.

old:

 ╰──➤ $ time crystal build src/college.cr

real    0m43.939s
user    0m48.787s
sys     0m8.349s

new:

 ╰──➤ $ CRYSTAL_WORKERS=8 time crystal build src/college.cr

real    0m39.098s
user    0m48.551s
sys     0m4.828s

Time has been reduced by about 10%. the reduced time almost come from the sys, I guess the project more larger, the effects more obviously.

BTW: Not see multi-core be used even parallel codegen enabled, maybe this stage is very quickly, there is no chance to see it in htop?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants