-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build extensions in parallel #1882
Conversation
@crcrpar Could you help review this PR? This reduces APEX build time a lot. |
README.md
Outdated
@@ -130,7 +130,7 @@ CUDA and C++ extensions via | |||
git clone https://github.com/NVIDIA/apex | |||
cd apex | |||
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... | |||
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ | |||
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext --cuda_ext --parallel 4" ./ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise --thread
option, this would increase CPU mem usage, so could you separately add the example command with --thread
and --parallel
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated README.md
1ef7547
to
d9c3507
Compare
8e833de
to
f1ee5b1
Compare
README.md
Outdated
@@ -135,6 +135,13 @@ pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation - | |||
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./ | |||
``` | |||
|
|||
To reduce the build time of APEX, parallel building can be enhanced via | |||
```bash | |||
export NVCC_APPEND_FLAGS="--threads 4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest not exporting this env, it affects nvcc globally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it to Temporary Environment Scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing this nice option
Previous Behaviour: extensions are built in series, despite the fact that multiple files in the same extension are compiled in parallel
This pull request adds support for parallel building of multiple extensions. Benchmark results show:
--parallel 4
--parallel 16
NVCC_APPEND_FLAGS="--threads 8"
--parallel 4
,NVCC_APPEND_FLAGS="--threads 8"
--parallel 16
,NVCC_APPEND_FLAGS="--threads 8"
NVCC_APPEND_FLAGS="--threads 8"
--parallel 16
,NVCC_APPEND_FLAGS="--threads 8"
Memory usage is shown below. The "mem used" values are obtained using the
free
command, and background memory usage is included.--parallel 16
NVCC_APPEND_FLAGS="--threads 8"
--parallel 16
,NVCC_APPEND_FLAGS="--threads 8"
Image: nvcr.io/nvidia/pytorch:25.01-py3
(or other images with the same CUDA version and TORCH_CUDA_ARCH_LISTS)
cmdline:
time NVCC_APPEND_FLAGS="--threads 8" pip wheel -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext --distributed_adam --distributed_lamb --cuda_ext --permutation_search --bnp --xentropy --focal_loss --group_norm --index_mul_2d --deprecated_fused_adam --deprecated_fused_lamb --fast_layer_norm --fmha --fast_multihead_attn --transducer --cudnn_gbn --peer_memory --nccl_p2p --fast_bottleneck --fused_conv_bias_relu --nccl_allocator --gpu_direct_storage --parallel 16" ./