-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Always pass -x cu
to nvcc
#1047
Conversation
Otherwise we get warnings about unused imports on Windows.
This is a bit of a hack. The integration tests under `src/test/`, specifically `test_server_compile` have started timing out at some point after the Windows builds were broken. It's not obvious to me why it should be this particular test, and not other tests. But we're running on Windows, and presumably running on some kind of VM in CI, so it seems worth bumping up the timeout here.
Tests on appveyor timeout before the server starts.
BLAKE3 is designed to be a very high performance cryptographic hash. The BLAKE3 team has shown 8.5x higher single-thread performance than SHA-512 on modern server hardware (AWS `c5.metal`). This change did not result in a significant improvement to my observed local build times, but newer hardware may see a meaningful improvement. Signed-off-by: George Hahn <[email protected]>
This is used by chromium for example.
Even though Actions aren't supported on the main repo, they can still be supported on personal repos, and people might proactively fix their Windows bustage if GitHub sends them emails about it. So let's keep Actions, but turn off things we haven't enabled on Travis yet.
Co-authored-by: Bert Belder <[email protected]>
Removes a chunk from the readme regarding a false positive rustc-wrapper entry not being used, which is closed since 1.40.0 . Cargo issue: rust-lang/cargo#7745
Co-authored-by: Bernhard Schuster <[email protected]>
Because the configuration is merged from both the environment and the configuration file, it's possible to forget about overriding variables related to one of the backends (e.g. by setting `SCCACHE_REDIS`). To account for that and not have to explicitly list/remember all of the supported env vars, we just don't inherit the sccache-related environment at all when running this test.
I needed a way to make hashes CWD-dependent, and it feels ugly to use other variables names for that purpose.
…eate custom gcc+nvcc toolchain tgz
Some tasks are failing (clippy, rustfmt, etc) could you please fix these issues? |
@sylvestre sorry for the delay over the holidays. My original goal for this PR was to make I'm fairly close with the full fix -- I think I just need to ensure more things are packaged into the nvcc dist toolchain. If you don't mind, I'd like to keep this PR open as a draft/work-in-progress so I can keep pushing things and testing in CI 🙏. |
Sounds good to me, go for it Paul. |
-x cu
to nvcc-x cu
to nvcc
This is tantalizing 😛. I would also like to make sccache-dist support nvcc nicely. @trxcllnt do you mind explaining what the gaps are between this PR and the "full fix"? If you don't have time to push this one over the finish line I might have some time to try. |
@suo Yeah, I'd love some help. I did some more exploring since I posted last, so I'll try to describe my current thoughts in detail. I don't think it's worth building from this branch anymore since it doesn't have any of the work I describe below and is so behind Disclaimer: I only know what's publicly available in docs and presentations. I don't have any special knowledge of The main issue for Executing Here's an example of the sub-compiler invocations generated and executed by nvcc:# Safe to run w/o creating `/tmp/x.cu` input file due to --dryrun /usr/local/cuda/bin/nvcc \ --generate-code=arch=compute_60,code=[sm_60] \ --generate-code=arch=compute_70,code=[sm_70] \ --generate-code=arch=compute_75,code=[compute_75,sm_75] \ --generate-code=arch=compute_80,code=[compute_80,sm_80] \ --generate-code=arch=compute_86,code=[compute_86,sm_86] \ -c /tmp/x.cu -o /tmp/x.cu.o -DSCCACHE_TEST_DEFINE \ --dryrun I formatted the output above to highlight the compiler phases. Device-side compilation
And finally, a call to Host-side compilationThe last three lines of the output:
|
Wow, thank you for the detailed and very helpful response! This is
definitely enough for me to start on. I'll have some free time in the
coming weeks so hopefully we can get it done.
…On Wed, Jul 6, 2022 at 4:50 PM Paul Taylor ***@***.***> wrote:
@suo <https://github.com/suo> Yeah, I'd love some help. I did some more
exploring since I posted last, so I'll try to describe my current thoughts
in detail. I don't think it's worth building from this branch anymore since
it doesn't have any of the work I describe below and is so behind
***@***.***
*Disclaimer:* I only know what's publicly available in docs
<https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html> and
presentations
<https://on-demand.gputechconf.com/gtc/2013/presentations/S3185-Building-GPU-Compilers-libNVVM.pdf>.
I *don't* have any special knowledge of nvcc internals or roadmap. I'm
sure there's use-cases/edge-cases of which I'm not aware.
The main issue for sccache-dist is that nvcc is a sort of
compiler-launcher, not the compiler itself. The sccache client expects to
be able to preprocess a file to compute a hash, then send the preprocessed
file contents (plus a compiler toolchain) to a worker and compile the
preprocessed file. Unfortunately, compiling preprocessed input is not a
supported nvcc run mode.
Executing nvcc <args> produces a list of sub-compiler invocations to the
host compiler and NVIDIA compilers.
Here's an example of the sub-compiler invocations generated and executed
by nvcc:
# Safe to run w/o creating `/tmp/x.cu` input file due to --dryrun
/usr/local/cuda/bin/nvcc \
--generate-code=arch=compute_60,code=[sm_60] \
--generate-code=arch=compute_70,code=[sm_70] \
--generate-code=arch=compute_75,code=[compute_75,sm_75] \
--generate-code=arch=compute_80,code=[compute_80,sm_80] \
--generate-code=arch=compute_86,code=[compute_86,sm_86] \
-c /tmp/x.cu -o /tmp/x.cu.o -DSCCACHE_TEST_DEFINE \
--dryrun
#$ _NVVM_BRANCH_=nvvm
#$ _SPACE_=
#$ _CUDART_=cudart
#$ _HERE_=/usr/local/cuda/bin
#$ _THERE_=/usr/local/cuda/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_DIR_=targets/x86_64-linux
#$ TOP=/usr/local/cuda/bin/..
#$ NVVMIR_LIBRARY_DIR=/usr/local/cuda/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/usr/local/cuda/bin/../lib:
#$ PATH=/usr/local/cuda/bin/../nvvm/bin:/usr/local/cuda/bin:/usr/local/cuda/bin:/usr/local/cuda/nvvm/bin:/home/ptaylor/.nvm/versions/node/v16.15.1/bin:/home/ptaylor/.cargo/bin:/home/ptaylor/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/ptaylor/.fzf/bin:/home/ptaylor/.bin:/home/ptaylor/.local/bin
#$ INCLUDES="-I/usr/local/cuda/bin/../targets/x86_64-linux/include"
#$ LIBRARIES= "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib"
#$ CUDAFE_FLAGS=
#$ PTXAS_FLAGS=
#$ gcc -D__CUDA_ARCH__=860 -D__CUDA_ARCH_LIST__=600,700,750,800,860 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D "SCCACHE_TEST_DEFINE" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=7 -D__CUDACC_VER_BUILD__=64 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=7 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/x.cu" -o "/tmp/tmpxft_0003a542_00000000-11_x.compute_86.cpp1.ii"
#$ cicc --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name "/tmp/x.cu" --orig_src_path_name "/tmp/x.cu" --allow_managed -arch compute_86 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0003a542_00000000-3_x.fatbin.c" -tused --gen_module_id_file --module_id_file_name "/tmp/tmpxft_0003a542_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_0003a542_00000000-6_x.compute_86.cudafe1.c" --stub_file_name "/tmp/tmpxft_0003a542_00000000-6_x.compute_86.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0003a542_00000000-6_x.compute_86.cudafe1.gpu" "/tmp/tmpxft_0003a542_00000000-11_x.compute_86.cpp1.ii" -o "/tmp/tmpxft_0003a542_00000000-6_x.compute_86.ptx"
#$ gcc -D__CUDA_ARCH__=600 -D__CUDA_ARCH_LIST__=600,700,750,800,860 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D "SCCACHE_TEST_DEFINE" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=7 -D__CUDACC_VER_BUILD__=64 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=7 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/x.cu" -o "/tmp/tmpxft_0003a542_00000000-12_x.compute_60.cpp1.ii"
#$ cicc --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name "/tmp/x.cu" --orig_src_path_name "/tmp/x.cu" --allow_managed -arch compute_60 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0003a542_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_0003a542_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_0003a542_00000000-10_x.compute_60.cudafe1.c" --stub_file_name "/tmp/tmpxft_0003a542_00000000-10_x.compute_60.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0003a542_00000000-10_x.compute_60.cudafe1.gpu" "/tmp/tmpxft_0003a542_00000000-12_x.compute_60.cpp1.ii" -o "/tmp/tmpxft_0003a542_00000000-10_x.compute_60.ptx"
#$ ptxas -arch=sm_60 -m64 "/tmp/tmpxft_0003a542_00000000-10_x.compute_60.ptx" -o "/tmp/tmpxft_0003a542_00000000-13_x.compute_60.cubin"
#$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=600,700,750,800,860 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D "SCCACHE_TEST_DEFINE" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=7 -D__CUDACC_VER_BUILD__=64 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=7 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/x.cu" -o "/tmp/tmpxft_0003a542_00000000-14_x.compute_70.cpp1.ii"
#$ cicc --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name "/tmp/x.cu" --orig_src_path_name "/tmp/x.cu" --allow_managed -arch compute_70 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0003a542_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_0003a542_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_0003a542_00000000-9_x.compute_70.cudafe1.c" --stub_file_name "/tmp/tmpxft_0003a542_00000000-9_x.compute_70.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0003a542_00000000-9_x.compute_70.cudafe1.gpu" "/tmp/tmpxft_0003a542_00000000-14_x.compute_70.cpp1.ii" -o "/tmp/tmpxft_0003a542_00000000-9_x.compute_70.ptx"
#$ ptxas -arch=sm_70 -m64 "/tmp/tmpxft_0003a542_00000000-9_x.compute_70.ptx" -o "/tmp/tmpxft_0003a542_00000000-15_x.compute_70.cubin"
#$ gcc -D__CUDA_ARCH__=750 -D__CUDA_ARCH_LIST__=600,700,750,800,860 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D "SCCACHE_TEST_DEFINE" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=7 -D__CUDACC_VER_BUILD__=64 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=7 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/x.cu" -o "/tmp/tmpxft_0003a542_00000000-16_x.compute_75.cpp1.ii"
#$ cicc --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name "/tmp/x.cu" --orig_src_path_name "/tmp/x.cu" --allow_managed -arch compute_75 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0003a542_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_0003a542_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_0003a542_00000000-8_x.compute_75.cudafe1.c" --stub_file_name "/tmp/tmpxft_0003a542_00000000-8_x.compute_75.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0003a542_00000000-8_x.compute_75.cudafe1.gpu" "/tmp/tmpxft_0003a542_00000000-16_x.compute_75.cpp1.ii" -o "/tmp/tmpxft_0003a542_00000000-8_x.compute_75.ptx"
#$ ptxas -arch=sm_75 -m64 "/tmp/tmpxft_0003a542_00000000-8_x.compute_75.ptx" -o "/tmp/tmpxft_0003a542_00000000-17_x.compute_75.sm_75.cubin"
#$ gcc -D__CUDA_ARCH__=800 -D__CUDA_ARCH_LIST__=600,700,750,800,860 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D "SCCACHE_TEST_DEFINE" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=7 -D__CUDACC_VER_BUILD__=64 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=7 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/x.cu" -o "/tmp/tmpxft_0003a542_00000000-18_x.compute_80.cpp1.ii"
#$ cicc --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name "/tmp/x.cu" --orig_src_path_name "/tmp/x.cu" --allow_managed -arch compute_80 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0003a542_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_0003a542_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_0003a542_00000000-7_x.compute_80.cudafe1.c" --stub_file_name "/tmp/tmpxft_0003a542_00000000-7_x.compute_80.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0003a542_00000000-7_x.compute_80.cudafe1.gpu" "/tmp/tmpxft_0003a542_00000000-18_x.compute_80.cpp1.ii" -o "/tmp/tmpxft_0003a542_00000000-7_x.compute_80.ptx"
#$ ptxas -arch=sm_80 -m64 "/tmp/tmpxft_0003a542_00000000-7_x.compute_80.ptx" -o "/tmp/tmpxft_0003a542_00000000-19_x.compute_80.sm_80.cubin"
#$ ptxas -arch=sm_86 -m64 "/tmp/tmpxft_0003a542_00000000-6_x.compute_86.ptx" -o "/tmp/tmpxft_0003a542_00000000-20_x.compute_86.sm_86.cubin"
#$ fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=60,file=/tmp/tmpxft_0003a542_00000000-13_x.compute_60.cubin" "--image3=kind=elf,sm=70,file=/tmp/tmpxft_0003a542_00000000-15_x.compute_70.cubin" "--image3=kind=ptx,sm=75,file=/tmp/tmpxft_0003a542_00000000-8_x.compute_75.ptx" "--image3=kind=elf,sm=75,file=/tmp/tmpxft_0003a542_00000000-17_x.compute_75.sm_75.cubin" "--image3=kind=ptx,sm=80,file=/tmp/tmpxft_0003a542_00000000-7_x.compute_80.ptx" "--image3=kind=elf,sm=80,file=/tmp/tmpxft_0003a542_00000000-19_x.compute_80.sm_80.cubin" "--image3=kind=ptx,sm=86,file=/tmp/tmpxft_0003a542_00000000-6_x.compute_86.ptx" "--image3=kind=elf,sm=86,file=/tmp/tmpxft_0003a542_00000000-20_x.compute_86.sm_86.cubin" --embedded-fatbin="/tmp/tmpxft_0003a542_00000000-3_x.fatbin.c"
#$ rm /tmp/tmpxft_0003a542_00000000-3_x.fatbin
#$ gcc -D__CUDA_ARCH_LIST__=600,700,750,800,860 -E -x c++ -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D "SCCACHE_TEST_DEFINE" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=7 -D__CUDACC_VER_BUILD__=64 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=7 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/x.cu" -o "/tmp/tmpxft_0003a542_00000000-5_x.cpp4.ii"
#$ cudafe++ --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name "/tmp/x.cu" --orig_src_path_name "/tmp/x.cu" --allow_managed --m64 --parse_templates --gen_c_file_name "/tmp/tmpxft_0003a542_00000000-6_x.compute_86.cudafe1.cpp" --stub_file_name "tmpxft_0003a542_00000000-6_x.compute_86.cudafe1.stub.c" --module_id_file_name "/tmp/tmpxft_0003a542_00000000-4_x.module_id" "/tmp/tmpxft_0003a542_00000000-5_x.cpp4.ii"
#$ gcc -D__CUDA_ARCH__=860 -D__CUDA_ARCH_LIST__=600,700,750,800,860 -c -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -m64 "/tmp/tmpxft_0003a542_00000000-6_x.compute_86.cudafe1.cpp" -o "/tmp/x.cu.o"
I formatted the output above to highlight the compiler phases
<https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#cuda-compilation-trajectory__cuda-compilation-from-cu-to-executable>
.
Device-side compilation
nvcc executes the following steps for each GPU arch:
1. host compiler preprocessor invocation (gcc -E)
2. cicc on the result of step 1 to produce an intermediate PTX
assembly .ptx file
3. ptxas on the result of step 2 to produce a device code binary .cubin
(for a single GPU arch)
And finally, a call to fatbinary to link all the .cubin files into a
.fatbin.
Host-side compilation
The last three lines of the output:
1. host compiler preprocessor invocation again (gcc -E)
2. cudafe++ to embed the device-side's fatbin into the result of step 1
3. a host compiler invocation to compile the host .cpp from step 2 to
an object .o file
sccache-dist modifications
Here's a rough outline of what we'd need to do for sccache-dist:
1. sccache client runs nvcc -E (like it does today) to compute the
compile hash for cache lookups
2. If no cached object exists, run nvcc <original-args> --dryrun to
produce the host/device sub-compiler commands
3. sccache client runs each <host-compiler> -E invocation (steps 1
above) and saves the output in the payload sent to the sccache-dist
worker
4. Send each preprocessed file, the cicc/ptxas/fatbin/cudafe++/<host
compiler> sub-compiler commands, and the minimal nvcc toolchain to the
sccache-dist worker
5. The sccache-dist worker executes each sub-compiler command and
ultimately generates the final .o object file
One caveat may be supporting the --threads=
<https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#options-for-guiding-compiler-driver-threads>
option, since that allows nvcc to compile multiple architectures in
parallel. We may need to ignore that flag when we send all compile jobs to
one worker, or (ideally) send each cicc + ptxas pair to separate workers
then perform the final host-linker step once they're done.
—
Reply to this email directly, view it on GitHub
<#1047 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMK4EGWPZN36CX3B6FTV33VSYLTPANCNFSM5E3DWPJA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Paul's outline is an amazing starting point. If my memory is correct the code paths are slightly different when compiling directly to SASS ( sm_XY ) since you don't embed the PTX. I also can't remember if -rdc effects the process for compilation. This isn't to take away from the effort. But we should be aware that whatever we design will most likely need changes to support all the complicated use cases |
sorry, I make a mistake, could you please resend/recreate it if you still want to land it? |
@sylvestre no worries, it's safe to close this PR. |
This PR ensures the
-x cu
language flag is not modified when constructing an nvcc compile string. Whenrewrite_includes_only
is false, the-x cu
argument is transformed to-x cu-cpp-output
, causing nvcc to error with:Curiously this only seems to show up when attempting to do a distributed compilation with
sccache-dist
. I have not encountered this issue doing local-onlysccache
builds. Doesrewrite_includes_only
only affect sccache-dist servers (or is set to false in the server job request)?cc: @robertmaynard