Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Always pass -x cu to nvcc #1047

Closed
wants to merge 1,103 commits into from

Conversation

trxcllnt
Copy link
Contributor

@trxcllnt trxcllnt commented Sep 27, 2021

This PR ensures the -x cu language flag is not modified when constructing an nvcc compile string. When rewrite_includes_only is false, the -x cu argument is transformed to -x cu-cpp-output, causing nvcc to error with:

nvcc fatal   : Value 'cu-cpp-output' is not defined for option 'x'

Curiously this only seems to show up when attempting to do a distributed compilation with sccache-dist. I have not encountered this issue doing local-only sccache builds. Does rewrite_includes_only only affect sccache-dist servers (or is set to false in the server job request)?

cc: @robertmaynard

edge90 and others added 30 commits March 8, 2020 16:57
Otherwise we get warnings about unused imports on Windows.
This is a bit of a hack.  The integration tests under `src/test/`,
specifically `test_server_compile` have started timing out at some point
after the Windows builds were broken.  It's not obvious to me why it
should be this particular test, and not other tests.  But we're running
on Windows, and presumably running on some kind of VM in CI, so it seems
worth bumping up the timeout here.
Tests on appveyor timeout before the server starts.
BLAKE3 is designed to be a very high performance cryptographic hash. The
BLAKE3 team has shown 8.5x higher single-thread performance than SHA-512
on modern server hardware (AWS `c5.metal`). This change did not result
in a significant improvement to my observed local build times, but
newer hardware may see a meaningful improvement.

Signed-off-by: George Hahn <[email protected]>
This is used by chromium for example.
Even though Actions aren't supported on the main repo, they can still be
supported on personal repos, and people might proactively fix their
Windows bustage if GitHub sends them emails about it.  So let's keep
Actions, but turn off things we haven't enabled on Travis yet.
Removes a chunk from the readme regarding a false positive
rustc-wrapper entry not being used, which is closed since 1.40.0 .
Cargo issue: rust-lang/cargo#7745
omid and others added 16 commits November 14, 2021 12:27
Because the configuration is merged from both the environment and the
configuration file, it's possible to forget about overriding variables
related to one of the backends (e.g. by setting `SCCACHE_REDIS`). To
account for that and not have to explicitly list/remember all of the
supported env vars, we just don't inherit the sccache-related
environment at all when running this test.
I needed a way to make hashes CWD-dependent,
and it feels ugly to use other variables names for that purpose.
@sylvestre
Copy link
Collaborator

Some tasks are failing (clippy, rustfmt, etc)
and some lines aren't covered by tests:
https://codecov.io/gh/mozilla/sccache/commit/8024345a32442df3cf9a8f21a8be6b5620fe7aea/

could you please fix these issues?
thanks

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Jan 27, 2022

@sylvestre sorry for the delay over the holidays. My original goal for this PR was to make sccache-dist work with nvcc. After diving in, I discovered some structural limitations in nvcc that mean the original small fix is insufficient.

I'm fairly close with the full fix -- I think I just need to ensure more things are packaged into the nvcc dist toolchain. If you don't mind, I'd like to keep this PR open as a draft/work-in-progress so I can keep pushing things and testing in CI 🙏.

@mitchhentges
Copy link
Contributor

mitchhentges commented Feb 16, 2022

Sounds good to me, go for it Paul.
Would you mind explicitly marking this as Draft/WIP so that it is removed from the review queue?
(I'm assuming that published PR's can be re-draft-ed, and that they still trigger CI. If not, then if you wouldn't mind using [WIP] instead of [FIX] - because "FIX" could be interpreted as a makeshift "this is a bugfix" label - that would be 👍).

@trxcllnt trxcllnt changed the title [FIX] Always pass -x cu to nvcc [WIP] Always pass -x cu to nvcc Feb 18, 2022
@suo
Copy link

suo commented Jul 6, 2022

I'm fairly close with the full fix -- I think I just need to ensure more things are packaged into the nvcc dist toolchain.

This is tantalizing 😛. I would also like to make sccache-dist support nvcc nicely. @trxcllnt do you mind explaining what the gaps are between this PR and the "full fix"? If you don't have time to push this one over the finish line I might have some time to try.

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Jul 6, 2022

@suo Yeah, I'd love some help. I did some more exploring since I posted last, so I'll try to describe my current thoughts in detail. I don't think it's worth building from this branch anymore since it doesn't have any of the work I describe below and is so behind [email protected].

Disclaimer: I only know what's publicly available in docs and presentations. I don't have any special knowledge of nvcc internals or roadmap. I'm sure there's use-cases/edge-cases of which I'm not aware.

The main issue for sccache-dist is that nvcc is a sort of compiler-launcher, not the compiler itself. The sccache client expects to be able to preprocess a file to compute a hash, then send the preprocessed file contents (plus a compiler toolchain) to a worker and compile the preprocessed file. Unfortunately, compiling preprocessed input is not a supported nvcc run mode.

Executing nvcc <args> produces a list of sub-compiler invocations to the host compiler and NVIDIA compilers.

Here's an example of the sub-compiler invocations generated and executed by nvcc:
# Safe to run w/o creating `/tmp/x.cu` input file due to --dryrun
/usr/local/cuda/bin/nvcc \
    --generate-code=arch=compute_60,code=[sm_60] \
    --generate-code=arch=compute_70,code=[sm_70] \
    --generate-code=arch=compute_75,code=[compute_75,sm_75] \
    --generate-code=arch=compute_80,code=[compute_80,sm_80] \
    --generate-code=arch=compute_86,code=[compute_86,sm_86] \
    -c /tmp/x.cu -o /tmp/x.cu.o -DSCCACHE_TEST_DEFINE \
    --dryrun
#$ _NVVM_BRANCH_=nvvm #$ _SPACE_= #$ _CUDART_=cudart #$ _HERE_=/usr/local/cuda/bin #$ _THERE_=/usr/local/cuda/bin #$ _TARGET_SIZE_= #$ _TARGET_DIR_= #$ _TARGET_DIR_=targets/x86_64-linux #$ TOP=/usr/local/cuda/bin/.. #$ NVVMIR_LIBRARY_DIR=/usr/local/cuda/bin/../nvvm/libdevice #$ LD_LIBRARY_PATH=/usr/local/cuda/bin/../lib: #$ PATH=/usr/local/cuda/bin/../nvvm/bin:/usr/local/cuda/bin:/usr/local/cuda/bin:/usr/local/cuda/nvvm/bin:/home/ptaylor/.nvm/versions/node/v16.15.1/bin:/home/ptaylor/.cargo/bin:/home/ptaylor/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/ptaylor/.fzf/bin:/home/ptaylor/.bin:/home/ptaylor/.local/bin #$ INCLUDES="-I/usr/local/cuda/bin/../targets/x86_64-linux/include" #$ LIBRARIES= "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib" #$ CUDAFE_FLAGS= #$ PTXAS_FLAGS=
#$ gcc -D__CUDA_ARCH__=860 -D__CUDA_ARCH_LIST__=600,700,750,800,860 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D "SCCACHE_TEST_DEFINE" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=7 -D__CUDACC_VER_BUILD__=64 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=7 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/x.cu" -o "/tmp/tmpxft_0003a542_00000000-11_x.compute_86.cpp1.ii" #$ cicc --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name "/tmp/x.cu" --orig_src_path_name "/tmp/x.cu" --allow_managed -arch compute_86 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0003a542_00000000-3_x.fatbin.c" -tused --gen_module_id_file --module_id_file_name "/tmp/tmpxft_0003a542_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_0003a542_00000000-6_x.compute_86.cudafe1.c" --stub_file_name "/tmp/tmpxft_0003a542_00000000-6_x.compute_86.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0003a542_00000000-6_x.compute_86.cudafe1.gpu" "/tmp/tmpxft_0003a542_00000000-11_x.compute_86.cpp1.ii" -o "/tmp/tmpxft_0003a542_00000000-6_x.compute_86.ptx"
#$ gcc -D__CUDA_ARCH__=600 -D__CUDA_ARCH_LIST__=600,700,750,800,860 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D "SCCACHE_TEST_DEFINE" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=7 -D__CUDACC_VER_BUILD__=64 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=7 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/x.cu" -o "/tmp/tmpxft_0003a542_00000000-12_x.compute_60.cpp1.ii" #$ cicc --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name "/tmp/x.cu" --orig_src_path_name "/tmp/x.cu" --allow_managed -arch compute_60 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0003a542_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_0003a542_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_0003a542_00000000-10_x.compute_60.cudafe1.c" --stub_file_name "/tmp/tmpxft_0003a542_00000000-10_x.compute_60.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0003a542_00000000-10_x.compute_60.cudafe1.gpu" "/tmp/tmpxft_0003a542_00000000-12_x.compute_60.cpp1.ii" -o "/tmp/tmpxft_0003a542_00000000-10_x.compute_60.ptx" #$ ptxas -arch=sm_60 -m64 "/tmp/tmpxft_0003a542_00000000-10_x.compute_60.ptx" -o "/tmp/tmpxft_0003a542_00000000-13_x.compute_60.cubin"
#$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=600,700,750,800,860 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D "SCCACHE_TEST_DEFINE" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=7 -D__CUDACC_VER_BUILD__=64 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=7 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/x.cu" -o "/tmp/tmpxft_0003a542_00000000-14_x.compute_70.cpp1.ii" #$ cicc --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name "/tmp/x.cu" --orig_src_path_name "/tmp/x.cu" --allow_managed -arch compute_70 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0003a542_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_0003a542_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_0003a542_00000000-9_x.compute_70.cudafe1.c" --stub_file_name "/tmp/tmpxft_0003a542_00000000-9_x.compute_70.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0003a542_00000000-9_x.compute_70.cudafe1.gpu" "/tmp/tmpxft_0003a542_00000000-14_x.compute_70.cpp1.ii" -o "/tmp/tmpxft_0003a542_00000000-9_x.compute_70.ptx" #$ ptxas -arch=sm_70 -m64 "/tmp/tmpxft_0003a542_00000000-9_x.compute_70.ptx" -o "/tmp/tmpxft_0003a542_00000000-15_x.compute_70.cubin"
#$ gcc -D__CUDA_ARCH__=750 -D__CUDA_ARCH_LIST__=600,700,750,800,860 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D "SCCACHE_TEST_DEFINE" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=7 -D__CUDACC_VER_BUILD__=64 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=7 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/x.cu" -o "/tmp/tmpxft_0003a542_00000000-16_x.compute_75.cpp1.ii" #$ cicc --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name "/tmp/x.cu" --orig_src_path_name "/tmp/x.cu" --allow_managed -arch compute_75 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0003a542_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_0003a542_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_0003a542_00000000-8_x.compute_75.cudafe1.c" --stub_file_name "/tmp/tmpxft_0003a542_00000000-8_x.compute_75.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0003a542_00000000-8_x.compute_75.cudafe1.gpu" "/tmp/tmpxft_0003a542_00000000-16_x.compute_75.cpp1.ii" -o "/tmp/tmpxft_0003a542_00000000-8_x.compute_75.ptx" #$ ptxas -arch=sm_75 -m64 "/tmp/tmpxft_0003a542_00000000-8_x.compute_75.ptx" -o "/tmp/tmpxft_0003a542_00000000-17_x.compute_75.sm_75.cubin"
#$ gcc -D__CUDA_ARCH__=800 -D__CUDA_ARCH_LIST__=600,700,750,800,860 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D "SCCACHE_TEST_DEFINE" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=7 -D__CUDACC_VER_BUILD__=64 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=7 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/x.cu" -o "/tmp/tmpxft_0003a542_00000000-18_x.compute_80.cpp1.ii" #$ cicc --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name "/tmp/x.cu" --orig_src_path_name "/tmp/x.cu" --allow_managed -arch compute_80 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0003a542_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_0003a542_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_0003a542_00000000-7_x.compute_80.cudafe1.c" --stub_file_name "/tmp/tmpxft_0003a542_00000000-7_x.compute_80.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0003a542_00000000-7_x.compute_80.cudafe1.gpu" "/tmp/tmpxft_0003a542_00000000-18_x.compute_80.cpp1.ii" -o "/tmp/tmpxft_0003a542_00000000-7_x.compute_80.ptx" #$ ptxas -arch=sm_80 -m64 "/tmp/tmpxft_0003a542_00000000-7_x.compute_80.ptx" -o "/tmp/tmpxft_0003a542_00000000-19_x.compute_80.sm_80.cubin"
#$ ptxas -arch=sm_86 -m64 "/tmp/tmpxft_0003a542_00000000-6_x.compute_86.ptx" -o "/tmp/tmpxft_0003a542_00000000-20_x.compute_86.sm_86.cubin" #$ fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=60,file=/tmp/tmpxft_0003a542_00000000-13_x.compute_60.cubin" "--image3=kind=elf,sm=70,file=/tmp/tmpxft_0003a542_00000000-15_x.compute_70.cubin" "--image3=kind=ptx,sm=75,file=/tmp/tmpxft_0003a542_00000000-8_x.compute_75.ptx" "--image3=kind=elf,sm=75,file=/tmp/tmpxft_0003a542_00000000-17_x.compute_75.sm_75.cubin" "--image3=kind=ptx,sm=80,file=/tmp/tmpxft_0003a542_00000000-7_x.compute_80.ptx" "--image3=kind=elf,sm=80,file=/tmp/tmpxft_0003a542_00000000-19_x.compute_80.sm_80.cubin" "--image3=kind=ptx,sm=86,file=/tmp/tmpxft_0003a542_00000000-6_x.compute_86.ptx" "--image3=kind=elf,sm=86,file=/tmp/tmpxft_0003a542_00000000-20_x.compute_86.sm_86.cubin" --embedded-fatbin="/tmp/tmpxft_0003a542_00000000-3_x.fatbin.c" #$ rm /tmp/tmpxft_0003a542_00000000-3_x.fatbin
#$ gcc -D__CUDA_ARCH_LIST__=600,700,750,800,860 -E -x c++ -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D "SCCACHE_TEST_DEFINE" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=7 -D__CUDACC_VER_BUILD__=64 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=7 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/x.cu" -o "/tmp/tmpxft_0003a542_00000000-5_x.cpp4.ii" #$ cudafe++ --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name "/tmp/x.cu" --orig_src_path_name "/tmp/x.cu" --allow_managed --m64 --parse_templates --gen_c_file_name "/tmp/tmpxft_0003a542_00000000-6_x.compute_86.cudafe1.cpp" --stub_file_name "tmpxft_0003a542_00000000-6_x.compute_86.cudafe1.stub.c" --module_id_file_name "/tmp/tmpxft_0003a542_00000000-4_x.module_id" "/tmp/tmpxft_0003a542_00000000-5_x.cpp4.ii" #$ gcc -D__CUDA_ARCH__=860 -D__CUDA_ARCH_LIST__=600,700,750,800,860 -c -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -m64 "/tmp/tmpxft_0003a542_00000000-6_x.compute_86.cudafe1.cpp" -o "/tmp/x.cu.o"

I formatted the output above to highlight the compiler phases.

Device-side compilation

nvcc executes the following steps for each GPU arch:

  1. host compiler preprocessor invocation (gcc -E)
  2. cicc on the result of step 1 to produce an intermediate PTX assembly .ptx file
  3. ptxas on the result of step 2 to produce a device code binary .cubin (for a single GPU arch)

And finally, a call to fatbinary to link all the .cubin files into a .fatbin.

Host-side compilation

The last three lines of the output:

  1. host compiler preprocessor invocation again (gcc -E)
  2. cudafe++ to embed the device-side's fatbin into the result of step 1
  3. a host compiler invocation to compile the host .cpp from step 2 to an object .o file

sccache-dist modifications

Here's a rough outline of what we'd need to do for sccache-dist:

  1. sccache client runs nvcc -E (like it does today) to compute the compile hash for cache lookups
  2. If no cached object exists, run nvcc <original-args> --dryrun to produce the host/device sub-compiler commands
  3. sccache client runs each <host-compiler> -E invocation (steps 1 above) and saves the output in the payload sent to the sccache-dist worker
  4. Send each preprocessed file, the cicc/ptxas/fatbin/cudafe++/<host compiler> sub-compiler commands, and the minimal nvcc toolchain to the sccache-dist worker
  5. The sccache-dist worker executes each sub-compiler command and ultimately generates the final .o object file

One caveat may be supporting the --threads= option, since that allows nvcc to compile multiple architectures in parallel. We may need to ignore that flag when we send all compile jobs to one worker, or (ideally) send each cicc + ptxas pair to separate workers then perform the final host-linker step once they're done.

@suo
Copy link

suo commented Jul 7, 2022 via email

@robertmaynard
Copy link
Collaborator

robertmaynard commented Jul 8, 2022

Paul's outline is an amazing starting point. If my memory is correct the code paths are slightly different when compiling directly to SASS ( sm_XY ) since you don't embed the PTX.

I also can't remember if -rdc effects the process for compilation.

This isn't to take away from the effort. But we should be aware that whatever we design will most likely need changes to support all the complicated use cases

@sylvestre
Copy link
Collaborator

sorry, I make a mistake, could you please resend/recreate it if you still want to land it?

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Jan 9, 2023

@sylvestre no worries, it's safe to close this PR.

@trxcllnt
Copy link
Contributor Author

@suo fyi, I have added nvcc support for sccache-dist in #2247.

@trxcllnt trxcllnt deleted the fix/invalid-nvcc-lang branch October 3, 2024 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.