Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA build rules may fail for patch releases #44626

Closed
fwyzard opened this issue Apr 4, 2024 · 10 comments · Fixed by cms-sw/cmsdist#9115
Closed

CUDA build rules may fail for patch releases #44626

fwyzard opened this issue Apr 4, 2024 · 10 comments · Fixed by cms-sw/cmsdist#9115

Comments

@fwyzard
Copy link
Contributor

fwyzard commented Apr 4, 2024

While testing #44622 we encountered the possibility that the current CUDA build rules for device code may fail in the case of patch releases, if a package needs a device static library from another package.

Copying from this comment:

-1

Failed Tests: Build Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-4509d9/38609/summary.html COMMIT: 775a4bc CMSSW: CMSSW_14_1_X_2024-04-04-1100/el8_amd64_gcc12 Additional Tests: GPU User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/44622/38609/install.sh to create a dev area with all the needed externals and cmssw changes.

Build

I found compilation error when building:

Entering library rule at src/HeterogeneousTest/CUDAKernel/plugins
>> Compiling src/HeterogeneousTest/CUDAKernel/plugins/CUDATestKernelAdditionAlgo.cu
>> Compiling edm plugin src/HeterogeneousTest/CUDAKernel/plugins/CUDATestKernelAdditionModule.cc
>> Cuda Device Link tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/HeterogeneousTestCUDAKernelPlugins_cudadlink.o
nvlink error : Undefined reference to '_ZN3cms8cudatest13add_vectors_fEPKfS2_Pfm' in '/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_14_1_X_2024-04-04-1100/static/el8_amd64_gcc12/libHeterogeneousTestCUDAKernel_nv.a:DeviceAdditionKernel.cu_nv.o' (target: sm_60)
gmake: *** [tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/HeterogeneousTestCUDAKernelPlugins_cudadlink.o] Error 255
>> Building edm plugin tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/libHeterogeneousTestCUDAKernelPlugins.so
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02831/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/bin/../lib/gcc/x86_64-redhat-linux-gnu/12.3.1/../../../../x86_64-redhat-linux-gnu/bin/ld.bfd: cannot find tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/HeterogeneousTestCUDAKernelPlugins_cudadlink.o: No such file or directory
collect2: error: ld returned 1 exit status
gmake: *** [tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/libHeterogeneousTestCUDAKernelPlugins.so] Error 1
Leaving library rule at src/HeterogeneousTest/CUDAKernel/plugins

@smuzaffar the error seems to point to a problem with our CUDA build rules.

If I add locally the unmodified package HeterogeneousTest/CUDADevice, the build succeeds.

Comparing the failing and working commands, the failing one uses

/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02831/el8_amd64_gcc12/external/cuda/12.4.0-db00bd44f20c40655446378926308f3f/bin/nvcc \
    -dlink \
    -L/data/user/fwyzard/CMSSW_14_1_X_2024-04-04-1100/static/el8_amd64_gcc12 \
    -L/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_14_1_X_2024-04-04-1100/static/el8_amd64_gcc12 \
    -L/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02831/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-04-03-2300/static/el8_amd64_gcc12 \
    -lHeterogeneousTestCUDAKernel_nv \
    -L/data/user/fwyzard/CMSSW_14_1_X_2024-04-04-1100/biglib/el8_amd64_gcc12 \
    -L/data/user/fwyzard/CMSSW_14_1_X_2024-04-04-1100/lib/el8_amd64_gcc12 \
    ...

while the working one uses

/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02831/el8_amd64_gcc12/external/cuda/12.4.0-db00bd44f20c40655446378926308f3f/bin/nvcc \
    -dlink \
    -L/data/user/fwyzard/CMSSW_14_1_X_2024-04-04-1100/static/el8_amd64_gcc12 \
    -L/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_14_1_X_2024-04-04-1100/static/el8_amd64_gcc12 \
    -L/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02831/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-04-03-2300/static/el8_amd64_gcc12 \
    -lHeterogeneousTestCUDAKernel_nv \
    -lHeterogeneousTestCUDADevice_nv \
    -L/data/user/fwyzard/CMSSW_14_1_X_2024-04-04-1100/biglib/el8_amd64_gcc12 \
    -L/data/user/fwyzard/CMSSW_14_1_X_2024-04-04-1100/lib/el8_amd64_gcc12 \
    ...

The difference is that the second adds

    -lHeterogeneousTestCUDADevice_nv \

In fact $CMSSW_BASE/static/$SCRAM_ARCH/libHeterogeneousTestCUDADevice_nv.a exists (after adding the package locally and building it), while $CMSSW_RELEASE_BASE/static/$SCRAM_ARCH/libHeterogeneousTestCUDADevice_nv.a does not exist.

Is that because the base release (CMSSW_14_1_X_2024-04-04-1100) is a patch release ?
$CMSSW_FULL_RELEASE_BASE/static/$SCRAM_ARCH/libHeterogeneousTestCUDADevice_nv.a does exist.

Maybe the CUDA build rules need to be updated to look under $CMSSW_FULL_RELEASE_BASE/static for the static libraries ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 4, 2024

assign core, heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 4, 2024

New categories assigned: core,heterogeneous

@Dr15Jones,@fwyzard,@makortel,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 4, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 4, 2024

A new Issue was created by @fwyzard.

@smuzaffar, @rappoccio, @sextonkennedy, @antoniovilela, @makortel, @Dr15Jones can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 5, 2024

A simple way to reproduce is (any recent patch release should work)

cmsrel CMSSW_13_3_2_patch1
cd CMSSW_13_3_2_patch1/src
cmsenv
git cms-init
git cms-addpkg HeterogeneousTest/CUDAKernel
scram b

which results in

Entering library rule at HeterogeneousTest/CUDAKernel
>> Compiling  /data/user/fwyzard/CMSSW_13_3_2_patch1/src/HeterogeneousTest/CUDAKernel/src/DeviceAdditionKernel.cu
>> Cuda Device Code Obj tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/src/HeterogeneousTestCUDAKernel/DeviceAdditionKernel.cu_nv.o 
>> Cuda Device Code library tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/src/HeterogeneousTestCUDAKernel/libHeterogeneousTestCUDAKernel_nv.a 
Copying tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/src/HeterogeneousTestCUDAKernel/libHeterogeneousTestCUDAKernel_nv.a to productstore area:
>> Cuda Device Link tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/HeterogeneousTestCUDAKernelPlugins_cudadlink.o 
nvlink error   : Undefined reference to '_ZN3cms8cudatest13add_vectors_fEPKfS2_Pfm' in '/data/user/fwyzard/CMSSW_13_3_2_patch1/static/el8_amd64_gcc12/libHeterogeneousTestCUDAKernel_nv.a:DeviceAdditionKernel.cu_nv.o' (target: sm_60)
gmake: *** [config/SCRAM/GMake/Makefile.rules:1803: tmp/el8_amd64_gcc12/src/HeterogeneousTest/CUDAKernel/plugins/HeterogeneousTestCUDAKernelPlugins/HeterogeneousTestCUDAKernelPlugins_cudadlink.o] Error 255

@smuzaffar
Copy link
Contributor

@fwyzard cms-sw/cmsdist#9115 (which contains the change cms-sw/cmssw-config@31ea067) should fix this for 14.1.X. I will backport the change to 14.0

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 5, 2024

@smuzaffar great, many thanks for the quick fix!

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 5, 2024

+heterogeneous

@makortel
Copy link
Contributor

makortel commented Apr 5, 2024

+core

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 5, 2024

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants