Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roll back to CUDA 12.1.1 [gcc 12] #8527

Conversation

smuzaffar
Copy link
Contributor

Reverts #8515

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @smuzaffar (Malik Shahzad Muzaffar) for branch IB/CMSSW_13_2_X/g12.

@cmsbuild, @smuzaffar, @aandvalenzuela, @iarspider can you please review it and eventually sign? Thanks.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

Pull request #8527 was updated.

@smuzaffar
Copy link
Contributor Author

test parameters:

  • full_cmssw = true
  • enable = gpu

@smuzaffar
Copy link
Contributor Author

please test

@fwyzard
Copy link
Contributor

fwyzard commented May 30, 2023

Do we gain anything from CUDA 12.1 over 12.0 ?
If not, I would prefer to stay with 12.0 and keep the --src-in-ptx option.

@smuzaffar
Copy link
Contributor Author

@fwyzard , I am just testing that if cuda 12.1.1 and no --src-in-ptx flag allow us to move forward with gcc12/cuda12. Note that cuda 12.0.1 with gcc12 has build errors

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/el8_amd64_gcc12/CMSSW_13_2_X_2023-05-29-1100/RecoLocalTracker/SiStripClusterizer

gcc/x86_64-redhat-linux-gnu/12.2.1/include/avx512fp16intrin.h(38): error: vector_size attribute requires an arithmetic or enum type

@cmsbuild
Copy link
Contributor

-1

Failed Tests: Build
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-02c814/32849/summary.html
COMMIT: c852f8a
CMSSW: CMSSW_13_2_X_2023-05-29-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8527/32849/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-02c814/32849/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-02c814/32849/git-merge-result

Build

I found compilation error when building:

>> Compiling edm plugin /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_13_2_X_2023-05-29-1100/src/EventFilter/HGCalRawToDigi/plugins/HFNoseRawToDigiFake.cc
>> Compiling edm plugin /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_13_2_X_2023-05-29-1100/src/EventFilter/HGCalRawToDigi/plugins/HGCalRawToDigi.cc
>> Compiling edm plugin /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_13_2_X_2023-05-29-1100/src/EventFilter/HGCalRawToDigi/plugins/HGCalRawToDigiFake.cc
>> Compiling edm plugin /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_13_2_X_2023-05-29-1100/src/EventFilter/HGCalRawToDigi/plugins/HGCalSlinkEmulator.cc
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_13_2_X_2023-05-29-1100/src/EventFilter/HGCalRawToDigi/plugins/HGCalRawToDigi.cc: In member function 'virtual void HGCalRawToDigi::produce(edm::Event&, const edm::EventSetup&)':
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_13_2_X_2023-05-29-1100/src/EventFilter/HGCalRawToDigi/plugins/HGCalRawToDigi.cc:68:37: error: comparing the result of pointer addition '(ptr + ((sizetype)i))' and NULL [-Werror=address]
   68 |       data_32bit.emplace_back(((ptr + i) ? ((*(ptr + i) & 0xff) << 0) : 0) +
      |                                ~~~~~^~~~
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_13_2_X_2023-05-29-1100/src/EventFilter/HGCalRawToDigi/plugins/HGCalRawToDigi.cc:69:41: error: comparing the result of pointer addition '(ptr + (((sizetype)i) + 1))' and NULL [-Werror=address]
   69 |                               ((ptr + i + 1) ? ((*(ptr + i + 1) & 0xff) << 8) : 0) +
      |                                ~~~~~~~~~^~~~


@smuzaffar
Copy link
Contributor Author

smuzaffar commented May 30, 2023

even removing --source-in-ptx does not work for cuda12/gcc12 due to error [a] (which is there for both cuda 12.0.1 and 12.1.1)

[a]

>> Compiling  RecoLocalTracker/SiStripClusterizer/plugins/SiStripRawToClusterGPUKernel.cu
gcc/12.2.1/lib/gcc/x86_64-redhat-linux-gnu/12.2.1/include/avx512fp16intrin.h(38): error: vector_size attribute requires an arithmetic or enum type
                  __v8hf __attribute__ ((__vector_size__ (16)));

@fwyzard
Copy link
Contributor

fwyzard commented May 30, 2023

hi @smuzaffar , as a workaround you can add

#undef _Float16

right after

#include <cub/cub.cuh>

in RecoLocalTracker/SiStripClusterizer/plugins/SiStripRawToClusterGPUKernel.cu.

While this should be safe for us (we don't use _Float16 in SiStripRawToClusterGPUKernel.cu) I do not know if this can have any consequences in general.

I'll open a bug report with NVIDIA.

@fwyzard
Copy link
Contributor

fwyzard commented May 31, 2023

@fwyzard
Copy link
Contributor

fwyzard commented May 31, 2023

After feedback from NVIDIA, the workaround should be safe, see cms-sw/cmssw#41819 .

In the meantime, a fix is being prepared upstream, see NVIDIA/libcudacxx#476 .

@smuzaffar
Copy link
Contributor Author

thanks a lot @fwyzard for following this up. As you like to stay with cuda 12.0.1 with --source-in-ptx so I am closing this PR

@smuzaffar smuzaffar closed this May 31, 2023
@jrhemstad
Copy link

@fwyzard I wanted to go out of my way to thank you for providing such an excellent bug triage and reproducer in https://github.com/fwyzard/nvidia_bug_4139266.

We had been trying to track this down from other bug reports and it would have taken me forever to figure out the _Float16 macro conflict.

If only every bug report and reproducer was this good 😉.

@smuzaffar smuzaffar deleted the revert-8515-IB/CMSSW_13_2_X/g12_CUDA_12.0.1 branch June 1, 2023 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants