Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ROCm to version 6.1.0 #9143

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Apr 17, 2024

Add amd-smi and ROCProfiler binaries and libraries.

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 17, 2024

please test

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard for branch IB/CMSSW_14_1_X/master.

@iarspider, @smuzaffar, @aandvalenzuela can you please review it and eventually sign? Thanks.
@rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 17, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-200d6f/38900/summary.html
COMMIT: f76bc04
CMSSW: CMSSW_14_1_X_2024-04-17-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9143/38900/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation warning when building: See details on the summary page.

Add amd-smi and ROCProfiler binaries and libraries.
@fwyzard fwyzard force-pushed the IB/CMSSW_14_1_X/master_rocm_6.1.0 branch from f76bc04 to f1b8c12 Compare April 17, 2024 20:18
@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 17, 2024

please test

@cmsbuild
Copy link
Contributor

Pull request #9143 was updated.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: Build
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-200d6f/38907/summary.html
COMMIT: f1b8c12
CMSSW: CMSSW_14_1_X_2024-04-17-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9143/38907/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-200d6f/38907/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-200d6f/38907/git-merge-result

Build

I found compilation error when building:

>> Leaving Package HeterogeneousCore/ROCmServices
>> Package HeterogeneousCore/ROCmServices built
Entering library rule at src/HeterogeneousCore/ROCmServices/plugins
>> Compiling edm plugin src/HeterogeneousCore/ROCmServices/plugins/ROCmMonitoringService.cc
>> Compiling edm plugin src/HeterogeneousCore/ROCmServices/plugins/ROCmService.cc
src/HeterogeneousCore/ROCmServices/plugins/ROCmService.cc:9:10: fatal error: rocm_version.h: No such file or directory
    9 | #include 
      |          ^~~~~~~~~~~~~~~~
compilation terminated.
gmake: *** [tmp/el8_amd64_gcc12/src/HeterogeneousCore/ROCmServices/plugins/HeterogeneousCoreROCmServicesPlugins/ROCmService.cc.o] Error 1
>> Building edm plugin tmp/el8_amd64_gcc12/src/HeterogeneousCore/ROCmServices/plugins/HeterogeneousCoreROCmServicesPlugins/libHeterogeneousCoreROCmServicesPlugins.so


@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 18, 2024

please test with cms-sw/cmssw#44777

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-200d6f/38939/summary.html
COMMIT: f1b8c12
CMSSW: CMSSW_14_1_X_2024-04-18-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9143/38939/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-200d6f/38939/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-200d6f/38939/git-merge-result

Unit Tests

I found 1 errors in the following unit tests:

---> test runtestPhysicsToolsPatAlgos had ERRORS

Comparison Summary

Summary:

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 19, 2024

runtestPhysicsToolsPatAlgos has been failing for weeks.

@smuzaffar do you think it would be possible to teach the bot to ignore failures that are already present in the base release ?

@smuzaffar
Copy link
Contributor

runtestPhysicsToolsPatAlgos has been failing for weeks.

@smuzaffar do you think it would be possible to teach the bot to ignore failures that are already present in the base release ?

we had discussed this in core sw meeting. One reason to not do this is that sometimes a PR is suppose to fix a failure but if it does not provide the right fix (and test still fails) then bot is going to ignore the error ( as we told it to ignore IB errors) and mark it as passed.

@smuzaffar
Copy link
Contributor

+externals

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_14_1_X/master IBs (test failures were overridden). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)
Notice This PR was tested with additional Pull Request(s), please also merge them if necessary: cms-sw/cmssw#44777

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 19, 2024

One reason to not do this is that sometimes a PR is suppose to fix a failure but if it does not provide the right fix (and test still fails) then bot is going to ignore the error ( as we told it to ignore IB errors) and mark it as passed.

What about ignoring the tests that are expected to fail (because they fail in the IB), but report if they actually pass ?
This way if a PR is supposed to fix a test, it should be easy to see that it does.

@smuzaffar
Copy link
Contributor

but if a PR is suppose to fix a test but it still fails during PR tests then bot will ignore the failure and mark PR test passed ... right?

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 19, 2024

but if a PR is suppose to fix a test but it still fails during PR tests then bot will ignore the failure and mark PR test passed ... right?

In such case I would expect the person that submitted the PR specifically to fix a test, the relevant L2, or the release manager, to check if the bot reports the test being fixed.

@smuzaffar
Copy link
Contributor

In such case I would expect the person that submitted the PR specifically to fix a test, the relevant L2, or the release manager, to check if the bot reports the test being fixed.

this is a big expectation :-) I will see if we can come up with reasonable solution

@cmsbuild
Copy link
Contributor

REMINDER @rappoccio, @sextonkennedy, @antoniovilela: This PR was tested with cms-sw/cmssw#44777, please check if they should be merged together

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 22, 2024

@cms-sw/orp-l2 could you merge this and cms-sw/cmssw#44777 for the next IB ?

@rappoccio
Copy link

+1

@cmsbuild cmsbuild merged commit 9e5e5f9 into cms-sw:IB/CMSSW_14_1_X/master Apr 22, 2024
9 of 10 checks passed
@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 22, 2024

thanks

@fwyzard fwyzard deleted the IB/CMSSW_14_1_X/master_rocm_6.1.0 branch April 22, 2024 17:41
@makortel
Copy link
Contributor

Given cms-sw/cmssw#44821 maybe we should revert this update before building 14_1_0_pre3? @cms-sw/orp-l2

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 23, 2024

OK for me to revert.

The main reason for the update was to fix the linker warnings about executable stack, but we can live with them.

Does cms-sw/cmssw#44821 mean that we have to start building HIP/ROCm from sources ?

@makortel
Copy link
Contributor

Does cms-sw/cmssw#44821 mean that we have to start building HIP/ROCm from sources ?

That is certainly one way forward, but I'm not sure if we understand the full picture well enough to judge if that is the best (or only?) way forward. (e.g. I still wonder if there is something (non-)LTO-specific in this failure mode, or if the set of so-far-impacted IBs is just "luck")

@smuzaffar
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants