Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[riscv] Update pytorch to version 2.4.0 #9293

Merged
merged 10 commits into from
Jul 25, 2024
Merged

Conversation

iarspider
Copy link
Contributor

I will remove unused patches later, once tests pass.

@iarspider
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @iarspider for branch IB/CMSSW_14_1_X/master.

@aandvalenzuela, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 10, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-fd80f0/40328/summary.html
COMMIT: 69254bb
CMSSW: CMSSW_14_1_X_2024-07-10-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9293/40328/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:


-- Added CUDA NVCC flags for: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_90,code=sm_90
-- Found Torch: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/pytorch/2.3.1-6771a7c6e591586d225837d0ec8eb1c4/lib/libtorch.so  
-- Configuring incomplete, errors occurred!
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.BeUGR9 (%build)


RPM build errors:
line 37: It's not recommended to have unversioned Obsoletes: Obsoletes: external+pytorch-scatter+2.1.2-259c490acc9d0c047893f5c74b3cb7a4
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.BeUGR9 (%build)


@iarspider
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

Pull request #9293 was updated.

@smuzaffar
Copy link
Contributor

@iarspider , please check the pytorch logs, looks like ROCM was not enabled properly. Either see the package doc on what requires to enable it or go through the cmake configuration ... may be we are missing some env to rocm distribution?

-- Looking for cgesdd_
-- Looking for cgesdd_ - found
-- Found a library with LAPACK API (open).
disabling ROCM because NOT USE_ROCM is set
-- MIOpen not found. Compiling without MIOpen support
disabling MKLDNN because USE_MKLDNN is not set
-- Looking for clock_gettime in rt
-- Looking for clock_gettime in rt - found

@cmsbuild
Copy link
Contributor

-1

Failed Tests: Build
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-fd80f0/40338/summary.html
COMMIT: e5fb08f
CMSSW: CMSSW_14_1_X_2024-07-10-2300/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9293/40338/install.sh to create a dev area with all the needed externals and cmssw changes.

Build

I found compilation error when building:

                 from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/pytorch/2.3.1-6771a7c6e591586d225837d0ec8eb1c4/include/torch/csrc/autograd/autograd.h:3,
                 from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/pytorch/2.3.1-6771a7c6e591586d225837d0ec8eb1c4/include/torch/csrc/api/include/torch/autograd.h:3,
                 from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/pytorch/2.3.1-6771a7c6e591586d225837d0ec8eb1c4/include/torch/csrc/api/include/torch/all.h:7,
                 from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/pytorch/2.3.1-6771a7c6e591586d225837d0ec8eb1c4/include/torch/csrc/api/include/torch/torch.h:3,
                 from src/PhysicsTools/PythonAnalysis/test/testTorch.cc:2:
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/pytorch/2.3.1-6771a7c6e591586d225837d0ec8eb1c4/include/c10/util/typeid.h:311:1: error: missing braces around initializer for 'std::__array_traits::_Type' {aka 'unsigned char [38]'} [-Werror=missing-braces]
  311 | };
      | ^
In file included from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/pytorch/2.3.1-6771a7c6e591586d225837d0ec8eb1c4/include/ATen/core/Dict.h:8,
                 from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/pytorch/2.3.1-6771a7c6e591586d225837d0ec8eb1c4/include/ATen/core/ivalue_inl.h:8,
                 from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/pytorch/2.3.1-6771a7c6e591586d225837d0ec8eb1c4/include/ATen/core/ivalue.h:1555,


@iarspider
Copy link
Contributor Author

Looks like you really can't use both CUDA and ROCm at the same time: https://github.com/pytorch/pytorch/blob/v2.3.1/aten/CMakeLists.txt#L71-L73 .

@smuzaffar
Copy link
Contributor

smuzaffar commented Jul 11, 2024

Looks like you really can't use both CUDA and ROCm at the same time: https://github.com/pytorch/pytorch/blob/v2.3.1/aten/CMakeLists.txt#L71-L73 .

ah ok. By the way, for this build you have both CUDA and ROCM ON, so any idea why cmake did not fail with this error

message(FATAL_ERROR "Both CUDA and ROCm are enabled and found. PyTorch can only be built with either of them. Please turn one off by using either USE_CUDA=OFF or USE_ROCM=OFF.")

something must have disabled rocm before cmake tried to configure aten

@iarspider
Copy link
Contributor Author

iarspider commented Jul 11, 2024

any idea why cmake did not fail with this error

Yes - it didn't find our installation of ROCm (there are at least two environment variables that need to be set)

@iarspider
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

Pull request #9293 was updated.

@smuzaffar
Copy link
Contributor

@iarspider , can you please open a separate PR to build pytorch with ROCM (i.e. disable CUDA and enable ROCM) ?

@iarspider
Copy link
Contributor Author

@smuzaffar will do.

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-fd80f0/40595/summary.html
COMMIT: b821aff
CMSSW: CMSSW_14_1_X_2024-07-24-2300/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9293/40595/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

patching file caffe2/CMakeLists.txt
Hunk #1 succeeded at 1413 (offset 50 lines).
patching file cmake/Dependencies.cmake
Hunk #1 succeeded at 1528 (offset -304 lines).
Hunk #2 succeeded at 1543 (offset -304 lines).
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.wzveUd (%prep)


RPM build errors:
line 37: It's not recommended to have unversioned Obsoletes: Obsoletes: external+pytorch_x86-64-v3+2.4.0-ed778d12fab7788e92d195fd7dd9a2a4
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.wzveUd (%prep)


@cmsbuild
Copy link
Contributor

Pull request #9293 was updated.

@smuzaffar
Copy link
Contributor

test parameters:

@smuzaffar
Copy link
Contributor

please test

@smuzaffar
Copy link
Contributor

please test for el9_amd64_gcc12

@smuzaffar
Copy link
Contributor

please test for el8_aarch64_gcc12

@smuzaffar
Copy link
Contributor

please test for el9_amd64_gcc13

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-fd80f0/40606/summary.html
COMMIT: 69849e4
CMSSW: CMSSW_14_1_X_2024-07-23-1100/el9_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9293/40606/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

Patch #3 (pytorch-cuda-12_4):
+ patch --no-backup-if-mismatch -f -p1 --fuzz=0
patching file aten/src/ATen/core/boxing/impl/boxing.h
Hunk #1 FAILED at 38.
1 out of 1 hunk FAILED -- saving rejects to file aten/src/ATen/core/boxing/impl/boxing.h.rej
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.AUekCg (%prep)

RPM build warnings:
line 37: It's not recommended to have unversioned Obsoletes: Obsoletes: external+pytorch+2.4.0-35478613216697e41ba7cd6a3fbcfa71

RPM build errors:


@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-fd80f0/40607/summary.html
COMMIT: 69849e4
CMSSW: CMSSW_14_1_X_2024-07-23-2300/el8_aarch64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9293/40607/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test TestIOPoolInputNoParentDictionary had ERRORS

@smuzaffar
Copy link
Contributor

+externals

looks good

@smuzaffar smuzaffar merged commit e2ad566 into IB/CMSSW_14_1_X/master Jul 25, 2024
12 of 16 checks passed
@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_14_1_X/master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @rappoccio, @sextonkennedy, @antoniovilela (and backports should be raised in the release meeting by the corresponding L2)

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-fd80f0/40598/summary.html
COMMIT: 69849e4
CMSSW: CMSSW_14_1_X_2024-07-24-2300/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9293/40598/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test TestIOPoolInputNoParentDictionary had ERRORS

Comparison Summary

Summary:

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-fd80f0/40605/summary.html
COMMIT: 69849e4
CMSSW: CMSSW_14_1_X_2024-07-23-2300/el9_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9293/40605/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test TestIOPoolInputNoParentDictionary had ERRORS

Comparison Summary

Summary:

  • You potentially added 1289 lines to the logs
  • Reco comparison results: 56002 differences found in the comparisons
  • DQMHistoTests: Total files compared: 44
  • DQMHistoTests: Total histograms compared: 3325254
  • DQMHistoTests: Total failures: 352179
  • DQMHistoTests: Total nulls: 278
  • DQMHistoTests: Total successes: 2972777
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.22399999999999998 KiB( 43 files compared)
  • DQMHistoSizes: changed ( 10224.0 ): -0.192 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 13034.0 ): -0.255 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 250202.181 ): -0.048 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 25202.0 ): 0.719 KiB SiStrip/MechanicalView
  • Checked 191 log files, 161 edm output root files, 44 DQM output files

@smuzaffar smuzaffar deleted the pytorch-2.3.1 branch July 26, 2024 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants