Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First PyTorch tests for TorchScript inference CPU/CUDA #43475

Merged
merged 1 commit into from
Aug 1, 2024

Conversation

valsdav
Copy link
Contributor

@valsdav valsdav commented Dec 1, 2023

PR description:

This PR adds a new subpackage PhysicsTools/PyTorch, including the first tests for TorchScript
inference on CPU and CUDA.

The PR is built on top of #41162 and cmsdist PR cms-sw/cmsdist#8388, which is adding C++ PyTorch library support.

PR validation:

A simple model is exported in TorchScript and saved to disk. The model is loaded in cmssw and runned on CPU and CUDA.
The code to generate the model is included in the PR, but cannot be run directly in CMSSW, as we are not including the Python interface to PyTorch in cmsdist for the moment.

@smuzaffar @makortel @iarspider @wpmccormack

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 1, 2023

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43475/38037

  • This PR adds an extra 32KB to repository

  • Found files with invalid states:

    • PhysicsTools/PythonAnalysis/test/test_torch.py:
    • PhysicsTools/PyTorch/test/testTorchSimpleDnnCuda.cc:
  • There are other open Pull requests which might conflict with changes you have proposed:

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 1, 2023

A new Pull Request was created by @valsdav (Davide Valsecchi) for master.

It involves the following packages:

  • PhysicsTools/PyTorch (****)
  • PhysicsTools/PythonAnalysis (analysis)

The following packages do not have a category, yet:

PhysicsTools/PyTorch
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@cmsbuild, @tvami can you please review it and eventually sign? Thanks.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

}
}

std::string testBasePyTorch::cmsswPath(std::string path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be feasible to use edm::FileInPath instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smuzaffar Should this binary file be placed in cms-data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid adding binary files to cmssw. It is better to move it to cms-data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm changing this to use the cmsml docker image to generate the binary files for tests on the fly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header file seems to be very tied to the testTorchSimpleDnn.cc. Would it be feasible to just include the contents of the header in the source file? Or is the header expected to be used by multiple source files in the future?

Same question for testBaseCUDA.h and testTorchSimpleDnnCUDA.cc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to follow the same approach as in the TensorFlow tests https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/TensorFlow/test/testBaseCUDA.h and I'm planning to add more tests with a similar structure

// runModel("/data/user/dvalsecc/simple_dnn.pt", cuda);

// return 0;
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this commented out code still useful?

// runModel("/data/user/dvalsecc/simple_dnn.pt", cuda);

// return 0;
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this commented out code still useful?

Comment on lines 137 to 145
<bin name="testTorchTimeSeries" file="time_serie_prediction.cpp">
<use name="pytorch"/>
<use name="pytorch-cuda"/>
</bin>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smuzaffar Does this test binary definition need any of the <iftool name="cuda">, or does the binary get ignored implicitly when cuda is not available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests came from the @iarspider PR #41162
We can move them in the new PyTorch subpackage I think.

#include "FWCore/Utilities/interface/Exception.h"
#include "FWCore/Utilities/interface/ResourceInformation.h"
#include "HeterogeneousCore/CUDAServices/interface/CUDAInterface.h"
#include "HeterogeneousCore/CUDAUtilities/interface/requireDevices.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This #include seems to be unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is used to test the presence of devices if (!cms::cudatest::testDevices())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I see the call to cms::cudatest::testDevices() is in testTorchSimpleDnnCUDA.cc, so this #include should be moved there.

On the other hand, that check is not really needed, because (nowadays) scram b runtests will skip the test if the node has no GPU. And in the case of USER_UNIT_TESTS=cuda scram b runtests we want the test program to fail. Therefore, I'd suggest to either remove the call to cms::cudatest::testDevices(), or if you want more clear error message in case of devices not being available, use the not testDevices() to print the clearer error message and fail the test program.

<use name="FWCore/ServiceRegistry"/>
<use name="FWCore/Utilities"/>
<use name="HeterogeneousCore/CUDAServices"/>
<use name="HeterogeneousCore/CUDAUtilities"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the dependence on CUDAUtilities is not really needed.

@makortel
Copy link
Contributor

makortel commented Dec 1, 2023

test parameters:

@makortel
Copy link
Contributor

makortel commented Dec 1, 2023

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 2, 2023

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c749ee/36272/summary.html
COMMIT: 590cdd2
CMSSW: CMSSW_14_0_X_2023-12-01-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/43475/36272/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c749ee/36272/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c749ee/36272/git-merge-result

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 48 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39740
  • DQMHistoTests: Total failures: 1588
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 38152
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor

fwyzard commented Dec 2, 2023

@valsdav coukd you add a README.md file for the new package, with a short description and the instructions for re-creating the model binary ?

@valsdav
Copy link
Contributor Author

valsdav commented Dec 4, 2023

@valsdav coukd you add a README.md file for the new package, with a short description and the instructions for re-creating the model binary ?

Hi @fwyzard sure, I will add a README and also improve the generation of the model binaries using apptainer and the cms-ml docker image for that.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 8, 2023

Pull request #43475 was updated. @wpmccormack, @valsdav, @tvami can you please check and sign again.

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c749ee/40662/summary.html
COMMIT: 2d2c5ff
CMSSW: CMSSW_14_1_X_2024-07-28-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/43475/40662/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 6
  • DQMHistoTests: Total histograms compared: 37022
  • DQMHistoTests: Total failures: 1307
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 35715
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 5 files compared)
  • Checked 20 log files, 25 edm output root files, 6 DQM output files
  • TriggerResults: no differences found

@valsdav
Copy link
Contributor Author

valsdav commented Jul 30, 2024

+ml

@tvami
Copy link
Contributor

tvami commented Jul 30, 2024

+1

  • PR adds tests for PyTorch
  • tests pass

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @mandrenguyen, @antoniovilela, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@valsdav
Copy link
Contributor Author

valsdav commented Jul 30, 2024

Just wanted to add that a README with docs and a more refined user-facing interface for PyTorch models will be added in a later PR. We wanted to merge this one adding basic tests for the pytorch inference functionality.

@mandrenguyen
Copy link
Contributor

assign core
It's not immediately clear that all Matti's comments were addressed so let's let core sign, and then hopefully we can merge this long-standing PR.

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Aug 1, 2024

+Core

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 1, 2024

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @sextonkennedy, @antoniovilela, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit a00ad1a into cms-sw:master Aug 1, 2024
15 checks passed
@mandrenguyen
Copy link
Contributor

@makortel
Copy link
Contributor

makortel commented Aug 5, 2024

Are these unit test failures coming from this PR?

Yes. The cause is in

===== Test "testTorchSimpleDnn" ====
Running .cmd: apptainer exec -B /data/cmsbld/jenkins_a/workspace/ib-run-qa/CMSSW_14_1_X_2024-08-03-1100  /cvmfs/unpacked.cern.ch/registry.hub.docker.com/cmsml/cmsml:3.11  python /data/cmsbld/jenkins_a/workspace/ib-run-qa/CMSSW_14_1_X_2024-08-03-1100/src/PhysicsTools/PyTorch/test/create_simple_dnn.py /data/cmsbld/jenkins_a/workspace/ib-run-qa/CMSSW_14_1_X_2024-08-03-1100/test/el8_aarch64_gcc12/52de-5310-c164-e523
FATAL:   image targets 'amd64', cannot run on 'arm64'

i.e. we either need the container for aarch64, or restrict these tests to be run only on amd64.

Other tests in this package show a printout

[W803 13:27:40.025787660 CUDAFunctions.cpp:108] Warning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (function operator())

Does PyTorch try to initialize CUDA every time? Is there a way to control the CUDA initialization explicitly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants