Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable TF XLA, add tensorflow-xla-runtime. #9005

Merged

Conversation

riga
Copy link
Contributor

@riga riga commented Feb 12, 2024

Hi!

This PR enables TensorFlow XLA and adds a new tool tensorflow-xla-runtime, which provides the library needed by AOT compiled ML models.

Changes in particular:

  • Enable XLA flag
  • Add tf2xla_supported_ops to bazel targets, providing an executable that lists XLA and AOT compatible TF ops for the current env.
  • Added tensorflow-xla-runtime spec, which is based on tensorflow-source which contains everything necessary to compile the runtime via cmake.
  • Updates some python packages needed for some ML workflows.
    • wrapt is downgraded to from 1.15.0 to 1.14.1 since 1.15 introduces a bug in TF's SavedModel interface (issue linked next to NO_AUTO_UPDATE instruction)

The PR is required by cms-sw/cmssw#43941.

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @riga (Marcel Rieger) for branch IB/CMSSW_14_1_X/master.

@cmsbuild, @aandvalenzuela, @smuzaffar, @iarspider can you please review it and eventually sign? Thanks.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 12, 2024

cms-bot internal usage

@@ -1,2 +1,2 @@
%define vectorized_packages zlib fastjet tensorflow-sources tensorflow OpenBLAS rivet gbl
%define vectorized_packages zlib fastjet tensorflow-sources tensorflow tensorflow-xla-runtime OpenBLAS rivet gbl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry, I forgot to mention, please also add scram-tools.file/tools/tensorflow-xla-runtime/vectorized.tmpl file same as https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_14_1_X/master/scram-tools.file/tools/tensorflow/vectorized.tmpl except that the tool name should be tensorflow-xla-runtime_@TOOL_VECTORIZATION@

@cmsbuild
Copy link
Contributor

Pull request #9005 was updated.

export CPATH="${CPATH}:%{i}/tensorflow/include/third_party/eigen3"

pushd tensorflow/xla_aot_runtime_src
cmake . -DCMAKE_CXX_FLAGS="-fPIC"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to properly build for multi-vectorization please add -msse3 here e.g. -DCMAKE_CXX_FLAGS="-fPIC -msse3"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 👍

@smuzaffar
Copy link
Contributor

please test

lets see the external build step

@cmsbuild
Copy link
Contributor

Pull request #9005 was updated.

@@ -0,0 +1,32 @@
### RPM external tensorflow-xla-runtime 2.12.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riga, we need tensorflow-xla-runtime to be part of cmssw dependency chain, so please add it in https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_14_1_X/master/cmssw-tool-conf.spec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 👍

@cmsbuild
Copy link
Contributor

Pull request #9005 was updated.

@smuzaffar
Copy link
Contributor

please test

@@ -1,7 +1,7 @@
Requires: py3-tensorboard py3-opt-einsum py3-tensorflow-estimator py3-wrapt py3-google-pasta py3-scipy
Requires: tensorflow-sources
%define PipPreBuildPy PIPFILE=${TENSORFLOW_SOURCES_ROOT}/tensorflow-%{realversion}-cp%{cms_python3_major_minor}-cp%{cms_python3_major_minor}-linux_%{_arch}.whl
%define PipPostBuild rm -f %{i}/bin/tensorboard* ; ls %{i}/bin/* | xargs -i mv '{}' '{}3'
%define PipPostBuild rm -f %{i}/bin/tensorboard* ; rm -r %{i}/lib/python%{cms_python3_major_minor_version}/site-packages/tensorflow/xla_aot_runtime_src ; ls %{i}/bin/* | xargs -i mv '{}' '{}3'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riga any specific reason to delete it? Looks like this is what you get out of tensorflow-sources wheel file. May be if we do not delete it here then you can just add dependency on py3-tensorflow and copy $PY3_TENSORFLOW_ROOT/$PYTHON3_LIB_SITE_PACKAGES/tensorflow/xla_aot_runtime_src to build tensorflow-xla-runtime ... right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. We changed this in f77258c.

export CPATH="${CPATH}:%{i}/tensorflow/include/third_party/eigen3"

pushd tensorflow/xla_aot_runtime_src
cmake . -DCMAKE_CXX_FLAGS="-fPIC -msse3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can add -DBUILD_SHARED_LIBS=OFF to disable shared library and directly build the archive lib. In this case you can avoid running gcc -shared .... command

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also please add -DCMAKE_CXX_STANDARD=%{cms_cxx_standard} just to make sure to use the correct c++std. Ypu need to include cpp-standard.file too see https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_14_1_X/master/tbb.spec#L2C12-L2C24

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both added in f77258c.


%install

mv tensorflow/include %{i}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riga, the contents of tensorflow/include are same as the one in $PY3_TENSORFLOW_ROOT/$PYTHON3_LIB_SITE_PACKAGES/tensorflow/include . I would suggest that instead of duplicating this why not we get the include from py3-tensorflow ( we can add an extra toolfile so that these headers are accessible in cmssw)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In f77258c we changed the dependency of tensorflow-xla-runtime from tensorflow-sources to py3-tensorflow and added a separate tool file py3-tensorflow/tensorflow-includes.xml that we used in the toolfile of the xla runtime.


mkdir -p %{i}/lib/archive
mv tensorflow/xla_aot_runtime_src/libtf_xla_runtime.a %{i}/lib/archive/libtf_xla_runtime-static.a
mv tensorflow/xla_aot_runtime_src/libtf_xla_runtime.so %{i}/lib/libtf_xla_runtime.so
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the shared lib?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for now. We removed it in f77258c.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-656fa9/37379/summary.html
COMMIT: cc33bb3
CMSSW: CMSSW_14_1_X_2024-02-12-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9005/37379/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 5 errors in the following unit tests:

---> test testTFConstSession had ERRORS
---> test testTFGraphLoading had ERRORS
---> test testTFThreadPools had ERRORS
and more ...

Comparison Summary

Summary:

  • You potentially removed 92 lines from the logs
  • Reco comparison results: 54 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3247438
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3247413
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 19061.808 KiB( 47 files compared)
  • DQMHistoSizes: changed ( 23234.0,... ): 14102.535 KiB TrackerPhase2OTStub/Stubs
  • DQMHistoSizes: changed ( 23234.0,... ): -14102.286 KiB SiOuterTracker/Stubs
  • DQMHistoSizes: changed ( 23234.0,... ): 14027.626 KiB TrackerPhase2TTCluster/Clusters
  • DQMHistoSizes: changed ( 23234.0,... ): -14027.360 KiB SiOuterTracker/Clusters
  • DQMHistoSizes: changed ( 23234.0,... ): 3178.168 KiB TrackerPhase2OTStubV/Stub_RZ
  • DQMHistoSizes: changed ( 23234.0,... ): -25.679 KiB SiOuterTrackerV/Tracks
  • DQMHistoSizes: changed ( 23234.0,... ): 20.667 KiB TrackerPhase2OTL1TrackV/Resolution
  • DQMHistoSizes: changed ( 23234.0,... ): 16.476 KiB TrackerPhase2OTL1Track/Tracks
  • DQMHistoSizes: changed ( 23234.0,... ): -16.163 KiB SiOuterTracker/Tracks
  • DQMHistoSizes: changed ( 23234.0,... ): 2.961 KiB TrackerPhase2OTL1TrackV/Efficiency
  • DQMHistoSizes: changed ( 23234.0 ): ...
  • Checked 200 log files, 161 edm output root files, 48 DQM output files
  • TriggerResults: found differences in 5 / 46 workflows

@cmsbuild
Copy link
Contributor

-1

Failed Tests: AddOn
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-656fa9/38038/summary.html
COMMIT: 12f9496
CMSSW: CMSSW_14_1_X_2024-03-11-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9005/38038/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-656fa9/38038/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-656fa9/38038/git-merge-result

AddOn Tests

  • unknown
AddOnTest might have timed out: FAILED -  secs

Comparison Summary

Summary:

  • You potentially added 23 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 205 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3297383
  • DQMHistoTests: Total failures: 1367
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3295996
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: found differences in 5 / 46 workflows

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-656fa9/38038/summary.html
COMMIT: 12f9496
CMSSW: CMSSW_14_1_X_2024-03-11-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9005/38038/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-656fa9/38038/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-656fa9/38038/git-merge-result

Comparison Summary

Summary:

  • You potentially added 23 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 205 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3297383
  • DQMHistoTests: Total failures: 1367
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3295996
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: found differences in 5 / 46 workflows

@cmsbuild
Copy link
Contributor

REMINDER @antoniovilela, @rappoccio, @sextonkennedy: This PR was tested with cms-sw/cmssw#43941, please check if they should be merged together

@cmsbuild
Copy link
Contributor

Pull request #9005 was updated.

@smuzaffar
Copy link
Contributor

@riga, tensorflow-xla-runtime contains only an archive library. Enabling this tool for multi-archs/vactorizations builds is not going to work. CMSSW multi-arch env only works for shared library. So I would suggest to remove https://github.com/cms-sw/cmsdist/pull/9005/files#diff-56c31d38da773f5fab39a62d89ad676228cf6d73cbee7a0d2265d052671ed984 change.

@makortel
Copy link
Contributor

tensorflow-xla-runtime contains only an archive library. Enabling this tool for multi-archs/vactorizations builds is not going to work. CMSSW multi-arch env only works for shared library.

Does this mean that in order to fully utilize the multi-arch build (i.e. wider vector instructions) for the AOT-compiled models we need some further development? (like making tensorflow-xla-runtime to produce a shared library, or extending CMSSW multiarch build rules to work with archive libraries, or something else)

@smuzaffar
Copy link
Contributor

yes @makortel

extending CMSSW multiarch build rules to work with archive libraries

currently the multi-arch build works by linking the shared lib and then loading the correct micro-arch library at runtime. For archive libs we need to change build rules to use the correct micro-arch library at build time. It is doable but need some development

making tensorflow-xla-runtime to produce a shared library

I do not know if this will work for xla cmssw build. @riga , remind me why we have this archive libs instead of shared.

@riga
Copy link
Contributor Author

riga commented Mar 13, 2024

@smuzaffar The shared library had some symbols missing when linked to certain AOT models. However, we did not fully understand whether this is the intended behavior and the library is only meant to be used as a archive, or whether there is a way to build it correctly. This is unfortunately not documented.

If having only the static archive is a show stopper, we can look deeper in the *.so build.

@riga, tensorflow-xla-runtime contains only an archive library. Enabling this tool for multi-archs/vactorizations builds is not going to work. CMSSW multi-arch env only works for shared library. So I would suggest to remove https://github.com/cms-sw/cmsdist/pull/9005/files#diff-56c31d38da773f5fab39a62d89ad676228cf6d73cbee7a0d2265d052671ed984 change.

Should this be reverted in any case?

@makortel
Copy link
Contributor

If having only the static archive is a show stopper, we can look deeper in the *.so build.

My take is that it is not a showstopper, but something that probably should be worked on in the (near) future.

@riga
Copy link
Contributor Author

riga commented Mar 13, 2024

My take is that it is not a showstopper, but something that probably should be worked on in the (near) future.

Ok. Since we plan a second PR with the production workflow for the compilation of models anyway, we would work on the shared library build then.

@cmsbuild
Copy link
Contributor

Pull request #9005 was updated.

@rappoccio
Copy link

+1

Tests are already successful in the sibling PR in cmssw.

cms-sw/cmssw#43941

@rappoccio
Copy link

merge

@cmsbuild cmsbuild merged commit 71a9f35 into cms-sw:IB/CMSSW_14_1_X/master Mar 14, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants