Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move dictionary declarations of Alpaka serial backend data products to the host classes_def.xml #40792

Merged
merged 1 commit into from
Feb 21, 2023

Conversation

makortel
Copy link
Contributor

PR description:

This PR moves the dictionary declarations of Alpaka serial backend data products to the host classes_def.xml. The "serial" data products are intended to be consumable by non-Alpaka EDModules as well, so having all of their definitions in the main library is more consistent. See more discussion in #40690 (comment) .

I noticed though that without the classes_serial_def.xml and classes_serial.h files the build fails with

>> Building LCG reflex dict from header file src/DataFormats/PortableTestObjects/src/alpaka/classes_serial.h
Error: Cannot find header /build/mkortela/coresw/CMSSW_13_0_0_pre4//src/DataFormats/PortableTestObjects/src/alpaka/classes_serial.h: cannot inline it.
Error: rootcling: cannot open linkdef file src/DataFormats/PortableTestObjects/src/alpaka/classes_serial_def.xml
gmake: *** [config/SCRAM/GMake/Makefile.rules:1740: tmp/slc7_amd64_gcc11/src/DataFormats/PortableTestObjects/src/alpaka/DataFormatsPortableTestObjectsSerialSync/a/DataFormatsPortableTestObjectsSerialSync_xr.cc] Error 1
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

and just to get started I included both with dummy content.

PR validation:

Unit tests pass (on a machine without GPU)

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-40792/34220

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel (Matti Kortelainen) for master.

It involves the following packages:

  • DataFormats/PortableTestObjects (heterogeneous)

@cmsbuild, @makortel, @fwyzard can you please review it and eventually sign? Thanks.
@missirol, @rovere this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

hold

(maybe I should have opened this as draft...)

@makortel
Copy link
Contributor Author

enable gpu

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

Pull request has been put on hold by @makortel
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

@makortel
Copy link
Contributor Author

@smuzaffar Would it be feasible to modify the Alpaka library build rules such that the dictionaries for the serial backend are generated only if the classes_serial.h or classes_serial_def.xml exist?

Alternatively we could decide that we'd never generate dictionaries for the serial backend, but I'd be tempted to keep the option in case some unexpected use case comes up.

@makortel
Copy link
Contributor Author

(@fwyzard I'll rebase this PR on top of the commit of #40789 later)

@fwyzard
Copy link
Contributor

fwyzard commented Feb 16, 2023

@smuzaffar Would it be feasible to modify the Alpaka library build rules such that the dictionaries for the serial backend are generated only if the classes_serial.h or classes_serial_def.xml exist?

What happens today if classes_serial.h and classes_serial_def.xml do not exist ?

In fact, what happens if classes_cuda.h and classes_cuda_def.xml do not exist ?

@makortel
Copy link
Contributor Author

What happens today if classes_serial.h and classes_serial_def.xml do not exist ?

The error I mentioned in the PR description

>> Building LCG reflex dict from header file src/DataFormats/PortableTestObjects/src/alpaka/classes_serial.h
Error: Cannot find header /build/mkortela/coresw/CMSSW_13_0_0_pre4//src/DataFormats/PortableTestObjects/src/alpaka/classes_serial.h: cannot inline it.
Error: rootcling: cannot open linkdef file src/DataFormats/PortableTestObjects/src/alpaka/classes_serial_def.xml
gmake: *** [config/SCRAM/GMake/Makefile.rules:1740: tmp/slc7_amd64_gcc11/src/DataFormats/PortableTestObjects/src/alpaka/DataFormatsPortableTestObjectsSerialSync/a/DataFormatsPortableTestObjectsSerialSync_xr.cc] Error 1
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

In fact, what happens if classes_cuda.h and classes_cuda_def.xml do not exist ?

Apparently the build completes successfully, in both cases where the classes_serial.h and classes_serial_def.xml exist or not. At least if I do scram b clean in between. If I didn't, I got a build failure

>> Building LCG reflex dict from header file src/DataFormats/PortableTestObjects/src/alpaka/classes_cuda.h
Error: Cannot find header /build/mkortela/coresw/CMSSW_13_0_0_pre4//src/DataFormats/PortableTestObjects/src/alpaka/classes_cuda.h: cannot inline it.
Error: rootcling: cannot open linkdef file src/DataFormats/PortableTestObjects/src/alpaka/classes_cuda_def.xml
gmake: *** [config/SCRAM/GMake/Makefile.rules:1740: tmp/slc7_amd64_gcc11/src/DataFormats/PortableTestObjects/src/alpaka/DataFormatsPortableTestObjectsCudaAsync/a/DataFormatsPortableTestObjectsCudaAsync_xr.cc] Error 1

I though I did scram b clean before when testing the removal of classes_serial_def.xml and classes_serial.h, but I'll test more.

@makortel
Copy link
Contributor Author

So what actually happens in my test when I remove the classes_serial_def.xml and classes_serial.h and do the scram b clean, the building of DataFormats/PortableTestObjects itself works fine.

But the build gets later stuck in edmWriteConfigs -p /build/mkortela/coresw/CMSSW_13_0_0_pre4/tmp/slc7_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableSerialSync/libHeterogeneousCoreAlpakaTestPluginsPortableSerialSync.so. The edmWriteConfigs itself is stuck in

#0  0x00007fc7df13154d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fc7df12ce9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007fc7df12cd68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fc7e01dd51e in DebugPrint(char const*, ...) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCore.so
#4  0x00007fc7e01dd7ed in DefaultErrorHandler(int, bool, char const*, char const*) ()
   from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCore.so
#5  0x00007fc7e0296c25 in ErrorHandler () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCore.so
#6  0x00007fc7e01f013c in TObject::Warning(char const*, char const*, ...) const ()
   from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCore.so
#7  0x00007fc7d8562da1 in TCling::ReadRootmapFile(char const*, TCling::TUniqueString*) ()
   from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCling.so
#8  0x00007fc7d8597d23 in TCling::LoadLibraryMap(char const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCling.so
#9  0x00007fc7d8598bf7 in TCling::Initialize() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCling.so
#10 0x00007fc7e01b1fed in TROOT::InitInterpreter() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCore.so
#11 0x00007fc7e01b233f in ROOT::Internal::GetROOT2() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCore.so
#12 0x00007fc7e01dc04e in TEnv::Getvalue(char const*) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCore.so
#13 0x00007fc7e01dc949 in TEnv::GetValue(char const*, char const*) const ()
   from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCore.so
#14 0x00007fc7e01ddb18 in DefaultErrorHandler(int, bool, char const*, char const*) ()
   from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCore.so
#15 0x00007fc7e0296c25 in ErrorHandler () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCore.so
#16 0x00007fc7e02975a8 in Warning(char const*, char const*, ...) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCore.so
#17 0x00007fc7e02c851f in ROOT::TGenericClassInfo::TGenericClassInfo(char const*, int, char const*, int, std::type_info const&, ROOT::Internal::TInitBehavior const*, TClass* (*)(), TVirtualIsAProxy*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/external/slc7_amd64_gcc11/lib/libCore.so
#18 0x00007fc7dc052989 in ROOT::GenerateInitInstanceLocal(edm::Wrapper<PortableHostCollection<portabletest::TestSoALayout<128ul, false> > > const*) [clone .constprop.0] ()
   from /build/mkortela/coresw/CMSSW_13_0_0_pre4/lib/slc7_amd64_gcc11/libDataFormatsPortableTestObjects.so
#19 0x00007fc7dc05240c in _GLOBAL__sub_I_DataFormatsPortableTestObjects_xr.cc ()
   from /build/mkortela/coresw/CMSSW_13_0_0_pre4/lib/slc7_amd64_gcc11/libDataFormatsPortableTestObjects.so
#20 0x00007fc7e0eab9c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#21 0x00007fc7e0eb059e in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#22 0x00007fc7e0eab7d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#23 0x00007fc7e0eafb8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#24 0x00007fc7dfa8afab in dlopen_doit () from /lib64/libdl.so.2
#25 0x00007fc7e0eab7d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#26 0x00007fc7dfa8b5ad in _dlerror_run () from /lib64/libdl.so.2
#27 0x00007fc7dfa8b041 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#28 0x00007fc7e0ee2af6 in edmplugin::SharedLibrary::SharedLibrary(std::filesystem::__cxx11::path const&) ()
   from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/slc7_amd64_gcc11/libFWCorePluginManager.so

which looks the same as in root-project/root#11383.

ldd /build/mkortela/coresw/CMSSW_13_0_0_pre4/tmp/slc7_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableSerialSync/libHeterogeneousCoreAlpakaTestPluginsPortableSerialSync.so shows

        libHeterogeneousCoreAlpakaTest.so => /build/mkortela/coresw/CMSSW_13_0_0_pre4/lib/slc7_amd64_gcc11/libHeterogeneousCoreAlpakaTest.so (0x00007fdd39425000)
        libHeterogeneousCoreAlpakaTestSerialSync.so => /build/mkortela/coresw/CMSSW_13_0_0_pre4/lib/slc7_amd64_gcc11/libHeterogeneousCoreAlpakaTestSerialSync.so (0x00007fdd39420000)
        libHeterogeneousCoreAlpakaCore.so => /build/mkortela/coresw/CMSSW_13_0_0_pre4/lib/slc7_amd64_gcc11/libHeterogeneousCoreAlpakaCore.so (0x00007fdd39419000)
        libHeterogeneousCoreAlpakaCoreSerialSync.so => /build/mkortela/coresw/CMSSW_13_0_0_pre4/lib/slc7_amd64_gcc11/libHeterogeneousCoreAlpakaCoreSerialSync.so (0x00007fdd3940a000)
        libDataFormatsPortableTestObjects.so => /build/mkortela/coresw/CMSSW_13_0_0_pre4/lib/slc7_amd64_gcc11/libDataFormatsPortableTestObjects.so (0x00007fdd393fd000)
        libDataFormatsPortableTestObjectsSerialSync.so => /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_0_pre4/lib/slc7_amd64_gcc11/libDataFormatsPortableTestObjectsSerialSync.so (0x00007fdd393f1000)

I guess the dependence on libDataFormatsPortableTestObjectsSerialSync.so is a result of DataFormats/PortableTestObjects/BuildFile.xml containing <flags ALPAKA_BACKENDS="1"/>, i.e. the package would lead to libDataFormatsPortableTestObjects<backend>.so shared objects that a dependent packace's library would be linked against (as happens here). And now I'm trying to make the SerialSync case special (by removing the library for data formats).

@smuzaffar
Copy link
Contributor

@makortel @fwyzard , <flags ALPAKA_BACKENDS="1"/> means enable all default backends selected by the project via config/Self.xml. So you can replace it with

<flags ALPAKA_BACKENDS="cuda rocm"/>

to only build cuda and rocm backends or

<flags ALPAKA_BACKENDS="serial"/>

to build only serial backend. So if there is missing pair of classes_serial files then better to update the BuildFile to not build serial backend

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-32b74b/30672/summary.html
COMMIT: f31b550
CMSSW: CMSSW_13_1_X_2023-02-16-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/40792/30672/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 1 lines from the logs
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3556272
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3556250
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 213 log files, 164 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 9 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19862
  • DQMHistoTests: Total failures: 227
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19635
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: found differences in 2 / 3 workflows

@cmsbuild
Copy link
Contributor

Pull request #40792 was updated. @cmsbuild, @makortel, @fwyzard can you please check and sign again.

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-32b74b/30752/summary.html
COMMIT: a71fc2e
CMSSW: CMSSW_13_1_X_2023-02-20-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/40792/30752/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 1203 lines to the logs
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3529029
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3529001
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 213 log files, 164 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • You potentially added 72 lines to the logs
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19862
  • DQMHistoTests: Total failures: 52
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19810
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: found differences in 2 / 3 workflows

@makortel
Copy link
Contributor Author

+heterogeneous

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

@perrotta
Copy link
Contributor

ping bot

@cmsbuild cmsbuild merged commit 60313e2 into cms-sw:master Feb 21, 2023
@dan131riley
Copy link

We're getting warnings (including in the test jobs for this PR):

Warning in <TInterpreter::ReadRootmapFile>: class  PortableHostCollection<portabletest::TestSoALayout<128,false> > found in libDataFormatsPortableTestObjectsSerialSync.so  is already in libDataFormatsPortableTestObjects.so 
Warning in <TInterpreter::ReadRootmapFile>: class  PortableHostCollection<portabletest::TestSoALayout<128ul,false> > found in libDataFormatsPortableTestObjectsSerialSync.so  is already in libDataFormatsPortableTestObjects.so 
Warning in <TInterpreter::ReadRootmapFile>: class  edm::Wrapper<PortableHostCollection<portabletest::TestSoALayout<128,false> > > found in libDataFormatsPortableTestObjectsSerialSync.so  is already in libDataFormatsPortableTestObjects.so 
Warning in <TInterpreter::ReadRootmapFile>: class  edm::Wrapper<PortableHostCollection<portabletest::TestSoALayout<128ul,false> > > found in libDataFormatsPortableTestObjectsSerialSync.so  is already in libDataFormatsPortableTestObjects.so 
Warning in <TInterpreter::ReadRootmapFile>: class  edm::Wrapper<portabletest::TestHostCollection> found in libDataFormatsPortableTestObjectsSerialSync.so  is already in libDataFormatsPortableTestObjects.so 
Warning in <TInterpreter::ReadRootmapFile>: class  portabletest::TestHostCollection found in libDataFormatsPortableTestObjectsSerialSync.so  is already in libDataFormatsPortableTestObjects.so 

and unit test failures:

Test name: testStandalone::writeAndReadFile
uncaught exception of type std::exception (or derived).
- An exception of category 'FatalRootError' occurred while
   [0] Constructing service of type InitRootHandlers
   Additional Info:
      [a] Fatal Root Error: @SUB=TInterpreter::ReadRootmapFile
class  PortableHostCollection<portabletest::TestSoALayout<128,false> > found in libDataFormatsPortableTestObjectsSerialSync.so  is already in libDataFormatsPortableTestObjects.so 

that appear to be related to this PR?

@makortel
Copy link
Contributor Author

I'd expect those to disappear in a full build (the latest IB is still a patch build).

@smuzaffar This PR removed the libDataFormatsPortableTestObjectsSerialSync.so (and only that of all the libDataFormatsPortableTestObjects*.so). Just guessing, but could it be that this corner case escapes the "poisoned library" setup?

@smuzaffar
Copy link
Contributor

smuzaffar commented Feb 22, 2023

I'd expect those to disappear in a full build (the latest IB is still a patch build).

right, a full build should fix this

@smuzaffar This PR removed the libDataFormatsPortableTestObjectsSerialSync.so (and only that of all the libDataFormatsPortableTestObjects*.so). Just guessing, but could it be that this corner case escapes the "poisoned library" setup?

although we have protection against edm plugins (via poison plugin cache) but we do not poison shared libraires itself. I think we can fix this as well by creating poison libs but I am not sure how root will behave (it still might go through the LD_LIBRARY_PATH and load the shared library from release area). Note that these warnings are from root and not from cms plugin system

@makortel makortel deleted the serialDictionary branch February 22, 2023 09:37
@makortel
Copy link
Contributor Author

Ah, the poison libraries were only for plugins. I started to wonder, because when we have moved dictionaries around in the past, I didn't recall any unit tests to fail because of that (but only the IB duplicate dictionary checker complaining until the next full build).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants