Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelized offline vertexing reconstruction #46663

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

cericeci
Copy link

@cericeci cericeci commented Nov 11, 2024

PR description:

This PR contains a first implementation of the parallelized offline vertexing using Alpaka. We intend this as a first PR to request feedback and comments. It includes contributions from @alexstrel that were squashed in during the porting to 14_2.

The content of the PR was previously discussed on:
https://indico.cern.ch/event/1442046/#4-status-of-offline-vertexing

PR validation:

When running the code-checks there are several warnings such as:

Processing tmp/el9_amd64_gcc12/code-checks/RecoVertex/PrimaryVertexProducer_Alpaka/plugins/alpaka/PrimaryVertexProducer_Alpaka.cc.yaml
Deleting: No Diagnostics found

which I'm not sure about their origin. Otherwise it properly compiles and (local) validation works. The added code should not touch any other packages.

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 11, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46663/42589

  • Found files with invalid states:
    • RecoVertex/PrimaryVertexProducer_Alpaka/test/testAlpaka.root:
    • RecoVertex/PrimaryVertexProducer_Alpaka/test/testCPU_noPU.py:
    • RecoVertex/PrimaryVertexProducer_Alpaka/test/testCPU_PU200.py:
    • RecoVertex/PrimaryVertexProducer_Alpaka/test/prevs/testCPU_PU0.root:
    • RecoVertex/PrimaryVertexProducer_Alpaka/test/prevs/testAlpaka.root:
    • RecoVertex/PrimaryVertexProducer_Alpaka/test/testPrimaryVertexProducer_Alpaka.py:
    • RecoVertex/PrimaryVertexProducer_Alpaka/test/testPrimaryVertexProducer_Alpaka_PU200.py:

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cericeci
Copy link
Author

Sorry, seems like when I rebased this to 14_2 there was some mix up on my side with the validation branch. Is it ok to push on this branch or do I close and open a new PR from the proper one?

@makortel
Copy link
Contributor

You can push to the same branch.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46663/42721

  • Found files with invalid states:
    • RecoVertex/PrimaryVertexProducer_Alpaka/test/testAlpaka.root:
    • RecoVertex/PrimaryVertexProducer_Alpaka/test/testCPU_noPU.py:
    • RecoVertex/PrimaryVertexProducer_Alpaka/test/testCPU_PU200.py:
    • RecoVertex/PrimaryVertexProducer_Alpaka/test/prevs/testCPU_PU0.root:
    • RecoVertex/PrimaryVertexProducer_Alpaka/plugins/alpaka/PortableBeamSpotSoAProducer.cc:
    • RecoVertex/PrimaryVertexProducer_Alpaka/test/prevs/testAlpaka.root:
    • RecoVertex/PrimaryVertexProducer_Alpaka/test/testPrimaryVertexProducer_Alpaka.py:
    • RecoVertex/PrimaryVertexProducer_Alpaka/test/testPrimaryVertexProducer_Alpaka_PU200.py:

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @cericeci for master.

It involves the following packages:

  • DataFormats/PortableVertex (****)
  • RecoVertex/PrimaryVertexProducer (reconstruction)
  • RecoVertex/PrimaryVertexProducer_Alpaka (****)

The following packages do not have a category, yet:

DataFormats/PortableVertex
RecoVertex/PrimaryVertexProducer_Alpaka
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@cmsbuild, @jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @VinInn, @VourMa, @dgulhan, @fabiocos, @martinamalberti, @missirol, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@cericeci
Copy link
Author

Thanks @makortel. As I had to rebase the proper branch to 14_2 anyhow I ended up updating cherrypicking changes by hand and updating this one. It is now the correct code

@jfernan2
Copy link
Contributor

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Nov 20, 2024

RecoVertex/PrimaryVertexProducer_Alpaka (****)

do we need a new subsystem? Elsewhere alpaka-specific code was put in a alpaka sub-folder, for example RecoLocalTracker/SiPixelClusterizer/plugins/alpaka or RecoTracker/PixelSeeding/plugins/alpaka

Copy link
Contributor

@fwyzard fwyzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of comments, concerning only the DataFormats package(s)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this file

Comment on lines +11 to +12
#include <Eigen/Core>
#include <Eigen/Dense>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please include these only once, before the SoA classes

Suggested change
#include <Eigen/Core>
#include <Eigen/Dense>

Comment on lines +16 to +17
using VertexToTrack = Eigen::Vector<float, 1024>;
using VertexToTrackInt = Eigen::Vector<int, 1024>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the code expects up to 1024 tracks per vertex ?
Does this value appear anywhere else in the code ?
If that is the case, could you define a named constant for it and use it here and everywhere else the same value is needed ?

This should make it easier to update the code if later it turns out that we need a different value.


using VertexSoA = VertexSoALayout<>;

using TrackToVertex = Eigen::Vector<float, 512>; // 512 is the max vertex allowed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, same for 512 here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you split this into three files, one per SoA, and name them after each SoA ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I guess the same should be done for VertexHostCollection.h and VertexDeviceCollection.h.
Opinions ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to first understand if TrackSoALayout and ClusterParams are needed as event data formats, or if they are really only internal details of the PrimaryVertexProducer_Alpaka.

If kept here, on a first though I'd be in favor of splitting into a file per SoA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the content of this new package could instead go into DataFormats/VertexSoA ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur.

#include <Eigen/Core>
#include <Eigen/Dense>

namespace portablevertex {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to introduce a new namespace ?
Would it be a problem to have VertexSoA etc in the global namespace ?

<lcgdict>
<class name="alpaka_cuda_async::portablevertex::VertexDeviceCollection" persistent="false"/>
<class name="edm::DeviceProduct<alpaka_cuda_async::portablevertex::VertexDeviceCollection>" persistent="false"/>
<class name="edm::Wrapper<edm::DeviceProduct<alpaka_cuda_async::portablevertex::VertexDeviceCollection>>" persistent="false"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be easier on the eyes, could you add an empty line between each group of collection, device product and wrapper ?

Comment on lines +10 to +20
<read
sourceClass="portablevertex::VertexHostCollection"
targetClass="portablevertex::VertexHostCollection"
version="[1-]"
source="portablevertex::VertexSoA layout_;"
target="buffer_,layout_,view_"
embed="false">
<![CDATA[
portablevertex::VertexHostCollection::ROOTReadStreamer(newObj, onfile.layout_);
]]>
</read>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace the use of explicit read rules with the macro-based approach.

See for example the definitions for portabletest::TestHostCollection in DataFormats/PortableTestObjects/src/classes_def.xml and DataFormats/PortableTestObjects/src/classes.cc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes seem unnecessary ?
If you could avoid touching this file, it would make the impact of the PR smaller.

@fwyzard
Copy link
Contributor

fwyzard commented Nov 20, 2024

This PR contains a first implementation of the parallelized offline vertexing using Alpaka. We intend this as a first PR to request feedback and comments. It includes contributions from @alexstrel that were squashed in during the porting to 14_2.

Thanks @cericeci !

Given that the commit history does not seem very relevant, could you squash all commits into a single one, and add

Co-authored-by: Alexei Strelchenko <[email protected]>

at the bottom of the commit message ?

@makortel
Copy link
Contributor

RecoVertex/PrimaryVertexProducer_Alpaka (****)

do we need a new subsystem? Elsewhere alpaka-specific code was put in a alpaka sub-folder, for example RecoLocalTracker/SiPixelClusterizer/plugins/alpaka or RecoTracker/PixelSeeding/plugins/alpaka

In this case the compilation time is very long (IIRC tens of minutes, for reasons not yet fully understood), so a separate package could make sense. The set of dependencies is also somewhat (about 50 %) different.

Copy link
Contributor

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a first pass (without looking the kernel code very deeply, and I probably won't)

Comment on lines +39 to +40
SOA_EIGEN_COLUMN(VertexToTrackInt, track_id),
SOA_EIGEN_COLUMN(VertexToTrack, track_weight),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that each vertex element holds weight and index to a track for up to 1024 tracks (i.e. every vertex uses 8 kB memory for the weight and index)?

Comment on lines +3 to +17
<use name="alpaka"/>
<use name="fmt"/>
<use name="DataFormats/PortableVertex"/>
<use name="DataFormats/TrackReco"/>
<use name="DataFormats/VertexReco"/>
<use name="FWCore/Framework"/>
<use name="FWCore/MessageLogger"/>
<use name="FWCore/ParameterSet"/>
<use name="FWCore/Utilities"/>
<use name="HeterogeneousCore/CUDACore"/>
<use name="HeterogeneousCore/AlpakaTest"/>
<use name="HeterogeneousCore/AlpakaCore"/>
<use name="HeterogeneousCore/AlpakaInterface"/>
<use name="TrackingTools/Records"/>
<use name="TrackingTools/TransientTrack"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list should be driven by what SoAToRecoVertexProducer depends on. By quick look it should be

Suggested change
<use name="alpaka"/>
<use name="fmt"/>
<use name="DataFormats/PortableVertex"/>
<use name="DataFormats/TrackReco"/>
<use name="DataFormats/VertexReco"/>
<use name="FWCore/Framework"/>
<use name="FWCore/MessageLogger"/>
<use name="FWCore/ParameterSet"/>
<use name="FWCore/Utilities"/>
<use name="HeterogeneousCore/CUDACore"/>
<use name="HeterogeneousCore/AlpakaTest"/>
<use name="HeterogeneousCore/AlpakaCore"/>
<use name="HeterogeneousCore/AlpakaInterface"/>
<use name="TrackingTools/Records"/>
<use name="TrackingTools/TransientTrack"/>
<use name="DataFormats/Math"/>
<use name="DataFormats/PortableVertex"/>
<use name="DataFormats/TrackReco"/>
<use name="DataFormats/VertexReco"/>
<use name="FWCore/Framework"/>
<use name="FWCore/ParameterSet"/>
<use name="FWCore/Utilities"/>

<use name="FWCore/Framework"/>
<use name="FWCore/ParameterSet"/>
<use name="FWCore/Utilities"/>
<use name="HeterogeneousCore/CUDACore"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependence should not be needed

Suggested change
<use name="HeterogeneousCore/CUDACore"/>

#include "FWCore/Framework/interface/stream/EDProducer.h"
#include "FWCore/Framework/interface/Event.h"
#include "FWCore/Framework/interface/EventSetup.h"
#include "HeterogeneousCore/AlpakaInterface/interface/config.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not be needed

Suggested change
#include "HeterogeneousCore/AlpakaInterface/interface/config.h"

// Finally, add references to the reco::Track used for building it
for (int iT=0; iT < hostVertexView[iV].ntracks(); iT++) {
int new_itrack = hostVertexView[iV].track_id()[iT];
reco::TrackRef ref(tracks, new_itrack);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where to put this comment, but it should be ensured in some way the ProductID of the edm::Handle<reco::TrackCollection> is the same as what was used for hostVertexView.

@@ -0,0 +1,166 @@
#include "DataFormats/PortableVertex/interface/alpaka/VertexDeviceCollection.h"
#include "DataFormats/PortableVertex/interface/VertexHostCollection.h"
#include "DataFormats/BeamSpot/interface/BeamSpotHost.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be

Suggested change
#include "DataFormats/BeamSpot/interface/BeamSpotHost.h"
#include "DataFormats/BeamSpot/interface/alpaka/BeamSpotDevice.h"

@@ -0,0 +1,166 @@
#include "DataFormats/PortableVertex/interface/alpaka/VertexDeviceCollection.h"
#include "DataFormats/PortableVertex/interface/VertexHostCollection.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be needed

Suggested change
#include "DataFormats/PortableVertex/interface/VertexHostCollection.h"

Comment on lines +12 to +15
#include "DataFormats/VertexReco/interface/VertexFwd.h"
#include "DataFormats/TrackReco/interface/TrackFwd.h"
#include "DataFormats/TrackReco/interface/Track.h"
#include "DataFormats/VertexReco/interface/Vertex.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these four headers needed? I didn't see any obvious use for them

Suggested change
#include "DataFormats/VertexReco/interface/VertexFwd.h"
#include "DataFormats/TrackReco/interface/TrackFwd.h"
#include "DataFormats/TrackReco/interface/Track.h"
#include "DataFormats/VertexReco/interface/Vertex.h"


cmsRun testPrimaryVertexProducer_Alpaka.py --backend $i

python compareAlgos.py testAlpaka.root testCPU.root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script alone seems to not to be very useful. The files

  • ../PrimaryVertexProducer/test/testPrimaryVertexProducer_CPU.py
  • testPrimaryVertexProducer_Alpaka.py
  • compareAlgos.py

Are these files perhaps just missing from the PR?


using TrackSoA = TrackSoALayout<>;

GENERATE_SOA_LAYOUT(ClusterParams,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the ClusterParams is used only internally in PrimaryVertexProducer_Alpaka, and is not put into the Event. Then it would be better defined in RecoVertex/PrimaryVertexProducer_Alpaka/plugins/alpaka.

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 56KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f5bc58/42979/summary.html
COMMIT: 86c21bd
CMSSW: CMSSW_14_2_X_2024-11-20-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/46663/42979/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

@jfernan2
Copy link
Contributor

enable profiling

@fwyzard
Copy link
Contributor

fwyzard commented Nov 21, 2024

do we need a new subsystem? Elsewhere alpaka-specific code was put in a alpaka sub-folder, for example RecoLocalTracker/SiPixelClusterizer/plugins/alpaka or RecoTracker/PixelSeeding/plugins/alpaka

In this case the compilation time is very long (IIRC tens of minutes, for reasons not yet fully understood), so a separate package could make sense. The set of dependencies is also somewhat (about 50 %) different.

Either way the name PrimaryVertexProducer_Alpaka is not good.
If this needs to be a separate package, I would suggest PortablePrimaryVertexProducer.

@makortel
Copy link
Contributor

Either way the name PrimaryVertexProducer_Alpaka is not good.
If this needs to be a separate package, I would suggest PortablePrimaryVertexProducer.

That I fully agree (and kind of already commented in #46663 (comment)). In case of a separate package, I think the package name should match the (main) producer name. We have some existing use of a Portable postfix for EDModule names, but no existing use of Portable prefix.

@makortel
Copy link
Contributor

@smuzaffar Do we have any monitoring of the compilation time of a PR?

@makortel
Copy link
Contributor

enable profiling

@jfernan2 Just curious, why profiling? As the PR stands presently, the added code is not part of any configuration (but perhaps that should change).

@smuzaffar
Copy link
Contributor

@smuzaffar Do we have any monitoring of the compilation time of a PR?

If PR job is still in jenkins then yes we can find the time it took for compilation otherwise we can only get the total time it took run the PR test job.

By the way, compilation time can vary depending on what other packages were checkout

@makortel
Copy link
Contributor

@smuzaffar Would it be feasible to run the scram b -v -k -j 16 part e.g. through /usr/bin/time to have something in the compilation log? I understand the caveats, I'm mostly thinking of monitoring (not even catching) of "outrageously long" compilation times when those are known beforehand (at least for now).

(by the way, why do I see >> Compiling edm plugin src/RecoVertex/PrimaryVertexProducer/plugins/PrimaryVertexProducer.cc three times in the build log?)

@smuzaffar
Copy link
Contributor

@makortel , sure I can add /usr/bin/time for scram build.
yes multiple >> Compiling edm plugin src/RecoVertex/PrimaryVertexProducer/plugins messages seems incorrect. I am looking in to it

@smuzaffar
Copy link
Contributor

@makortel , ok I think I know why we have duplicate compilation messages. Bot runs

COMPILATION_CMD="scram b vclean && BUILD_LOG=yes $USER_FLAGS scram b ${BUILD_VERBOSE} -k -j ${NCPU}"
eval $COMPILATION_CMD

so we get two compilation messages. One directly from scram and 2nd from the eval itself. scram actually compiles the sources once but due to the use of eval we see duplicates.

The third message is coming from https://github.com/cms-sw/cms-bot/blob/master/pr_testing/test_multiple_prs.sh#L1149 where we try to get any log messages where were in tmp logs files but were not printed ( in case of build failures).

I will update bot to avoid using eval

@makortel
Copy link
Contributor

Thanks @smuzaffar!

@jfernan2
Copy link
Contributor

enable profiling

@jfernan2 Just curious, why profiling? As the PR stands presently, the added code is not part of any configuration (but perhaps that should change).

You are right @makortel I did not realize there is no configuration added into the PR, I was eager to find a candidate PR to test the new profiling script. I am sorry

@jfernan2
Copy link
Contributor

jfernan2 commented Nov 21, 2024

enable none

@smuzaffar
Copy link
Contributor

smuzaffar commented Nov 21, 2024

so we get two compilation messages. One directly from scram and 2nd from the eval itself. scram actually compiles the sources once but due to the use of eval we see duplicates.

ah its not eval but BUILD_LOG=yes flag which build rule uses to capture all logs in a file first and print once the product compilation is done. There was a bug that with BUILD_LOG=yes these log files were sent to stdout at the end of compilation again.

cms-sw/cmsdist#9526 should fix the duplicate build messages

iEvent.getHandle(recoTrackToken_)
.product(); // Note that we need reco::Tracks for building the track Reference vector inside the reco::Vertex

// This is an annoying conversion as the vertex expects a transient track here, which is a dataformat which we otherwise bypass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment applies to ?
Where are TransientTracks used in this producer ?

@cmsbuild
Copy link
Contributor

Milestone for this pull request has been moved to CMSSW_15_0_X. Please open a backport if it should also go in to CMSSW_14_2_X.

@cmsbuild cmsbuild modified the milestones: CMSSW_14_2_X, CMSSW_15_0_X Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants