Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alpaka vs CUDA DQM compare modules for pixel tracks objects #43964

Closed
wants to merge 2 commits into from

Conversation

borzari
Copy link
Contributor

@borzari borzari commented Feb 14, 2024

PR description:

This implements DQM modules for comparing pixel tracks objects reconstructed using Alpaka and CUDA. These modules can be used for any possible comparisons using pixel tracks: CUDA CPU vs CUDA GPU, Alpaka Serial vs Alpaka Device and Alpaka vs CUDA (both Serial/CPU and Device/GPU). A new process modifier chain, alpakaCUDAValidation, was added to turn on both reconstructions and the Alpaka vs CUDA DQM modules when needed.

A new workflow offset, .5041, was added to run DQM with all the possible comparisons using pixel tracks objects. A new customization to run Alpaka vs CUDA comparisons in HLT was also created.

These changes will be back ported to CMSSW_14_0_X for the special validation of the Alpaka vs CUDA reconstruction.

PR validation:

The modifications were tested using wfs 12434.403 (Alpaka validation), 12434.5041 (Alpaka vs CUDA validation), and 12434.503 (CUDA validation). All comparisons follow the expectations.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 14, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43964/38859

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @borzari for master.

It involves the following packages:

  • Configuration/ProcessModifiers (operations)
  • Configuration/PyReleaseValidation (upgrade, pdmv)
  • DQM/SiPixelHeterogeneous (dqm)
  • DataFormats/TrackSoA (reconstruction, heterogeneous)
  • DataFormats/TrackingRecHitSoA (reconstruction, heterogeneous)
  • HLTrigger/Configuration (hlt)
  • RecoLocalTracker/SiPixelClusterizer (reconstruction)
  • RecoLocalTracker/SiPixelRecHits (reconstruction)
  • RecoTracker/Configuration (reconstruction)
  • RecoTracker/PixelSeeding (reconstruction)
  • RecoTracker/PixelTrackFitting (reconstruction)
  • RecoTracker/PixelVertexFinding (reconstruction)
  • RecoVertex/BeamSpotProducer (alca, reconstruction)

@mmusich, @syuvivida, @AdrianoDee, @saumyaphor4252, @miquork, @nothingface0, @perrotta, @tjavaid, @antoniovilela, @fabiocos, @cmsbuild, @sunilUIET, @mandrenguyen, @davidlange6, @fwyzard, @Martin-Grunewald, @makortel, @rappoccio, @consuegs, @rvenditti, @jfernan2, @antoniovagnerini, @srimanob, @subirsarkar can you please review it and eventually sign? Thanks.
@idebruyn, @mmusich, @GiacomoSguazzoni, @mroguljic, @yuanchao, @JanFSchulte, @francescobrivio, @jandrea, @rsreds, @missirol, @silviodonato, @VinInn, @dgulhan, @dkotlins, @fabiocos, @tvami, @fioriNTU, @felicepantaleo, @VourMa, @Martin-Grunewald, @makortel, @gpetruc, @mtosi, @slomeo, @tocheng, @threus, @ferencek, @rovere this is something you requested to watch as well.
@rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented Feb 14, 2024

-hlt

fragment.hltSiPixelRecHitsSoACompareGPUvsCPU = cms.EDProducer( "SiPixelPhase1CompareRecHitsSoA",
pixelHitsSrcCPU = cms.InputTag( "hltSiPixelRecHitsSoA@cpu" ),
pixelHitsSrcGPU = cms.InputTag( "hltSiPixelRecHitsSoA@cuda" ),
fragment.hltSiPixelRecHitsSoACompareGPUvsCPU = cms.EDProducer( "SiPixelPhase1CompareRecHits",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are not supposed to touch these files directly.
Please provide a suitable customization function in https://github.com/cms-sw/cmssw/blob/master/HLTrigger/Configuration/python/customizeHLTforCMSSW.py that modifies these configurations.
STORM will take care of migrating these files centrally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Marco: we're on that!
(there was a miss-communication between me and Breno).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mmusich. Yes, this was my mistake. I was just talking to @AdrianoDee about it and he explained that I should create a customization instead of changing fragments directly. I'll remove the last commit and make the customization

Copy link
Contributor

@mmusich mmusich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Martin-Grunewald
Copy link
Contributor

-1
No modification of the HLT menu dumps in HLTrigger/Configuration. AFAIK all ALPKA modifications are to be coded in the ALPAKA customisation routine in HLTrigger/Configuration/python/customizeHLTforAlpka.py .

@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Feb 14, 2024

Also, it is unclear why DQM comparisons affect the HLT menu; I think it needs to be discussed how many modules we add and execute for this comparison to the HLT menu. Will this also need to run at P5, or only for offline workflows? If the former, it needs to be assessed for timing. If the latter, it definitely does NOT go in the HLT menus in HLTrigger/Configuration, and has to remain a DQM specific customisation routine called in a cmsDriver DQM step and put in some DQM subsystem.

@AdrianoDee
Copy link
Contributor

Also, it is unclear why DQM comparisons affect the HLT menu; I think it needs to be discussed how many modules we add and execute for this comparison to the HLT menu. Will thus also need to run at P5 or only for offline workflows? If the former, it needs to be assessed for timing. If the latter, it definitely does NOT go in the menus and has to remain a DQM specific customisation routine called in a cmsDriver DQM step.

This is because there's a path for CPUvsGPU checks DQM_PixelReconstruction_v that uses the comparison modules. In this case we are not adding anything, just changed the module to be able in general to do CUDAvsCUDA, CUDAvsAlpaka and AlpakaVSAlpaka comparisons.

@Martin-Grunewald
Copy link
Contributor

Only one path in the HLT menu is modified?
In any case, if you modify a path run in the HLT menu at P5, you need to file a CMSHLT JIRA request with the details.

@mmusich
Copy link
Contributor

mmusich commented Apr 17, 2024

@borzari this branch has now conflicts to resolve. Notice that the current HLT menu in release as integrated in #44733 runs alpaka for pixel local reconstruction + tracking + vertexing by default.

@borzari
Copy link
Contributor Author

borzari commented Apr 17, 2024

@borzari this branch has now conflicts to resolve. Notice that the current HLT menu in release as integrated in #44733 runs alpaka for pixel local reconstruction + tracking + vertexing by default.

Hi @mmusich. I was working on fixing the conflicts and matching the AlpakavsCUDA customizations with the Alpaka ones, and just finished some local tests using wf 12434.5041. You can see the modifications here. Before force pushing the changes to the PR branch, I have a question: should I modify HLTriggerOffline/Common/python/customizeHLTforAlpakavsCUDA.py to be as similar as possible to HLTrigger/Configuration/python/customizeHLTforAlpaka.py? I am asking because I see some modules that might not be used for the Alpaka vs CUDA comparisons, but I am not sure if I should keep (and make CUDA counterparts) of those.

@mmusich
Copy link
Contributor

mmusich commented Apr 17, 2024

I am asking because I see some modules that might not be used for the Alpaka vs CUDA comparisons, but I am not sure if I should keep (and make CUDA counterparts) of those.

I think it would still be beneficial if this could be discussed in a wider forum (e.g. a Joint TRK DPG / POG meeting).

@slava77
Copy link
Contributor

slava77 commented May 8, 2024

@mmusich
was the issue related to SoA persistency #44700 ?

Do I understand correctly that the issue is related to

'keep *RecHit*_hltSiPixelRecHitsSoASerialSync_*_*',
'keep *RecHit*_hltSiPixelRecHitsSoA_*_*',

Does it mean that the track SoAs can be persisted OK?
Would this reduce the problem to monitor only to siPixelPhase1CompareRecHitsSoAAlpaka?
Is the old style equivalent for hit monitoring in the HLT process also the slowest or is it the others?
Perhaps we could start with a mix-and-match solution

@mmusich
Copy link
Contributor

mmusich commented May 8, 2024

@slava77

was the issue related to SoA persistency #44700 ?

yes.

Does it mean that the track SoAs can be persisted OK?

no, please read carefully #44700 (comment).

This seems to come from these keep statements:

'keep *RecHit*_hltSiPixelRecHitsSoASerialSync_*_*',
'keep *RecHit*_hltSiPixelRecHitsSoA_*_*',

reverting that, one encounters a different error related to the hltPixelTracksSoA*:

----- Begin Fatal Exception 10-Apr-2024 14:01:59 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling EventProcessor::runToCompletion (which does almost everything after beginJob and before endJob)
   Additional Info:
      [a] Fatal Root Error: @SUB=TStreamerInfo::Build
reco::TrackSoA<pixelTopology::Phase1>::Layout<128,false>, unknown type: cms::alpakatools::OneToManyAssocSequential<unsigned int,32769,163840>* hitIndices_
----- End Fatal Exception -------------------------------------------------

the only product we can safely persists are pixel vertices.

@mmusich
Copy link
Contributor

mmusich commented May 8, 2024

Is the old style equivalent for hit monitoring in the HLT process also the slowest or is it the others?

I don't have measurements, but judging from the multiplicities, I think it's a reasonable assumption.

Perhaps we could start with a mix-and-match solution

do you care to expand?

@slava77
Copy link
Contributor

slava77 commented May 8, 2024

Perhaps we could start with a mix-and-match solution

do you care to expand?

I meant to drop the DQM monitors/modules from the HLT table for parts that can be persisted and run those in the client.

@mmusich
Copy link
Contributor

mmusich commented May 8, 2024

I meant to drop the DQM monitors/modules from the HLT table for parts that can be persisted and run those in the client.

as mentioned we can do it (only) for the vertices for the time being. If @cms-sw/tracking-pog-l2 prefers to do that, I think we are set to do it, but please do chime in in the ticket https://its.cern.ch/jira/browse/CMSHLT-3147 clearly explaining your preference and motivations.

@slava77
Copy link
Contributor

slava77 commented May 8, 2024

as mentioned we can do it (only) for the vertices for the time being.

I'm guessing that the resource gain will be small using just vertices (unless there's a record that this code is significantly slow).
I was more motivated to propose a mix when most of what's needed can be moved to the DQM client.

@mmusich
Copy link
Contributor

mmusich commented Jun 11, 2024

@borzari can you clarify on the status / plans for this PR?

@borzari
Copy link
Contributor Author

borzari commented Jun 11, 2024

@borzari can you clarify on the status / plans for this PR?

@mmusich I am closing this PR, as most of the changes are not needed anymore. I am going to open a new PR just with changes to the Alpaka DQM modules, and some others that were to be included in this PR (I'll list them in the new PR to make it clear), to be included in CMSSW_14_1_X only. I was making some tests with the changes and had a few issues, that I solved now, but it took a while. I am opening the new PR later today and will tag you to check it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants