Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[13.2.X] Fixed channel decoding for the timeout error in SiPixelRawToClusterGPUKernel #42631

Merged
merged 2 commits into from
Sep 5, 2023

Conversation

sroychow
Copy link
Contributor

PR description:

Backport of #42618

PR validation:

Code compiles.

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 22, 2023

A new Pull Request was created by @sroychow (Suvankar Roy Chowdhury) for CMSSW_13_2_X.

It involves the following packages:

  • RecoLocalTracker/SiPixelClusterizer (reconstruction)

@cmsbuild, @mandrenguyen, @clacaputo can you please review it and eventually sign? Thanks.
@mtosi, @VourMa, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @mroguljic, @missirol, @dkotlins, @ferencek, @gpetruc, @mmusich, @threus, @tvami this is something you requested to watch as well.
@perrotta, @dpiparo, @antoniovilela, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@sroychow
Copy link
Contributor Author

test parameters:

  • enable = gpu
  • workflows_gpu = 11634.503, 11634.507, 11634.583, 11634.587, 11634.593
  • relvals_opt_gpu = --what upgrade,standard,highstats,pileup,generator,extendedgen,production,ged,machine,premix,nano

@sroychow
Copy link
Contributor Author

@cmsbuild please test

@sroychow
Copy link
Contributor Author

type bug-fix

@mmusich
Copy link
Contributor

mmusich commented Aug 23, 2023

tests appear to be stuck.

@mandrenguyen
Copy link
Contributor

backport of #42618

@mandrenguyen
Copy link
Contributor

please abort

@mandrenguyen
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-64b28e/34470/summary.html
COMMIT: a2b7125
CMSSW: CMSSW_13_2_X_2023-08-24-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/42631/34470/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 5 lines to the logs
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3196410
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3196382
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 207 log files, 159 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 43 differences found in the comparisons
  • DQMHistoTests: Total files compared: 8
  • DQMHistoTests: Total histograms compared: 228916
  • DQMHistoTests: Total failures: 2857
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 226059
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 7 files compared)
  • Checked 28 log files, 35 edm output root files, 8 DQM output files
  • TriggerResults: no differences found

@gparida
Copy link
Contributor

gparida commented Aug 25, 2023

Hi everyone,

The CPU-GPU comparisons including the changes in PR #42618 are summarized in this spreadsheet.

I used the release CMSSW_13_0_X_2023-08-21-1100 and the latest GRUN menu ( /dev/CMSSW_13_0_0/GRun/V153) to run on Run 2023 EphemeralHLTPhysics Dataset. It reduces the differences in DoubleEle* paths by about 7-8% from what was seen after the fix cms-sw/cmssw#42010 (The spreadsheet for that is available here).

For example:
After the PR 42010:

Total Accepted CPU Accepted GPU Gained Lost Change Percentage Change trigger
915722 221 246 28 -3 31 14.0271 HLT_DoubleEle9p5_eta1p22_mMax6_v4

Now (After the PR 42618):

1158651 235 235 5 -5 10 4.255 HLT_DoubleEle9p5_eta1p22_mMax6_v4

Similarly for the other DoubleEle* paths.

@mmusich
Copy link
Contributor

mmusich commented Aug 25, 2023

@gparida

For example:
After the PR 42010:

915722 221 246 28 -3 31 14.0271 HLT_DoubleEle9p5_eta1p22_mMax6_v4

Similarly for the other DoubleEle* paths.

Please put titles to columns for ease of reading the results

@missirol
Copy link
Contributor

@cms-sw/reconstruction-l2 @cms-sw/trk-dpg-l2 @mmusich

The one check done by TSG (#42631 (comment)) suggests that this PR goes in the right direction, in terms of reducing CPU-vs-GPU differences. Do you suggest to do more validation, or can this PR be integrated ?

@mmusich
Copy link
Contributor

mmusich commented Aug 28, 2023

Do you suggest to do more validation, or can this PR be integrated ?

well, it would be good to show event by event comparisons of the content of theedm::DetSetVector<SiPixelRawDataError> on CPU vs on GPU and show that is indeed changed in the right direction. I was hoping to see it in the matrix tests, but in MC there are no FED errors simulated.
Do we have any workflows running CPU vs GPU validation on data?

@mandrenguyen
Copy link
Contributor

From the reco side I don't see any issue, although perhaps someone else should be signing.
Taking note of the comment here it seems prudent to follow-up, perhaps with a quick discussion at the ORP, before merging this into a production release.

@Martin-Grunewald
Copy link
Contributor

@mmusich and all, can we proceed with this in view that the PR reduces the GPU-CPU differences?

@mmusich
Copy link
Contributor

mmusich commented Sep 4, 2023

can we proceed with this in view that the PR reduces the GPU-CPU differences?

assuming the analysis presented in #42631 (comment) was carried out correctly, I would say yes. I still think it would be nice if the PR proponent (@sroychow) could substantiate the validation at the level of the pixel FED errors unbalance.

@malbouis
Copy link
Contributor

malbouis commented Sep 4, 2023

urgent

  • from the JoinOps meeting today, HLT is requesting to have this PR included in the next 13_2_X release build.

@cmsbuild cmsbuild added the urgent label Sep 4, 2023
@antoniovilela
Copy link
Contributor

@cms-sw/reconstruction-l2 Matthew, OK from your side to proceed (this will enter 13_2_3)?

@mandrenguyen
Copy link
Contributor

Can we address #42631 (comment)
?

@antoniovilela
Copy link
Contributor

Can we address #42631 (comment) ?

Please comment.
@sroychow @cms-sw/trk-dpg-l2 (@cms-sw/hlt-l2 @cms-sw/ppd-l2)

@sroychow
Copy link
Contributor Author

sroychow commented Sep 4, 2023

I am running some tests with wf introduced in #42674. I'll share the results once the jobs finish.

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 5, 2023

This pull request is fully signed and it will be integrated in one of the next CMSSW_13_2_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_13_3_X is complete. This pull request will now be reviewed by the release team before it's merged. @rappoccio, @antoniovilela, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

@antoniovilela
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 2630874 into cms-sw:CMSSW_13_2_X Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants