Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix try/catch in Vx3DHLTAnalyzer #39285

Merged
merged 1 commit into from
Sep 9, 2022

Conversation

francescobrivio
Copy link
Contributor

PR description:

In #38687 a try/catch statement was added in the DQM client Vx3DHLTAnalyzer to protect the client against Minuit2 FatalRootError, but the return codes were not set properly, as reported in #38687 (comment).
This PR provides a bugfix.

PR validation:

Code compiles + run the unitTest.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Not a backport.
Backports to 12_5_X and 12_4_X will be provided soon.

FYI @dinardo @mmusich

@francescobrivio
Copy link
Contributor Author

type bug-fix

@francescobrivio
Copy link
Contributor Author

test parameters:

  • addpkg = DQM/Integration

@mmusich
Copy link
Contributor

mmusich commented Sep 1, 2022

Code compiles + run the unitTest.

was this tested against the problematic data leading to the crash online?

@francescobrivio
Copy link
Contributor Author

francescobrivio commented Sep 1, 2022

Code compiles + run the unitTest.

was this tested against the problematic data leading to the crash online?

I tried running it on the RAW data of the problematic run, but I observed no crash...I don't think we still have the streamer files for that run...

As substitute I increased manually the VtxError parameter in order to make the same crash appear (FatalRootError: @sub=Minuit2) and it is catched correctly, but yes I agree it's not the same thing...

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 1, 2022

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39285/31947

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 1, 2022

A new Pull Request was created by @francescobrivio for master.

It involves the following packages:

  • DQM/BeamMonitor (dqm, db)

@malbouis, @pmandrik, @emanueleusai, @ahmad3213, @tvami, @cmsbuild, @jfernan2, @saumyaphor4252, @ggovi, @francescobrivio, @micsucmed, @rvenditti can you please review it and eventually sign? Thanks.
@mmusich this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented Sep 1, 2022

I tried running it on the RAW data of the problematic run, but I observed no crash...I don't think we still have the streamer files for that run...

according to #38687 (comment) there are streamer files, or did I misunderstand?

@francescobrivio
Copy link
Contributor Author

francescobrivio commented Sep 1, 2022

I tried running it on the RAW data of the problematic run, but I observed no crash...I don't think we still have the streamer files for that run...

according to #38687 (comment) there are streamer files, or did I misunderstand?

please see my answer to that post, the misunderstanding is from the DQM-DOC side 😄
bottomline, at the moment, we do not have streamer files to reproduce the issue.

@francescobrivio
Copy link
Contributor Author

ok I'm in contact with DQM-DOC (@vicha-w) to get hold of the needed streamer files. I will update you once I can run a test on them.

@mmusich
Copy link
Contributor

mmusich commented Sep 1, 2022

@cmsbuild, please test

  • I guess, testing this in the meanwhile won't harm.

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 1, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-6a15fe/27269/summary.html
COMMIT: ee5b0ec
CMSSW: CMSSW_12_6_X_2022-08-31-2300/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39285/27269/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3620448
  • DQMHistoTests: Total failures: 16
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3620409
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 50 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 212 log files, 49 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

@francescobrivio
Copy link
Contributor Author

francescobrivio commented Sep 5, 2022

ok I'm in contact with DQM-DOC (@vicha-w) to get hold of the needed streamer files. I will update you once I can run a test on them.

I tried running on LSs 1-2-3 of the problematic run (357900) but I could not reproduce the crash originally observed by DQM experts (which I attach here for reference: beampixel_crash_run357900.txt).

The only way I have to mimic this error is to enlarge the VtxError parameter so I get the Minuit failure which is correctly catch-ed by the try/catch that I had already implemented in my previous PR, but of course this is not the real crash observed online.

Anyway this is a "bug-fix" because in the original try/catch I added the return codes were not correctly propagated...what do you think @mmusich?

@syuvivida
Copy link
Contributor

@francescobrivio
I just did a test by running the beampixel client locally at lxplus. The CMSSW is built in CMSSW_12_4_7. The input files are the same one copied from Vicha, (DQM-DOC last week): /afs/cern.ch/work/s/syu/run357900_stream/run357900. If I ran each lumi sections (1, 2, 3) separately in 3 jobs, the job finished OK. But if I ran all 3 lumisections together (one after the other in the same job), I started seeing a crash. Hope this Log357900.txt helps. It seems similar to the text file from the original DQM crash?

@mmusich
Copy link
Contributor

mmusich commented Sep 6, 2022

@syuvivida

I just did a test by running the beampixel client locally at lxplus. The CMSSW is built in CMSSW_12_4_7.

does this test include the changes proposed in this PR?

@syuvivida
Copy link
Contributor

syuvivida commented Sep 6, 2022

@mmusich
My original idea was to try to see if we can reproduce the crash under the same condition when the first crash happened in run 357900 on August 22. At that time, DQM only ran CMSSW_12_4_7 without this PR. Therefore, I didn't test this PR.

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 7, 2022

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39285/32039

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 7, 2022

Pull request #39285 was updated. @malbouis, @pmandrik, @emanueleusai, @ahmad3213, @tvami, @jfernan2, @saumyaphor4252, @ggovi, @francescobrivio, @micsucmed, @rvenditti can you please check and sign again.

@francescobrivio
Copy link
Contributor Author

%MSG-w DAClusterizerinZ_vect:  PrimaryVertexProducer:pixelVertices  07-Sep-2022 17:23:09 CEST Run: 357900 Event: 3204845
rejected track t_tkwt 1.67157e-24
%MSG

all of these warnings with ludicrous track weights are indicating that the DA is getting in input some tracks with very large IP significance, indications that they are not correctly associated to the right vertex. Perhaps it is worth checking how the DA vertexing is setup for this application.

I agree with you, but I'd prefer to leave this to a future PR since it is addressing a different issue.
Do you agree Marco?

@mmusich
Copy link
Contributor

mmusich commented Sep 7, 2022

I agree with you, but I'd prefer to leave this to a future PR since it is addressing a different issue.
Do you agree Marco?

It might be that the Minuit exception is linked to the vertices given in input, but sure let's deal with that later as this PR offers a safety net against crashes at P5.
Opening a gitHub issue to remind us about it would be good though.

@francescobrivio
Copy link
Contributor Author

It might be that the Minuit exception is linked to the vertices given in input

Good point!

Opening a gitHub issue to remind us about it would be good though.

Please see #39339

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 7, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-6a15fe/27406/summary.html
COMMIT: 2de5b5f
CMSSW: CMSSW_12_6_X_2022-09-07-1100/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/39285/27406/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 9 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3618326
  • DQMHistoTests: Total failures: 11
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3618293
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 212 log files, 49 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

@tvami
Copy link
Contributor

tvami commented Sep 7, 2022

+db

  • trusting private results from the author

@emanueleusai
Copy link
Member

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 7, 2022

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@francescobrivio
Copy link
Contributor Author

+db

* trusting private results from the author

and also: #39341 (comment) :D

@perrotta
Copy link
Contributor

perrotta commented Sep 9, 2022

+1

@cmsbuild cmsbuild merged commit 9cac6a1 into cms-sw:master Sep 9, 2022
@francescobrivio francescobrivio deleted the alca-fix_Vx3DHLTAnalyzer branch September 14, 2022 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants