Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add protection vs Minuit2 Fatal Root Error and update VxErrCorr in Vx3DHLTAnalyzer #38687

Merged
merged 2 commits into from
Jul 13, 2022

Conversation

francescobrivio
Copy link
Contributor

PR description:

In recent Fills ( >=7920 ) the beampixel DQM client has been crashing with error:

An exception of category 'FatalRootError' occurred while
   [0] Processing global end LuminosityBlock run: 355443 luminosityBlock: 2
   [1] Calling method for module Vx3DHLTAnalyzer/'pixelVertexDQM'
   Additional Info:
      [a] Fatal Root Error: @SUB=Minuit2
VariableMetricBuilder Initial matrix not pos.def.

This PR:

  • updates the VxErrCorr parameter in Vx3DHLTAnalyzer from 1.2 to 1.0, similarly to the errorScale of the beam client which has been recently fixed in Update errorScale for BeamSpot Legacy DQM client #38632.
  • adds some try/catch protections are added in Vx3DHLTAnalyzer.cc in order to avoid crashes of the whole client in case of Fatal Errors coming from Minuit2 minimization
  • changes couts to edm:Log*s in Vx3DHLTAnalyzer

PR validation:

Code compiles.
Run privately on FEVT events and fit now converges, also, in case it fails, the Fatal Error is catched correctly by the LogError avoiding the crash of the beampixel client.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Not a backport, but 12_4_X and 12_3_X backports will be provided soon

FYI @dinardo

@francescobrivio
Copy link
Contributor Author

@cmsbuild please test

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38687/30988

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @francescobrivio for master.

It involves the following packages:

  • DQM/BeamMonitor (dqm, db)
  • DQM/Integration (dqm)

@malbouis, @pmandrik, @emanueleusai, @ahmad3213, @tvami, @jfernan2, @ggovi, @francescobrivio, @micsucmed, @rvenditti can you please review it and eventually sign? Thanks.
@mmusich, @threus, @batinkov, @battibass this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

Gauss3D->Minimize();
try {
Gauss3D->Minimize();
} catch (...) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this construct is forbidden, or at least is flagged as such by the S/A

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the coding rules I can try something like:

if (!Gauss3D->Minimize()) {
   throw cms::Exception("FitFailed") << "Vx3DHLTAnalyzer \tInitial matrix not pos. def");
}

Would this be better?

Copy link
Contributor

@mmusich mmusich Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catching cms::Exception or std exceptions is allowed. What is not allowed is using catch(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 76fa9de

@dinardo
Copy link
Contributor

dinardo commented Jul 11, 2022

Hi guys,
all i know is that if we don't catch the exception the code crashes.
And somehow the exception is generated inside ROOT.

@dinardo
Copy link
Contributor

dinardo commented Jul 11, 2022

Ok cool.
What should we write then?

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bb6b2a/26146/summary.html
COMMIT: 60e5e71
CMSSW: CMSSW_12_5_X_2022-07-11-1100/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/38687/26146/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3655970
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3655940
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 208 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@francescobrivio
Copy link
Contributor Author

@cmsbuild please test

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38687/31001

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

Pull request #38687 was updated. @malbouis, @pmandrik, @emanueleusai, @ahmad3213, @tvami, @jfernan2, @ggovi, @francescobrivio, @micsucmed, @rvenditti can you please check and sign again.

@tvami
Copy link
Contributor

tvami commented Jul 12, 2022

type bugfix

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bb6b2a/26171/summary.html
COMMIT: 76fa9de
CMSSW: CMSSW_12_5_X_2022-07-12-1100/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/38687/26171/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3653734
  • DQMHistoTests: Total failures: 19
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3653692
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 49 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 208 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@tvami
Copy link
Contributor

tvami commented Jul 12, 2022

+db

  • tests pass
  • PR is according to the description

@emanueleusai
Copy link
Member

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 071e67f into cms-sw:master Jul 13, 2022
@francescobrivio francescobrivio deleted the alca-fix_beamPixel branch July 19, 2022 13:19
try {
Gauss3D->Minimize();
} catch (cms::Exception& er) {
edm::LogError("Vx3DHLTAnalyzer") << "\tCaught Minuit2 exception: " << er.what();
Copy link
Contributor

@mmusich mmusich Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the problems recently observed (again) with the online DQM client (see log), after catching the exception and logging the error, shouldn't we return the function with an error state instead of carrying on?
I am not sure the rest of the computation make sense if the fit fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vicha-w is there a way to reproduce offline the failure reported in the data-taking mattermost channel , e.g. by making available the input streamer files that are giving rise to the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #39285

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mmusich. Sorry I didn't reach out. I have given LS files that caused the error in beamspot clients to @francescobrivio this morning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vicha-w sorry there is some confusion here: the streamer files you gave me this morning are NOT related to this issue, they are for different studies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants