-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instabilities in 11634.911 (DD4Hep) workflow comparisons #35109
Comments
A new Issue was created by @makortel Matti Kortelainen. @Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign geometry |
New categories assigned: geometry @Dr15Jones,@cvuosalo,@civanch,@ianna,@mdhildreth,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Observed in #35068 (comment) and #34995 (comment) |
Here is another occurrence https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cf3e63/18431/summary.html |
@cvuosalo , is the problem back or it is another one? |
The instability appears to be random and rare. It is strange that wf 11634.912 does not show it. The difference between the two workflows is that 11634.911 runs the algorithms and calculates the reco geometry, while 11634.912 reads the already calculated algorithm results and reco geometry out of the DB. |
I ran workflow 11634.911 thirty times in CMSSW_12_1_X_2021-09-20-1100 with identical results each time. It appears the instability has gone away. |
This issue is fully signed and ready to be closed. |
On the other hand the comparison differences have appeared rather rarely. |
Here is another instance #36222 (comment). Could we re-open the issue (and keep it open for longer time)? |
@makortel , I cannot, may be you can reopen? |
I don't have the power. I'm not sure @qliphy / @perrotta have, or if we need @smuzaffar. |
Wow: I have the power! |
Let's record here that the tests in #41273 (comment) showed 5932 differences in the DQM comparisons of 11634.911 (and that being the only phase Run-{1,2,3} workflow showing differences). Running the tests for second time did not show any differences. The differences seemed to be across the board (i.e. not localized to a few subsystems) |
Let's record here that the tests in #41522 (comment) showed 4822 differences in the DQM comparisons of 23634.911 across the board. |
For the record, something similar happened in #41533: 47459 differences in the DQM comparisons of wf |
@cms-sw/geometry-l2 Should we open a new issue to record these instabilities or reopen this one? |
Strange to me that #41541 (comment) reports exactly the same: 47459 differences in the DQM comparisons of wf |
And another one in #41532 (comment), 4822 differences in workflow 23634.911. |
One more in #41504 (comment) |
Another one in #41852 (comment), 5582 differences in workflow 11634.911 |
(reopening the issue) |
Another one in #43041 (comment), 6123 differences in workflow 11634.911. The CPU model was the same ( |
To note here that #43439 is removing 11634.911 from the short matrix, after which we would not see these instabilities anymore in PR tests. |
Let me know if you think it is preferable to keep it just to have this "constant reminder" of the issue or if it is something that we can leave to IB tests. |
From my point of view, keeping this issues does not help much even if likely we have a problem with 11634.911, which is taken out of everyday testing. |
Good question. PR tests (including the short matrix) should be about ensuring the PRs behave as expected, and therefore I think using PR tests to stress-test reproducibility is likely not the best way. If there is no other use for 11634.911 in short matrix (@cms-sw/geometry-l2 could you comment?), I'd be in favor of dropping 11634.911 from the short matrix. Unfortunately IBs themselves don't provide any facilities for inspecting workflow results. @smuzaffar Maybe we should think about something here, at least for select workflows? (not really optimal, but maybe better than (mis)using PR tests?) |
Just to note that in the end #43439 kept 11634.911 |
Hi @makortel |
Do we know how the issue got resolved? Or is it just not occurring anymore? |
The workflow in topic is Run-3, right? As DD4hep is run by default in Run-3 workflow (.911 = .0 for Run-3), I think we don't see any instabilities any more. Do I miss some points that we should keep investigating Run-3 DD4hep workflow? |
From the history the frequency seems to have been one occurrence every 1-4 months (although I suspect not all L2s report those). Earlier comments suggest that .911 and .0 are different, by .911 reading the geometry from XML and .0 from the DB. |
Ah, you are right. .911 is XML version, and .912 (which is .0 default now) is DB. Do we need to monitor XML when we use DB? I mean we don't do Run-1, Run-2 XML (DDD) anymore. So, we never know if there is an issue there or not. |
We've observed differences in the DD4Hep workflow 11634.911 comparisons in tests of a few PRs that should not affect results of the DD4Hep workflow. This issue is to collect pointers to those comparisons.
The text was updated successfully, but these errors were encountered: