-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-reproducibility in DeepTau in 1325.81 #32628
Comments
A new Issue was created by @makortel Matti Kortelainen. @Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign reconstruction, xpog |
more recently in #32622 (comment) |
Clicking through your messages, the difference looks always the same (unless the image was copied). Do I understand correctly and does it fit your observations so far that if there is an irreproducibility, it has only two outcomes, i.e. the observed diff and the reference? |
@kandrosov @lwezenbe fyi and in case you have any ideas spontaneously |
I believe we have so far saw the difference twice (cms-sw/cms-bot#1456 (comment) and #32622 (comment)), and indeed the difference looks the same in both (the images were not copied between the two PRs). It could be that there are only two possible outcomes, but with so few occurrences it is hard to say. |
Looking at the logs for #32622 (comment)
@smuzaffar @mrodozov do you know if there is a way to find out which nodes were used in to run the runTheMatrix jobs? (and what were their architectures). |
@slava77 , both baseline and PR tests ran on cmsbuild machines
All these VMs are identical and support Do we see the differences if we run multiple time using same IB? |
It's odd, I thought that the warning message from TF was showing the flags present on the node where it's executed. The warning for 08-1100 set of tests showed that the PR test node did not have |
Here is another example #32782 (comment). If still relevant, the TF warning line is
|
AddressSanitizer reports a |
Here is another example #32947 (comment) |
@swozniewski @kandrosov @lwezenbe @mbluj |
I'm not aware of any news from TauPOG side about this. From this one and linked threads, it seemed to consolidate that the issue is related to dependencies between TF and hardware, so I didn't feel we can do much about it. |
what are the different hardware leading to different outcome through TF? can this be reproduced "stand-alone" (i.e. without CMSSW) ? |
there is one more case in #33706 curiously, this time the change shows up also in particleNetMD, which is based on ONNX. @hqucms |
@slava77 Yes, ONNXRuntime has different kernels for AVX and AVX2. |
Showing up again in 1325.81, 136.731 in #35216
Dumping the pairDiscri names:
|
IIUC, this issue is in a state of a "known feature" now. The differences appear somewhat regularly, depending on the build machines using AVX or AVX2 for baseline or the reference. |
I wanted to explicitly put the discriminator names and workflows here so they can be found with a github issue search. In #35216, it took me a bit to be sure all the differences are really from this ONNX feature. |
related to #36552 |
Shows up in reco comparison of 1325.81 for
nanoaodFlatTable_tauTable__DQM_obj_floats__rawDeepTau2017v2p1VSmu
Noticed first in cms-sw/cms-bot#1456 (comment)
The text was updated successfully, but these errors were encountered: