Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-reproducibility in DeepTau in 1325.81 #32628

Open
makortel opened this issue Jan 11, 2021 · 24 comments
Open

Non-reproducibility in DeepTau in 1325.81 #32628

makortel opened this issue Jan 11, 2021 · 24 comments

Comments

@makortel
Copy link
Contributor

Shows up in reco comparison of 1325.81 for nanoaodFlatTable_tauTable__DQM_obj_floats__rawDeepTau2017v2p1VSmu
image

Noticed first in cms-sw/cms-bot#1456 (comment)

@cmsbuild
Copy link
Contributor

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

assign reconstruction, xpog

@cmsbuild
Copy link
Contributor

New categories assigned: xpog,reconstruction

@slava77,@fgolf,@mariadalfonso,@perrotta,@jpata,@gouskos you have been requested to review this Pull request/Issue and eventually sign? Thanks

@slava77
Copy link
Contributor

slava77 commented Jan 11, 2021

Noticed first in cms-sw/cms-bot#1456 (comment)

more recently in #32622 (comment)

@slava77
Copy link
Contributor

slava77 commented Jan 11, 2021

@swozniewski @mbluj

@swozniewski
Copy link
Contributor

Clicking through your messages, the difference looks always the same (unless the image was copied). Do I understand correctly and does it fit your observations so far that if there is an irreproducibility, it has only two outcomes, i.e. the observed diff and the reference?

@swozniewski
Copy link
Contributor

@kandrosov @lwezenbe fyi and in case you have any ideas spontaneously

@makortel
Copy link
Contributor Author

Do I understand correctly and does it fit your observations so far that if there is an irreproducibility, it has only two outcomes, i.e. the observed diff and the reference?

I believe we have so far saw the difference twice (cms-sw/cms-bot#1456 (comment) and #32622 (comment)), and indeed the difference looks the same in both (the images were not copied between the two PRs). It could be that there are only two possible outcomes, but with so few occurrences it is hard to say.

@slava77
Copy link
Contributor

slava77 commented Jan 11, 2021

Looking at the logs for #32622 (comment)
tensorflow/core/platform/cpu_feature_guard.cc... message, which IIRC corresponds to an info about not full utilization of the CPU capabilities, has a difference:

@smuzaffar @mrodozov do you know if there is a way to find out which nodes were used in to run the runTheMatrix jobs? (and what were their architectures).

@smuzaffar
Copy link
Contributor

@slava77 , both baseline and PR tests ran on cmsbuild machines

All these VMs are identical and support SSE4.1 SSE4.2 AVX AVX2 FMA. If issue is with TF then it could be the VM where TF external was build. We do have an old physical machine vocms0315 without avx2 and that might have been used to build the TF.

Do we see the differences if we run multiple time using same IB?

@slava77
Copy link
Contributor

slava77 commented Jan 12, 2021

All these VMs are identical and support SSE4.1 SSE4.2 AVX AVX2 FMA. If issue is with TF then it could be the VM where TF external was build. We do have an old physical machine vocms0315 without avx2 and that might have been used to build the TF.

It's odd, I thought that the warning message from TF was showing the flags present on the node where it's executed. The warning for 08-1100 set of tests showed that the PR test node did not have AVX2 FMA. I'm not sure I understand how the TF build can affect anything, since it's supposedly the same in the PR and baseline cases with differences made using 08-1100.

@makortel
Copy link
Contributor Author

makortel commented Feb 2, 2021

Here is another example #32782 (comment).
image

If still relevant, the TF warning line is

2021-02-02 19:41:40.193960: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA

@makortel
Copy link
Contributor Author

makortel commented Feb 6, 2021

AddressSanitizer reports a stack-buffer-overflow in DeepTauId::fillGrids() (see #32837), could that be the cause for this non-reproducibility? (answer: "no")

@makortel
Copy link
Contributor Author

Here is another example #32947 (comment)
image

@mariadalfonso
Copy link
Contributor

@swozniewski @kandrosov @lwezenbe @mbluj
is there a better understanding of this ?

@swozniewski
Copy link
Contributor

I'm not aware of any news from TauPOG side about this. From this one and linked threads, it seemed to consolidate that the issue is related to dependencies between TF and hardware, so I didn't feel we can do much about it.

@vlimant
Copy link
Contributor

vlimant commented May 12, 2021

what are the different hardware leading to different outcome through TF? can this be reproduced "stand-alone" (i.e. without CMSSW) ?

@mrodozov
Copy link
Contributor

see this
#33180
and this
#33442
if it helps although it's not strictly related to this workflow

@slava77
Copy link
Contributor

slava77 commented May 17, 2021

there is one more case in #33706
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-87155f/15037/summary.html

curiously, this time the change shows up also in particleNetMD, which is based on ONNX.
Based on jenkins details
the baseline here was running on cmsbuild73 (SSE4.1 SSE4.2 AVX AVX2 FMA), vs the PR test on cms-vocms0315 (SSE4.1 SSE4.2 AVX).

@hqucms
is the ONNX in this case making a distinction between AVX and AVX2/FMA and running different methods?

@hqucms
Copy link
Contributor

hqucms commented May 18, 2021

is the ONNX in this case making a distinction between AVX and AVX2/FMA and running different methods?

@slava77 Yes, ONNXRuntime has different kernels for AVX and AVX2.

@jpata
Copy link
Contributor

jpata commented Sep 16, 2021

Showing up again in 1325.81, 136.731 in #35216

all_OldVSNew_TTbar13nanoEDM106Xv1in2017wf1325p81
  nanoaodFlatTable_fatJetTable__DQM_obj_floats__particleNetMD_QCD_100.png
  nanoaodFlatTable_tauTable__DQM_obj_floats__rawDeepTau2017v2p1VSmu_442.png
all_mini_OldVSNew_RunSinglePh2016Bwf136p731
  patJets_slimmedJetsAK8__reRECO_obj___pairDiscriVector__73__second.png
  patJets_slimmedJetsAK8__reRECO_obj___pairDiscriVector__73__second285.png
  patJets_slimmedJetsAK8__reRECO_obj___pairDiscriVector__100__second312.png
  patJets_slimmedJetsAK8__reRECO_obj___pairDiscriVector__103__second315.png

Dumping the pairDiscri names:

73 pfParticleNetJetTags:probWcq
100 pfParticleNetDiscriminatorsJetTags:HccvsQCD
103 pfParticleNetDiscriminatorsJetTags:ZbbvsQCD

@slava77
Copy link
Contributor

slava77 commented Sep 16, 2021

Showing up again in 1325.81, 136.731 in #35216

IIUC, this issue is in a state of a "known feature" now. The differences appear somewhat regularly, depending on the build machines using AVX or AVX2 for baseline or the reference.

@jpata
Copy link
Contributor

jpata commented Sep 16, 2021

I wanted to explicitly put the discriminator names and workflows here so they can be found with a github issue search. In #35216, it took me a bit to be sure all the differences are really from this ONNX feature.

@vlimant
Copy link
Contributor

vlimant commented Nov 2, 2022

related to #36552

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants