-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
significant slow-down of tensorflow on non-AVX machine(s) #33442
Comments
A new Issue was created by @slava77 Slava Krutelyov. @Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core, reconstruction |
New categories assigned: core,reconstruction @Dr15Jones,@smuzaffar,@slava77,@perrotta,@makortel,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@smuzaffar @mrodozov can you comment? |
FYI @riga @mialiu149 |
in the head of 11_3_X we are using TF 2.4.1 and based on https://github.com/cms-externals/tensorflow/blob/cms/v2.4.1/tensorflow/workspace.bzl it has
So, to correct the initial assumption about SSE4.1, the logic is actually about AVX if (mayiuse(avx512_mic)) {
return jit_avx512_common_gemm_f32(transa, transb,
...
} else if (mayiuse(avx)) {
...
return gemm_driver(transa, transb, bias ? "C" : NULL, M, N, K, alpha,
...
} else {
return ref_gemm<float>(transa, transb, |
right, so TF is not using oneDNN from direct deps since we don't have it anywhere else.
let me find the latest logs of TF to check it. |
let the bot builds it and we can check what bazel is doing |
BTW, do we have a debug build for our externals and CMSSW? |
we have 'a' debug build |
The DBG build is |
after reading this:
also before that I checked the cache in the build directory and only one of the two tar files hash was there: |
if I understand correctly, the same problem will be present on ARM and Power. |
The library itself will be the same version. the problem might not be the same as gemm implementations on arm and ppc employ different simd gimmick than on x86. could be worse :D |
@gartung @smuzaffar @mrodozov |
making a piechart/timing measurement should be enough |
enable profiling |
@slava77 , currently no. Profiling job is only run if we have profiling enable for IB. as currently we have run profiling for production arch, so bit is not going to run profiling for PR |
You would need to run the run-pr-profiling job specificly on the ARM and/or PPC node. |
of course one can manually run run-pr-profiling . @gartung, does PR profiling need any thing from IB profiling? |
The Jenkins profiling jobs are set to run on nodes matching profiling label, ie vocms011. |
I was mostly interested in a manual request to run. |
Let me start a job for last 12.0.X IB |
but now I realized that I've asked for a wrong workflow number, I was supposed to ask for 11834.21 (it has pileup and matches what we run in IBs). @smuzaffar |
@slava77 , ok restarted for both aarch64 and ppc64le. |
@slava77 , profiling is nor available for ppc64le https://cmssdt.cern.ch/circles/web/piechart.php?local=false&dataset=CMSSW_12_0_X_2021-05-04-2300%2Fslc7_ppc64le_gcc9%2F11834.21%2Fstep4_PAT_PU.resources&resource=time_thread&colours=default&groups=packages&threshold=0 For aarch64, it is still running. Last time it was timed out after 12 hours |
the fraction of Regarding running on aarch64; I still do not see the output with aarch in the regular place for piecharts. |
I suppose this is still an issue. Do we have a way to produce timing charts for other arches? |
PPC is on its way for production (physics validation and operational testing are going on at Marconi100), and tests on ARM HPC(s) should be starting in the near future. |
The pull requst profiling script is set up to use the production arch for the release it will be merged into. |
You can manually trigger the profiling for a pull request and specify an alternate arch from what is automatically schecudled. |
Originally from https://mattermost.web.cern.ch/cms-o-and-c/pl/zrtbufg8zbb9jgspeuxef183rc
I learned that TF inference is much slower on an older AMD compared to Intel.
Intel Broadwell: https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.int34/133
AMD Opteron 6128 https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.wn36/29
both are running the same inputs in a bit older release where I had input data and where igprof was still working fine
one example call to
mkldnn_sgemm
has a very large difference in two cases, about a factor of 1000 less on Intel (look at % total):https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.int34/2651
vs
https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.wn36/30
[From @makortel ] Some slowdown was observed e.g. in https://mathematica.stackexchange.com/questions/64645/mkl-on-intel-vs-amd
I have a suspicion that we are using https://github.com/oneapi-src/oneDNN/blob/v1.0.4/src/cpu/gemm/gemm.cpp
Here (
mkldnn_sgemm
callsextended_sgemm
, which in tern makes a choice betweengemm_driver
[igprof cost 0.02%] orref_gemm<float>
[igprof cost 30%])If that's correct, then my analysis is that
mkldnn_sgemm
is common in both cases and it's really just this method implementation that differs by selecting for SSE4.1 flag.Then the difference in speed is close to 1000. This does not look reasonable. A better understanding of what we actually compile here would help to confirm. (it may be straightforward to modify and more clearly confirm that
ref_gemm
is really so slow)Goals towards resolving the issue:
The text was updated successfully, but these errors were encountered: