Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove old cuda #283

Merged
merged 4 commits into from
Nov 9, 2020
Merged

Remove old cuda #283

merged 4 commits into from
Nov 9, 2020

Conversation

srlantz
Copy link
Collaborator

@srlantz srlantz commented Oct 30, 2020

This PR removes a large amount of deprecated GPU code, over 6000 lines. The bulk of it is in 44 files, mostly .cu and related header files, that are now deleted. (Of course they are not really deleted; they can be retrieved at any time from the repository by pointing to a prior commit.)

@osschar already looked through the diffs and even contributed the revised GenMul.pm, and he gives his approval.

Benchmark and validation plots are here: https://slantz.web.cern.ch/slantz/remove-old-cuda/

Merging this PR will resolve issue #281.

@srlantz srlantz requested a review from dan131riley October 30, 2020 01:37
Copy link
Collaborator

@dan131riley dan131riley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still some __CUDACC__ code in from-root/Math. Otherwise it looks fine. I don't feel strongly about fixing the from-root files (which we really ought to get rid of anyway).

@srlantz
Copy link
Collaborator Author

srlantz commented Oct 30, 2020

There is still some __CUDACC__ code in from-root/Math. Otherwise it looks fine. I don't feel strongly about fixing the from-root files (which we really ought to get rid of anyway).

@dan131riley I saw this and decided to leave those places alone, on the assumption that if we are borrowing code more-or-less directly from ROOT, then it is probably best to leave things intact.

However, if you believe these places are insertions made by Matthieu, then I'll gladly remove them. It wasn't easy for me to tell, given that __CUDACC__ is defined simply through the act of compiling with nvcc and isn't one of the custom symbols defined by our group.

In the end, if the from-root code is also headed for the dustbin, then I guess it doesn't make much difference one way or the other.

@dan131riley
Copy link
Collaborator

@srlantz The __CUDACC__ sections were definitely added by Matthieu, they aren't in the original source. Getting rid of from-math is more of a longer-term project, for now I think we should get rid of the added CUDA sections, best to stick as close to the original source as we can.

@srlantz
Copy link
Collaborator Author

srlantz commented Oct 30, 2020

@srlantz The __CUDACC__ sections were definitely added by Matthieu, they aren't in the original source. Getting rid of from-math is more of a longer-term project, for now I think we should get rid of the added CUDA sections, best to stick as close to the original source as we can.

Done. I don't think it's necessary to re-run the benchmarks for this little commit.

But - while I'm thinking about it, I should mention that previously I had to run the benchmark/validation script a couple of times before I finally got clean data for this PR. The first run showed some wonky data points in the MEIF plots (I have seen this issue before). The second run had a bunch of segfaults mixed in, then the plots for SIMVAL_MTV were missing when it finished. The third time worked just fine. Maybe it helped to "make distclean" prior to the third run? - though that shouldn't be essential. In any case, I don't think the above issues with running the benchmarks have anything to do with my last 2 PRs.

@srlantz
Copy link
Collaborator Author

srlantz commented Nov 5, 2020

This is a follow-up to my previous comment, where I noted the the benchmark/validation script had two misfires before it ran correctly. Here are more details on what went wrong in each case.

  1. The bad points in the MEIF timing plots happened when nths = nevs. The strangely slow times returned to their expected values upon re-running. This behavior was previously observed when running the benchmarks for PR comment out all omp refs but simd, change icc(gcc) flag to -q(f)openm… #279 (removing OpenMP dependence). If one looks at the MEIF portion of the benchmark script, one sees that the outer loop is over threads, and the inner loop is over events. Therefore, each of the bad points is actually the last point to be run for a given number of threads, so thread affinity (if any) should already be established. However: these points are also the first ones to be run for a given number of events; might this indicate that the newly-read event data are (sometimes) local to the wrong threads?

  2. The segfaults observed in the second bad run appear to stem from a segfault of mkFit, which is followed by other segfaults when plots and analyses of data are being generated--presumably because some data are missing. Again, the validation worked fine upon re-running. I don't know why identical runs of mkFit would segfault on one occasion but run to completion on another occasion. Below are the snippets from the output of the benchmark script, beginning with the first segfault.

make[1]: warning:  Clock skew detected.  Your build may be incomplete.
make[1]: Leaving directory `/home/users/slantz/srlantz-mkFit/mkFit'
make: warning:  Clock skew detected.  Your build may be incomplete.
SKL-SP_CMSSW_TTbar_PU50_CMSSW: SIMVAL [nTH:64, nVU:16int, nEV:32]
./val_scripts/validation-cmssw-benchmarks.sh: line 121: 2326722 Segmentation fault      (core dumped) ${bExe} &>log_${oBase}_NVU${maxvu}int_NTH${maxth}_NEV${maxev}_${vN}.txt
Crashed on CMD: ./mkFit/mkFit --silent --cmssw-n2seeds --num-thr 64 --num-thr-ev 32 --input-file /data2//pu50-ccc-hs.bin --num-events 500 --remove-dup --sim-val-for-cmssw --try-to-save-sim-info --read-cmssw-tracks --mtv-like-val --build-cmssw
hadd Target file: valtree.root

\/\/\/\/\/\/\/\/\/\/\/\/\/\

Computing observables for: SKL-SP_CMSSW_TTbar_PU50 CMSSW SIMVAL

Processing plotting/runValidation.C("_SKL-SP_CMSSW_TTbar_PU50_CMSSW_SIMVAL",0)...
Info in <TUnixSystem::ACLiC>: creating shared library /home/users/slantz/srlantz-mkFit/plotting/PlotValidation_cpp.so
Computing Efficiency, Inefficiency, and Duplicate Rate ...

 *** Break *** segmentation violation
Crashed on CMD: root -b -q -l plotting/runValidation.C("_SKL-SP_CMSSW_TTbar_PU50_CMSSW_SIMVAL",0)
Computing observables for: SKL-SP_CMSSW_TTbar_PU50 STD SIMVAL

Processing plotting/runValidation.C("_SKL-SP_CMSSW_TTbar_PU50_STD_SIMVAL",0)...
Computing Efficiency, Inefficiency, and Duplicate Rate ...
Computing Fake Rate, <nHits/track>, and kinematic diffs to Sim tracks ...
Printing Totals ...
--------Track Reconstruction Summary--------
nEvents: 500 nSimTracks/evt: 7119.51 nRecoTracks/evt: 605.502
++++++++++++++++++++++++++++++++++++++++++++

\/\/\/\/\/\/\/\/\/\/\/\/\/\

Computing observables for: SKL-SP_CMSSW_TTbar_PU50 CE SIMVAL

Processing plotting/runValidation.C("_SKL-SP_CMSSW_TTbar_PU50_CE_SIMVAL",0)...
Computing Efficiency, Inefficiency, and Duplicate Rate ...

\/\/\/\/\/\/\/\/\/\/\/\/\/\

Overlaying histograms for: SKL-SP_CMSSW_TTbar_PU50 

Processing plotting/makeValidation.C("SKL-SP_CMSSW_TTbar_PU50","_SIMVAL",0,"forPR")...
Info in <TUnixSystem::ACLiC>: creating shared library /home/users/slantz/srlantz-mkFit/plotting/StackValidation_cpp.so

 *** Break *** segmentation violation
Computing observables for: SKL-SP_CMSSW_TTbar_PU50 CMSSW SIMVALSEED

Processing plotting/runValidation.C("_SKL-SP_CMSSW_TTbar_PU50_CMSSW_SIMVALSEED",0)...
Computing Efficiency, Inefficiency, and Duplicate Rate ...
Computing Fake Rate, <nHits/track>, and kinematic diffs to Sim tracks ...
Printing Totals ...
--------Track Reconstruction Summary--------
nEvents: 500 nSimTracks/evt: 7119.51 nRecoTracks/evt: 605.502
++++++++++++++++++++++++++++++++++++++++++++

\/\/\/\/\/\/\/\/\/\/\/\/\/\

Moving plots and text files locally to remove-old-cuda
mv: cannot stat ‘validation_SKL-SP_CMSSW_TTbar_PU50_CMSSW_SIMVAL/totals_validation_SKL-SP_CMSSW_TTbar_PU50_CMSSW_SIMVAL.txt’: No such file or directory

[many .png files also reported as missing at this point]

@dan131riley dan131riley merged commit 10c2aaf into trackreco:devel Nov 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants