Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Reduce calls to cudaEventRecord() via the caching allocators #412

Open
wants to merge 5 commits into
base: CMSSW_11_0_X_Patatrack
Choose a base branch
from

Conversation

makortel
Copy link

PR description:

This PR adds overloads for the caching allocators to allocate memory without device-side ownership. These overloads do not take the CUDA stream as an argument. Deallocating such memory blocks from host makes them free immediately without calling cudaEventRecord() (according to VTune cudaEventRecord() had the second-highest total waiting time for locking the mutex in CUDA API).

I changed all unique_ptr's in the CUDADataFormats (that are used in the pixel tracking workflow) that are owned by the data format class to use these overloads. It works because the data format objects are destructed only after all relevant work in their CUDA streams have finished.

This work was done during the NERSC-9 GPU hackathon at Cray offices. On Cori GPU node (V100) I got 14 % (20 % on 2 GPUs) increase in throughput for 2018D JetHT data.

The PR is RFC for two reasons

  • I'm unsure if the current interface is sufficient for distinguishing whether or not the allocation has device-side ownership
  • I'm looking forward to see the impact on performance on T4

PR validation:

Profiling workflow runs, unit tests run.

@VinInn
Copy link

VinInn commented Nov 22, 2019

We could try to see if the "new" callback mechanism is faster. In principle we would NOT be able to call cudaFree anymore: do we really need that?

@makortel
Copy link
Author

We could try to see if the "new" callback mechanism is faster.

Do you mean using a callback instead of an event to signal from device to host that the device releases the ownership?

I'd expect the events to be more performant than callback, because both cudaEventRecord() and cudaStreamAddCallback() lock the mutex, and my impression is that the callback needs "more resources" when executed compared to the setting events state. On the other hand we would gain by not having to call cudaEventQuery() to check the state (don't know if that one locks the mutex or not).

But yeah, could be tested.

In principle we would NOT be able to call cudaFree anymore: do we really need that?

The cudaFree() code path would be called within the DeviceFree() call, i.e. it would not be called from the callback (I believe it would be sufficient for the callback to just set an atomic boolean). (and we are not using that code path anyway)

@VinInn
Copy link

VinInn commented Nov 22, 2019

Do you mean using a callback instead of an event to signal from device to host that the device releases the ownership?

Indeed, my understanding is that it is just a sort of " cpu kernel"

ok for the cudaFree., understood.

@makortel
Copy link
Author

Do you mean using a callback instead of an event to signal from device to host that the device releases the ownership?

Indeed, my understanding is that it is just a sort of " cpu kernel"

Ah, you refer specifically to the cudaLaunchHostFunc()? Could try that one as well, here it would even suffice (callback not being called in case of device errors won't block the host execution).

@VinInn
Copy link

VinInn commented Nov 22, 2019

Indeed.
In the cuda doc of callback one can read
"This function(cuStreamAddCallback) is slated for eventual deprecation and removal. If you do not require the callback to execute in case of a device error, consider using cuLaunchHostFunc. "

@makortel
Copy link
Author

Right. cudaLaunchHostFunc() just has a "nasty" feature

Note that, in constrast to cuStreamAddCallback, the function will not be called in the event of an error in the CUDA context.

which is a guarantee we need to avoid starting the framework. Ok, it can be worked around if/when we need to go there (I suppose best idea so far is to have a separate CPU thread checking the health of the CUDA runtime e.g. once a second). I had an e-mail thread with Andreas Hehn, @fwyzard, @felicepantaleo during the hackathon in July.

@fwyzard
Copy link

fwyzard commented Nov 26, 2019

Validation summary

Reference release CMSSW_11_0_0_pre11 at 5b0a828
Development branch CMSSW_11_0_X_Patatrack at 614ee0b
Testing PRs:

Validation plots

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

scan-136.86452.png
zoom-136.86452.png

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.51
  • development release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.86452
  • testing release, workflow 10824.5
  • testing release, workflow 10824.51
  • testing release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.86452

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.51
  • development release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.86452
  • testing release, workflow 10824.5
  • testing release, workflow 10824.51
  • testing release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.86452

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.51
  • development release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.86452
  • testing release, workflow 10824.5
  • testing release, workflow 10824.51
  • testing release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.86452

Logs

The full log is available at https://patatrack.web.cern.ch/patatrack/validation/pulls/40866e04e975c9068c435d73175657560da6d695/log .

@fwyzard
Copy link

fwyzard commented Dec 3, 2019

@makortel low priority, could you fix the conflicts ?

@makortel makortel force-pushed the allocatorReduceEvent_v2 branch from 9704d00 to 6197223 Compare December 13, 2019 14:32
@makortel
Copy link
Author

Rebased on top of CMSSW_11_0_X_Patatrack (merge of #413).

@makortel makortel force-pushed the allocatorReduceEvent_v2 branch from 6197223 to c7a5d59 Compare January 22, 2020 22:24
@makortel
Copy link
Author

Rebased on top of CMSSW_11_0_0_Patatrack.

@fwyzard fwyzard added this to the CMSSW_11_1_0_Patatrack milestone Jan 22, 2020
@makortel
Copy link
Author

Needs to be rebased after #449 even if there technically there are not merge conflicts.

@makortel
Copy link
Author

To reduce the confusion between make_unique variants that take and don't take a CUDA stream argument, @fwyzard suggested to highlight the intended use

  • Event/EventSetup products (without CUDA stream)
  • temporary data structures (with CUDA stream)

Better ideas are still welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants