use a dynamic buffer for CA cells components, adjust allocator growing factor to reduce memory used #509

VinInn · 2020-07-13T14:19:27Z

This PR main objective is to revive the dynamic buffer for CA cells components that was left out of the main merge of last year because of crashes.... (see 6ec0bc7#diff-80b2ae8844f1bd61dff8c97dda310263R78-L70 )

took the opportunity to enlarge a bit the buffers to reduce overflows in large events.

I took the opportunity to fix a possible data race that was left unresolved in prefixScan.
I used the code pattern in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions
on V100 one can trigger the crash pretty easily and after the fix no more crashes were observed...

Finally (after much struggle) the growing factor in the allocator has been reduced to the minimum (2) that seems to fit all kind of data allocation pattern and reduce the actual memory used by a factor 2

with 16threads on 5K events of the usual lumi section
this PR tops at 1.4GB of memory while current release (with growing factor 2 as well) tops at 2.4GB

benchmark for quadruplets on T4 8 threads is at 1050 Hz while current release barely reach 900 Hz
(with mps)
running multiple jobs on T4 for triplets
Running 4 times over 5000 events with 4 jobs, each with 5 threads, 5 streams and 1 GPUs
this PR: 707.8 ± 7.2 ev/s
reference: 543.5 ± 1.4 ev/s
and on V100:
this PR: 1547.1 ± 1.6 ev/s
reference: 1380.1 ± 1.0 ev/s
V100 and hyperthreads
Running 4 times over 5000 events with 8 jobs, each with 5 threads, 5 streams and 1 GPUs
this PR: 1616.6 ± 1.0 ev/s
reference: 1429.2 ± 1.1 ev/s

purely technical. No regression expected besides minor one due to reduced overflows .

fwyzard · 2020-07-13T17:09:27Z

HeterogeneousCore/CUDAUtilities/src/getCachingDeviceAllocator.h

  // Smallest bin, corresponds to binGrowth^minBin bytes (min_bin in cub::CacingDeviceAllocator
-  constexpr unsigned int minBin = 1;
+  constexpr unsigned int minBin = 8;


so, the smallest bin is now 256 (instead of 8) bytes ...

(which makes sense, I don't think cudaMalloc actually returns memory chunks smaller than 256 bytes, since in all the tests I ran it looks like the memory is always aligned at least to that)

fwyzard · 2020-07-13T17:09:39Z

HeterogeneousCore/CUDAUtilities/src/getCachingDeviceAllocator.h

  // Largest bin, corresponds to binGrowth^maxBin bytes (max_bin in cub::CachingDeviceAllocator). Note that unlike in cub, allocations larger than binGrowth^maxBin are set to fail.
-  constexpr unsigned int maxBin = 10;
+  constexpr unsigned int maxBin = 30;


... and the largest is 1 GB (as before) ?

fwyzard · 2020-07-13T17:12:15Z

HeterogeneousCore/CUDAUtilities/src/getCachingDeviceAllocator.h

@@ -13,11 +13,11 @@ namespace cms::cuda::allocator {
  // Use caching or not
  constexpr bool useCaching = true;
  // Growth factor (bin_growth in cub::CachingDeviceAllocator
-  constexpr unsigned int binGrowth = 8;
+  constexpr unsigned int binGrowth = 2;


Makes sense.

fwyzard · 2020-07-13T17:14:44Z

HeterogeneousCore/CUDAUtilities/interface/prefixScan.h

-                                                             T* __restrict__ co,
+    template <typename VT, typename T>
+    __host__ __device__ __forceinline__ void blockPrefixScan(VT const* ci,
+                                                             VT* co,


is VT supposed to be either T or volatile T ?

yes,at least in this contest

fwyzard · 2020-07-13T17:16:35Z

RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu

@@ -144,6 +144,9 @@ void CAHitNtupletGeneratorKernelsGPU::launchKernels(HitsOnCPU const &hh, TkSoA *
  cudaDeviceSynchronize();
  cudaCheck(cudaGetLastError());
 #endif
+
+  // free space asap
+  // device_isOuterHitOfCell_.reset();


is this the change that didn't make any difference ?

yes, I though I committed the one with the "reset", will test again

fwyzard · 2020-07-13T17:34:33Z

Validation summary

Reference release CMSSW_11_1_0 at b7ad279
Development branch cms-patatrack/CMSSW_11_1_X_Patatrack at 1aeedb3
Testing PRs:

use a dynamic buffer for CA cells components, adjust allocator growing factor to reduce memory used #509 at 0b021bf

Validation plots

/RelValTTbar_14TeV/CMSSW_11_1_0_pre8-PU_111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 11634.5
tracking validation plots and summary for workflow 11634.501
tracking validation plots and summary for workflow 11634.502

/RelValZMM_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 11634.5
tracking validation plots and summary for workflow 11634.501
tracking validation plots and summary for workflow 11634.502

/RelValZEE_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 11634.5
tracking validation plots and summary for workflow 11634.501
tracking validation plots and summary for workflow 11634.502

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

logs and `nvprof`/`nvvp` profiles

/RelValTTbar_14TeV/CMSSW_11_1_0_pre8-PU_111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

reference release, workflow 11634.5
- ✔️ step3.py: log
development release, workflow 11634.5
- ✔️ step3.py: log
development release, workflow 11634.501
- ✔️ step3.py: log
development release, workflow 11634.502
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 11634.511
- ✔️ step3.py: log
development release, workflow 11634.512
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ❌ cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
development release, workflow 11634.521
- ✔️ step3.py: log
development release, workflow 11634.522
- ✔️ step3.py: log
- ✔️ profile.py: log
- ❌ cuda-memcheck --tool initcheck (report, log) found 958624 errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 136.885502
development release, workflow 136.885512
development release, workflow 136.885522
testing release, workflow 11634.5
- ✔️ step3.py: log
testing release, workflow 11634.501
- ✔️ step3.py: log
testing release, workflow 11634.502
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 11634.511
- ✔️ step3.py: log
testing release, workflow 11634.512
- ❌ step3.py: log
- ❌ profile.py: log, profile and summary are missing, see the full log for more information
- ⚠️ cuda-memcheck --tool initcheck did not run
- ⚠️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all did not run
- ⚠️ cuda-memcheck --tool synccheck did not run
testing release, workflow 11634.521
- ✔️ step3.py: log
testing release, workflow 11634.522
- ✔️ step3.py: log
- ✔️ profile.py: log
- ❌ cuda-memcheck --tool initcheck (report, log) found 924064 errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ❌ cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
testing release, workflow 136.885502
testing release, workflow 136.885512
testing release, workflow 136.885522

/RelValZMM_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

reference release, workflow 11634.5
- ✔️ step3.py: log
development release, workflow 11634.5
- ✔️ step3.py: log
development release, workflow 11634.501
- ✔️ step3.py: log
development release, workflow 11634.502
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 11634.511
- ✔️ step3.py: log
development release, workflow 11634.512
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ❌ cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
development release, workflow 11634.521
- ✔️ step3.py: log
development release, workflow 11634.522
- ✔️ step3.py: log
- ✔️ profile.py: log
- ❌ cuda-memcheck --tool initcheck (report, log) found 47120 errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 136.885502
development release, workflow 136.885512
development release, workflow 136.885522
testing release, workflow 11634.5
- ✔️ step3.py: log
testing release, workflow 11634.501
- ✔️ step3.py: log
testing release, workflow 11634.502
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 11634.511
- ✔️ step3.py: log
testing release, workflow 11634.512
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ❌ cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
testing release, workflow 11634.521
- ✔️ step3.py: log
testing release, workflow 11634.522
- ✔️ step3.py: log
- ✔️ profile.py: log
- ❌ cuda-memcheck --tool initcheck (report, log) found 45824 errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 136.885502
testing release, workflow 136.885512
testing release, workflow 136.885522

/RelValZEE_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

reference release, workflow 11634.5
- ✔️ step3.py: log
development release, workflow 11634.5
- ✔️ step3.py: log
development release, workflow 11634.501
- ✔️ step3.py: log
development release, workflow 11634.502
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 11634.511
- ✔️ step3.py: log
development release, workflow 11634.512
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ❌ cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
development release, workflow 11634.521
- ✔️ step3.py: log
development release, workflow 11634.522
- ✔️ step3.py: log
- ✔️ profile.py: log
- ❌ cuda-memcheck --tool initcheck (report, log) found 25968 errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 136.885502
development release, workflow 136.885512
development release, workflow 136.885522
testing release, workflow 11634.5
- ✔️ step3.py: log
testing release, workflow 11634.501
- ✔️ step3.py: log
testing release, workflow 11634.502
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 11634.511
- ✔️ step3.py: log
testing release, workflow 11634.512
- ✔️ step3.py: log
- ✔️ profile.py: log
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ❌ cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
testing release, workflow 11634.521
- ✔️ step3.py: log
testing release, workflow 11634.522
- ✔️ step3.py: log
- ✔️ profile.py: log
- ❌ cuda-memcheck --tool initcheck (report, log) found 26400 errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 136.885502
testing release, workflow 136.885512
testing release, workflow 136.885522

Logs

The full log is available at https://patatrack.web.cern.ch/patatrack/validation/pulls/4bdacc448b68f2ca3504723cfec3494356ef6598/log .

fwyzard · 2020-07-14T05:50:31Z

At least, the pixel part looks good ...

VinInn · 2020-07-14T07:23:59Z

The memcheck failures are not related to this PR, aren't them?

fwyzard · 2020-07-14T12:31:02Z

The memcheck failures are not related to this PR, aren't them?

No, they are in the ECAL-only and/or HCAL-only workflows.

fwyzard · 2020-07-15T09:58:58Z

For comparison, here is the memory usage of

CMSSW_11_1_0_Patatrack
only the changes to the caching allocator
the full PR

running N jobs each with 4 threads/streams, on Run 3 MC TTbar 1000 events, with pixel triplets:

jobs	11.1.0	alloc. only	#509
10	11006	9260	5548
11	11921	10187	6097
12	12996	10990	6638
13	14039	11779	7347
14	14898	12570	7870
15	n/a	13471	8487
16	n/a	14108	9026
17	n/a	15043	9397
18	n/a	n/a	9828
19	n/a	n/a	10029
20	n/a	n/a	10812

Note: 28 MB are used by MPS

Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.

makortel · 2020-07-15T13:58:27Z

running N jobs each with 4 threads/streams, on Run 3 MC TTbar 1000 events, with pixel triplets:

jobs 11.1.0 alloc. only #509
14 14898 12570 7870
17 n/a 15043 9397
20 n/a n/a 10812

Nice! Just to clarify, is the job pixel-only, or does it include also ECAL and/or HCAL?

fwyzard · 2020-07-15T14:09:58Z

That was pixel-only, with triplets.

We don't have a matrix workflows that combine all three...

makortel · 2020-07-15T14:40:55Z

Thanks (that's what I though but wanted to make sure).

Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.

VinInn added 8 commits July 10, 2020 17:28

reimplement dyn memory for cells

090b1b2

reduce memory buffer, adjust size

e4b82bc

add a fence

fc0bdd2

attempt to delete early, memory increases!

00287f8

fix data race?

29caebf

solve type punning issue on cpu

abcf9aa

use growing factor 2 in allocator

715f4f0

code format

0b021bf

fwyzard reviewed Jul 13, 2020

View reviewed changes

fwyzard merged commit a1d9d5c into cms-patatrack:CMSSW_11_1_X_Patatrack Jul 15, 2020

This was referenced Jul 15, 2020

Validate CMSSW_11_1_0_Patatrack #512

Closed

Validate CMSSW_11_2_0_pre2_Patatrack #513

Closed

fwyzard mentioned this pull request Aug 10, 2020

Patatrack integration - common tools updates (3/N) cms-sw/cmssw#31110

Merged

fwyzard mentioned this pull request Oct 8, 2020

Patatrack integration - Pixel track reconstruction (10/N) cms-sw/cmssw#31722

Merged

fwyzard mentioned this pull request Dec 26, 2020

Patatrack integration - Pixel vertex reconstruction (11/N) cms-sw/cmssw#31723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use a dynamic buffer for CA cells components, adjust allocator growing factor to reduce memory used #509

use a dynamic buffer for CA cells components, adjust allocator growing factor to reduce memory used #509

VinInn commented Jul 13, 2020 •

edited

Loading

fwyzard Jul 13, 2020 •

edited

Loading

fwyzard Jul 13, 2020

fwyzard Jul 13, 2020 •

edited

Loading

fwyzard Jul 13, 2020

fwyzard Jul 13, 2020

VinInn Jul 14, 2020

fwyzard Jul 13, 2020

VinInn Jul 14, 2020

fwyzard commented Jul 13, 2020 •

edited

Loading

fwyzard commented Jul 14, 2020

VinInn commented Jul 14, 2020

fwyzard commented Jul 14, 2020

fwyzard commented Jul 15, 2020

makortel commented Jul 15, 2020

fwyzard commented Jul 15, 2020

makortel commented Jul 15, 2020

use a dynamic buffer for CA cells components, adjust allocator growing factor to reduce memory used #509

use a dynamic buffer for CA cells components, adjust allocator growing factor to reduce memory used #509

Conversation

VinInn commented Jul 13, 2020 • edited Loading

fwyzard Jul 13, 2020 • edited Loading

Choose a reason for hiding this comment

fwyzard Jul 13, 2020

Choose a reason for hiding this comment

fwyzard Jul 13, 2020 • edited Loading

Choose a reason for hiding this comment

fwyzard Jul 13, 2020

Choose a reason for hiding this comment

fwyzard Jul 13, 2020

Choose a reason for hiding this comment

VinInn Jul 14, 2020

Choose a reason for hiding this comment

fwyzard Jul 13, 2020

Choose a reason for hiding this comment

VinInn Jul 14, 2020

Choose a reason for hiding this comment

fwyzard commented Jul 13, 2020 • edited Loading

Validation summary

Validation plots

/RelValTTbar_14TeV/CMSSW_11_1_0_pre8-PU_111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValZMM_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValZEE_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

logs and nvprof/nvvp profiles

/RelValTTbar_14TeV/CMSSW_11_1_0_pre8-PU_111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValZMM_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValZEE_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

Logs

fwyzard commented Jul 14, 2020

VinInn commented Jul 14, 2020

fwyzard commented Jul 14, 2020

fwyzard commented Jul 15, 2020

makortel commented Jul 15, 2020

fwyzard commented Jul 15, 2020

makortel commented Jul 15, 2020

VinInn commented Jul 13, 2020 •

edited

Loading

fwyzard Jul 13, 2020 •

edited

Loading

fwyzard Jul 13, 2020 •

edited

Loading

fwyzard commented Jul 13, 2020 •

edited

Loading

logs and `nvprof`/`nvvp` profiles