Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use a dynamic buffer for CA cells components, adjust allocator growing factor to reduce memory used #509

Merged
merged 8 commits into from
Jul 15, 2020

Conversation

VinInn
Copy link

@VinInn VinInn commented Jul 13, 2020

This PR main objective is to revive the dynamic buffer for CA cells components that was left out of the main merge of last year because of crashes.... (see 6ec0bc7#diff-80b2ae8844f1bd61dff8c97dda310263R78-L70 )

took the opportunity to enlarge a bit the buffers to reduce overflows in large events.

I took the opportunity to fix a possible data race that was left unresolved in prefixScan.
I used the code pattern in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions
on V100 one can trigger the crash pretty easily and after the fix no more crashes were observed...

Finally (after much struggle) the growing factor in the allocator has been reduced to the minimum (2) that seems to fit all kind of data allocation pattern and reduce the actual memory used by a factor 2

with 16threads on 5K events of the usual lumi section
this PR tops at 1.4GB of memory while current release (with growing factor 2 as well) tops at 2.4GB

benchmark for quadruplets on T4 8 threads is at 1050 Hz while current release barely reach 900 Hz
(with mps)
running multiple jobs on T4 for triplets
Running 4 times over 5000 events with 4 jobs, each with 5 threads, 5 streams and 1 GPUs
this PR: 707.8 ± 7.2 ev/s
reference: 543.5 ± 1.4 ev/s
and on V100:
this PR: 1547.1 ± 1.6 ev/s
reference: 1380.1 ± 1.0 ev/s
V100 and hyperthreads
Running 4 times over 5000 events with 8 jobs, each with 5 threads, 5 streams and 1 GPUs
this PR: 1616.6 ± 1.0 ev/s
reference: 1429.2 ± 1.1 ev/s

purely technical. No regression expected besides minor one due to reduced overflows .

// Smallest bin, corresponds to binGrowth^minBin bytes (min_bin in cub::CacingDeviceAllocator
constexpr unsigned int minBin = 1;
constexpr unsigned int minBin = 8;
Copy link

@fwyzard fwyzard Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, the smallest bin is now 256 (instead of 8) bytes ...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(which makes sense, I don't think cudaMalloc actually returns memory chunks smaller than 256 bytes, since in all the tests I ran it looks like the memory is always aligned at least to that)

// Largest bin, corresponds to binGrowth^maxBin bytes (max_bin in cub::CachingDeviceAllocator). Note that unlike in cub, allocations larger than binGrowth^maxBin are set to fail.
constexpr unsigned int maxBin = 10;
constexpr unsigned int maxBin = 30;
Copy link

@fwyzard fwyzard Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and the largest is 1 GB (as before) ?

@@ -13,11 +13,11 @@ namespace cms::cuda::allocator {
// Use caching or not
constexpr bool useCaching = true;
// Growth factor (bin_growth in cub::CachingDeviceAllocator
constexpr unsigned int binGrowth = 8;
constexpr unsigned int binGrowth = 2;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

T* __restrict__ co,
template <typename VT, typename T>
__host__ __device__ __forceinline__ void blockPrefixScan(VT const* ci,
VT* co,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is VT supposed to be either T or volatile T ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,at least in this contest

@@ -144,6 +144,9 @@ void CAHitNtupletGeneratorKernelsGPU::launchKernels(HitsOnCPU const &hh, TkSoA *
cudaDeviceSynchronize();
cudaCheck(cudaGetLastError());
#endif

// free space asap
// device_isOuterHitOfCell_.reset();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the change that didn't make any difference ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I though I committed the one with the "reset", will test again

@fwyzard
Copy link

fwyzard commented Jul 13, 2020

Validation summary

Reference release CMSSW_11_1_0 at b7ad279
Development branch cms-patatrack/CMSSW_11_1_X_Patatrack at 1aeedb3
Testing PRs:

Validation plots

/RelValTTbar_14TeV/CMSSW_11_1_0_pre8-PU_111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 11634.5
  • tracking validation plots and summary for workflow 11634.501
  • tracking validation plots and summary for workflow 11634.502

/RelValZMM_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 11634.5
  • tracking validation plots and summary for workflow 11634.501
  • tracking validation plots and summary for workflow 11634.502

/RelValZEE_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 11634.5
  • tracking validation plots and summary for workflow 11634.501
  • tracking validation plots and summary for workflow 11634.502

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

scan-136.885502.png
zoom-136.885502.png
scan-136.885522.png
zoom-136.885522.png

logs and nvprof/nvvp profiles

/RelValTTbar_14TeV/CMSSW_11_1_0_pre8-PU_111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 11634.5
  • development release, workflow 11634.5
  • development release, workflow 11634.501
  • development release, workflow 11634.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 11634.511
  • development release, workflow 11634.512
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
  • development release, workflow 11634.521
  • development release, workflow 11634.522
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • cuda-memcheck --tool initcheck (report, log) found 958624 errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.885502
  • development release, workflow 136.885512
  • development release, workflow 136.885522
  • testing release, workflow 11634.5
  • testing release, workflow 11634.501
  • testing release, workflow 11634.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 11634.511
  • testing release, workflow 11634.512
    • step3.py: log
    • profile.py: log, profile and summary are missing, see the full log for more information
    • ⚠️ cuda-memcheck --tool initcheck did not run
    • ⚠️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all did not run
    • ⚠️ cuda-memcheck --tool synccheck did not run
  • testing release, workflow 11634.521
  • testing release, workflow 11634.522
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • cuda-memcheck --tool initcheck (report, log) found 924064 errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
  • testing release, workflow 136.885502
  • testing release, workflow 136.885512
  • testing release, workflow 136.885522

/RelValZMM_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 11634.5
  • development release, workflow 11634.5
  • development release, workflow 11634.501
  • development release, workflow 11634.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 11634.511
  • development release, workflow 11634.512
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
  • development release, workflow 11634.521
  • development release, workflow 11634.522
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • cuda-memcheck --tool initcheck (report, log) found 47120 errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.885502
  • development release, workflow 136.885512
  • development release, workflow 136.885522
  • testing release, workflow 11634.5
  • testing release, workflow 11634.501
  • testing release, workflow 11634.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 11634.511
  • testing release, workflow 11634.512
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
  • testing release, workflow 11634.521
  • testing release, workflow 11634.522
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • cuda-memcheck --tool initcheck (report, log) found 45824 errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.885502
  • testing release, workflow 136.885512
  • testing release, workflow 136.885522

/RelValZEE_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 11634.5
  • development release, workflow 11634.5
  • development release, workflow 11634.501
  • development release, workflow 11634.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 11634.511
  • development release, workflow 11634.512
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
  • development release, workflow 11634.521
  • development release, workflow 11634.522
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • cuda-memcheck --tool initcheck (report, log) found 25968 errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.885502
  • development release, workflow 136.885512
  • development release, workflow 136.885522
  • testing release, workflow 11634.5
  • testing release, workflow 11634.501
  • testing release, workflow 11634.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 11634.511
  • testing release, workflow 11634.512
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
  • testing release, workflow 11634.521
  • testing release, workflow 11634.522
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • cuda-memcheck --tool initcheck (report, log) found 26400 errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.885502
  • testing release, workflow 136.885512
  • testing release, workflow 136.885522

Logs

The full log is available at https://patatrack.web.cern.ch/patatrack/validation/pulls/4bdacc448b68f2ca3504723cfec3494356ef6598/log .

@fwyzard
Copy link

fwyzard commented Jul 14, 2020

At least, the pixel part looks good ...

@VinInn
Copy link
Author

VinInn commented Jul 14, 2020

The memcheck failures are not related to this PR, aren't them?

@fwyzard
Copy link

fwyzard commented Jul 14, 2020

The memcheck failures are not related to this PR, aren't them?

No, they are in the ECAL-only and/or HCAL-only workflows.

@fwyzard
Copy link

fwyzard commented Jul 15, 2020

For comparison, here is the memory usage of

  • CMSSW_11_1_0_Patatrack
  • only the changes to the caching allocator
  • the full PR

running N jobs each with 4 threads/streams, on Run 3 MC TTbar 1000 events, with pixel triplets:

jobs 11.1.0 alloc. only #509
10 11006 9260 5548
11 11921 10187 6097
12 12996 10990 6638
13 14039 11779 7347
14 14898 12570 7870
15 n/a 13471 8487
16 n/a 14108 9026
17 n/a 15043 9397
18 n/a n/a 9828
19 n/a n/a 10029
20 n/a n/a 10812

Note: 28 MB are used by MPS

@fwyzard fwyzard merged commit a1d9d5c into cms-patatrack:CMSSW_11_1_X_Patatrack Jul 15, 2020
fwyzard pushed a commit that referenced this pull request Jul 15, 2020
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations.

Use a dynamic buffer for CA cells components.

Fix a possible data race in the prefix scan.
@makortel
Copy link

running N jobs each with 4 threads/streams, on Run 3 MC TTbar 1000 events, with pixel triplets:

jobs 11.1.0 alloc. only #509
14 14898 12570 7870
17 n/a 15043 9397
20 n/a n/a 10812

Nice! Just to clarify, is the job pixel-only, or does it include also ECAL and/or HCAL?

@fwyzard
Copy link

fwyzard commented Jul 15, 2020

That was pixel-only, with triplets.

We don't have a matrix workflows that combine all three...

@makortel
Copy link

Thanks (that's what I though but wanted to make sure).

fwyzard pushed a commit that referenced this pull request Oct 8, 2020
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations.

Use a dynamic buffer for CA cells components.

Fix a possible data race in the prefix scan.
fwyzard pushed a commit that referenced this pull request Oct 20, 2020
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations.

Use a dynamic buffer for CA cells components.

Fix a possible data race in the prefix scan.
fwyzard pushed a commit that referenced this pull request Oct 20, 2020
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations.

Use a dynamic buffer for CA cells components.

Fix a possible data race in the prefix scan.
fwyzard pushed a commit that referenced this pull request Oct 23, 2020
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations.

Use a dynamic buffer for CA cells components.

Fix a possible data race in the prefix scan.
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations.

Use a dynamic buffer for CA cells components.

Fix a possible data race in the prefix scan.
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations.

Use a dynamic buffer for CA cells components.

Fix a possible data race in the prefix scan.
fwyzard pushed a commit that referenced this pull request Nov 16, 2020
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations.

Use a dynamic buffer for CA cells components.

Fix a possible data race in the prefix scan.
fwyzard added a commit that referenced this pull request Nov 27, 2020
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations.

Use a dynamic buffer for CA cells components.

Fix a possible data race in the prefix scan.
fwyzard pushed a commit that referenced this pull request Dec 26, 2020
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations.

Use a dynamic buffer for CA cells components.

Fix a possible data race in the prefix scan.
fwyzard pushed a commit that referenced this pull request Jan 13, 2021
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations.

Use a dynamic buffer for CA cells components.

Fix a possible data race in the prefix scan.
fwyzard pushed a commit that referenced this pull request Jan 15, 2021
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations.

Use a dynamic buffer for CA cells components.

Fix a possible data race in the prefix scan.
fwyzard pushed a commit that referenced this pull request Mar 23, 2021
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations.

Use a dynamic buffer for CA cells components.

Fix a possible data race in the prefix scan.
fwyzard pushed a commit that referenced this pull request Apr 1, 2021
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations.

Use a dynamic buffer for CA cells components.

Fix a possible data race in the prefix scan.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants