-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use a dynamic buffer for CA cells components, adjust allocator growing factor to reduce memory used #509
use a dynamic buffer for CA cells components, adjust allocator growing factor to reduce memory used #509
Conversation
// Smallest bin, corresponds to binGrowth^minBin bytes (min_bin in cub::CacingDeviceAllocator | ||
constexpr unsigned int minBin = 1; | ||
constexpr unsigned int minBin = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, the smallest bin is now 256 (instead of 8) bytes ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(which makes sense, I don't think cudaMalloc
actually returns memory chunks smaller than 256 bytes, since in all the tests I ran it looks like the memory is always aligned at least to that)
// Largest bin, corresponds to binGrowth^maxBin bytes (max_bin in cub::CachingDeviceAllocator). Note that unlike in cub, allocations larger than binGrowth^maxBin are set to fail. | ||
constexpr unsigned int maxBin = 10; | ||
constexpr unsigned int maxBin = 30; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and the largest is 1 GB (as before) ?
@@ -13,11 +13,11 @@ namespace cms::cuda::allocator { | |||
// Use caching or not | |||
constexpr bool useCaching = true; | |||
// Growth factor (bin_growth in cub::CachingDeviceAllocator | |||
constexpr unsigned int binGrowth = 8; | |||
constexpr unsigned int binGrowth = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
T* __restrict__ co, | ||
template <typename VT, typename T> | ||
__host__ __device__ __forceinline__ void blockPrefixScan(VT const* ci, | ||
VT* co, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is VT
supposed to be either T
or volatile T
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes,at least in this contest
@@ -144,6 +144,9 @@ void CAHitNtupletGeneratorKernelsGPU::launchKernels(HitsOnCPU const &hh, TkSoA * | |||
cudaDeviceSynchronize(); | |||
cudaCheck(cudaGetLastError()); | |||
#endif | |||
|
|||
// free space asap | |||
// device_isOuterHitOfCell_.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the change that didn't make any difference ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I though I committed the one with the "reset", will test again
Validation summaryReference release CMSSW_11_1_0 at b7ad279
Validation plots/RelValTTbar_14TeV/CMSSW_11_1_0_pre8-PU_111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZMM_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZEE_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
Throughput plots/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53logs and
|
At least, the pixel part looks good ... |
The memcheck failures are not related to this PR, aren't them? |
No, they are in the ECAL-only and/or HCAL-only workflows. |
For comparison, here is the memory usage of
running N jobs each with 4 threads/streams, on Run 3 MC TTbar 1000 events, with pixel triplets:
Note: 28 MB are used by MPS |
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.
Nice! Just to clarify, is the job pixel-only, or does it include also ECAL and/or HCAL? |
That was pixel-only, with triplets. We don't have a matrix workflows that combine all three... |
Thanks (that's what I though but wanted to make sure). |
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.
Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.
This PR main objective is to revive the dynamic buffer for CA cells components that was left out of the main merge of last year because of crashes.... (see 6ec0bc7#diff-80b2ae8844f1bd61dff8c97dda310263R78-L70 )
took the opportunity to enlarge a bit the buffers to reduce overflows in large events.
I took the opportunity to fix a possible data race that was left unresolved in prefixScan.
I used the code pattern in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions
on V100 one can trigger the crash pretty easily and after the fix no more crashes were observed...
Finally (after much struggle) the growing factor in the allocator has been reduced to the minimum (2) that seems to fit all kind of data allocation pattern and reduce the actual memory used by a factor 2
with 16threads on 5K events of the usual lumi section
this PR tops at 1.4GB of memory while current release (with growing factor 2 as well) tops at 2.4GB
benchmark for quadruplets on T4 8 threads is at 1050 Hz while current release barely reach 900 Hz
(with mps)
running multiple jobs on T4 for triplets
Running 4 times over 5000 events with 4 jobs, each with 5 threads, 5 streams and 1 GPUs
this PR: 707.8 ± 7.2 ev/s
reference: 543.5 ± 1.4 ev/s
and on V100:
this PR: 1547.1 ± 1.6 ev/s
reference: 1380.1 ± 1.0 ev/s
V100 and hyperthreads
Running 4 times over 5000 events with 8 jobs, each with 5 threads, 5 streams and 1 GPUs
this PR: 1616.6 ± 1.0 ev/s
reference: 1429.2 ± 1.1 ev/s
purely technical. No regression expected besides minor one due to reduced overflows .