-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cudadev] Improve caching allocator performance #218
Comments
IIUC this will make the caching allocator very tightly bound with the CMS-specific "CUDA framework", right ? |
The caching allocator would get a technical dependence on The Together these would allow
|
Like @fwyzard , I feel worried about extending the scope of the allocator and would prefer having a hierarchy of concerns with base classes solving limited problems while higher level classes build on top of that (although this not always achievable). There are a few costly features right now in the allocator:
A run of
We can see from this that the caching allocator does a good job caching (only about 687 device +122 host real allocations (for cudaMalloc + cudaHostAlloc or about 781 from event creation) for 410k frees (from the We could imagine to replace (Regarding the events, I extrapolated that we have 7 CUDA events per CMS event due to the EDM (using |
I'm concerned, as you noted as well, that doing At the time of cms-patatrack/cmssw#412 I actually tried to use My motivation to move the CUDA event to be given in the allocation call stems from
|
OK, I see, we explicitly create the event and all users (here allocation) ride the same event, pushing the optimization beyond memory allocation. I now have a few SoAs in place, and the memory allocation is consoliated, so we should get a similar effect (I targeted the places where most consolidation can be obtained first). Running current master (d78674b) and currently rebased https://github.com/ericcano/pixeltrack-standalone/tree/macroBasedSoA (ericcano@b1fd2d0). I still get the similar statistics in the reference:
With the SoA in, the calls to cudaEventRecord are dropped significantly:
Yet the global performance is the same (16 threads, 24 streams, 10k events, no transfers, no validation). The reference gets 887 events/s, while SoA gets 891 events/s. The global cost of All the green blocks in the following plot are We can see the neighboring thread remaining locked during this streak: Tooltips confirm both are working on device memory and hence working with the same mutex. Finally, looking from the system call perspective, we can see not so many pthread_mutex_lock, in both cases. This is still the leading cause of wait:
With SoA:
I would imterpret this low number from the fact that the system calls were short enough to go under nsys's sampling radar. An Aa a conclusion it seems CUDA's (and Linux's) internal optimizations are already working around our repetitive calling of |
The generalization of the caching allocator in #216 makes it easier to make various improvements to the caching allocator. #211 (comment) shows a measurement pointing that the mutex in the caching allocator would be the bottleneck (my studies ~2 years ago pointed more to the mutex in CUDA, but things seem to have evolved). This PR is to discuss improvement ideas, with a(n ordered) plan shown below
ScopedContext
and the caching allocator by having theScopedContext
to passSharedEventPtr
to the caching allocator (evolution of ideas in [RFC] Reduce calls to cudaEventRecord() via the caching allocators cmssw#412 and [RFC] Add make_device_unique() functions to ScopedContextBase cmssw#487)multiset
with nested vectors (device, bin) for (much) faster lookup (from [cudadev] Generalize caching allocator #216 (comment))The text was updated successfully, but these errors were encountered: