Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a flag to disable the caching for the allocators #205

Merged

Conversation

makortel
Copy link

Following the discussion in #172, this PR adds a flag to CUDAService to disable the caching for the allocators (to demonstrate how bad event-by-event cudaMalloc+cudaFree would be). Default behavior is unchanged, the caching can be disabled with process.CUDAService.allocator.enabled = False.

@fwyzard fwyzard force-pushed the CMSSW_10_4_X_Patatrack branch from 6110cf4 to b8334a2 Compare November 29, 2018 09:44
@fwyzard fwyzard changed the base branch from CMSSW_10_4_X_Patatrack to CMSSW_10_4_X November 29, 2018 12:59
@fwyzard fwyzard changed the base branch from CMSSW_10_4_X to CMSSW_10_4_X_Patatrack November 29, 2018 12:59
@fwyzard
Copy link

fwyzard commented Nov 29, 2018

Something doesn't look right if I disable the allocator:

Begin processing the 1st record. Run 321177, Event 187588887, LumiSection 142 on stream 5 at 29-Nov-2018 14:06:41.870 CET
Begin processing the 101st record. Run 321177, Event 187660569, LumiSection 142 on stream 2 at 29-Nov-2018 14:06:49.634 CET
Begin processing the 201st record. Run 321177, Event 188192711, LumiSection 142 on stream 6 at 29-Nov-2018 14:06:50.127 CET
/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/RecoLocalTracker/SiPixelRecHits/plugins/PixelRecHits.cu, line 159: cudaErrorInvalidConfiguration: invalid configuration argument


A fatal system signal has occurred: abort signal
...

Enabling some debug messages I get

...
hitbuilder: 24 clusters in module 1701. will write at 5547
hitbuilder: 8 clusters in module 101. will write at 561
launching getHits kernel for 801 blocks
hitbuilder: 9 clusters in module 1601. will write at 4851
hitbuilder: 4 clusters in module 1001. will write at 1121
hitbuilder: 16 clusters in module 1201. will write at 1426
hitbuilder: 9 clusters in module 1301. will write at 2118
hitbuilder: 15 clusters in module 1801. will write at 6666
hitbuilder: 6 clusters in module 1501. will write at 3888
hitbuilder: 8 clusters in module 1701. will write at 5787
hitbuilder: 9 clusters in module 1401. will write at 2960
launching getHits kernel for 295 blocks
launching getHits kernel for 0 blocks
/mnt/home/fwyzard/patatrack/CMSSW_10_4_0_pre2_Patatrack/src/RecoLocalTracker/SiPixelRecHits/plugins/PixelRecHits.cu, line 159: cudaErrorInvalidConfiguration: invalid configuration argument

@makortel
Copy link
Author

Right, because the deallocation doesn't work that simply (whoops). I mean, with the caching enabled, one can destruct device::unique_ptr even if the device memory is still being used (it won't be re-used until a CUDA event recorded on the CUDA stream at the point of unique_ptr destructor has become occurred). Need to think a bit more...

@fwyzard
Copy link

fwyzard commented Nov 29, 2018

OK, if I follow what you mean, I think I would prefer the allocator not to "hide" this...

@makortel
Copy link
Author

@fwyzard So you'd want the "freed" device memory to be available for everyone immediately after the free?

@fwyzard
Copy link

fwyzard commented Nov 30, 2018

No, what I mean is that I would like to use of the allocator not to change the malloc/free semantic.

That is, once a block is returned to the allocator, it should be considered "available" for reuse, and if some (GPU ?) code uses the memory after the (CPU ?) code has returned it, bad things can happen.

Of course, as long as there is enough free memory, the allocator can hold on to the block, and try to give it back only to the same CUDA stream that used it last (or whatever caching we do).

But, the user code should not be allowed to rely on this behaviour for correctness, only for speed optimisations.

Does it make sense ?

@fwyzard
Copy link

fwyzard commented Nov 30, 2018

Summary of the chat with @makortel regarding the behaviour of the caching allocator, after looking at the code for the cub::CachingDeviceAllocator.

For large memory chunks (bigger than the largest bin):

  • allocations are synchronous, using cudaMalloc
  • deallocations are synchronous, using cudaFree

For small memory chunks (up to the size of the largest bin):

  • allocations can be synchronous, using cudaMalloc or memory returned to the pool from an idle CUDA stream, or asynchronous, reusing memory "freed" in the current CUDA stream
  • deallocations are asynchronous, marking the memory are reusable by the current CUDA stream

Since work within each CUDA stream is serialised, it is possible to do something along the lines of (pseudocode by @makortel):

dev_mem = allocate(1024, stream);
kernel<<<1,1, 0, stream>>>(dev_mem);
cudaMemcpyAsync(host_mem, dev_mem, stream);
free(dev_mem);
dev_mem2 = allocate(1024, stream);
kernel2<<<1,1, 0, stream>>>(dev_mem2);

Here free(dev_mem) marks the chunk used by dev_mem as "available within the stream stream".
dev_mem2 is likely to receive a pointer to the same chunk.
Launching kernel2 using dev_mem2 is assumed to be fine, because the operations associated to the CUDA stream stream are serial:

  • execute kernel that possibly writes to the chunk via dev_mem;
  • read the memory via dev_mem;
  • execute kernel2 that possibly writes to the chunk via dev_mem2.

If the allocator is replaced by direct calls to cudaMalloc/cudaFree, the behaviour changes, and the assumption is no longer valid.

@fwyzard

This comment has been minimized.

@makortel

This comment has been minimized.

@fwyzard

This comment has been minimized.

@fwyzard
Copy link

fwyzard commented Dec 2, 2018

I think we should settle on the semantic we want, and then update the allocator to make it consistent with it. I can think of three options:

  1. synchronous: memory operations on the device are synchronous with the host: memory is allocated and deallocated immediately (à la cudaMalloc()/cudaFree())
  2. asynchronous: memory operations on the device are asynchronous from the host, and synchronous with a CUDA stream (or other constructs like CUDA graphs)
  3. lazy: memory allocations on the device are synchronous with the host (cudaMalloc()), while memory deallocations are asynchronous from the host, and synchronous with a CUDA stream.

The cub::CachingDeviceAllocator is doing a mixture of all three: 1. for large memory chunks, 2. for small memory chunks recycled from the pool, 3. for small memory chunks when the allocation pool is grown.

I suspect that what we want for temporary buffers is more along the lines of 2, to avoid issuing a synchronisation every time.

@fwyzard
Copy link

fwyzard commented Dec 2, 2018

Few more thing to consider:

  • do we want to be able to allocate and deallocate memory from device code ?
  • do we want to be able to allocate memory on the host, and deallocate it from the device(or the opposite) ?
  • do we want to support managed memory ?

@makortel
Copy link
Author

makortel commented Dec 3, 2018

@fwyzard I agree with your assessment in #205 (comment). On your questions

  • do we want to support managed memory ?

I was thinking to add a caching managed allocator and try that out.

(should we move the discussion not being exactly on "simple way to disable caching for testing" back to #138 ?)

@fwyzard
Copy link

fwyzard commented Dec 3, 2018

Sure... I'll copy there my comments.

@makortel makortel force-pushed the cudaServiceAllocNoCache branch from 1d2607f to ace925f Compare January 2, 2019 23:28
@makortel
Copy link
Author

makortel commented Jan 2, 2019

Rebased on top of 10_4_0_pre3_Patatrack.

With handful of tests (with multiple threads, EDM streams, CUDA devices) I have not been able to reproduce #205 (comment).

@fwyzard fwyzard added this to the CMSSW_10_4_X_Patatrack milestone Jan 8, 2019
@fwyzard
Copy link

fwyzard commented Jan 9, 2019

Indeed, I have not run into any problems testing today, neither on a pair of P100 nor V100, neither with the "profile" workflow on data nor with the "step3" workflow on TTbar.

@fwyzard
Copy link

fwyzard commented Jan 9, 2019

On a pair of P100:

$ multirun.py profile.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
  1904.3 ±   1.5 ev/s (4000 events, 98.9% overlap)
  1903.8 ±   1.3 ev/s (4000 events, 98.6% overlap)
  1904.5 ±   1.7 ev/s (4000 events, 98.8% overlap)
  1904.8 ±   1.7 ev/s (4000 events, 98.8% overlap)

$ multirun.py profile-no-allocator.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
   508.9 ±   0.2 ev/s (4000 events, 99.8% overlap)
   510.3 ±   0.2 ev/s (4000 events, 99.6% overlap)
   510.2 ±   0.2 ev/s (4000 events, 99.9% overlap)
   507.6 ±   0.2 ev/s (4000 events, 99.9% overlap)

$ multirun.py step3.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
     3.3 ±   0.0 ev/s (300 events, 99.2% overlap)
     3.3 ±   0.0 ev/s (300 events, 98.9% overlap)
     3.3 ±   0.0 ev/s (300 events, 98.6% overlap)
     3.3 ±   0.0 ev/s (300 events, 99.9% overlap)

$ multirun.py step3-no-allocator.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
     3.3 ±   0.0 ev/s (300 events, 99.0% overlap)
     3.3 ±   0.0 ev/s (300 events, 98.8% overlap)
     3.3 ±   0.0 ev/s (300 events, 99.0% overlap)
     3.3 ±   0.0 ev/s (300 events, 99.0% overlap)

On a pair of V100:

$ multirun.py profile.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
  2814.7 ±   2.0 ev/s (4000 events, 99.4% overlap)
  2818.8 ±   2.4 ev/s (4000 events, 99.6% overlap)
  2814.9 ±   3.1 ev/s (4000 events, 99.5% overlap)
  2814.1 ±   3.0 ev/s (4000 events, 99.1% overlap)

$ multirun.py profile-no-allocator.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
   518.5 ±   0.2 ev/s (4000 events, 99.3% overlap)
   516.0 ±   0.2 ev/s (4000 events, 99.0% overlap)
   518.7 ±   0.2 ev/s (4000 events, 99.7% overlap)
   508.9 ±   0.1 ev/s (4000 events, 98.8% overlap)

$ multirun.py step3.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
     3.8 ±   0.0 ev/s (300 events, 99.6% overlap)
     3.8 ±   0.0 ev/s (300 events, 99.6% overlap)
     3.9 ±   0.0 ev/s (300 events, 98.3% overlap)
     3.8 ±   0.0 ev/s (300 events, 99.4% overlap)

$ multirun.py step3-no-allocator.py
Warming up
Running 4 times over 4200 events with 2 jobs, each with 16 threads, 8 streams and 2 GPUs
     3.8 ±   0.0 ev/s (300 events, 99.8% overlap)
     3.9 ±   0.0 ev/s (300 events, 98.7% overlap)
     3.8 ±   0.0 ev/s (300 events, 98.7% overlap)
     3.8 ±   0.0 ev/s (300 events, 99.7% overlap)

@fwyzard fwyzard merged commit 59fe318 into cms-patatrack:CMSSW_10_4_X_Patatrack Jan 9, 2019
@fwyzard fwyzard added fixed and removed later labels Jan 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants