Incorporating cuda unified memory into caffe #2775

jmerkow · 2015-07-16T18:46:58Z

Hello,

I have been working a number of development efforts to extend caffe to 3d and nd. With n-dimensional images, memory management on GPUs becomes a central issue; even on K40's, one quickly runs out of memory for moderate sized images. Though, I only have one PR publicly available here, a lot of my efforts have been working towards managing memory on the GPU.
Currently, I am using a shared blobs, to store temporary items on the gpu (similar to these PRs #2016, #2009) and I can train on very large images (Cx512x512x512), but this requires explicit copying to/from the gpu and cpu, which is slow. And doesn't scale to lower memory GPUs such as those in workstations. Ideally, this management would happen dynamically to minimize computation time.

Recently, Cuda's unified memory (UM) system was brought to my attention, which has been part of the Cuda SDK since version 6, link. It seems like a large improvement towards simplifying device abstraction in caffe to accompany development in recent PRs: device abstraction, de-duplication of forward/backward and openCL.

My questions are: is there a particular reason UM was not incorporated when caffe moved to UVA in this PR? It may be simply that UM was only about 6 months old at the time, and development on UVA probably began before UM was common knowledge (or possibly buggy), or it could be a larger architecture issue.

If there is no particular reason why UM was avoided, I am interested in contributing to caffe in this regard, (i.e. utilize UM in blobs), but this would involve a major change to blobs and synced memory. For example, the gpu/cpu_data functions would need major changes or may disappear completely. Due to the scope of the changes, I wanted to get a second opinion on this and would love tips/discussion from the main dev team to help guide development.

--Jameson

jmerkow · 2015-07-17T15:46:45Z

So did a few gemm operations using unified memory and it greatly decreases performance. I multiplied two 8196x8196 matrices and the operation took about 300ms using UVA, and ~45s using unified memory. Seems like its a no go. It still may help caffe consumers on workstations to process images that would be too large for workstation-grade GPUs, but its far too slow for research purposes.

That still leaves the question of memory management on GPUs. I'll keep working on a solution...

shelhamer · 2015-07-17T16:34:48Z

@jmerkow while we were aware of unified memory (UM), we decided to adopt only unified addressing (UVA) since Caffe is already careful about its allocation and communication and -- as you have confirmed -- is much faster than UM at least out of the box.

For memory management, there are of course more sophisticated mechanisms than what we have now for lazy allocation and on-demand communication plus the special case of data layer prefetching, but many use cases should be covered by a memory pool such as #2738.

It still may help caffe consumers on workstations to process images that would be too large for workstation-grade GPUs

Note that as a workaround one can divide an input image / volume and partially compute a net on each fragment, then collect these fragments on the host to make an intermediate feature map to then compute the rest of the net on back on the device. This is obviously hands-on, and not dynamic like you requested, but can be done if it needs to be through the current interface. This could be sped up by doing async host-device transfers while computing the net on a given fragment.

Closing, since this addresses the question of UM, but please open a further issue for any other memory proposals you might have.

lukeyeager · 2015-07-17T18:18:24Z

That still leaves the question of memory management on GPUs. I'll keep working on a solution...

@jmerkow, you might want to take a look at this pull request: NVIDIA#11.
It utilizes this library: https://github.com/NVIDIA/cnmem.

jmerkow · 2015-07-17T19:28:40Z

Thank you for the responses.
@shelhamer , I have a follow-up question before I work on a new proposal. First, I'll explain my current approach and its issues.
I am already doing something similar to what you suggested.
Breaking up the operation is trivial for layers without spatial dims, for vision layers its easiest to break up by output channel/input channel since its easiest to copy for row-major indexing.
This works great for all the vision layers that are within channel operations such as pooling.
And its easy to split up convolution-forward by output channel and produce one 'top channel' at a time.
However, this doesn't get you very far with convolution since the bottom column buffer and bottom occupy the majority of the GPU memory.
To break up by input channel, you need to accumulate the top over input channels. This reduces memory stored for both the original bottom data, and decreases the size of the im2col result.
(For backwards, this is pretty much the same operation except you produce the bottom_diff across the output channels)

However, this is abysmally slow since it requires numerous host->device, and device->host memory copies. This is by far the slowest part of the process.
Im thinking a better approach is to produce the entire im2col matrix, remove original the bottom from the GPU memory, then perform the gemm operation with as many rows of the im2col result as possible then increment/shift until done, this would reduce the memcopy operations significantly.

This would require deallocating the bottom from the GPU and to my knowledge there is currently no way to do this. In addition, it looks like most (if not all) of the memory loaded on the GPU is persistent until the entire network is deallocated. Is this due to the lazy-allocation? There isn't a particular need to keep this data on GPU? Do you see any problem with freeing up GPU after layer computation?
A memory pool doesn't address how and when to allocate GPU memory for operations, this would still need to be handled although a pool would certainly be useful for automatically maximizing memory usage.

evolu8 · 2017-10-19T10:16:57Z

This is now a reality in CUDA 8. Are we likely to see this?

jmerkow changed the title ~~incorporating cuda unified memory into caffe~~ Incorporating cuda unified memory into caffe Jul 16, 2015

shelhamer closed this as completed Jul 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorporating cuda unified memory into caffe #2775

Incorporating cuda unified memory into caffe #2775

jmerkow commented Jul 16, 2015

jmerkow commented Jul 17, 2015

shelhamer commented Jul 17, 2015

lukeyeager commented Jul 17, 2015

jmerkow commented Jul 17, 2015

evolu8 commented Oct 19, 2017

Incorporating cuda unified memory into caffe #2775

Incorporating cuda unified memory into caffe #2775

Comments

jmerkow commented Jul 16, 2015

jmerkow commented Jul 17, 2015

shelhamer commented Jul 17, 2015

lukeyeager commented Jul 17, 2015

jmerkow commented Jul 17, 2015

evolu8 commented Oct 19, 2017